The DRS feature is a little more advanced than HA but, for all but the smallest environments, no less important, as performance across many VMs in a dynamic environment is an ever-present concern and one that could easily consume one or more administrators’ full time without it. Leverage DRS for an environment that will run as smoothly as possible.
Not Preparing for New Hardware
One of the biggest changes many administrators don’t plan for is new hardware (new CPUs with more advanced capabilities or switching CPU vendors between Intel and AMD). The problem here is that you end up with “islands of vMotion compatibility,” where VMs, once started, can only be moved to some of the other servers in the cluster. This severely limits what DRS can do to load balance the cluster.
Solutions: This issue has several solutions:
- Build separate clusters for AMD and Intel (if you have both CPU architectures – better yet, stick with a single CPU vendor) to solve the CPU vendor issue.
- Always enable Enhanced vMotion Compatibility (EVC) on every cluster so that, as new nodes are added, they will be “dumbed down” to the level of your existing hosts.
- As old hosts are removed from a cluster, remember to upgrade the EVC level to expose the capabilities of the new hosts added to the cluster. The setting must always be set to the lowest CPU type in the cluster.
“I’m Smarter than DRS” Mentality
This is a common mentality for administrators who are new to using DRS – they don’t trust DRS or want to know where VMs are all the time. I once had a student who said that his security department mandated documentation of which host each VM was located on – very silly in a virtual environment which is designed to be very dynamic. In other cases, administrators think they are smarter than DRS and can better balance the load.
Solution: Let DRS run in Fully Automated mode. You are not smarter than DRS. There are just too many VMs to watch, and you can’t always watch them, but DRS will check on the load balance of the cluster every five minutes and will automatically load balance as conditions change.
Setting the Migration Threshold too Aggressively
One of the mistakes new administrators often make with DRS is that they set the Migration Threshold too aggressively. This value goes on a five-point scale from Conservative to Aggressive. Conservative only implements Priority 1 (five-star) recommendations, namely: the host is going into maintenance mode, reservations on the host exceed the host’s capacity, or if affinity rules violated. Priorities 2 – 5 (four- to one-star recommendations, respectively) take performance into account by using higher priority recommendations when the cluster is more out of balance and lower priority recommendations when the difference between nodes is less.
Many administrators think that they want to be as aggressive as possible to be as balanced as possible, but remember that there is a trade-off between being perfectly balanced and the cost of achieving that balance; in other words, the cost of vMotion. Doing too much vMotion may actually cost more than the benefits of being perfectly balanced.
Solution: Set the threshold to the mid-point – Priority 3 – unless the load is fairly static. Analyze cluster performance and recommendations and adjust as necessary.
Non-optimal Sizing of Clusters
A cluster (HA or DRS) can have up to 32 nodes in it, but just because you can, doesn’t mean you should. Very small clusters give DRS few options for load-balancing and often incur higher overhead by HA, reducing the available capacity to run VMs.
On the other hand, very large clusters may be fine from an HA perspective. If you run vSphere 5, there is one master node and the rest are slaves, but any slave can be promoted to be the primary if the primary fails, so large cluster sizes are okay. On the other hand, if you run version 4 or below, you may wish to use a smaller cluster size as there are a maximum of five primary nodes, with the other nodes being secondary nodes, but secondary nodes are usually not automatically promoted to a primary node if a primary fails. This is important because, if all primary nodes are down, HA will not automatically restart anything.
DRS clusters are another matter, however. The problem is that the larger the cluster and the more VMs in the cluster, the more possible scenarios vCenter has to analyze, dramatically increasing the load on that server.
Solution: Many experts recommend putting between 16 and 24 hosts in a cluster as a good balance between the reduced overhead for HA and the increased load on the vCenter for DRS. If you use linked clones, such as with View, the maximum cluster size is eight nodes.
HA and DRS
Finally, there’s an issue that affects both HA and DRS. Optimizing VMs through the proper use of reservations, limits, and shares will be a more time-consuming and challenging task then many previously listed, but will pay dividends day in and day out.
Overuse of Reservations, Limits, and Affinities
One of the powerful features in vSphere is the ability to guarantee a certain level of resources (via reservations) or to cap consumption (via limits) for VMs. While this can be done, it reduces the options that HA and DRS have in load-balancing and restarting VMs. Using affinities, while convenient and may be necessary for HA, performance, or licensing reasons, add even more constraints to HA and DRS.
Solution: Use shares whenever possible instead of using reservations and limits, and minimize the use of affinity (VM-to-VM, as well as VM-to-Host) rules to give HA and DRS the most possible options. If limits and reservations are needed, implement them at the resource pool level whenever possible instead of at the individual VM level.
Reposted from Ten vSphere HA and DRS Misconfiguration Issues