VMware has a popular and powerful virtualization suite of products in the vSphere and vCenter family of products. This post focuses on ten of the biggest mistakes people make when configuring the High Availability (HA) and Distributed Resource Scheduler (DRS) features. We’ll begin by looking at five common HA issues, then we’ll look at four common DRS issues, then conclude with an issue that affects both HA and DRS.
HA is included in almost every version of vSphere, including one of the small business bundles (Essentials Plus), as the impact of an ESXi host failure is much bigger than the loss of a single server in the traditional world because many virtual machines (VMs) are affected. Thus, it is very important to get HA designed and configured correctly.
Purchasing Differently Configured Servers
One of the common mistakes people make is buying differently sized servers (more CPU and/or memory in some servers than others) and placing them in the same cluster. This is often done with the idea that some VMs require a lot more resources than others, and the big, powerful servers are more expensive than several smaller servers. The problem with this thinking is that HA is pessimistic and assumes that the largest servers will fail.
Solution: Either buy servers that are configured the same (or at least similarly) or create a couple of different clusters, with each cluster having servers configured the same. Some people also implement affinity rules to keep the big VMs on designated servers, but this impacts DRS – we’ll cover that issue later.
Insufficient Hosts to Run All VMs Accounting for HA Overhead
When budgets are tight, many administrators size their environments to have sufficient resources to run all the VMs that are needed but forget to take into account the overhead HA imposes to guarantee that sufficient resources exist to restart the VMs on a failed host (or multiple hosts, if you are pessimistic). VMware’s best practice is to always leave Admission Control enabled to have HA automatically set aside resources to restart VMs after a host failure.
Solution: Plan for the HA overhead and purchase sufficient hardware to cover the resources required by the VMs in the environment plus the overhead for HA.
Using the Host Failures Cluster Tolerates Policy
Recall that there are three admission control policies, namely:
- Host failures the cluster tolerates: The original (and only) option for HA, this type assumes the loss of a specified number of hosts (one to four in versions 3 and 4, up to 31 in vSphere 5).
- Percentage of cluster resources reserved as failover spare capacity: Introduced in vSphere4, this option sets aside a specified percentage of both CPU and memory resources from the total in the cluster for failover use; vSphere 5 improved this option by allowing different percentages to be specified for CPU and memory.
- Specify failover hosts: This policy specifies a standby host that runs all the time but is never used for running VMs unless a host in the cluster fails. It was introduced in vSphere 4 and upgraded in version 5 by allowing multiple hosts to be specified.
As described previously, HA is pessimistic, and always assumes the largest host will fail, reserving more resources than usually needed if the hosts are sized differently (though, per issue one, we don’t recommend that). This policy also uses a concept called slots to reserve the right amount of spare capacity, but it assumes a “one size fits all” policy in this regard and uses the VM with the largest CPU and the largest memory reservation as the slot size for all VMs.
Solution: Use the VMware recommended policy of percentage of cluster resources reserved as failover spare capacity instead, which takes a Percentage of the entire cluster’s resources and uses actual reservations on each VM instead of using the largest reservation.
Forgetting to Update the Percentage Admission Control Policy as Cluster Grows
If the Percentage of cluster resources reserved as failover spare capacity policy is used (as suggested), it is important to reserve the correct amount of CPU and memory based on the needs of the VMs and the size of the cluster. For example, in a two-node cluster, the loss of one of the nodes removes half of the cluster resources (assuming they are sized the same). Thus, the percentage may be set to 50. However, if additional nodes are added to the cluster later, that value is probably too high and should be reduced to take into account the additional node(s) and the number of simultaneous failures expected (for example with four nodes, the loss of one node suggests that the percentage be set to 25, while if two failures are expected, then 50 percent should be used).
Solution: Go back and recalculate the appropriate value in your cluster whenever hosts are added to or removed from the cluster.
Configuring VM Restart Priorities Inefficiently
One of the settings that can be set in an HA cluster is the default restart priority of VMs after a host failure. This defaults to Medium, but can be set to Low, Medium, or High, or Disabled, if most VMs should not be restarted after a host failure.
Solution: Consider setting the cluster default for restart priority to Low, enabling two higher levels for VMs. For example, maybe infrastructure VMs such as domain controllers or DNS servers are the highest priority (setting those VMs to High), followed by critical services, such as database or e-mail servers (setting those VMs to Medium), and then the rest of the VMs will be at the default (Low). Any VMs that don’t need to be restarted can be set to Disabled to save resources after a host failure.
Reprinted with permission from Ten vSphere HA and DRS Misconfiguration Issues