How to Enable VMware HA for High Availability

VMware Virtualization

After discussing how to use VMware’s vMotion, which is used to move Virtual Machines off hosts to perform planned maintenance, and learning about DRS clusters, which help balance workloads between ESXi hosts, a clear question emerged: What about unplanned downtime? What can we do to reduce the impact of a host falling without prior notice? VMware also has an answer to this question, which comes in the form of VMware High Availability (HA).

If an ESXi server fails, the VMs hosted on it will go down or at least become unreachable. VMware HA responds by restarting the VMs on other hosts in the cluster that survived the failure. This assumes that protected VMs are hosted on shared storage that is accessible to other hosts in the cluster.

HA Restating VMs
A VMware graph illustrating HA restating VMs on surviving hosts in the cluster.

Like vMotion, HA is available with all vSphere licenses such as Essentials Plus. This means that only the essentials and free editions do not offer this great feature.

VMware HA can also monitor the operating system inside the VM and restart it in case of failure.

To enable VMware HA you first need to create a new cluster and turn on vSphere HA for it. In order for HA to work, Host Monitoring must be enabled as it drives ESXi hosts in the cluster to exchange network heartbeats to indicate that they are alive and well.

Any time you need to perform network maintenance it is good idea to disable Host monitoring to avoid HA responding by powering off and restating VMs in the cluster.

New Cluster

One of the most common mistakes when having a HA cluster is fully utilizing the resources of all hosts on the cluster. If an ESXi server fails, the VMs hosted on it need to be restarted somewhere. If all other hosts in the cluster are fully utilized they will have a hard time catering for the need of the homeless VMs. In the event of a failure and the risk that some VMs may not be able to find a home to restart, at best there will be performance issues for all VMs in the cluster.

To avoid this scenario, it is strongly advised to always keep some reserved capacity in the cluster for HA failover. Admission control forces that by refusing to run any new VM if the resources left in the cluster will not satisfy the configured values.

Admission Control Policies

The first type of policy is to reserve a specified number of hosts in the cluster for failover. This can be very wasteful if you do not have a big number of hosts in your cluster. For example, for a two nodes cluster, setting aside a full sever worth of resources is a huge waste. Even for a four nodes cluster, one complete host worth of resources is 25% of the total capacity.

It is recommended instead to set aside a reasonable persecute of memory and CPU capacity. When calculating this percentage you need to keep in mind that not all VMs are equally critical. Some VMs can even stay down if an ESXi server fails.

HA Cluster

VM Monitoring

VM Monitoring is a very useful feature; especially if a VM is running a poorly optimized code it could cause it to stop responding. I used to get a lot of calls often asking me to restart a particle VM that kept freezing. After enabling VM monitoring, HA solved the problem for me.

VM monitoring detects heartbeats from the VMware tools installed inside the VM. Those heartbeats have nothing to do with virtual machine network traffic. They are a direct communication from the tools to the host VMkernel. You can even take it one step further and use APIs to monitor the state of applications running inside the VM.

Edit Cluster Settings

Datastore Heartbeat

In vSphere versions earlier than 5.0 HA used to rely exclusively on management network heartbeat to determine if a host had failed. If your host has a separate management network than your virtual server network, the former can fail while the later can still run serving VMs. Regardless, HA would assume that host has failed and restart those VMs elsewhere.

With Datastore heartbeat HA delivers heartbeats as the form of data written on shared storage for other hosts (mainly the master of cluster) to see that it is still running even if the management network has failed. This will prevent the master from triggering HA response and restarting the running machines elsewhere.

vSphere HA

Host Isolation Response

What HA does in response to host isolation is very important. If a host is running but cannot communicate heartbeats on the management network, it pings the cluster isolation address (its gateway by default). If pining the address does not work, the host considers itself isolated.

You are then left with one of three options: leaving the VMs running, powering them off to failover or shutting them down to then failover. The correct choice is greatly dependent on your network topology:

If the management traffic does not share the same network as virtual machine traffic and iSCSI traffic (if you use that to connect the host to its datastores) then you’re probably better off leaving the VMs running since losing the management network is not likely to affect the services they offer.

However, if management traffic shares the same network as the VMs, then you may want to shut them down and failover to another host as the VMs will not be accessible from the network if they stay on the isolated host .

Yet, if the same network is also used for iSCSI storage traffic, then there is no point in wasting five minutes trying to gracefully shut the VMs down. You may want HA to minimize down time and immediately power the VMs off to restart them on another host. On the other hand, you may prefer to leave the machines on in hope that the network problem gets solved soon and the VMs are recovered without losing any work.

Shutdown then Failover

Advanced Options

HA’s advanced options are more often used and much better documented than DRS’. The following is a good example:

Since HA pings the VMkernal gateway by default to determine if it is isolated, this may cause a big problem if the gateway is a virtual router hosted on one of the ESXi servers on the cluster. The ESXi server hosting the router VM will always think that it is connect because it hosts the isolation address on it, while the other ESXi severs will believe that they were disconnected from the network when it reality they were not. Actually even if the gateway was a hardware router, losing that gateway may cause all ESXi servers to get confused.

The solution is very simple: use another or a number of more reliable addresses as the isolation address. Core switches in your datacenter can be a great alternative. Just remember to add the “das.usedefaultisolationaddress=false” if you are using more than one line of das.isolationaddress1, das.isolationaddress2… etc.

HA-DRS Cluster

Conclusion

HA is great in restoring normal operation atomically in a very short time. The benefits are without a doubt very valuable for the organization. Calling an admin at home outside of normal working hours to connect to the datacenter and restart VMs manually is a far greater waste of time and efforts.

However, VMware HA may not be adequate for everything. There are those special applications where you cannot offer any downtime. For those applications, the wait for the VMs to be restarted on another host is unacceptable, and the loss of data that was being processed in the time of failure may be tremendous. For those applications VMware developed Fault Tolerance (FT) which will be the subject of our next article.

Comments