Super HA with VMware Fault Tolerance

VMware Fault Tolerance

We know by now that VMware HA provides high availability to our virtual machines in case of a host failure or unplanned downtime by restarting the affected VMs on the surviving hosts in the cluster. By definition, the machine has to become unavailable for HA to kick in: this means an interruption of service measured in few minutes, which may be unacceptable for some mission critical applications. Point of sales systems, a critical database or a messaging server can be good examples for such applications, hence, the need for zero downtime and VMware FT.

Traditionally, clustering on the OS and/or the application level has been the accepted solution to protect those services. However, clustering can be much more complex to implement and has its limitations. For example, you need a clustering-aware application, and clustering capable operating systems are usually much more expensive.

I once implemented RADIUS wireless authentication for a company that uses Wi-Fi access points that can be configured to authenticate with only one RADIUS server. If this RADIUS server is down for any reason (like a HA event), no one will be able to access the network until it is brought up. I could have explored clustering RADIUS, but enabling FT for this VM seemed like the easiest solution, as it is transparent for both applications and end users.

VMware FT
A diagram by VMware that shows FT providing zero downtime and zero data lost continuous availability for your most critical applications

VMware fault tolerance protection is creating another copy of the VM running on another host and keeping the two in a lockstep state where the secondary is replaying everything the primary does to stay in complete sync. This mandates a set of strict and special requirements.

A recent CPU that is vlockstep capable is required, and we need the hosts to be of the same CPU family. vMotion is much less strict in this regard, as you could enable EVC to run the VMs on common dominator. With VMware FT there are no compromises; even the CPU clock speed difference needs to be within 400MHZ on the hosts.

All hosts must be running the same build of VMware ESXi, which is not an issue as ideally all your hosts must be updated to the latest version you feel comfortable with (or your hardware is compatible with).
Shared storage is also a requirement, as with VMware HA. Actually, although FT is configured per VM, the VM must be hosted on an HA cluster.

Fault Tolerance requires a 1Gbps network card and a VMkernal port enabled for FT logging (which we will perform in this article). VMware recommends that a physical NIC be dedicated for this task as every transaction done on the primary VM must be copied and performed on the secondary VM. Naturally very low latency is required to keep the two VMs in lockstep; hence they need to be in the same geographical site.

With vSphere 5.1, this advanced feature became available in the standard edition, where it was before only available at the enterprise and enterprise plus levels. Yet, there is no joy for the typical SMB as even the essentials plus edition does not offer VMware FT.

Not supporting multicore is by far the biggest turndown for FT: critical VMs are usually heavily loaded, configured and benefit from multicore (vSMP). You seldom see a database server, for example, with a single core.

There is no overprovisioning: all virtual disks must be thick and the amount of memory configured for the VM will be reserved for both the primary and the secondary VMs on their respective hosts.
In addition, some virtual hardware devices should not be connected to the VM, as USB, sound and physical devices cannot be replayed with vLockstop.

DRS will be automatically disabled for the primary and secondary VMs (while staying active for the rest of the cluster), as we need to minimize moving those VMs around.
Also, snapshots for the VM are not allowed. You cannot use vMotion with its disk or create them using linked clones.

Some Best Practices

It is generally recommended to have at least 3 hosts, so if the ESXi hosting either the primary or the secondary VM fails you have a place to recreate it and keep the VM FT protected.
Also, it is recommended not to have more than eight FT protected VMs on the same host as this is a strain on the FT enabled NIC.

Actually, VMware recommends dedicated 10Gbps NIC for this purpose, and it seems that 10Gbps will become a requirement for vSphere 6.0 and onward with the introduction of the second generation of FT, which will finally be able to support multiple vCores.

To make enabling FT on the VM much faster, some suggest inflating any thin disks before starting the process. Others suggest using storage VMotion as a means to change the type of the virtual disk. I would do the first option, as it is easier, but you may need to do the second if you do not have enough free space on the current datastore of the VM.

Inflate Disks

SiteSurvey Tool is No Longer Supported

For vSphere 5.0 and earlier, there is a nice vSphere client plugin from VMware that examines your cluster for compatibility with FT. The tool produces an html report pointing to some configuration that needs to be changed or hardware that may need to be replaced.

As of vSphere 5.1, this tool is no longer supported, and there is no alternative from VMware. To be honest, personally, I do not see the requirements that complex that it needs a special tool to check them before starting. It is surely much simpler than configuring Microsoft Clustering Services (MSCS) or Oracle Real Application Clusters (RAC).

So, How is it Done?

The first step is to enable Fault Tolerance logging on a VMkenral port, on all hosts where you want to have FT protected VMs. Again, best practices dictate a dedicated VMkernal port on a dedicated Network card (although this is not a strict requirement).

Fault Tolerance Logging

Now let us protect a critical Windows 7 VM, which is something Microsoft clustering cannot do as only server OS provides clustering (as of Windows Server 2012, clustering is supported with Standard. You used to have to get the much more expensive enterprise edition to have clustering).

Turn on Fault Tolerance

Since I did not inflate this VM’s virtual disks, VMware FT will do it for me. It will also disable DRS on this VM and will reserve the configured memory size of the original VM and it for its mirror.

Disable DRS

For a while, you will (hopefully) not see anything more than a progress indicator on the left which indicates that vSphere is enabling FT for your critical Windows 7 VM. If it succeeds, you will notice that you have two mirror VMs; a primary in a dark blue color, and a secondary in shadowy gray (each on different host).

Related Objects

If you turn on the primary virtual machine, vSphere will also start the secondary VM on the other host.

Top Level Objects

Notice that now we have new information on the Summary Tab of the virtual machine about the statuses of fault tolerance protection, the secondary VM consumed resources and how many micro seconds the secondary virtual time lags in comparison to the physical time (vLockStep Interval).

Summary

Conclusion

VMware FT may seem demanding on the requirements and constraints side, but as we saw, those requirements are easily met and the steps needed to enable it are very simple compared to other fault tolerance products. It may even be much cheaper to implement.

The beauty of this technology is that nobody except the virtualization admin will sense it, as it requires zero modification to the operating system or application while providing users with zero downtime for their most critical services.

This makes VMware FT a valuable tool on your tool belt, but not yet the solution for everything. The requirements and constraints may be hard to overcome depending on your needs and configuration.

Comments