Continuous Replication via DAGs: 10,000 ft View of Exchange High Availability

continuous-replication-dags-exchange-ha_featured-image

Redundancy and Resiliency: two words that sound better in my ears than Recovery. With a mission critical solution like Exchange it has always been a key design focus to build in redundancy and resiliency when possible. In this vein, legacy flavors of Exchange design included placing your database on a RAID 5 stripe with parity to ensure continued access even if a single disk failed. RAID 1 mirroring was encouraged for transaction logs as well. When affordable, a cluster with an active and passive server using a storage array could be implemented to ensure server resiliency. And there have been more costly third-party options for increased availability for Exchange for quite some time.

Through advances in both hardware and software technology the Exchange Team began offering a new solution with Exchange 2007. This solution involves seeding the database into another location and then utilizing “continuous replication” of the transaction logs to keep the passive copies up-to-date while the active copy is ready to serve users.

Exchange 2007 had a variety of different “continuous replication” options while Exchange 2010 offers a single solution called a Database Availability Group (DAG). A DAG offers redundancy in the form of multiple copies of your Exchange data, while offering resiliency in the form of built-in features to ensure single points of failure do not have us restoring from a backup. Configuring these DAGs begins with creating a DAG within your Exchange environment and then joining members to that DAG. You then establish passive copies and/or lagged copies (and you can have up to 16 copies). In the event of a disk failure/corruption, a server failure, or even a site failure, you have both failover and switchover options in play, depending on the severity of the failure and your design.

DAG Example

There are obviously some stipulations for the design and deployment of a DAG. For example, you must have the Mailbox server role installed on an Enterprise version of Server 2008/R2. In addition, to ensure you have a means for provided ‘quorum’ you need to have, at the minimum, a third server to act as a witness between the two Mailbox servers. The concept is half+1 in that you have the ability confirm that the active has gone down and, with quorum, the authority to failover to another system.

We keep focusing on disaster but even everyday maintenance can be handled easily when you have the ability to shift the active copy of the database to another server while you take one offline for necessary hardware and software upgrades and such.

By having redundant copies of the data spread across multiple servers, perhaps even located in multiple sites, you can keep your email services up and running and available for your users through failures. The extent that your environment can withstand will depend upon the design and implementation of your DAGs, as well as the strategic redundancy of non-mailbox server roles.

Note: If your organization has a single site or a primary datacenter, you might consider utilizing a secondary datacenter for disaster recovery. In times past when we considered disaster recovery we only thought of the arduous task of recovering from a backup. However, through this process of continuous replication and replay into the seeded database it is possible to switch to a secondary datacenter relatively quickly (with a few commands and some DNS changes) so long as you have planned appropriately for these moments (which may simply be an extended blackout, or may be a disaster of either natural or manmade causes).

In addition to being concerned about your mailboxes through DAGs and your non-mailbox roles through redundant role deployment combined with hardware load balancers and so forth, you must also be sure to SPoF-proof your network as well (SPoF: Single Point of Failure). Ensuring you have multiple NICs, switches, routers, WAN connections… these are all essential. Performing latency tests on WAN connections to branch offices and/or secondary datacenters is also relevant to maintain solid connectivity should the “lights go out” on your primary as a result of one disaster or another.

Exchange has advanced tremendously in the years since it was first released and High Availability options like DAGs is one of the finest enhancements to date. At this point we have merely scratched the surface in this 100,000 foot overview of the solution itself and some of the concerns. Before deploying this solution in a production environment it is essential that you gain a deeper knowledge of the subject. A great deal has been provided through the Exchange Team and can be found on the Exchange Team blog and through TechNet articles that explore the very depths of HA.

Here are two excellent examples:

I just finished a course on Exchange High Availability that I personally created. It takes you through the terminology and concepts of built-in HA, into simple DAG deployment and then into a very complex multi-site double DAG design with a total of 16 systems being used to demonstrate the configuration and inner-workings of Exchange high availability.

Comments