NetApp Tech OnTap
     

Roundtable: DR for Microsoft Applications
with VMware SRM and NetApp

The February 2010 issue of Tech OnTap featured an article on virtualizing Microsoft applications using VMware®, NetApp®, and Cisco technologies. As a follow-on to that article, Tech OnTap sat down with Wen Yu of VMware and Larry Touchette from NetApp to dig into the details of disaster recovery for Microsoft® applications and learn why the adoption rate for DR is higher in VMware/NetApp environments.

Tech OnTap: People seem hesitant to use online disaster recovery widely in Microsoft application environments. What are the factors that you find contribute to this?


Wen Yu (VMware): At VMware, we typically find that there are three key reasons. First, and perhaps biggest, is the cost involved with doing DR. You don’t just need a second facility—you need a number of additional servers, network gear, and twice the storage. These costs can be prohibitive regardless of whether you’re working with physical or virtual servers.

Second, there has traditionally been a high degree of complexity associated with performing DR, especially in physical server environments and even more so when you try to implement DR across multiple applications. You can end up with a confusing combination of products and technologies to get the job done. A lot of products out there also require that you have an almost identical configuration on both sides, adding to the cost.

Finally, the network bandwidth required to achieve an adequate RPO can be a limitation for many. A lot of Windows® shops may not have the necessary bandwidth in place to do replication and may hesitate to invest in the bandwidth needed to make it feasible.

The joint solution that NetApp, Cisco, and VMware created addresses a lot of these issues.

Larry Touchette (NetApp): To elaborate on Wen’s last thought, NetApp and VMware take a lot of the cost and complexity out of DR, so you can deploy a solution that covers a much greater number of applications—your entire VMware environment if you want. Some joint customers have been able to offset, or even completely finance, storage for a DR environment with the money saved by using NetApp deduplication on primary VMware storage. Anyone who’s been reading Tech OnTap fairly regularly knows about the benefits of NetApp deduplication in combination with VMware. [This article is a good starting point to learn more about VMware DR and dedupe.—TOT Editor]

When it comes to bandwidth savings, NetApp SnapMirror® technology, when used in combination with NetApp deduplication, only replicates unique blocks, so it’s very bandwidth efficient. NetApp recently added SnapMirror compression capability to Data ONTAP® for WAN acceleration, which makes it even better at working in environments where bandwidth is limited, lowering bandwidth utilization up to 70% depending on the compressibility of your data. [SnapMirror compression was discussed in detail in a recent Tech OnTap article. —TOT Editor]

The adoption of DR in joint VMware/NetApp environments is quite high. I think these factors are driving that higher adoption.

TOT: Why would someone choose to use VMware Site Recovery Manager (SRM) in a VMware/NetApp DR configuration?


Wen: The most critical part of DR for virtualized application servers is the execution of the steps necessary to connect, inventory, reconfigure, and power up virtual machines at your DR site. Manual execution of these tasks can be complicated and error prone, especially when you’ve got dependencies that require one VM to start before another. Scripts can be written to try to automate DR processes and address these problems, but they are often costly to implement and difficult to maintain.

Site Recovery Manager simplifies the management of the entire DR process, including discovery and configuration, failover, and DR testing. The recovery plan you create during the setup phase of SRM configuration allows you to preconfigure your entire plan. Built-in discovery capabilities and close integration with vCenter accelerate the process.

Once a plan exists, it can be executed automatically with minimal user intervention. SRM enables all necessary steps to be performed and virtual machines to be started in the correct order. For example, virtual machines supporting the infrastructure such as Active Directory® (AD) and DNS servers can be started first, followed by database servers, application servers, and then Web servers.

The ability to perform testing is another big advantage. With most DR solutions it is nearly impossible to test without disrupting normal production operations and ongoing replication. SRM and NetApp make it easy and efficient to perform DR testing. For example, one thing you have to do is create an isolated testing network (so that you don’t inadvertently have two active instances of each VM on your corporate network). SRM automates the process so your tests stay isolated.

Larry: Using NetApp FlexClone® technology in combination with SRM DR testing, you can bring up your DR site and run tests there without using a huge amount of additional storage and without disrupting ongoing replication between sites or operations at the primary site. This gives you an easy way to run tests to validate DR without impacting the production site or agreed-upon SLAs.

Some replication solutions require two times the capacity to create replicas of storage at the DR site so that replication can continue while you’re performing the test. This wastes a lot of time and reduces the length of time you can keep the test environment around or how often you can perform tests. Using FlexClone significantly reduces the amount of storage needed and accelerates the process.

 

Incremental storage requirement for DR testing with FlexClone.

Figure 1) Incremental storage requirement for DR testing with FlexClone.

TOT: So what are the major considerations for someone who wants to deploy a DR solution using VMware SRM and NetApp storage?


Wen: From the standpoint of SRM, there are a number of considerations. First of all, you’ve got to have a VMware vCenter server at each site, along with a Microsoft SQL Server to store the SRM database and servers running supported versions of ESX.

The primary and recovery sites must be connected by a reliable IP network, and the recovery site should also have access to the same public and private networks as the primary site. Last but not least, the recovery site should have up-to-date Active Directory and DNS servers.

When it comes to the actual replication between sites, SRM relies on storage—in this case, NetApp—to do that. Customers running tier-1 applications can achieve a zero RPO by configuring SnapMirror to replicate synchronously. In addition to replication, maintaining consistency at both the OS and application level is key.

Larry: NetApp uses a number of components to provide consistent replication for both the VMs themselves and for Microsoft applications (Exchange, SQL Server, and SharePoint Server). The key consideration for both VMs and applications is that it’s not enough to simply replicate the data periodically; it has to be in an application-consistent state from which each component can be restarted. We’ve described the whole approach in some detail in a recent tech report. Wen reviewed this TR to make sure we had the VMware information correct.

The VMs reside in shared datastores, either VMFS (FC or iSCSI LUNs) or NFS. NetApp SnapManager for Virtual Infrastructure provides consistent Snapshotâ„¢ copies and replication for VM data.

A key design element is that we keep application data separate from VM datastores by storing it in physical-mode RDM LUNs (either iSCSI or FC). This allows us to use the NetApp SnapManager suite of products to create consistent recovery points for each application, and we can also have different replication schedules for each application to accommodate different RPOs by creating different numbers of recovery points.

 

Replication architecture.

Figure 2) Replication architecture.

TOT: We did a lot of work to make it possible to have multiple recovery points from which to restart applications. Can you tell our readers a little more about that?


Larry: NetApp SnapManager products for SQL Server, Exchange, and SharePoint increase flexibility by allowing the creation and verification of multiple recovery points replicated to the recovery site. The SnapManager applications create full backups, which are verified to be application consistent, plus more frequent backups that include only the incremental logs of changes that have occurred between full backups. These incremental backups are referred to as frequent recovery point, or FRP, backups. Adjusting the time between FRP backups provides the flexibility to set the desired RPO for each application separately.

If any issues are detected with the recovered application data at the recovery site, individual applications may be reverted to any previous recovery point. SnapManager can roll forward any uncommitted database logs if the applications are reverted to a previous recovery point to prevent the loss of any new data that was written at the recovery site after failover.

SRM allows you to insert custom commands into your recovery plan. We use this capability to execute a command in the recovery plan that configures SnapDrive® to enable VMs running at the DR site to see the full history of backups that were replicated from the production site. For those with access to the NetApp NOW™ (NetApp on the Web) site, this process is described more fully in KB56952.

TOT: Can one of you explain the importance of Active Directory in an SRM environment?


Wen: Microsoft applications are highly dependent on Active Directory and DNS for correct operation, so it’s really critical to have this configured correctly at your recovery site. When you perform DR testing, you also have to be certain to provide a correctly configured and up-to-date Active Directory server on the isolated test network. When you fail back to the primary site from the recovery site, you again have to be certain to deal correctly with Active Directory/DNS servers. If you fail to do so, you may experience update sequence number (USN) rollback problems and Active Directory database corruption. These problems are described more fully in Microsoft knowledge base article 875495.

The easiest way to make sure that Active Directory is correct at the recovery site is to maintain at least one Active Directory server at the recovery site that is synchronized with the primary site.

For DR testing, you have to clone this AD server just prior to running the DR test. Once the cloning is done, but before powering on the VM, make sure the cloned AD server is connected only to the DR test network. After the AD VM is powered on in the test network, five FSMO (Flexible Single Master Operation) roles in the Active Directory forest must be seized according to the procedure described in Microsoft knowledge base article 255504.

This cloning process is not necessary when a real failover occurs, but seizing the FSMO roles is still required and must be done manually. Once you’ve recovered from your disaster—whatever it is—and prior to failback, you must reestablish Active Directory services at the original site. This can be done by recovering the AD servers at that site and forcing them to resynchronize with the newer AD servers at the recovery site or by establishing new AD servers.

All of these actions are covered in a fair amount of detail in NetApp TR-3822, which Larry mentioned previously.

TOT: To wrap up, can you both talk a little about the methods available for failing back to the original site?


Wen: As I just suggested, the first step once your original site is up and running is to get Active Directory up. SRM doesn’t provide fully automated reversal and failback yet, but we still recommend that you use SRM to do the failback by reconfiguring the software to fail in the opposite direction.

Larry: In order to fail back you’ve got to synchronize the data between the recovery site and the original site. SnapMirror relationships are easy to reverse and resynchronize. The resynchronization process will depend somewhat on the failure that occurred. If the original storage wasn’t destroyed in the disaster, SnapMirror will only have to replicate the delta—the changes that occurred while the original site was offline. Otherwise, a full resync will be required. Of course, NetApp deduplication and SnapMirror compression can reduce the WAN impact in either case. Dedupe reduces the total amount of data in your VMware environment by eliminating the duplication that results from having many, many copies of the same guest operating systems, and compression makes sure that any data that is transmitted over the WAN uses the least bandwidth possible.

We hope the above information summarized from a roundtable has been helpful, and we would love to hear what you think about this article. For complete details on the topics discussed, see TR-3822.

Community Center
 Got opinions about disaster recovery with SRM and NetApp?

Ask questions, exchange ideas, and share your thoughts
online in NetApp Communities.


Wen Yu

Wen Yu
Senior Technical Alliance Manager
VMware

Wen has been with VMware for over five years, supporting and evangelizing virtualization products for continuous availability, disaster recovery, and desktop. He is currently a member of the Infrastructure Alliance Technology Team.

Larry Touchette

Larry Touchette
Technical Marketing Engineer
NetApp

Larry has been with NetApp for nine years, supporting, implementing, and designing NetApp storage and disaster recovery solutions. He is currently a part of the NetApp Server Virtualization Technical Marketing Team.

 
Explore