Using VMware Site Recovery Manager
to Simplify DR
Nothing is scarier than the prospect of having to recover an entire site after a disaster, and the addition of virtual infrastructure to your environment may further complicate the situation. VMware® Site Recovery Manager (SRM) is designed to simplify and accelerate disaster recovery for VMware infrastructures, and it also includes nondisruptive testing so you can make sure your site recovery plan will work before you need to use it.
SRM provides automated virtual infrastructure failover for virtual machines (VMs) and servers; it relies on the existing replication capabilities provided by storage vendors—such as NetApp® SnapMirror® technology—rather than provide its own data replication mechanisms for moving VM data to the DR site. NetApp has worked closely with VMware to enable the advanced capabilities of SnapMirror and other NetApp technologies to be fully leveraged by SRM.
In this article I’m going to discuss the challenges of DR planning and explain how VMware SRM—in conjunction with NetApp storage functionality—can greatly simplify DR for virtual infrastructure.
Disaster Recovery Planning
Planning and execution are the most crucial aspects of a disaster recovery scenario. There are many natural, human-caused, and computer-driven disasters that can affect data availability. Here are a few of the most common problems with DR planning today.
No plan at all. For some, the cost and complexity of DR is simply too much to address given current resource and budget constraints. There is no time available for planned downtime, and the process of making—and, more importantly, testing—a real DR plan is continuously delayed.
Inability to meet your recovery point objective (RPO) and recovery time objective (RTO). For many of you the RPO and RTO your business operations require are not being met because your plan depends on expensive infrastructure, time-consuming restores, and/or system installations from scratch. The resources to implement and run a test plan are just too demanding, especially if a true disaster has never occurred at your company.
Administration vs. RPO/RTO. When failing over business operations to recover from a disaster, there are many steps that are manual and time consuming. Often, though custom scripts are written and utilized to simplify some of these processes, it is the processes that must be followed that affect the real RTO that any DR solution can deliver. Consider the flow of a typical disaster recovery process:
- A situation occurs that requires failover to a DR site. (This could result from a power outage that is too long for the business to withstand without failing over or a disaster that causes the loss of data and/or equipment at the production site.)
- The DR team takes the necessary steps to confirm the disaster and makes the decision to failover.
- Assuming that the necessary testing was done to confirm that data replication was successful and that the DR site is in a usable state:
- Replicated storage must be presented to ESX systems at the DR site.
- The systems must be attached to the storage.
- The correct virtual machines must be added to the inventory of the ESX systems.
- If the DR site is on a different network segment than the production site, each VM may need to be reconfigured for the new network.
- IT staff must ensure the environment is brought up properly, with systems and services being made available in the proper order to prevent naming/IP conflicts.
- Once the DR environment is ready, business is supported by the equipment at the DR site. Eventually, the production site will be available again or lost equipment will be replaced.
- Changes applied to data while the DR site was supporting business operations will sometimes need to be replicated back to the primary site. (Replication must be reversed to accomplish this.)
- The processes that took place in step 3 above must be performed again.
- Once the production environment has been recovered, the original replication schedule and relationships must be reestablished (from the production site to the DR site).
In general, DR processes can be lengthy, time consuming, and prone to human error. Essentially the same processes—with the same problems—have to be repeated to return operations to the primary site. Nevertheless, a DR solution is an important insurance policy for any business. Periodic testing of the DR plan is a must if your solution is to be relied on.
Such testing typically involves repeating steps 1–7 above on a scheduled basis. This can be costly (in terms of man hours and downtime), and can be a disruption to normal IT operations. Because of physical environment limitations and the difficulty of performing DR testing, most sites are able to do so only a few times a year at most, and some can’t do testing in a realistic manner at all. SRM can automate the DR process and reduce the chance that human error will affect disaster recovery.
SRM Basics
The most difficult and time-consuming part of DR failover in a VMware environment is the execution of the steps necessary to connect, inventory, reconfigure, and power up virtual machines at a DR site. VMware has solved these problems with the introduction of Site Recovery Manager, which simplifies the management of the entire DR process.
SRM has three main components:
- Discovery/configuration
- Failover
- Test
The most critical part of DR is planning the response you will provide when a disaster strikes. SRM allows you to preprogram your disaster response; this is one of the most powerful benefits of the solution. Powering hundreds of servers up or down can be an enormous challenge especially in the middle of a crisis. Bringing systems online in a specific order is even more of a challenge.
The recovery plan you create during the setup phase of SRM configuration allows you to preconfigure the entire plan. SRM allows you to create the plan in a short period of time because of its built-in discovery capabilities and close integration with Virtual Center. Once a plan exists, it can be executed flawlessly with minimal user intervention. (You simply start the recovery process if and when a disaster strikes.) Creating such a plan using traditional DR approaches can take months—even years—to construct, and include a lot of error-prone manual steps.
The SRM automated test solution provides nondisruptive DR testing without disrupting ongoing replication (for DR) and without affecting SLAs even while various virtual machines are booted and data sets are tested. Replicated data sets can also be cloned and used for development testing and so on.
Using NetApp’s cloning technology in combination with the SRM tool to execute virtual machine operations, you can bring a site online and run tests there without impacting production—all from a central location with little intervention. This is a powerful way to run a set of tests to validate a specific application or data set without impacting the production site or agreed-upon SLAs.
SRM and NetApp Storage
SRM enables two separate VMware environments, the Primary and the DR sites, to communicate with each other. VMs at the primary site can be quickly and easily collected into Protection Groups that share common resources and can be recovered together. These Protection Groups are configured into Recovery Plans at the DR site.
In a NetApp storage environment, SnapMirror can be configured to replicate VMs from local NetApp storage to a remote system, resulting in read-only mirrors at remote locations. An advantage of SnapMirror is that it gives you great flexibility in configuring the storage for your DR site, significantly reducing the cost of DR storage. Many replication solutions require you to have an identical storage configuration in both locations. With SnapMirror—while you are required to have NetApp storage at both sites—you can mirror high-end to low-end platforms, FC disk to SATA disk, and Fibre Channel SAN to iSCSI.
VMware SRM discovers these relationships and allows you to preprogram a response, including promoting mirrors to make them writable, mounting file systems, and booting systems.
Upon execution of a DR plan in a NetApp storage environment, SRM will:
- Quiesce and break NetApp SnapMirror relationships.
- Map the LUNs to existing igroups (igroups are tables of worldwide port names—WWPNs—of hosts that have access to LUNs).
- Trigger DR ESX hosts to rescan and detect storage.
- Reconfigure VMs as defined for the network at the DR site.
- Power on VMs in the order defined in the Recovery Plan.
Upon execution of a DR test in a NetApp environment, SRM will:
- Create FlexClone® volumes of the FlexVol® volumes on the DR storage system.
- Map the LUNs contained within these FlexVol volumes to existing igroups.
- Trigger DR ESX hosts to rescan and detect the storage.
- Connect VM network adapters to a private Test Bubble network.
- Reconfigure VMs as defined for the network at the DR site.
- Power on the VMs in the order defined in the Recovery Plan.
Conclusion
In summary, by automating the resource-intensive and/or manual parts of DR planning, failover, and test, such as mapping VMs to storage, booting in the right sequence, and taking care of IP addressing and naming schemes, SRM greatly simplifies DR in virtual environments. The incremental cost of protecting a VM is almost zero from an operational perspective. The only real costs are the disk space at the destination site and bandwidth to handle the data change rate of that VM. Virtual servers and storage are joined with minimal administrative headache. Your DR plan and recovery procedures can be created in the least amount of time using the least amount of resources.
 |
Darrin Chapman Data Protection Subject Matter Expert and Technical Marketing Manager NetApp
Darrin Chapman is the person you turn to for just about any question involving disaster recovery or backup and recovery at NetApp. He's been involved with almost every NetApp best practices guide on data protection since 2002 and in his spare time designs training courses for customers and NetApp technical staff. Originally schooled as an electrical engineer, Darrin spent several years in systems architecture at AT&T, Nortel, and EMC.
|