NetApp Tech OnTap
     

How to Bulletproof Your DR Plan

According to the United States Department of Homeland Security, only one in four U.S. companies has a disaster recovery (DR) plan. Of those companies, only 30% actually test their plans. The frightening fact is that most companies are probably underprotected, and those of us who believe we have an adequate plan in place should probably think again.

At Datalink, we’ve been helping companies address complex information storage, management, and protection challenges for more than 20 years. We’ve developed extensive data protection expertise and best practices that have been encapsulated in our data continuity and enhanced data recovery practices.

About 75% of new Datalink customers come to us without a viable disaster recovery plan in hand. A DR plan is a natural extension of your backup and recovery procedures, and should be part of a comprehensive data protection plan that includes backup/recovery, business continuity, and disaster recovery. In this article I encapsulate some of the rules of thumb we use to help customers with this type of planning. Specifically, I describe:

  • How to tell if your DR plan is robust enough
  • Technologies that can make DR planning simpler
  • The value of integrated data protection

Is Your Plan Robust Enough?

With the rate at which business practices are changing and data storage is growing at many companies, a DR plan that was adequate when it was created may not be adequate now. At Datalink, we often see gaps between business requirements and IT infrastructure. These gaps can be technology related, process related, and, most recently, staff related. With the state of the economy, companies are stretching staff as far as possible, causing gaps that can affect DR readiness.

For example, we recently worked with a midsized manufacturing company to enhance its disaster recovery. The company did a business impact analysis (BIA) a number of years ago, and reached the conclusion that it had 40 applications that were necessary to run the business. However, our analysis showed that there were actually about 70 applications that were necessary. To be fair to the company, some of the difference may have been due to changes that occurred since the initial assessment, while others were probably things their analysis missed.

Use a Process-Oriented Approach to Identify Critical Applications
We use a process-oriented approach to help us determine which applications are critical to a business. For a manufacturing company, there are essentially four things they must be able to do to function:

  • Order and receive raw materials
  • Pay for raw materials
  • Produce and ship product
  • Collect payment for products shipped

For the above example, we identified the 70 applications by mapping each application from a process perspective and asking whether that application was necessary to accomplish one or more of the functions listed above. You can perform a similar analysis for your own company.

Remember Infrastructure Applications
Don’t overlook critical infrastructure applications. There are a number of “level 0” applications that have to be running, such as Active Directory, DNS, phone systems, networking, etc. If DNS isn’t running, that critical business application probably isn’t running either.

Establish Recovery Objectives for Each Application
Once you’ve got your list of applications, you need to set up objectives for each one. Typically, there are four key metrics to think about:

  • Availability. This is the time an application or data set is available for use. It is frequently measured as a percentage. For example, 99% uptime corresponds to just under 88 hours of downtime per year, while 99.999% uptime corresponds to just over 5 minutes per year.
  • Recovery Time Objective (RTO). This defines how much time it should take to recover if a failure occurs. An RTO of 20 minutes means an application or data set will be back online in 20 minutes after a failure. Note that a long RTO is inconsistent with an aggressive availability goal.
  • Recovery Point Objective (RPO). In most cases, it’s prohibitively expensive to eliminate all chance of data loss when a failure occurs. RPO defines the maximum amount of data you are willing to lose. For instance, an RPO of one hour means you will be able to restore an application or data set to a point no more than an hour prior to the time the outage occurred. (Note that for most applications such an objective would imply hourly backups.) RPO does not directly affect availability or RTO, except in cases in which an aggressive (short) RPO goal lengthens the time needed to complete the recovery process.
  • Data Retention. Data retention metrics define how long a given data set must be retained to meet backup needs or compliance requirements. You may have data retention metrics for both online and offline (tape) copies of data. For instance, you may keep a particular backup online for one month and then retain it on tape for a year.

Rather than having different objectives for each application, you’ll probably want to establish a few different “tiers” of service and sort your applications into those tiers. You’ll also need to map the dependencies between applications and establish the order in which applications will restart. Dependencies that seem obvious under normal operations—like starting the database before the database application—are easy to overlook in a crisis, and even small mistakes can cost you a lot of valuable time.

Coordinate Your Activities
To really succeed in DR planning, you have to make a coordinated, interdepartmental effort that spans all your lines of business. Unfortunately, no one has yet devised a software tool that can take the place of doing careful and thoughtful legwork up front. If IT isn’t talking to the business side of the house, your results will suffer.

Leverage Technology to Optimize Your Infrastructure

A number of tools and technologies will help you as you analyze your environment and design a workable infrastructure for DR.

Use Tools for Discovery
A variety of software tools can help you understand application, storage, and network utilization and trends. Judicious application of these tools can not only help you analyze your environment up front, but will also help you fine-tune your design to enable capacities and bandwidths to be adequate.

At Datalink we often use NetApp SANscreen® for storage discovery. Among its many functions, SANscreen maps service paths between applications running on servers and data stored on storage systems. SANscreen doesn’t require any agents to be installed, so you can get it up and running and get results quickly. We also use Riverbed Cascade and similar tools to look at network utilization, performance, and behavior.

Understanding your operation at a quantitative level will help determine the types of DR solutions you need or which DR solutions you can support.

Deduplicate and/or Compress
The cost of providing a network pipe big enough to handle your replication traffic can be prohibitive. Using primary storage deduplication with or without LAN compression can significantly reduce your bandwidth requirements, potentially reducing the size of the network pipe you need. For example, applying NetApp deduplication to your virtual server environment can save you up to 90%. That obviously translates to huge bandwidth savings during SnapMirror® replication. In general, applying deduplication to your storage volumes can produce savings ranging from 25% to 90% depending on data type.

Consolidate and Virtualize
If you haven’t yet done so, consolidating and virtualizing your servers, networks, and storage can have a big impact on your ability to accomplish DR. Eliminating complexity not only reduces your data center operating costs, it makes it easier to know what you have and where it is—which is essential to your data protection and DR strategies.

The latest server virtualization technologies may also provide integrated capabilities that simplify restarting virtual machines and applications at a secondary site. (A recent Tech OnTap article discussed the use of VMware® Site Recovery Manager and NetApp storage.)

Another emerging trend is consolidating your networks onto a single Ethernet fabric. Again, this reduces costs, makes network management simpler, and can make it easier to meet network bandwidth needs for DR. The arrival of Fibre Channel over Ethernet (FCoE) makes it possible to accommodate existing FC storage within a unified network fabric.

Some advantages of virtualized storage are discussed in the following section.

Integrated Data Protection Can Simplify Life

From the standpoint of someone who designs DR implementations for a living, the ability to have all your data protection and disaster recovery functions integrated into your underlying storage can be a crucial design element. Instead of having different solutions for each data set, integrated data protection allows you to apply the same consistent methods across all your data.

NetApp storage integrates a full range of data protection capabilities, including space-efficient Snapshot™ technology, SnapVault® for disk-based backup, SnapMirror for DR, and MetroCluster for continuous data availability. (These technologies are described in more detail in two companion articles in this issue, one specifically focusing on integrated data protection and the other a case study on MetroCluster.)

As I mentioned above, it almost always makes sense to define several tiers of service with different data protection and disaster recovery capabilities for each tier. Because of the range of capabilities that NetApp offers, we design in NetApp solutions whenever possible to meet these requirements.

If you’ve already got a large investment in other storage systems that lack the integrated data protection features of NetApp, the NetApp V-Series allows you to continue to use that storage while providing the benefits of NetApp capabilities. Datalink has had good success using the V-Series in several recent engagements. (V-series is also featured in the MetroCluster case study in this issue.)

Finally, the only way you can be sure your DR plan is right is to test it periodically, and NetApp FlexClone® technology makes the testing process much easier. You don’t need nearly as much extra storage and the process is far less disruptive to production activity.

In a typical DR testing scenario, all the data for the test must be copied to another set of disks before testing begins. That means you need two times the storage space right off the bat, and you need to make time-consuming copies before you can start testing.

With FlexClone, you can make space-efficient, writable clones of any or all of your DR volumes; additional space is only consumed as you make changes to the cloned volume. These FlexClone volumes allow you to capture a static view of your DR data at a fixed point in time without disrupting ongoing updates or requiring massive amounts of additional storage.

Using FlexClone you can reduce the time it takes for DR testing from 24 hours or more down to a few hours, because the process is fast, reliable, efficient, and far less resource intensive.

Finally, a DR site represents a substantial investment in resources. Using FlexClone, you can leverage those resources for other tasks, such as development, data mining, QA, and so on, without negatively impacting DR readiness.

Test and Update Your Plan Regularly

As the previous section suggests, regular testing is critical to being ready should a disaster occur. I recommend that you test your DR plan at least annually—more often if your rates of business change and/or storage growth are unusually high. If possible, use automated tools to track trends in storage capacity, network bandwidth, and so on.

Finally, you should plan an annual review of your business requirements and update your DR plan and capabilities accordingly. This will enable your plan to be not only viable, but continue to meet your company’s overall needs in the face of growth and change.

Figure 1) Value of data vs. the level of protection needed.

How Do You Get Started?

If you need help getting started with your DR planning effort or updating your existing plan, a lot of resources are out there to help you; you don’t have to reinvent the wheel. The U.S. government provides checklists and other resources on its ready.gov Web site that can serve as a good starting point.

If you’re still intimidated, consider getting some outside help. Your server, network, storage, and application vendors may have additional resources they can provide (free and fee based), and services companies such as Datalink are also available to help. Spending a few extra dollars can have a big dividend in protecting your business and peace of mind.

Got opinions about DR Planning?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.

Joshua Konkle

James Mason
National Practice Lead and Storage Architect
Data Continuity Lead
Datalink

James has spent the last 5 years of his 25-year IT career at Datalink; the last 10 years of his career have been focused almost exclusively on disaster recovery. Designing the largest SAN solution in the history of Datalink is among his recent accomplishments. Past employers include IBM and Convex.

 
Explore