NetApp Tech OnTap
     

Case Study: Continuous Data Availability

Combining NetApp MetroCluster and V-Series

There are some applications where availability is paramount. For energy providers in the southwestern United States, certain critical applications are mandated to achieve availability of 99.999% or better. Regulations actually require my company and others to operate critical energy applications from a secondary site one week per quarter to demonstrate that DR capabilities are fully functional.

Four months before we implemented the NetApp® MetroCluster solution described in this article we had an incident that caused our critical application to be down for over two hours and to lose four hours’ worth of data—a nightmare scenario. We needed to make a change, and we needed to make it fast.

In this article, I’ll describe how a combination of MetroCluster and NetApp V-Series virtualization systems addressed our availability problems while allowing us to preserve our existing investment in Hitachi Storage. I’ll include details of our implementation and discuss the lessons we’ve learned since implementing the NetApp solution.

In a future article, I’ll describe how we use NetApp and NFS to support our diverse server virtualization environment, which includes both VMware® and Oracle® VM. Twenty months ago, we didn’t have any NetApp storage. Today, we’ve grown to almost half a petabyte of NetApp storage with no end in sight.

Why NetApp MetroCluster and V-Series?

For five years prior to implementing MetroCluster, our critical application ran on a Windows® server sitting on a secure network. The application itself operates by doing batch processing against a database and then writing output files to local storage. These output files are then accessed by users of the application.

Double-Take software was used to periodically replicate an image of that server to a second site. In the best case scenario, Double-Take failover wasn’t really able to meet our 99.999% availability objectives. When the outage that I mentioned above occurred, there was a problem with the most recent round of replication, causing an even longer downtime than expected with significant data loss.

What our internal customer wanted to do was implement a solution where they had application servers running on top of WebLogic in a dual configuration with one application server at site A and a second application server at site B about 30 kilometers away. A load balancer sitting in front of them would then spread the workload across both servers under normal operation. (We have a fast WAN link between the two sites, so this would not pose a problem when requests from site A go to site B server or vice versa.)

For this to work correctly, the output files would then need to be written to a shared volume accessible by both servers. We had about 15TB of Hitachi SAN storage available within this isolated and secure application environment, but Hitachi didn’t really have a workable solution to meet our availability needs, so we started looking at NAS gateways.

We’d recently deployed a NetApp FAS3070 with CIFS shares, and we were very happy with the way it performed, so we went to NetApp to see if there was a way to provide a high-availability CIFS share. Ultimately, NetApp helped us decide on a solution using two NetApp V-Series virtualization systems (one in each site) front-ending our existing Hitachi storage to preserve that investment in disk. NetApp MetroCluster provided synchronous mirroring between the two V-Series systems, allowing the same CIFS shares to be accessible at both sites with no delays and instantaneous failover with zero data loss should a problem occur.

We looked at a variety of other technologies from other vendors, some of which had lower up front costs, but the back-end complexity and management required by those solutions would have made our ongoing costs much higher.

Deployment Details

A high-level overview of our MetroCluster environment is shown in Figure 1.

Figure 1) Overview of MetroCluster/V-Series implementation.

As you can see, almost everything is redundant, including the Fibre Channel links between sites, to make sure of the greatest possible availability. MetroCluster on the NetApp V3020 nodes creates an exact replica of all shared volumes at both sites. We actually have three separate shares: one for production, one for development, and a third for testing.

The V-Series systems allow us to utilize all NetApp capabilities—including MetroCluster—with our existing storage. Our Hitachi storage systems were already providing storage for a database component of the application. Because this environment is isolated from the rest of our operations, we were unable to leverage the unused storage on the two Hitachi systems (15TB free space on each) for other purposes, so it made sense to use it to accommodate these CIFS shares (1.5TB) rather than buy NetApp storage.

Initial Testing

To test the configuration before entering production operation, we simulated a wide variety of failures, including pulling the plug on one storage node, disconnecting networks and Fibre Channel connections, turning off one of the two Cisco directors, and so on. Failover behavior was immediate and correct in all cases. We also tested the manual failover capabilities to make site B the primary rather than site A. As noted, we have to perform this switch at least every quarter. In all our testing, we never encountered a single problem. Everything worked as expected with no problems.

Results So Far

We’ve had zero issues since we installed the MetroCluster configuration. The Fibre Channel connections between sites have gone down on occasion due to scheduled maintenance, and MetroCluster always resynchronizes itself without intervention. We get an e-mail from MetroCluster saying that the connection has gone down. When it comes back up, we get a second notification that the connection is up and that MetroCluster is resynchronizing.

In addition to supporting MetroCluster, the V-Series systems allow us to take advantage of other NetApp functions such as Snapshot™ on our Hitachi disk storage. Our backup and recovery strategy across our entire IT infrastructure (not just MetroCluster) is to move away from tape toward online backups. To that end we create regular Snapshot copies of our MetroCluster shares. Because of synchronous mirroring, that gives us Snapshot copies at both sites should recovery be necessary after any type of failure. Outside the MetroCluster configuration, we use SnapMirror® to provide similar backup and DR functionality for less critical applications where asynchronous replication is appropriate.

Lessons Learned

The only real problem we had during the installation of MetroCluster was due to the fact that we didn’t fully understand our existing Fibre Channel connections between the two sites at the outset. We “inherited” these networks, and we didn’t have accurate data about the configuration. We ended up having to learn on the fly how the network was configured and wasting valuable time.

My recommendation to anyone undertaking a MetroCluster implementation is to do a full assessment with NetApp up front to work out the specifics and validate the configuration. Some of the initial information we supplied to NetApp was not completely correct, resulting in implementation delays. In my estimation, a presales professional services engagement would have made the implementation a lot smoother and in hindsight would have been money well spent.

Conclusion

NetApp technology has allowed us to take our entire IT infrastructure in what has been a very productive direction for the company. Today, our entire 480TB of NetApp storage is managed by a single full-time person. The maintenance contracts on the Hitachi storage in the MetroCluster configuration are currently coming up for renewal, and we’re moving to replace those expensive systems with two additional NetApp clusters to provide storage for the database back end of the application. We’ll add disk shelves to our existing V-Series cluster to provide storage for the CIFS shares we mirror with MetroCluster.

We’ll also be looking at implementing deduplication on our MetroCluster shares. We use deduplication in the rest of our environment with savings up to 90% in server virtualization environments and savings of 35% on non-MetroCluster CIFS shares and 60% to 65% on NFS shares. Needless to say, any time you’re mirroring data, reducing the total amount of data you have to store can yield big storage savings, not to mention savings on the bandwidth necessary to mirror that data between locations.

In the longer term, we’re looking at adding a third DR site at least 150 miles away. With this configuration, we would have MetroCluster to provide continuous data availability, plus a third SnapMirror copy of the data at a third location to provide full disaster recovery should a regional disaster affect operations at both our primary and secondary sites. This is consistent with NetApp best practices for critical applications.

Got opinions about MetroCluster?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.

Joshua Konkle

Dave Larson
Supervisor of Infrastructure Architecture

Prior to joining his present employer seven years ago, Dave worked in a variety of IT positions for Digital Equipment Corporation and other companies and gained broad storage experience with products from leading vendors, including EMC, Hitachi, and HP StorageWorks. In his current role he manages his company’s SAN, UNIX®, and Oracle Database teams, giving him a unique and broad perspective on IT infrastructure challenges.

 
Explore