100 Year Archiving
Are You Ready?
Pick up a book from 100 years ago and you can probably read it without much trouble, but pick up a backup tape that’s just 10 or 20 years old and it’s likely to be much more problematic. Even if you have the right hardware to read the tape (and the tape itself is still good), you need to know what format it was written in, and you need an application that can make sense of the data. The situation can get very complicated, and it only gets worse as time passes.
At this point you may be thinking, “That’s okay, I don’t need to keep data that long.” Think again.
Figure 1) Data retention requirements.
Source: SNIA Survey
In 2007, the Storage Networking Industry Association (SNIA) completed a comprehensive survey in which we talked to hundreds of individuals in a wide variety of organizations from countries around the world. An astonishing 80% said they have information that they must keep over 50 years, and 68% said they must keep data over 100 years. A full 70% also reported that they are highly dissatisfied with their perceived ability to read retained information in 50 years. What type of data were they most concerned about? E-mail, customer records, business application data, and databases—the kinds of information that most of us deal with every day.
Have I got your attention?
In this article, I’ll explain the challenges of long-term archiving, discuss a few best practices that you can use now, and talk about efforts that are under way through the SNIA Long Term Archive and Compliance Storage Initiative (LTACSI), which I chair.
What Are the Long Term Archiving Challenges?
Figure 2 makes clear the challenge of long-term data retention.
Figure 2) Typical lifetimes of storage systems, applications, and physical media versus information retention.
Simply put, the time that we need to retain information (even using the more modest figure of 50 years) far exceeds the typical lifespan of storage systems (disk or tape) and applications. Even the physical media start to degrade and may become unreadable long before the retention period expires.
The current practice is to migrate data—both physically and logically—every 3 to 5 years. Physical migration requires moving information from one physical storage system to another or from one media format to another to maintain physical readability, accessibility, and integrity. Drivers for this type of migration include media failure, media or storage system obsolescence, system changes, and cost of operations (people, power, space).
Logical migration requires moving information from one logical format to another—such as from an old version of an application to a new version—to preserve readability and interpretability. Drivers may include changing application formats, obsolete applications, and mergers. Inhibitors to both types of migrations include cost, complexity, sheer volume of information, and lack of time and/or budget.
The SNIA survey mentioned previously concluded that logical and physical migration simply do not scale cost effectively to meet current and future needs. In fact, only 30% of those surveyed migrate data every 3 to 5 years if it’s on disk. Clearly, new approaches are required to meet the legal, regulatory, business, cost, and scalability requirements for long-term digital information retention.
Interim Solutions
The limitations imposed by current storage systems and applications are not going to go away in the near term. What then should you do today to address long-term retention? The best current recommendation is to implement formal lifecycle management processes for your applications, operations, and data repositories to address the effective management of data through its useful life. Best practices should include:
- Close collaboration among all stakeholders (IT, RIM, legal, business, security) to ensure that all needs are addressed
- Clear identification of all existing assets and resources
- Classification of information so that retention needs can be determined
- Establishment of requirements for retention, protection, security, compliance, and so on
- Service implementation to meet requirements
- Measurement and improvement
Helpful practices may include:
- Classifying information into a few common buckets
- Setting retention periods and deleting expired data
- Controlling the number of copies of data that you maintain for data protection
- Setting policies for audits and performing them
- Using standards-based storage platforms
Your long-term preservation policy should identify your business, legal, and compliance goals and include a description of the best practices to which each storage repository adheres, including both physical and logical migration. The goal for physical migration should be to move from fixed-term (3 to 5 years) migration to an “as needed” strategy. Federated, standards-based, and virtualized systems (such as NetApp® storage running Data ONTAP® GX) can help minimize the disruption, complexity, and labor involved in migration.
For logical migration, you must be able to maintain authenticity—proof that the data is what it was originally. Again, you should migrate only as needed, and you may want to consider other options for retaining some data: transform it into a standard format (XML, PDF, etc.), archive a hard copy if appropriate, or use microfilm.
If this seems complicated, it is. You still have to execute both physical and logical migration on an as-needed basis, and the two events may not coincide. But there are currently few options to ensure that data retained over the long term will remain readable. Fortunately, some significant efforts are underway to help address this situation.
Standards Efforts
It’s probably clear that the storage industry hasn’t done a lot to address the problem of long-term data retention thus far. Today’s archiving applications use proprietary data formats that effectively lock you into a solution and that could further complicate migration efforts in the future. That’s all about to change.
For long-term archiving (15+ years), the biggest challenge is logical migration. The physical migration situation can be adequately addressed with effective lifecycle management processes and current standards-based storage technology, as opposed to proprietary storage formats. The situation should improve further as vendors begin to focus more attention on hardware that can meet long-term storage needs. (For an example see the sidebar, Collaborative Research into Long Term Archive.)
Logical migration, on the other hand, remains application-specific, making automation of key processes more difficult. Full “preservation” requires more than just keeping the data readable and interpretable, it requires long-term retention of the data with metadata that includes its provenance, its reference information (context), and mechanisms to ensure its integrity and authenticity.
To this end, the SNIA LTACSI proposed that SNIA form a Long-Term Digital Information Retention and Preservation Technical Working Group to look into encapsulation (see sidebar). Encapsulation would define a “preservation-oriented” logical container consisting of the content (the data) and associated preservation metadata.
Encapsulation could be modeled on the OAIS AIP (Archival Information Package). Figure 2 shows the content of an OAIS AIP container.
Figure 3) OAIS AIP includes both the content of the information to be stored and metadata describing that content. (Source: SNIA)
Encapsulation implies “self-contained” because a container holds the information’s data, its metadata, reference information, integrity and authenticity checks, access controls, and logs. This content makes the container portable and independent of storage. It allows the container to be managed independently of the application, in accordance with the requirements you’ve established for the information.
Encapsulation is ”self-describing” because the container could be interpreted by different types of systems and because it can include readers, allowing the contents to be interpreted independently of the application. This capability is important for long-term preservation. Encapsulation provides a standard format that any application can understand and that in theory allows many application types to access archived content, such as ECM, legal, migration, preservation, and so on.
Figure 4) Logically, encapsulation creates a standard data layer that fits between the bit layer (physical media) and the application. (Source: SNIA)
The goal is to eliminate the need for frequent logical migrations so that organizations can continue to access and use archived data as necessary over long periods without the overhead and complexity of regularly updating data to accommodate application changes.
NetApp and Long-Term Archiving
From a hardware perspective, NetApp has long recognized that physical data migrations—whether they are for archival or other purposes—are complicated and disruptive. For this reason, NetApp is moving toward a scale-out hardware architecture that supports tiered storage—including write once, read many (WORM) volumes for compliance needs—to ease migration of data from one tier to another in a nondisruptive manner. This architecture allows the transparent incorporation of new storage building blocks (physical media, storage controllers) alongside existing storage, greatly simplifying the physical migration process.
To meet archive and compliance requirements, NetApp open SnapLock® technology enables the creation of WORM volumes on NetApp storage to meet corporate governance and regulatory requirements without requiring physically separate storage systems. NetApp works with industry-leading archive partners, such as Symantec, Zantaz, and CommVault, to deliver solutions that leverage the unique features of NetApp hardware and software technology and is also collaborating with these partners on long-term solutions.
From the standpoint of logical migration, NetApp knows that solving near-term archiving problems is only part of the solution. We recognized the need for industry standards early and have been a key contributor to standardization efforts. In my role at NetApp, one of my chief responsibilities is to chair LTACSI, a cooperative effort of end users, IT professionals, vendors, integrators, and service providers with interest in addressing the challenges of long-term digital information retention, archiving, and compliance-related storage practices.
What Should You Be Doing Now?
The most important thing is to begin taking steps now to avoid ending up in a crisis situation with terabytes of data requiring physical and/or logical migration. The best way to do this is to follow the guidelines described in “Interim Solutions,” including the use of open standards wherever possible. Open standards give you many more options when it comes to migration and help prevent lock-in.
If your organization has not done so already, consider implementing data classification to better understand your data and to support lifecycle management. Then look for solutions, both hardware and software, that can enforce policies and simplify the physical migration process. By taking these steps now, you’ll be well positioned to take advantage of new long-term archival standards as they take shape in coming years.
 |
Gary Zasman Worldwide Practice Director NetApp Gary serves as the chair of the SNIA Long Term Archive and Compliance Storage Initiative (LTACSI). He also spearheads the development of the NetApp worldwide practice for business applications and database integration. Before joining NetApp in 2006, Gary held a variety of positions with leading storage vendors focused on the development of ILM solutions and consulting practices. In 2001, a team that Gary worked with was selected as a finalist for the prestigious Computerworld Smithsonian Award for developing a visual history digital archive.
|
Interested in learning more?
Visit NetApp at Symantec VISION in Las Vegas, Nevada, on June 10 through 12. Stop by NetApp booth #201, and be sure to attend our technical sessions:
Birds of a Feather Session
Tuesday, June 10, 1:00 p.m.
Join leaders from ING, Countrywide, and Renault F1 to learn how they implemented EV, NBU, and VCS with NetApp to go further, faster.
Technical Sessions Featuring NetApp Customers
Wednesday, June 11, 9:00–10:00 a.m.
Can You Find That E-Mail? Legal E-Discovery for Hospitals
Thursday, June 12, 9:00–10:00 a.m.
Disk to Disk to Tape for the Small Business Customer
Thursday, June 12, 10:15 a.m. – 11:15 a.m.
NetBackup OpenStorage API: Cutting Edge Support for Backup Appliances
Thursday, June 12, 11:30 a.m.–12:30 p.m.
Reliable Enterprise Application Hosting with Symantec NetBackup and NetApp VTL at T-Systems