NetApp Tech OnTap
     

Dedupe Unstructured Data for Up to
70% Space Savings

Over 50% of the data stored at most companies is unstructured, consisting of a wide variety of file types stored in home directories, shared directories, document management systems, and other locations. If your company has knowledge workers, whether for business tasks such as finance, accounting, and marketing or for engineering tasks such as software development or electronic design, your unstructured data includes a huge quantity of files, which might include files from Microsoft® Office, software configuration management (SCM), product lifecycle management (PLM), or electronic design automation (EDA). Many business and engineering applications produce large amounts of unstructured data.

Your company might have developed good processes for managing and planning for the structured data stored in databases; however, provisioning storage for an explosion of unstructured data in the face of tight IT budgets might be another matter. Your IT organization is constantly on the lookout for ways to reduce the cost associated with this storage, more so, given the economic changes of 2008.

This is where NetApp® Data ONTAP® deduplication (“dedupe”) comes in. By consolidating your storage and using deduplication to identify and eliminate duplicate blocks from your primary storage pool, you can recover 30% to 50% or more of your storage space. That translates directly into deferred spending for additional storage as well as a decreased rate of storage purchases in the future.

This article describes the challenges associated with unstructured data, explains what you can do to overcome these challenges, and highlights the space savings you can expect when applying deduplication to a number of business and engineering applications.

Unstructured Data Challenges

The greatest challenge presented by unstructured data growth: the data grows at rates faster than many IT organizations can cope with. The challenge is further complicated by a number of elements:

  • Unstructured data might be widely dispersed. Some data is on direct-attached storage (DAS) on servers, while other data resides on storage area networks (SANs). Other data resides in NFS data stores and on Windows® file servers. Yet other unstructured data of value to your organization might be stored on disk storage in workstations, desktops, and laptops.
  • Much of this storage is underutilized. With different types of storage and perhaps multiple storage systems, these systems are inevitably underutilized: for example, capacity and/or performance. Concurrently, a small number of storage systems might be capacity and/or performance oversubscribed with no easy way to balance needs between storage systems.
  • Data protection and security are problematic. In these environments, it’s sometimes hard to ascertain what is stored where, let alone provide adequate security and backup for it. While ascertaining unstructured data types is not a focus of this article, using the guidelines under “Consolidate Your Unstructured Data Storage Storage Systems” is a great start.
  • A large percentage of the data is redundant. You might be consuming a large amount of valuable storage for copies of files that are similar, if not identical. However, it also represents a significant opportunity if you can find a way to exploit it.

The solutions to these unstructured data problems are to (1) consolidate your unstructured data as much as possible and (2) eliminate the redundancy. The question you should be answering is “how can you best go about accomplishing these objectives?”

Consolidate Your Unstructured Data Storage Systems

To improve management of your unstructured data, you must first consolidate existing storage systems into one or a few storage systems. You can start by analyzing each storage system in your environment to ascertain the data you have on each storage system, by platform, data type, how it is accessed (storage protocol or protocols), who uses it (applications and end users), and what level of performance is required.

You’re creating an unstructured data map, which can be used to improve storage capacity utilization. You’ll use that data map to consolidate your unstructured data onto as few storage systems as possible. The NetApp unified storage architecture supports all popular NAS and SAN storage protocols (for example, NFSv4 and FCoE), so when you choose NetApp for your unstructured data management, you can consolidate your unstructured data onto a single storage system. Now you can support simultaneous access by all your Windows, Linux®, and UNIX® operating systems or any other platforms you might have.

There are key advantages to consolidating your data in this way:

  • It delivers great flexibility. You can easily allocate storage where it’s needed without the headaches created by having separate “silos” of storage.
  • Capacity planning is improved. You only need to monitor and provision a single storage system rather than a collection of DAS, NAS, and SAN.
  • Consolidation enables maximum elimination of redundancy. To achieve the highest operational efficiency and make it possible to recover the most storage space with deduplication, all your unstructured data has to be in the same place.

Deduplication Yields Significant Savings

You might want to take steps to begin eliminating redundancy as you consolidate your unstructured data onto a single storage system. Eliminating redundancy can reduce the amount of storage you need at the outset to establish your consolidated storage pool. For instance, if you know that your engineering home directories have many copies of the same files (which they likely do because engineers have a “low-risk profile” that encourages them to keep originals and copies), you would move that data to your consolidated storage and deduplicate it to recover space before moving the next unstructured data set and repeating the process.

The results you achieve will depend on the method of deduplication. For deduplication of primary storage, there are few products currently available other than NetApp deduplication. Alex McDonald of NetApp provided a comparison of various approaches to storage space savings in a recent blog post.

Of the available products, some only eliminate identical copies of files. NetApp deduplication works at the block level, so it can achieve a significant level of deduplication when multiple versions of a file exist. For example, imagine two copies of a 10MB file that differ by a single block. File-level deduplication would have no effect because the files are different, so you would still need 20MB of storage. Block-level deduplication would deduplicate all but the changed block, so you could store both files in 10MB plus one block.



Figure 1) Virtual server farm and network configuration.

NetApp deduplication is built into the NetApp Data ONTAP operating environment and is completely independent of the storage protocol you use. Deduplication works on all NetApp volumes whether they are accessed with SAN or NAS protocols; it can be applied to both production and archival data, and it is completely transparent to end users and applications. Moreover, if the NetApp storage system is online, data will be rehydrated and available to systems. Off-primary-storage deduplication products can fail, leaving storage inaccessible until the off-primary-storage appliance can be recovered. The technical details of NetApp deduplication have been described in detail in a previous Tech OnTap article.

By consolidating and regularly deduplicating your data, you can reduce the storage needed for your unstructured data and defer additional storage purchases. If you’re wondering whether your unstructured data can benefit from deduplication, NetApp offers a Space Savings Estimation Tool (SSET) that will crawl through an NFS or a CIFS volume and estimate about how much space you can save. The following section explores how much space you will typically save by using deduplication in a variety of environments.

Potential Savings with NetApp Deduplication

NetApp has been measuring the benefits of deduplication in real-world environments since deduplication was introduced. Many Tech OnTap articles have focused particularly on the benefits of deduplication in VMware® environments, which have an inherently high level of file duplication owing to the nearly identical operating system environments used by each virtual machine. The following table summarizes results to date.

Table 1) Typical results for deduplicating various types of unstructured data.

Data Type Typical Space Savings Range
Backup Data 90% 85–95%
VMware VMs 70% 50–90%
Database Backups
55% 40–70%
Home Directories 35% 20–50%
CIFS Shares 35% 20–50%
E-mail Archives 30% 20–60%
Mixed Enterprise Data 30% 20–40%
Document Archives 25% 20–30%

Recently, NetApp has been investigating the benefits of deduplication on the repositories of unstructured file data created by some of the most popular engineering and scientific applications, including Siemens Teamcenter PLM software, IBM Rational ClearCase SCM software, and Schlumberger Petrel software for seismic data analysis.

Teamcenter from Siemens PLM Software is one of the leading product lifecycle management solutions on the market. Teamcenter utilizes a relatively small metadata database combined with a large “vault,” where engineering design files are stored. Every time an engineer saves a design within Teamcenter, a complete copy of that design file is saved in the vault, even if the change made to the design is minor. As a result, Teamcenter is a great candidate for deduplication.

NetApp worked closely with Siemens PLM to assess the value of deduplication in a Teamcenter environment using Siemens’s performance and scalability benchmarking tool, which simulates the creation of multiple revisions of many design files, as would occur during normal use. Deduplication of the resulting vault yielded a 57% space savings. Results in the real world might be even higher than this since in many cases the number of file revisions is likely to be higher than that which we simulated. You can learn more about this in a recent technical report.

IBM Rational ClearCase is a leading software configuration management solution. Similar to Teamcenter, ClearCase consists of a metadata database in combination with a large “versioned object base,” or VOB, where files are stored. Unless you are using ClearCase to store binary files as well as source files, it is normally fairly efficient in terms of the way it uses storage.

Deduplication might be useful with ClearCase in situations where a copy of a VOB needs to be made. In addition, preliminary results in a laboratory environment suggest space savings of 40% or more using deduplication in a ClearCase environment when whole files are stored. (See the sidebar for details on the potential advantages of this approach.)

Schlumberger Petrel. This application is used for seismic data interpretation, reservoir visualization, and simulation workflows in the field of oil and gas exploration and production. Like the previous two applications, it creates project directories that contain huge numbers of files. As users create, distribute, and archive data, duplicate data objects are stored across multiple storage devices. NetApp observed space savings of approximately 48% by applying deduplication to such project directories. This is described in more detail in a recent white paper.

Table 2) Deduplication results with several engineering applications.

Application Typical Space Savings
Siemens Teamcenter 57%+
IBM Rational ClearCase 40% (for whole file storage)
Schlumberger Petrel
48%

Other applications. A wide variety of scientific and engineering applications create large repositories of unstructured data, exhibiting behavior similar to that in the three examples cited here. For example, electronic design automation (EDA) is one such application. Applying deduplication to any application that shows similar behavior would be expected to yield similar results.

A Real-World Example

The value of NetApp deduplication in a mixed environment with both business and engineering data was recently illustrated at Atlanta-based Polysius Corporation, which designs and enhances new and existing cement plants. Polysius was experiencing up to 30% annual increases in its storage requirements. By applying deduplication to its mix of AutoCAD files, Microsoft Office documents, and other unstructured data, Polysius was able to recover 47% of its storage space. Some volumes showed reductions as high as 70%. As a result, the company expects to defer new storage purchases for at least six to eight months and has been able to double the time period it retains backup data on disk. Read the Polysius success story for more details.

Conclusion

If you’re struggling to cope with large amounts of unstructured data, using the consolidate, deduplicate, and repeat approach might help. Begin by consolidating your unstructured data as much as possible—ideally onto just one or a few storage systems—and apply some form of deduplication to eliminate redundancy. Do this in steps or small projects to improve data efficiency of NetApp Data ONTAP fabric-attached storage.

Using primary storage system deduplication to reduce your total storage requirements enables you to reduce capital expenses related to storage acquisitions. Moreover, you can improve data management and reduce operational expenses by managing fewer storage systems. For instance, if you cut the average storage per volume by a third, you’ll significantly reduce the time and media needed for full backups and recovery.

The NetApp unified storage architecture makes it possible to accommodate all your unstructured data (iSCSI or Fibre Channel SAN, CIFS, or NFS) on a single storage system and deduplicate data at the block level for maximum space savings. Applying NetApp deduplication to typical home directory shares and volumes might result in savings ranging up to 50%. Deduplication of engineering application data stores for popular applications such as Siemens Teamcenter and IBM Rational ClearCase can yield space savings from 40% to 60% or higher.

Got opinions about Deduplication of Unstructured Data?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.

Joshua Konkle

Joshua Konkle
Technical Evangelist for NAS and Engineering Applications
NetApp

Joshua champions technologies and solutions that help customers be more productive. His background includes both UNIX and Windows experience, with emphasis on security. He has spoken on numerous storage-related and security-related topics at various industry and technical venues.

Erik Mulder

Erik Mulder
Solution Marketing Manager
NetApp

Erik has over 20 years of experience in high-tech marketing with expertise in CAD/CAM, CAE, and related areas. Erik works closely with product management, services, and partners to create and promote NetApp solutions for engineering application environments.

 
Explore