NetApp Tech OnTap
     

Deduplicating Backup Data Streams with the NetApp VTL

If you’ve been a regular Tech OnTap reader over the past year or so, you’ve probably read many articles about NetApp’s deduplication technology for our FAS storage systems. This article describes a new and different deduplication technology—deduplication on our VTL systems. The results are the same—saving space by removing redundant data—but how we accomplish this on the NetApp® VTL is quite different.

The deduplication technology used in the NetApp VTL is a completely new technology designed to meet the unique deduplication requirements of backup data formats such as those generated by common backup applications like Symantec™ NetBackup™, Tivoli Storage Manager, and so on. Deduplication on the NetApp VTL couples new and unique deduplication algorithms with proven NetApp VTL features such as NetApp’s high-performance hardware compression to achieve outstanding storage efficiency with space savings of 20:1 or higher for your backup data sets.



The NetApp VTL deduplication and compression pipeline

Figure 1) The NetApp VTL deduplication and compression pipeline.

Important Considerations for VTL Deduplication

Virtual tape libraries take the place of traditional tape libraries in your data center, allowing multiple backup data streams to be written directly to disk storage at extremely high speed. Providing deduplication in such an environment results in some unique considerations.

Alignment Independence. The biggest requirement for VTL deduplication is that it has to be independent of data alignment. Duplicate data in the formats generated by most backup applications can appear at any offset in a backup data stream, so the deduplication analysis must be performed in such a way that duplicate data can be located wherever it occurs and for however long it occurs. It’s not merely a question of identifying duplicate, fixed-size blocks.
I’ll describe the details of how the NetApp VTL detects duplication in a later section. For now, suffice it to say that the NetApp VTL is able to detect duplication without regard for the offset at which it starts or ends.

Format Independence. Some deduplication algorithms take the approach of reverse engineering backup data formats and deduplicating data based on prior knowledge of its format. The potential problem with such an approach is that there are a lot of possible formats out there.
There are at least half a dozen popular backup applications right now, all with distinct data formats, and individual applications may also have separate data formats for different data types (database versus file data and so on). These formats are often proprietary and subject to change without notice.
Recognizing a format-dependent approach as potentially problematic, the NetApp VTL processes a backup (or any) data stream as opaque data and performs deduplication without depending on an ability to decode the data stream.

Inline or Postprocessing? Another important consideration is whether to perform deduplication inline as each data stream is received, or to write those data streams to disk and do deduplication in a postprocessing mode. The inline approach is more efficient from a storage perspective (duplicate data never reaches disk). On the other hand, performing deduplication processing inline runs the risk of slowing down the applications writing the data streams and therefore increasing the time needed for backups to complete.

For the NetApp VTL, we chose a “best of both worlds” approach. The NetApp VTL deduplication algorithm is designed to be “rate adaptive.” It either runs inline or performs postprocessing, depending on the throughput requirements of the backup application. It is designed to automatically switch modes of operation as the backup workload changes. This rate-adaptive capability will be rolled out in stages. The initial implementation is confined to postprocessing to maintain high data ingest rates. Even though deduplication is performed only on a postprocessed basis in the initial release, some aspects of deduplication processing are performed inline and are rate-adaptively switched between inline and postprocessing modes (see the description of anchor generation later in this article).

Hashes or Byte-Level Comparison? A long-standing debate in the deduplication world is whether a “hash collision”—two different pieces of data that result in an identical hash value—will occur and cause unique data to be discarded. The NetApp VTL is not susceptible to this potential problem. Like most algorithms, it uses hashes to identify potential duplication, but then it does a complete byte-by-byte comparison so that unique data is never lost or corrupted.
Compression? Another important question is whether you can combine the benefits of deduplication and data compression. The NetApp VTL gives you the full benefit of both deduplication and hardware-based compression. (The hardware compression implementation in the NetApp VTL was described in a previous Tech OnTap article. Other vendors provide deduplication plus software-based compression, but only NetApp combines the benefits of deduplication with hardware-accelerated compression (Figure 1).

To Deduplicate or Not to Deduplicate? A final consideration is whether to deduplicate incoming data at all. Deduplication necessarily incurs a significant processing cost (regardless of the implementation), and it may not make sense to deduplicate data sets where the benefit is likely to be minimal. For example, incremental backups with short retention periods may not result in enough space savings to justify the overhead of deduplicating them.

NetApp lets you turn deduplication on or off on a per virtual library basis (a single NetApp VTL can be configured with multiple virtual libraries). With some other VTL vendors, deduplication is always mandatory.

The NetApp VTL Deduplication Algorithm

The NetApp VTL uses both variable block sizes and variable byte offsets, along with advanced techniques such as skip filters, to identify duplicate data at any offset and to maximize effectiveness when deduplicating backup data streams. The design lets you take advantage of our unique policy-driven, rate-adaptive methodology, which attempts to make sure that backup windows are met by automatically switching between partial inline deduplication processing, (anchor generation), and full postprocessing, depending on the rate at which data needs to be ingested.

NetApp has developed this technology in house specifically for our VTL, and—unlike other deduplication technologies that have been tacked on to an existing VTL code base—it is an intrinsic part of the core VTL software. Several patents are pending on the algorithms we have created.
The NetApp VTL deduplication algorithm relies on four key technologies:

  • Anchor generation
  • Grow by compare
  • Skips
  • Hardware compression

Anchor Generation: This technology is used to perform initial markup of “interesting” data points in a data stream. Some of these “interesting” data points become “anchor points,” which are used as a starting point to identify identical segments of data. Anchor generation uses a fast and efficient rolling hash function, and it is typically performed inline (while data is being ingested). If the location of duplicate data changes from one backup to the next, the location of any unique anchors that data contains move with it, making rapid identification possible.

Anchors are stored on disk and are aggressively cached for high-speed lookup. Our algorithm typically generates about three anchors for every 64K of data. Anchors with identical hash-IDs indicate a high likelihood of duplicate data in the data streams in which those anchors reside, but they are not considered matches until a byte compare is performed.

Grow by Compare (GBC). When an anchor with a duplicate value is found, a sequential byte compare is performed to determine that the data matches with 100% certainty. The comparison is performed both forward and backward from the anchor point to determine the full length of the matching data segment.

GBC is unique in that it allows duplicate sequences of arbitrary length to be eliminated—not just fixed-length blocks, but true byte-level granularity with infinitely variable length. When a data segment is found that matches data already stored, that data is replaced with a reference to the already stored segment.


Figure 2) Video illustrating the concept of anchor generation and grow by compare.


Because GBC checks every byte for uniqueness, it is safer than the hash-based algorithms used by other deduplication vendors, which result in data loss if a hash collision occurs.

Skips efficiently accommodate the small variations that can occur between otherwise duplicate data segments as the result of file headers and backup application metadata. Skips make GBC extremely efficient by filtering out the changing metadata embedded in the data stream and leaving behind long extents of raw backup data that can be efficiently read from disks. Skips virtually eliminate the performance problems suffered by competing solutions, caused by the data fragmenting effect of their deduplication algorithms.


Figure 3) Video illustrating the concept of skips.


Hardware-Based Compression. All unique data is written to disk through the hardware-based compression cards of the NearStore VTL. These cards use an efficient conventional compression algorithm that typically doubles the amount of data that can be stored on disk and that can be used either with or without deduplication. Hardware compression works hand in hand with postprocessing deduplication. Because all data is compressed at ingest, only about half as much needs to be stored on disk before deduplication takes place.

Conclusion

NetApp has taken great care to create an efficient deduplication methodology that is uniquely suited to the requirements of a VTL. This technology:

  • Is independent of backup format, so it works with any data stream
  • Provides both inline and postprocessing modes of operation
  • Avoids the possibility of data loss resulting from hash collisions
  • Works with hardware compression for maximum performance
  • Can be turned off to avoid processing overhead on data sets where it isn’t necessary.

NetApp has taken a unique approach to achieve alignment independence that combines unique anchors created by a rolling hash comparison, identification of duplicate data segments by comparing in both directions from identical anchors, and the ability to skip embedded file headers and backup metadata in otherwise duplicate data for maximum efficiency.

Got opinions about the NetApp VTL?

Ask questions, exchange ideas, and share your thoughts online in NetApp communities.

Keith Brown

Keith Brown
Director of Technology
NetApp

Keith has been with NetApp for over 11 years and currently works in the Data Protection and Retention Group. He has had technical and marketing responsibilities for numerous NetApp products and technologies. Currently, Keith focuses mainly on the Data ONTAP™ Snapshot™ based backup and replication products, NetApp’s data deduplication technologies, and the NetApp VTL.

 
Explore