| Quick Links |
| netapp.com |
| Tech OnTap Archive |
| June 2008 (PDF) |
How Safe Is Deduplication?Unless you’ve had your head in the sand recently, you’re probably well aware that deduplication is hot. It seems like every storage vendor you’ve ever heard of (and many more you haven’t) is touting deduplication technology as a way to reduce the cost of disk-to-disk backup.
In this article, I’m going to take a look at deduplication technology in light of these two criteria. I’ll also explain the choices that NetApp has made to enhance the reliability of our own deduplication technology. Because we support deduplication for both primary and secondary storage—whereas most other vendors provide deduplication only for backups—data safety is of the utmost importance for us. Identifying Duplicate DataMost existing deduplication products operate at the block level—new blocks are compared against previously stored blocks to determine whether an identical block has been previously stored. If it has been previously stored, the “new” block is discarded in favor of a pointer to the stored block. Reliability of Underlying Hardware and SoftwareDeduplication is only as reliable as the underlying hardware and software. In fact, although it may not be immediately obvious, with deduplication, reliability becomes even more crucial.
There are a wide variety of deduplication products out there. Some are software only and may use a variety of underlying hardware; some include both hardware and software (possibly obtained from a variety of sources through licensing or OEM arrangements). Before making a decision, you should assess how mature the software is, how robust the underlying hardware is, and how well the two integrate together. NetApp ReliabilityWith NetApp® storage, deduplication is an integral part of the Data ONTAP® operating environment that runs across our entire product line. Data ONTAP has been under continuous co-development with NetApp hardware platforms for more than 15 years. Unique features of the NetApp WAFL® technology actually simplify the implementation of deduplication and make it possible to deduplicate any stored data, not just backup data.Proven reliability features in NetApp hardware and software result in data availability of more than 99.999% as measured across the NetApp installed base. A recent analyst report describes the NetApp methodology and many of the features that contribute to NetApp’s reliability. One example of our attention to detail involves the well-known fact that disk drive bit errors can develop over time—or even during the manufacture of disk drives. Every drive has built-in error correction that detects and usually corrects such bit errors. If a string of errors is too great to be handled by ECC, the drive reports back that the sector is unreadable, at which point RAID algorithms fix the error from the information stored on other sectors. NetApp, however, also uses a checksum scheme for further protection—we use an additional portion of the drive as overhead to store checksums that move with the data through the system to check that what was written is returned perfectly during data restoration. In essence, we provide a third level of protection. To protect the reliability of data committed to disk, NetApp also developed RAID-DP™, a high-performance, dual-parity RAID 6 implementation that protects against double disk failures without sacrificing write performance. You can read more about RAID-DP and other NetApp enhancements to protect against misbehaving disk drives in a previous Tech OnTap article. ConclusionTo protect your backup data, deduplication technology must use appropriate algorithms to avoid discarding unique data blocks and also provide the fundamental hardware and software reliability necessary to safely store deduplicated data for later recovery. Because NetApp deduplication technology is used for primary data stores as well as for backup data, we take extra care to protect data reliability. NetApp deduplication uses a combination of fingerprints plus byte-by-byte block comparisons so that unique data blocks are never erroneously deleted due to hash collisions. Deduplicated data is stored on NetApp storage systems using hardware and operating software that have been proven reliable and resilient through years of field deployment, so you can be confident that when it comes time to recover data, you’ll get back the data you backed up. |
|