What Is Data Deduplication – Deduplication Meaning

Topics

Share this page

Data deduplication is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements.

Deduplication can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to disk.

At NetApp, deduplication is a zero data-loss technology that is run both as an inline process and as a background process to maximize savings. It is run opportunistically as an inline process so that it doesn’t interfere with client operations, and it is run comprehensively in the background to maximize savings. Deduplication is turned on by default, and the system automatically runs it on all volumes and aggregates without any manual intervention.

The performance overhead is minimal for deduplication operations, because it runs in a dedicated efficiency domain that is separate from the client read/write domain. It runs behind the scenes, regardless of what application is run or how the data is being accessed (NAS or SAN).

Deduplication savings are maintained as data moves around – when the data is replicated to a DR site, when it’s backed up to a vault, or when it moves between on premises, hybrid cloud, and/or public cloud.

Deduplication reduces the amount of physical storage required for a volume by discarding duplicate data blocks.

How does deduplication work?

Deduplication operates at the 4KB block level within an entire FlexVol® volume and among all the volumes in the aggregate, eliminating duplicate data blocks and storing only unique data blocks.

The core enabling technology of deduplication is fingerprints — unique digital signatures for all 4KB data blocks.

When data is written to the system, the inline deduplication engine scans the incoming blocks, creates a fingerprint, and stores the fingerprint in a hash store (in-memory data structure).

After the fingerprint is computed, a lookup is performed in the hash store. Upon a fingerprint match in the hash store, the data block corresponding to the duplicate fingerprint (donor block) is searched in cache memory:

If found, a byte-by-byte comparison is done between the current data block (recipient block) and the donor block as verification to make sure of an exact match. On verification, the recipient block is shared with the matching donor block without an actual write of the recipient block to disk. Only metadata is updated to track the sharing details.
If the donor block is not found in cache memory, the donor block is prefetched from disk into the cache to do a byte-by-byte comparison to make sure of an exact match. On verification, the recipient block is marked as duplicate without an actual write to disk. Metadata is updated to track sharing details.

The background deduplication engine works in the same way. It scans all the data blocks in the aggregate and eliminates duplicates by comparing fingerprints of the blocks and by doing a byte-by-byte comparison to eliminate any false positives. This procedure also ensures that there is no data loss during the deduplication operation.

Test

Benefits of NetApp deduplication

There are some significant advantages to NetApp^® deduplication:

Operates on NetApp or third-party primary, secondary, and archive storage
Application independent
Protocol independent
Minimal overhead
Works on NetApp AFF, FAS
Byte-by-byte validation
Can be applied to new data or to data previously stored in volumes and LUNs
an run during off-peak times
Integrated with other NetApp storage efficiency technologies
Savings due to deduplication can be inherited when using NetApp SnapMirror^® replication technology or Flash Cache^™ intelligent caching
Free of charge

Deduplication use cases

Deduplication is useful regardless of workload type. Maximum benefit is seen in virtual environments where multiple virtual machines are used for test/dev and application deployments.

Virtual desktop infrastructure (VDI) is another very good candidate for deduplication, because the duplicate data among desktops is very high.

Some relational databases such as Oracle and SQL do not benefit greatly from deduplication, because they often have a unique key for each database record, which prevents the deduplication engine from identifying them as duplicates.

Configuring deduplication

Deduplication is automatically enabled on all new volumes and aggregates on AFF systems. On other systems, deduplication can be enabled on a per-volume and/or per-aggregate basis.

Once enabled, the system automatically runs both inline and background operations to maximize savings.