How to Deploy Data Deduplication: Top 10 Things to Know

Data deduplication technologies identify and eliminate redundant data in the enterprise, decreasing the amount of storage capacity required. Data deduplication vendors offer data deduplication solutions that can improve the performance, reliability and efficiency of an IT organization's data backup and recovery efforts. Here, Knowledge Center contributor Jeffrey Tofano explains the top 10 things IT professionals must know before deploying a data deduplication solution in their company.


Data deduplication continues to be one of the hot topics across IT for reducing operational costs associated with managing and protecting data. Gartner predicts that by 2012, deduplication technology will be applied to 75 percent of all backups. While this technology will continue to demonstrate tremendous benefit in backup, it is evolving into a disruptive technology across all tiers of storage. Deduplication alone is not a panacea for all storage challenges. Rather, it will increasingly become a critical feature driving the evolution of all data management and protection tasks.

Here are 10 "must-knows" that IT professionals should understand when considering deployment of a data deduplication solution.

1. Data deduplication offers disruptive reduction in the cost of physical storage.

As the amount of data organizations must manage grows exponentially, companies are spending more resources managing and protecting multiple copies of data. To reduce the storage footprint, deduplication technologies systematically examine data sets, storing references to unique data items rather than storing physical copies.

Unlike other reduction technologies, true deduplication technologies are not limited to examining past versions of particular data sets. Instead, redundancy checks are performed against ALL ingested data to maximize potential reduction ratios. Depending on the type of data and storage policies, deduplicating systems can reduce the physical storage needed by a factor of 20x or more when compared to conventional storage systems.

2. Data deduplication offers disruptive reduction in the cost of data transfer.

As data is moved between deduplicating storage systems, transfers of redundant data sets can be largely eliminated. Deduplication systems maintain block level "fingerprints" that can be efficiently negotiated between endpoints to filter out data already shared so that only unique items need be transferred.

The transfer reductions attainable from the use of deduplication-optimized transfers generally are of the same order as physical storage reductions: 20x or more. Deduplication-optimized transfers will become increasingly critical to all data protection and management tasks moving forward.

3. Data deduplication differs from older data reduction technologies.

Because deduplication has become the industry buzz, clever marketing has re-cast compression, byte differential and even file-level single instancing as deduplication technology. However, true deduplication technologies exhibit a couple of key differentiating traits. First, the scope of deduplication reduction is not limited to a single data set or versions of that data set. Rather, each data set is deduplicated against all other stored data, regardless of type or origin.

Second, the granularity of comparison is sub-file level and typically small (a few kilobytes or less). Third, as data is examined, a small, globally unique fingerprint for each data chunk is computed that serves as the proxy for the actual data. These fingerprints can be quickly examined to determine redundancy and can be used locally or remotely across systems. All of this enables true deduplication technologies to deliver far greater reduction rates and a more granular distributed foundation, compared to other reduction technologies.

4. Variable-length methods are better than fixed-length methods.

Deduplication divides large data sets into numerous small chunks that are checked against a global chunk repository to detect and eliminate duplicates. Fixed-length methods only support one predetermined chunk size, while variable-length schemes allow the size of chunk to vary based on observations of the overall data set's structure.

Variable schemes typically offer much better reduction ratios for two reasons. First, because the boundaries of chunks are not fixed, changes (such as the insertion of a small number of bytes) affect only the targeted chunk(s) but not adjacent ones, thereby avoiding the ripple effect inherent in fixed-length methods. Second, the size of all chunks is allowed to vary based on observed redundancies in the data set, allowing more granular comparisons for better matches.

5. Data integrity and hash collision worries are ill-founded.

Because deduplication internally replaces redundant copies of data with references to a single copy, some worry that any data integrity breach-including rare but possible hash collisions-could erroneously affect all referencing data sets. However, today's deduplication systems typically support multiple features to prevent collisions and assure excellent integrity properties. Hash collision issues are often dealt with through the use of multiple, different hashes for fingerprints or even store-time byte level comparisons. These are used to detect collisions and assure that the proper data is always stored and retrieved.

Similarly, deduplication systems leverage the complex nature of hash-based fingerprints for data integrity purposes, using them to provide additional end-to-end integrity checks and drive complex recovery schemes that often interact with required block-level RAID subsystems.