How to Deploy Data Deduplication: Top 10 Things to Know

 
 
By Jeffrey Tofano  |  Posted 2008-10-15 Email Print this article Print
 
 
 
 
 
 
 

Data deduplication technologies identify and eliminate redundant data in the enterprise, decreasing the amount of storage capacity required. Data deduplication vendors offer data deduplication solutions that can improve the performance, reliability and efficiency of an IT organization's data backup and recovery efforts. Here, Knowledge Center contributor Jeffrey Tofano explains the top 10 things IT professionals must know before deploying a data deduplication solution in their company.

Data deduplication continues to be one of the hot topics across IT for reducing operational costs associated with managing and protecting data. Gartner predicts that by 2012, deduplication technology will be applied to 75 percent of all backups. While this technology will continue to demonstrate tremendous benefit in backup, it is evolving into a disruptive technology across all tiers of storage. Deduplication alone is not a panacea for all storage challenges. Rather, it will increasingly become a critical feature driving the evolution of all data management and protection tasks.

Here are 10 "must-knows" that IT professionals should understand when considering deployment of a data deduplication solution.

1. Data deduplication offers disruptive reduction in the cost of physical storage.

As the amount of data organizations must manage grows exponentially, companies are spending more resources managing and protecting multiple copies of data. To reduce the storage footprint, deduplication technologies systematically examine data sets, storing references to unique data items rather than storing physical copies.

Unlike other reduction technologies, true deduplication technologies are not limited to examining past versions of particular data sets. Instead, redundancy checks are performed against ALL ingested data to maximize potential reduction ratios. Depending on the type of data and storage policies, deduplicating systems can reduce the physical storage needed by a factor of 20x or more when compared to conventional storage systems.

2. Data deduplication offers disruptive reduction in the cost of data transfer.

As data is moved between deduplicating storage systems, transfers of redundant data sets can be largely eliminated. Deduplication systems maintain block level "fingerprints" that can be efficiently negotiated between endpoints to filter out data already shared so that only unique items need be transferred.

The transfer reductions attainable from the use of deduplication-optimized transfers generally are of the same order as physical storage reductions: 20x or more. Deduplication-optimized transfers will become increasingly critical to all data protection and management tasks moving forward.

3. Data deduplication differs from older data reduction technologies.

Because deduplication has become the industry buzz, clever marketing has re-cast compression, byte differential and even file-level single instancing as deduplication technology. However, true deduplication technologies exhibit a couple of key differentiating traits. First, the scope of deduplication reduction is not limited to a single data set or versions of that data set. Rather, each data set is deduplicated against all other stored data, regardless of type or origin.

Second, the granularity of comparison is sub-file level and typically small (a few kilobytes or less). Third, as data is examined, a small, globally unique fingerprint for each data chunk is computed that serves as the proxy for the actual data. These fingerprints can be quickly examined to determine redundancy and can be used locally or remotely across systems. All of this enables true deduplication technologies to deliver far greater reduction rates and a more granular distributed foundation, compared to other reduction technologies.

4. Variable-length methods are better than fixed-length methods.

Deduplication divides large data sets into numerous small chunks that are checked against a global chunk repository to detect and eliminate duplicates. Fixed-length methods only support one predetermined chunk size, while variable-length schemes allow the size of chunk to vary based on observations of the overall data set's structure.

Variable schemes typically offer much better reduction ratios for two reasons. First, because the boundaries of chunks are not fixed, changes (such as the insertion of a small number of bytes) affect only the targeted chunk(s) but not adjacent ones, thereby avoiding the ripple effect inherent in fixed-length methods. Second, the size of all chunks is allowed to vary based on observed redundancies in the data set, allowing more granular comparisons for better matches.

5. Data integrity and hash collision worries are ill-founded.

Because deduplication internally replaces redundant copies of data with references to a single copy, some worry that any data integrity breach-including rare but possible hash collisions-could erroneously affect all referencing data sets. However, today's deduplication systems typically support multiple features to prevent collisions and assure excellent integrity properties. Hash collision issues are often dealt with through the use of multiple, different hashes for fingerprints or even store-time byte level comparisons. These are used to detect collisions and assure that the proper data is always stored and retrieved.

Similarly, deduplication systems leverage the complex nature of hash-based fingerprints for data integrity purposes, using them to provide additional end-to-end integrity checks and drive complex recovery schemes that often interact with required block-level RAID subsystems.



 
 
 
 
Jeffrey Tofano is a recognized industry veteran, having worked in storage and data protection for more than 25 years. As Chief Technology Officer at Quantum, he oversees the company's technological vision and portfolio roadmaps, with an emphasis on integrating data deduplication and file system technologies into the company's broader solutions strategy, as well as protecting critical business data in a virtual world. Previously, he served as Technical Director at NetApp and Chief Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken on various topics covering emerging data protection concepts and technologies for audiences at data protection and storage network conferences. These include: Storage Networking Industry Association (SNIA) events, Techtarget Backup School end-user tutorials, and national conferences such as Storage Network World (SNW) and Symantec Vision. He can be reached at jeffrey.tofano@quantum.com.
 
 
 
 
 
 
 

Submit a Comment

Loading Comments...
 
Manage your Newsletters: Login   Register My Newsletters























 
 
 
 
 
 
 
 
 
 
 
Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel