How Data Deduplication Works

By Chris Poelker  |  Posted 2009-01-20 Print this article Print

How data deduplication works

1. Divide the input data into blocks or chunks

2. Calculate a hash value for the data

3. Use the hash value to determine whether another block of data has already been stored

4. Replace the original data with a reference to an object in the database

You can implement the actual process of data deduplication in several ways. For example, you can eliminate duplicate data simply by comparing two files and deleting the one that's older or no longer needed, or you can use a commercial deduplication product. Commercial solutions use sophisticated methods and the actual math involved can make your head spin. If you want to understand all the nuances of the mathematical techniques used to find duplicate data, you should take college courses in statistical analysis, data security and cryptography (and hey, who knows-if your current line of work doesn't pan out for you, maybe you could get a job at the CIA).

Most of the data deduplication solutions on the market today use standard data encryption techniques to create a unique mathematical representation of the dataset in question-a hash-so that the hash can be compared with any new hashes to determine whether the data is unique. The hash also serves as the metadata (that is, the data about other data) for the chunk of data in question. A hash used as metadata serves as an efficient index in a lookup table, allowing you to quickly determine whether or not any new data being stored already is present and can be eliminated.

Chris Poelker is Vice President of Enterprise Solutions at FalconStor Software. Prior to working at FalconStor, Chris was a Storage Architect at Hitachi Data Systems. Before that, Chris was a Lead Storage Architect/Senior Systems Architect for Compaq Computer, Inc. While at Compaq, Chris built the sales/service engagement model for Compaq StorageWorks and trained VARs and Compaq ES/PS contacts on StorageWorks. His certifications include MCSE, MCT (Microsoft Trainer), MASE (Compaq Master ASE Storage Architect) and A+ certified (PC Technician). Chris is also the co-author of "Storage Area Networks for Dummies." He can be reached at

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel