How data deduplication works
1. Divide the input data into blocks or chunks
2. Calculate a hash value for the data
3. Use the hash value to determine whether another block of data has already been stored
4. Replace the original data with a reference to an object in the database
You can implement the actual process of data deduplication in several ways. For example, you can eliminate duplicate data simply by comparing two files and deleting the one that's older or no longer needed, or you can use a commercial deduplication product. Commercial solutions use sophisticated methods and the actual math involved can make your head spin. If you want to understand all the nuances of the mathematical techniques used to find duplicate data, you should take college courses in statistical analysis, data security and cryptography (and hey, who knows-if your current line of work doesn't pan out for you, maybe you could get a job at the CIA).
Most of the data deduplication solutions on the market today use standard data encryption techniques to create a unique mathematical representation of the dataset in question-a hash-so that the hash can be compared with any new hashes to determine whether the data is unique. The hash also serves as the metadata (that is, the data about other data) for the chunk of data in question. A hash used as metadata serves as an efficient index in a lookup table, allowing you to quickly determine whether or not any new data being stored already is present and can be eliminated.