What is Data Deduplication?

The eWeek Knowledge Center features IT experts answering questions about the most pertinent enterprise technology issues of the day. This installment features Robert Stevenson, Managing Director for Storage at TheInfoPro.

Q: How does data deduplication work?
A: Data deduplication is based on the fact that in any enterprise where you are storing and backing up data there is a tremendous amount of content the occurs more than once. Its more efficient to eliminate or deduplicate those occurrences rather than store them in multiple places. Deduplication vendors use a variety of different algorithms. Some use hash algorithms like SHA-1, others do bit-by-bit comparison. But it boils down to examining the blocks of data in a backup stream and replacing duplicated instances with pointers to a unique instance.

Q: What do data deduplication products look like?
A: Typically its an appliance that can sit either in-band or out-of-band. If its in-band, then it analyzes and deduplicates the backup stream while its being sent to backup storage (for example, to a virtual tape library or VTL). If its out-of-band, it analyzes and rewrites the data after its been written to the backup device. In either case, the goal is to remove duplicate data while changing as little as possible in your existing infrastructure, all you do is deploy the appliance.

Q: What kind of applications does deduplication work best with?
A: It can work with either file-oriented or block-oriented applications. It really depends on which applications that particular vendors product is targeting. But you need to keep in mind that it isnt suited for data thats already been compressed or encrypted, because that will reduce the number of pattern matches the deduplication algorithm can detect. Typically you would do encryption after deduplication, not before.

Q: What are the main benefits of deduplication?
A: Well, contrary to what you might think, the most important benefit isnt really saving storage space, but the fact that you need to send less data to backup in the first place. That can save you a lot of time and bandwidth.

Q: Just how much data redundancy can be eliminated with deduplication?
A: It varies tremendously of course. In the best case, you can get a compression ratio of 20-to-1. In other words, a 20 terabyte backup would be reduced to just one terabyte. About 10% of the data deduplication users we talk to get this kind of ratio. But this is definitely something you need to test for yourself with your own data before you buy a deduplication appliance.

Q: What are some of the vendors of data deduplication gear?
A: Data Domain and Diligent Technologies are two of the leaving private independent vendors. EMC acquired a well-known company called Avamar. Network Appliance, Symantec and FalconStor also have solutions.