Q: What are some of the applications that generate a lot of redundant data that can be eliminated by deduplication?
A: There are two parts to your answer. First, certain applications by their very design create redundant data on primary storage. Then backup makes it even more redundant.
Q: Lets start with the backup part first.
A: Suppose youre using typical backup software like Symantec NetBackup [formerly Veritas] or EMC NetWorker [formerly Legato] to do your backups. Lets say you keep 12 weeks of full backups and do daily incremental backups. This alone is going to create a lot of redundant data by definition. A good rule of thumb is that one gigabyte of data on primary storage yields 10 GB on backup.
Q: What about the redundancy created by applications?
A: Certain applications create tremendous amounts of redundant data even before the backup software goes to work. One major example is Microsoft Exchange. If you are sending lots of file attachments with your messages, then Exchange is most likely storing many multiple copies of these files. A common rule of thumb is that 90 percent of e-mail volume is in the attachments. If several people are mailing a spreadsheet or a PowerPoint back and forth and each time making a few changes, you can easily end up with 20 or 30 nearly identical copies of the file in Exchange. In extreme cases you can have hundreds of copies. And these extreme cases arent that uncommon. Ive seen customers where we installed deduplication get a 100-to-1 reduction in the volume of data stored by Exchange. Another example comes from the way many organizations provision disk space for their databases. When the DBA ask the database owners how much data they expect to generate, the answer sometimes represents a dream rather than reality. I recently saw a database provisioned for 4 terabytes that only had 400 Mbytes of actual data. But without deduplication the entire 4 terabytes were being regularly backed up.
Q: So part of the problem is the way certain applications work, and part of it is bad policy?
A: There is also a behavioral element to this. If you think about how people use their file systems, you can often observe that they dont trust their backup systems. So what they do is save multiple versions of the same document in different places, or perhaps lots of similar versions that are only slightly different. A deduplication system will notice this and keep only the new blocks.
Q: Given the amount of redundant data that commonly gets created both before and during backup, how much overall reduction can a typical organization expect to get from deduplication?
A: Overall a 20-to-1 reduction in the amount of data backed up is very common. It obviously depends on the exact mix of your applications and your data. It also depends on your policies and on the backup software you are using. For example, Symantec NetBackup and EMC NetWorker do full as well as incremental backups, so if you deduplicate a full weekly backup to an intelligent disk target you will save a lot of space. But IBMs TSM [Tivoli Storage Manager] uses an “incremental forever” approach and doesnt do the “full and incremental” routine that most backup products do. So with TSM and an intelligent disk target you wont see the same 20-to-1 reduction, its more likely to be 5-to-1 or 10-to-1, which of course is still significant.
Q: Aside from “forever incremental” backups, what kinds of applications get the least benefit from data deduplication?
A: Certain specialized types of data have inherently small amounts of redundancy. Interestingly, these are very often data types that describe natural phenomena rather than the result of human activities or business processes. One example is medical imaging, where there may not be much redundancy to begin with, and where the file formats already use specialized compression techniques. Another example is seismic data from the oil and gas exploration industry.