What is data deduplication? What are its benefits? In simplified terms, data deduplication means comparing objects (usually files or blocks) and removing all non-unique objects (that is, copies). The basic benefits of data deduplication can be summarized as follows: reduced hardware costs, reduced data center footprint, reduced backup costs, reduced costs for disaster recovery, and increased efficiency use of storage.
If you look at the left side of the figure below, you will see several blocks being stored that are not unique. The data deduplication process removes any blocks that are not unique, resulting in the smaller group of blocks to the right.
You can apply data deduplication in multiple places. Wherever you apply it, data deduplication can affect costs not only for your Storage Area Network (SAN), but also for your entire IT infrastructure.
Based on an enterprise environment running typical applications, you probably could squeeze out between 10 to 20 percent more storage space just by getting rid of duplicate and unnecessary files. Files are commonly known as “unstructured data” and the data residing in databases is commonly known as “structured data.” Simple unstructured data in files can therefore be deduplicated at the file system level, but the structured data residing in large databases is typically deduplicated underneath the actual operating system’s file system at the block level.
Interestingly, though, since block-level deduplication does not need to understand the file system, it is sometimes even more efficient to deduplicate files at the block level. Whether you choose a solution that works at the block level, file level or both, you will find that it can pay for itself extremely fast in the amount of savings you get from storage, media, power, cooling and floor space costs.
How Data Deduplication Works
How data deduplication works
1. Divide the input data into blocks or chunks
2. Calculate a hash value for the data
3. Use the hash value to determine whether another block of data has already been stored
4. Replace the original data with a reference to an object in the database
You can implement the actual process of data deduplication in several ways. For example, you can eliminate duplicate data simply by comparing two files and deleting the one that’s older or no longer needed, or you can use a commercial deduplication product. Commercial solutions use sophisticated methods and the actual math involved can make your head spin. If you want to understand all the nuances of the mathematical techniques used to find duplicate data, you should take college courses in statistical analysis, data security and cryptography (and hey, who knows-if your current line of work doesn’t pan out for you, maybe you could get a job at the CIA).
Most of the data deduplication solutions on the market today use standard data encryption techniques to create a unique mathematical representation of the dataset in question-a hash-so that the hash can be compared with any new hashes to determine whether the data is unique. The hash also serves as the metadata (that is, the data about other data) for the chunk of data in question. A hash used as metadata serves as an efficient index in a lookup table, allowing you to quickly determine whether or not any new data being stored already is present and can be eliminated.
Why Data Deduplication is Important
Why data deduplication is important
Data deduplication goes a long way toward reducing data storage costs by making storage much more efficient, which in turn can reduce the overall footprint inside the data center. Just think: if by deduplicating your data you can store the exact same amount of information in less than one-tenth the footprint, imagine how much money and energy you could save in power and cooling costs.
The machine on top is a tape library with 16 tape drives and 6,000 tapes. The bottom machine is a Virtual Tape Library (VTL) with deduplication-which can emulate over 512 of the tape libraries pictured above it. Even if the cost of the equipment is not an issue, the floor space required sure is! So, let’s see, 10 floor tiles in the data center dedicated to housing 6,000 tapes worth of data-or one floor tile dedicated to housing over 65,000 tapes worth of data: hmm, which to choose?
Why tape is not so green
Some of the folks who sell tape will tell you, since tape does not require power after it’s used, it’s greener to use tape than disk-even if the data is deduplicated. They would be right. Tape takes up no power at rest. But some of those older, massive tape libraries need a nuclear power plant to operate. Disks draw a lot of power when they are spinning up, but draw much less during normal operation.
The other not-so-green fact about tape is that you end up with a lot of it over time. If your Disaster Recovery (DR) strategy is to ship tapes offsite for recovery or storage, those tapes are using a heck of a lot of gasoline that disk drives don’t need. In fact, a VTL that implements deduplication can electronically replicate the data to another VTL at a different location-which would also green the other data center. Also, the most prevalent VTL solution can encrypt the replicated virtual tapes so there is no risk of losing or misplacing sensitive data.
How Backup Environments Benefit
How backup environments benefit
Let’s look at a typical backup environment as an example, since that is the area that benefits greatly from data deduplication. Data deduplication solutions can be implemented in many places but data backup and data archiving are the areas where benefits are immediately apparent. The more data you have, and the longer you need to retain it for business reasons or regulatory purposes, the better results you see from your data deduplication solution.
The figure below shows a sample dataset of 20 TB being retained over five weeks, with typical data growth and change rates. If you use a traditional backup solution (such as Veritas NetBackup, CommVault, IBM Tivoli Storage Manager (TSM), EMC Legato or HP Data Protector) to back up the data to media (disk or tape) with no deduplication, you’ll need to store more than 101 TB of data in only five weeks. [Okay, for you IBMers out there, TSM is a progressive backup solution, so you will probably store less on tape but don’t get me started on all the disk-based file systems being used for the D2D (disk to disk) part of the backup!]
In the figure below, you can see that after five weeks with no deduplication going on, you will have stored about 110 TB of data.
Now let’s take the same metrics and apply a deduplication ratio of a little over 6-to-1. Instead of storing 110 TB, we now only need to store a little more than 24 TB for the exact same amount of information.
All things being equal, we can see that data deduplication can offer a dramatic savings in data center floor space, tape media costs, tape storage and shipping costs. And, if used in conjunction with disks as a backup methodology, much faster recovery if something goes wrong.
The green aspects of data deduplication even extend outside the data center to the trucks that are no longer required to ship bulky tapes offsite. I haven’t even mentioned yet how data deduplication can improve disaster recovery. Less WAN bandwidth needed to replicate data is a major benefit. Another benefit is, if you send less, you store less on the other side-which relates to the cost of storage, power and cooling of the DR location. So you can see, the value and the benefits can add up real fast, and that relates to a greener world for you in more ways than one.