Regulatory compliance and business continuity demands are putting increasing pressure on enterprises to find ways to store the rapidly growing amount of data that is being generated. One technology businesses are turning to is data deduplication.
Data deduplication removes duplicate chunks of information before it is stored; there is a huge variation in the definition of “before it is stored.” Deduplication can take place in-line or post-process, at the source or at the target, in software or in hardware, on a SAN, NAS or VTL (virtual tape library).
Have no fear, all of this will be explained, and the confusion you’re now feeling is part of my point. Data de-duplication means different things to different vendors and practitioners-some of it is real and some of it is product positioning. It’s our job at eWEEK Labs to separate the wheat from the chaff and to help you make informed decisions.
Different vendors and products take different approaches, so make sure you are comparing apples with apples when researching, and above all, make sure your research is focused on solutions that apply to your needs.
Industry research companies say the data deduplication market could exceed $1 billion in 2009, and Data Domain, NetApp, IBM, EMC, FalconStor Software, ExaGrid Systems, Sepaton, NEC and Quantum all have products competing in some way in this space.
Click here for a look at Data Domain’s OpenStorage dedupe offering.
On the most general level, all data deduplication uses a similar process regardless of where and when it operates. Data is broken down into segments, a fingerprint of each segment is computed and that fingerprint is compared with all of the other fingerprints already in the system. If the fingerprint is unique, then the segment is written to storage. If the fingerprint is not unique-meaning that the incoming data segment is equal to a segment already stored-an additional reference to the previously stored unique segment is all that’s needed.
Data deduplication in any form is sort of a filter through which data flows between the operating system and storage, transforming and translating information so as to store it more efficiently.
It’s like adding a layer to the various ways that we access physical drives, aka, the disk operating system, or DOS. This is not to be confused with, for example, MS-DOS, which is an actual operating system. This is the subsystem of any operating system that deals with high-level disk input and output. Software, especially these days, is all about abstraction (this being what makes virtualization possible). Hardware doesn’t exist in and of itself; it exists as it is defined in the operating system and applications.
Typically, when an operating system builds a file system on physical disk storage, several layers of abstraction exist: internal to the disk itself; between the disk and the controller; and between the controller and the OS. I could probably spend another 1,000 words on the details, but my point remains the same. It helps to conceptualize data deduplication as adding another layer of abstraction on top of a process that is already layers of abstractions. Only, this time, the new layer is adding a greater degree of efficiency to the storage process.
An enterprise wrestling with any number of problems should consider the potential impact of data deduplication technology. Many enterprises are able to back up more data in a shorter time frame. When using deduplication for backup and archiving, some enterprises have experienced five- to 25-times reductions in the size of their data. For widely used but rarely changed files, this can sometimes be as high as 100 times.
A Knowledge Center on 10 things to know about data deduplication.
As a caveat, if you implement a deduplication system, it will take several months to truly understand how the technology functions in your environment. This process is very much like baselining network performance-a single reading is of little value, and what really matters is performance over time.
There are other benefits of data deduplication. There can be a significant reduction in the amount of data transferred over the network, and decreasing the amount of data backed up and archived also reduces the need for storage capacity. Eliminating unnecessary storage will have the effect of lowering power, cooling and space requirements in the data center and potentially slowing spending on storage equipment. Data deduplication can also dramatically reduce the number of backup and restore windows because less data is physically written or read from storage media.
One result of implementing data deduplication is a significant reduction in the amount of disk or tape required for storing a given period of backup data. This allows IT departments to shift from storing backups for days or weeks to months or quarters. In situations where backing up to disk would have been cost-prohibitive, the reduction in the amount physical storage required for backups can enable organizations to move to disk-to-disk backup or at least disk-to-disk-to-tape.
In-line or Post-Process?
One of the arguments that is raging within the data deduplication community is where deduplication operations should take place. The choice pretty much boils down to whether deduplication should occur in-line (storing only unique data), or post-process (stored somewhere normally first and then deduped and written elsewhere).
This argument revolves around the way various deduplication products work. However, the overall process is always the same. Data Domain and Permabit would like us to believe that in-line deduplication is the wave of the future. NetApp and Sepaton say post-process is the way to go.
One of the key advantages of in-line processing is that it requires much less storage than post-processing because data is deduped on the fly. Post-processing needs to write the data to disk before it can be deduped, which means that post-processing requires more disk space (to write the original and the deduped version). However, more can go wrong in-line than post-process; in the event of catastrophic failure during the dedupe process, at least with post-process you’d be left with the original (not deduped) data.
It would not be too far-fetched to imagine a scenario where a crash on an in-line dedupe system used for primary storage would corrupt or omit mission critical data.
Source or Target?
Another way of classifying dedupe products is whether they operate at the storage source or the storage target. When dedupe takes place at the source, then only unique segments are transferred to the target. Target deduplication identifies duplicate data once the data reaches the target. Neither affects deduplication rates, but there are implications for your environment.
Source dedupe decreases the amount of data transferred over the network but decreases performance at the source, which is a broad statement because the “source” can be anything from a workstation to a mission-critical server, NAS or SAN device. There are also many variations in the definition of “target.” A target can be a NAS, SAN, VTL, or a combination of the above.
Software vs. Hardware
Another battle raging is whether you’re better off with deduplication taking place via software or hardware. There are arguments in favor of both. Much data deduplication software is hardware-agnostic, which makes implementation and migration much easier and less disruptive. Once you build your storage policies in management software, you can scale the hardware up or down without having to recreate policy. On the other hand, running deduplication as an application-layer process on top of generic hardware can’t match the performance of an appliance-based solution running on purpose-built hardware and operating system.
How Dedupe Can Be Used
It’s important to understand that only the right combination of in-line or post-process, source or target, and hardware or software will provide the benefits that your organization needs. Let’s also throw into the mix that de-dupe can be run on primary or secondary storage. Setting aside all of the philosophy behind the different implementation strategies, your task is to decide if the benefits outweigh the costs and risks in your environment and usage scenario.
Certain solutions will perform better but be more disruptive, others will be less disruptive but may not solve the root cause of your storage problems. And don’t forget about your users. Whichever solution you select, the negative impact, if any, on users needs to be minimized.
One thing is for certain-the amount of reduction inherent in the data deduplication process comes from writing small changes to files over time, a process that is more likely to occur on secondary than primary storage. My perception of primary storage data deduplication is that there will inherently be less of a percentage reduction in the need for physical storage due to the fact that most users don’t work by making a single small change to a file. Databases may work like this, but I certainly don’t.
Writing this article is the perfect example of both when deduplication would be beneficial and when it wouldn’t. I wrote the article in two sittings, saving the same file with significant changes on top of itself. There would be very little data that could be removed from the process in this case.
However, assuming my editor is in a good mood when I submit this, he will open the file, make very few changes (please) and then save it again. Deduplication may be beneficial in this case.
One final caveat to consider is vendor lock-in. What happens if your data dedupe vendor goes out of business or withdraws support for your solution? If data is the lifeblood of your business and one of the above situations takes place, the real bloody mess will be when your head is on the chopping block.
Conclusion
There are a few issues I haven’t touched on, such as the underlying physical storage within various deduplication solutions (RAID level, SAS vs. SATA or Serial ATA disks, the role of SSD). Also, there may be regulatory issues surrounding deduplication. For example, enterprises may have serious concerns about whether deduplicating data will expose them to regulatory or legal problems, especially in cases where the law states that a full and unaltered copy of original data must be retained for auditors.
No matter how you slice it, decisions about data deduplication are on the table for many enterprises because it can provide great value for those trying to cope with rapidly growing storage requirements. If it’s important to you, then it’s important to me. Using the methodology established in my Data Domain review and the principles outlined in this technical analysis, I will be testing and reviewing data deduplication solutions as a large component of enterprise storage coverage.
Matthew D. Sarrel is executive director of Sarrel Group, an IT test lab, editorial services and consulting company in New York.