Data deduplication continues to be one of the hot topics across IT for reducing operational costs associated with managing and protecting data. Gartner predicts that by 2012, deduplication technology will be applied to 75 percent of all backups. While this technology will continue to demonstrate tremendous benefit in backup, it is evolving into a disruptive technology across all tiers of storage. Deduplication alone is not a panacea for all storage challenges. Rather, it will increasingly become a critical feature driving the evolution of all data management and protection tasks.
Here are 10 “must-knows” that IT professionals should understand when considering deployment of a data deduplication solution.
1. Data deduplication offers disruptive reduction in the cost of physical storage.
As the amount of data organizations must manage grows exponentially, companies are spending more resources managing and protecting multiple copies of data. To reduce the storage footprint, deduplication technologies systematically examine data sets, storing references to unique data items rather than storing physical copies.
Unlike other reduction technologies, true deduplication technologies are not limited to examining past versions of particular data sets. Instead, redundancy checks are performed against ALL ingested data to maximize potential reduction ratios. Depending on the type of data and storage policies, deduplicating systems can reduce the physical storage needed by a factor of 20x or more when compared to conventional storage systems.
2. Data deduplication offers disruptive reduction in the cost of data transfer.
As data is moved between deduplicating storage systems, transfers of redundant data sets can be largely eliminated. Deduplication systems maintain block level “fingerprints” that can be efficiently negotiated between endpoints to filter out data already shared so that only unique items need be transferred.
The transfer reductions attainable from the use of deduplication-optimized transfers generally are of the same order as physical storage reductions: 20x or more. Deduplication-optimized transfers will become increasingly critical to all data protection and management tasks moving forward.
3. Data deduplication differs from older data reduction technologies.
Because deduplication has become the industry buzz, clever marketing has re-cast compression, byte differential and even file-level single instancing as deduplication technology. However, true deduplication technologies exhibit a couple of key differentiating traits. First, the scope of deduplication reduction is not limited to a single data set or versions of that data set. Rather, each data set is deduplicated against all other stored data, regardless of type or origin.
Second, the granularity of comparison is sub-file level and typically small (a few kilobytes or less). Third, as data is examined, a small, globally unique fingerprint for each data chunk is computed that serves as the proxy for the actual data. These fingerprints can be quickly examined to determine redundancy and can be used locally or remotely across systems. All of this enables true deduplication technologies to deliver far greater reduction rates and a more granular distributed foundation, compared to other reduction technologies.
4. Variable-length methods are better than fixed-length methods.
Deduplication divides large data sets into numerous small chunks that are checked against a global chunk repository to detect and eliminate duplicates. Fixed-length methods only support one predetermined chunk size, while variable-length schemes allow the size of chunk to vary based on observations of the overall data set’s structure.
Variable schemes typically offer much better reduction ratios for two reasons. First, because the boundaries of chunks are not fixed, changes (such as the insertion of a small number of bytes) affect only the targeted chunk(s) but not adjacent ones, thereby avoiding the ripple effect inherent in fixed-length methods. Second, the size of all chunks is allowed to vary based on observed redundancies in the data set, allowing more granular comparisons for better matches.
5. Data integrity and hash collision worries are ill-founded.
Because deduplication internally replaces redundant copies of data with references to a single copy, some worry that any data integrity breach-including rare but possible hash collisions-could erroneously affect all referencing data sets. However, today’s deduplication systems typically support multiple features to prevent collisions and assure excellent integrity properties. Hash collision issues are often dealt with through the use of multiple, different hashes for fingerprints or even store-time byte level comparisons. These are used to detect collisions and assure that the proper data is always stored and retrieved.
Similarly, deduplication systems leverage the complex nature of hash-based fingerprints for data integrity purposes, using them to provide additional end-to-end integrity checks and drive complex recovery schemes that often interact with required block-level RAID subsystems.
Improved Disaster Recovery Support
6. The way data deduplication occurs matters.
Some deduplication vendors argue about the merits of in-line versus post-process deduplication. Both have benefits and drawbacks. In-line methods examine data as it is ingested, attempting to remove redundancies before writing to physical storage. This can reduce the number of IOs and space consumed, but at a cost: the deduplication process must keep up with the ingest rate or throttle back to offer benefits. Post-process methods are designed to allow full-speed ingests, avoiding throttling by deferring or doing deduplication activities in parallel. This comes with some cost: to maintain top performance, temporary surges in physical capacity can occur.
More recently, “adaptive” methods have emerged that maintain the benefits of in-line up to some ingest threshold and then convert to post-process, avoiding throttling and trading physical capacity as ingest rates increase. All these approaches attempt to balance ingest rate and capacity, based on perceptions of customer need. Why this balancing act? The simple reason is that deduplication isn’t free. It requires compute resources and imposes a performance tax that can often scale with the amount of data managed. Deduplication customers need to be aware of inherent tradeoffs in any method.
7. Where data deduplication occurs matters.
Some deduplication schemes are distributed in nature, examining and reducing data at the client systems and then pushing the results to a target storage device. Other deduplication schemes are designed to be transparent to clients, functioning like traditional storage systems but performing the full deduplication processing on the target side. Client-side deduplication schemes often reduce network bandwidth but can impose significant penalties: the integration of deduplication isn’t transparent, requiring software to be installed on all attached clients. It also “steals” significant compute cycles (and often local storage) from other applications running on the client.
Target systems allow transparent integration and offer easier performance provisioning, but with a related penalty: they are usually more costly because they offer hardware “beefed-up” to support the required deduplication processing and do little to reduce the initial transfer cost of data.
8. Data deduplication makes disaster recovery better.
Probably the most overlooked benefit of deduplication today is dramatically improved disaster recovery support. Deduplication transfer optimizations typically have a profound impact on storage system replication, allowing significant reductions in the time required to do initial synchronization and large reductions in the volume of data that is continuously or periodically transferred.
More importantly, deduplication-enabled transfer reductions offer substantial relief when used with slow wire technologies, thereby providing much more flexibility for geographic protection.
9. Different data sets deduplicate differently.
Deduplication technologies can’t and don’t achieve the same reduction ratios for all types of data. Deduplication is most effective when data sets have high, in-file redundancy or are copied and/or stored again after minor edits. In general, unstructured data types like Office files, virtualized disk files, backups, e-mail and archive data sets exhibit very good deduplication ratios, often in the 20-30x range. Structured data sets, such as databases, can also deduplicate well but exhibit lower reduction ratios (more in the 5-8x range).
Why? Because structured data sets tend to exhibit low intrinsic redundancy (applications prune duplicates), contain unique headers that describe the data items and are copied far less often–and typically only under the mediation of the application. To achieve more balance in the ratios customers can expect, most deduplication implementations are evolving to recognize specific data types, performing specific pre-processing to enhance deduplication. Variances exist, so customers need to understand the role of pre-processing. Some hash-based schemes use pre-processing to improve deduplication rates but don’t require it. Other schemes require pre-processing, limiting the overall benefits of deduplication.
10. Getting the full benefits of data deduplication depends on a solid understanding of retention and protection requirements.
Disk and tape storage systems today support various recovery time and recovery point objectives. A good understanding of the overall retention and protection requirements will make it easier to determine when and where deduplication can provide the biggest operational and financial benefits.
Although data deduplication technology has evolved to be easily and transparently deployed, the advantages are greatest when it is applied to the appropriate data sets in the environment. Deduplication-savvy vendors provide good sizing tools and support services that offer tremendous value by setting proper reduction and provisioning expectations, and providing overall performance guarantees.
Previously, he served as Technical Director at NetApp and Chief Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken on various topics covering emerging data protection concepts and technologies for audiences at data protection and storage network conferences. These include: Storage Networking Industry Association (SNIA) events, Techtarget Backup School end-user tutorials, and national conferences such as Storage Network World (SNW) and Symantec Vision. He can be reached at jeffrey.tofano@quantum.com.