How to Deploy Data Deduplication: Top 10 Things to Know - Page 2

6. The way data deduplication occurs matters.

Some deduplication vendors argue about the merits of in-line versus post-process deduplication. Both have benefits and drawbacks. In-line methods examine data as it is ingested, attempting to remove redundancies before writing to physical storage. This can reduce the number of IOs and space consumed, but at a cost: the deduplication process must keep up with the ingest rate or throttle back to offer benefits. Post-process methods are designed to allow full-speed ingests, avoiding throttling by deferring or doing deduplication activities in parallel. This comes with some cost: to maintain top performance, temporary surges in physical capacity can occur.

More recently, "adaptive" methods have emerged that maintain the benefits of in-line up to some ingest threshold and then convert to post-process, avoiding throttling and trading physical capacity as ingest rates increase. All these approaches attempt to balance ingest rate and capacity, based on perceptions of customer need. Why this balancing act? The simple reason is that deduplication isn't free. It requires compute resources and imposes a performance tax that can often scale with the amount of data managed. Deduplication customers need to be aware of inherent tradeoffs in any method.

7. Where data deduplication occurs matters.

Some deduplication schemes are distributed in nature, examining and reducing data at the client systems and then pushing the results to a target storage device. Other deduplication schemes are designed to be transparent to clients, functioning like traditional storage systems but performing the full deduplication processing on the target side. Client-side deduplication schemes often reduce network bandwidth but can impose significant penalties: the integration of deduplication isn't transparent, requiring software to be installed on all attached clients. It also "steals" significant compute cycles (and often local storage) from other applications running on the client.

Target systems allow transparent integration and offer easier performance provisioning, but with a related penalty: they are usually more costly because they offer hardware "beefed-up" to support the required deduplication processing and do little to reduce the initial transfer cost of data.

8. Data deduplication makes disaster recovery better.

Probably the most overlooked benefit of deduplication today is dramatically improved disaster recovery support. Deduplication transfer optimizations typically have a profound impact on storage system replication, allowing significant reductions in the time required to do initial synchronization and large reductions in the volume of data that is continuously or periodically transferred.

More importantly, deduplication-enabled transfer reductions offer substantial relief when used with slow wire technologies, thereby providing much more flexibility for geographic protection.

9. Different data sets deduplicate differently.

Deduplication technologies can't and don't achieve the same reduction ratios for all types of data. Deduplication is most effective when data sets have high, in-file redundancy or are copied and/or stored again after minor edits. In general, unstructured data types like Office files, virtualized disk files, backups, e-mail and archive data sets exhibit very good deduplication ratios, often in the 20-30x range. Structured data sets, such as databases, can also deduplicate well but exhibit lower reduction ratios (more in the 5-8x range).

Why? Because structured data sets tend to exhibit low intrinsic redundancy (applications prune duplicates), contain unique headers that describe the data items and are copied far less often--and typically only under the mediation of the application. To achieve more balance in the ratios customers can expect, most deduplication implementations are evolving to recognize specific data types, performing specific pre-processing to enhance deduplication. Variances exist, so customers need to understand the role of pre-processing. Some hash-based schemes use pre-processing to improve deduplication rates but don't require it. Other schemes require pre-processing, limiting the overall benefits of deduplication.

10. Getting the full benefits of data deduplication depends on a solid understanding of retention and protection requirements.

Disk and tape storage systems today support various recovery time and recovery point objectives. A good understanding of the overall retention and protection requirements will make it easier to determine when and where deduplication can provide the biggest operational and financial benefits.

Although data deduplication technology has evolved to be easily and transparently deployed, the advantages are greatest when it is applied to the appropriate data sets in the environment. Deduplication-savvy vendors provide good sizing tools and support services that offer tremendous value by setting proper reduction and provisioning expectations, and providing overall performance guarantees.

/images/stories/heads/knowledge_center/tofano_jeffrey70x70.jpg Jeffrey Tofano is a recognized industry veteran, having worked in storage and data protection for more than 25 years. As Chief Technology Officer at Quantum, he oversees the company's technological vision and portfolio roadmaps, with an emphasis on integrating data deduplication and file system technologies into the company's broader solutions strategy, as well as protecting critical business data in a virtual world.

Previously, he served as Technical Director at NetApp and Chief Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken on various topics covering emerging data protection concepts and technologies for audiences at data protection and storage network conferences. These include: Storage Networking Industry Association (SNIA) events, Techtarget Backup School end-user tutorials, and national conferences such as Storage Network World (SNW) and Symantec Vision. He can be reached at