How to Deploy Data Deduplication: Top 10 Things to Know - Improved Disaster Recovery Support (
Page 2 of 2 )
6. The way data deduplication occurs matters.
Some deduplication vendors argue about the merits of in-line versus
post-process deduplication. Both have benefits and drawbacks. In-line
methods examine data as it is ingested, attempting to remove
redundancies before writing to physical storage. This can reduce the
number of IOs and space consumed, but at a cost: the deduplication
process must keep up with the ingest rate or throttle back to offer
benefits. Post-process methods are designed to allow full-speed
ingests, avoiding throttling by deferring or doing deduplication
activities in parallel. This comes with some cost: to maintain top
performance, temporary surges in physical capacity can occur.
More recently, “adaptive” methods have emerged that maintain the
benefits of in-line up to some ingest threshold and then convert to
post-process, avoiding throttling and trading physical capacity as
ingest rates increase. All these approaches attempt to balance ingest
rate and capacity, based on perceptions of customer need. Why this
balancing act? The simple reason is that deduplication isn’t free. It
requires compute resources and imposes a performance tax that can often
scale with the amount of data managed. Deduplication customers need to
be aware of inherent tradeoffs in any method.
7. Where data deduplication occurs matters.
Some deduplication schemes are distributed in nature, examining and
reducing data at the client systems and then pushing the results to a
target storage device. Other deduplication schemes are designed to be
transparent to clients, functioning like traditional storage systems
but performing the full deduplication processing on the target side.
Client-side deduplication schemes often reduce network bandwidth but
can impose significant penalties: the integration of deduplication
isn’t transparent, requiring software to be installed on all attached
clients. It also “steals” significant compute cycles (and often local
storage) from other applications running on the client.
Target systems allow transparent integration and offer easier
performance provisioning, but with a related penalty: they are usually
more costly because they offer hardware “beefed-up” to support the
required deduplication processing and do little to reduce the initial
transfer cost of data.
8. Data deduplication makes disaster recovery better.
Probably the most overlooked benefit of deduplication today is
dramatically improved disaster recovery support.
Deduplication transfer optimizations typically have a profound impact
on storage system replication, allowing significant reductions in the
time required to do initial synchronization and large reductions in the
volume of data that is continuously or periodically transferred.
More importantly, deduplication-enabled transfer reductions offer
substantial relief when used with slow wire technologies, thereby
providing much more flexibility for geographic protection.
9. Different data sets deduplicate differently.
Deduplication technologies can’t and don’t achieve the same
reduction ratios for all types of data. Deduplication is most effective
when data sets have high, in-file redundancy or are copied and/or
stored again after minor edits. In general, unstructured data types
like Office files, virtualized disk files, backups, e-mail and archive
data sets exhibit very good deduplication ratios, often in the 20-30x
range. Structured data sets, such as databases, can also deduplicate
well but exhibit lower reduction ratios (more in the 5-8x range).
Why? Because structured data sets tend to exhibit low intrinsic
redundancy (applications prune duplicates), contain unique headers that
describe the data items and are copied far less often--and
typically only under the mediation of the application. To achieve more
balance in the ratios customers can expect, most deduplication
implementations are evolving to recognize specific data types,
performing specific pre-processing to enhance deduplication. Variances
exist, so customers need to understand the role of pre-processing. Some
hash-based schemes use pre-processing to improve deduplication rates
but don’t require it. Other schemes require pre-processing, limiting
the overall benefits of deduplication.
10. Getting the full benefits of data deduplication depends on a solid understanding of retention and protection requirements.
Disk and tape storage systems today support various recovery time
and recovery point objectives. A good understanding of the overall
retention and protection requirements will make it easier to determine
when and where deduplication can provide the biggest operational and
financial benefits.
Although data deduplication technology has evolved to be easily and
transparently deployed, the advantages are greatest when it is applied
to the appropriate data sets in the environment. Deduplication-savvy
vendors provide good sizing tools and support services that offer
tremendous value by setting proper reduction and provisioning
expectations, and providing overall performance guarantees.
Jeffrey
Tofano is a recognized industry veteran, having worked in storage and
data protection for more than 25 years. As Chief Technology Officer at Quantum,
he oversees the company’s technological vision and portfolio roadmaps,
with an emphasis on integrating data deduplication and file system
technologies into the company’s broader solutions strategy, as well as
protecting critical business data in a virtual world.
Previously, he served as Technical Director at NetApp and Chief
Architect at OnStor. Since joining Quantum in 2007, Tofano has spoken
on various topics covering emerging data protection concepts and
technologies for audiences at data protection and storage network
conferences. These include: Storage Networking Industry Association
(SNIA) events, Techtarget Backup School end-user tutorials, and
national conferences such as Storage Network World (SNW) and Symantec
Vision. He can be reached at jeffrey.tofano@quantum.com.