How to Deploy Data Deduplication: Top 10 Things to Know (
Page 1 of 2 )
Data deduplication technologies identify and eliminate redundant data in the enterprise, decreasing the amount of storage capacity required. Data deduplication vendors offer data deduplication solutions that can improve the performance, reliability and efficiency of an IT organization's data backup and recovery efforts. Here, Knowledge Center contributor Jeffrey Tofano explains the top 10 things IT professionals must know before deploying a data deduplication solution in their company.
Data
deduplication continues to be one of the hot topics across IT for
reducing operational costs associated with managing and protecting
data. Gartner predicts that by 2012, deduplication technology will be
applied to 75 percent of all backups. While this technology will
continue to demonstrate tremendous benefit in backup, it is evolving
into a disruptive technology across all tiers of storage. Deduplication
alone is not a panacea for all storage challenges. Rather, it will
increasingly become a critical feature driving the evolution of all
data management and protection tasks.
Here are 10 “must-knows” that IT professionals should understand when considering deployment of a data deduplication solution.
1. Data deduplication offers disruptive reduction in the cost of physical storage.
As the amount of data organizations must manage grows exponentially,
companies are spending more resources managing and protecting multiple
copies of data. To reduce the storage footprint, deduplication
technologies systematically examine data sets, storing references to
unique data items rather than storing physical copies.
Unlike other reduction technologies, true deduplication technologies
are not limited to examining past versions of particular data sets.
Instead, redundancy checks are performed against ALL ingested data to
maximize potential reduction ratios. Depending on the type of data and
storage policies, deduplicating systems can reduce the physical storage
needed by a factor of 20x or more when compared to conventional storage
systems.
2. Data deduplication offers disruptive reduction in the cost of data transfer.
As data is moved between deduplicating storage systems, transfers of
redundant data sets can be largely eliminated. Deduplication systems
maintain block level “fingerprints” that can be efficiently negotiated
between endpoints to filter out data already shared so that only unique
items need be transferred.
The transfer reductions attainable from the use of
deduplication-optimized transfers generally are of the same order as
physical storage reductions: 20x or more. Deduplication-optimized
transfers will become increasingly critical to all data protection and
management tasks moving forward.
3. Data deduplication differs from older data reduction technologies.
Because deduplication has become the industry buzz, clever marketing
has re-cast compression, byte differential and even file-level single
instancing as deduplication technology. However, true deduplication
technologies exhibit a couple of key differentiating traits. First, the
scope of deduplication reduction is not limited to a single data set or
versions of that data set. Rather, each data set is deduplicated
against all other stored data, regardless of type or origin.
Second, the granularity of comparison is sub-file level and
typically small (a few kilobytes or less). Third, as data is examined,
a small, globally unique fingerprint for each data chunk is computed
that serves as the proxy for the actual data. These fingerprints can be
quickly examined to determine redundancy and can be used locally or
remotely across systems. All of this enables true deduplication
technologies to deliver far greater reduction rates and a more granular
distributed foundation, compared to other reduction technologies.
4. Variable-length methods are better than fixed-length methods.
Deduplication divides large data sets into numerous small chunks
that are checked against a global chunk repository to detect and
eliminate duplicates. Fixed-length methods only support one
predetermined chunk size, while variable-length schemes allow the size
of chunk to vary based on observations of the overall data set’s
structure.
Variable schemes typically offer much better reduction ratios for
two reasons. First, because the boundaries of chunks are not fixed,
changes (such as the insertion of a small number of bytes) affect only
the targeted chunk(s) but not adjacent ones, thereby avoiding the
ripple effect inherent in fixed-length methods. Second, the size of all
chunks is allowed to vary based on observed redundancies in the data
set, allowing more granular comparisons for better matches.
5. Data integrity and hash collision worries are ill-founded.
Because deduplication internally replaces redundant copies of data
with references to a single copy, some worry that any data integrity
breach–including rare but possible hash collisions–could erroneously
affect all referencing data sets. However, today’s deduplication
systems typically support multiple features to prevent collisions and
assure excellent integrity properties. Hash collision issues are often
dealt with through the use of multiple, different hashes for
fingerprints or even store-time byte level comparisons. These are used
to detect collisions and assure that the proper data is always
stored and retrieved.
Similarly, deduplication systems leverage the complex nature of
hash-based fingerprints for data integrity purposes, using them to
provide additional end-to-end integrity checks and drive complex
recovery schemes that often interact with required block-level RAID
subsystems.