Deduplication is a hot technology. Because of this, many vendors have responded with a proliferation of approaches and terminologies that seem more designed to confuse than to explain. Global deduplication. Content-aware. Target-based. Source-based. ISV-integrated. So, what does it all mean? And how can businesses know when and how to deploy this new offering?
When it comes to deduplication, it helps to focus on the basics. For example, just what is deduplication and what benefits come from using it?
First, deduplication is a data discovery and indexing technology which decreases the volume of data in a storage or communication system while maintaining complete data access.
By reducing data volume, deduplication decreases the hardware, software, communications and administration costs associated with maintaining and managing the data. Unlike tools such as data classification, which require human analysis and intervention, data deduplication happens automatically.
A deduplication system finds strings of data which are exactly the same, saves the first instance of each unique string, and stores a pointer (index) for every successive copy. Generally, this process is sub-file. The definitive ROI of any deduplication product is its deduplication ratio-that is, the degree to which it extracts common data, reducing volume. A 100TB data set with a 2:1 deduplication ratio will result in approximately 50TB of data needing to be stored. That same data set at a 20:1 ratio will result in storing five terabytes, while still maintaining application access to all the same information.
Different deduplication products accomplish the string discovery and indexing process in different ways. Despite this, there are four basic rules driving which approach is the best fit for an organization.
Rule No. 1: Higher deduplication ratios are good
Higher deduplication ratios are good; these are delivered by data intelligence and system scalability. Different deduplication approaches and products deliver different deduplication ratios. The success of an approach rests on the solution's effectiveness in finding common strings. Products that operate sub-file and account for variable length strings tend to discover and extract more duplicate data. Results vary by product and by application usage.
For example, backup natively creates many copies of data both across and within systems over time, but the resulting deduplication ratios can vary widely depending on data type, data change rate and even the customer's backup model. An average deduplication ratio of 20:1 or higher is not unusual, but underneath this average may be virtual machine file backups at 40:1, e-mail backups at 15:1 and transactional database backups at 3:1. Solutions claiming "content awareness" often promise higher deduplication ratios. Organizations should ignore the lingo and assess the results. Most vendors offer a tool or consulting approach to help businesses size what results their product will deliver for an environment.