Deduplication is a hot technology. Because of this, many vendors have responded with a proliferation of approaches and terminologies that seem more designed to confuse than to explain. Global deduplication. Content-aware. Target-based. Source-based. ISV-integrated. So, what does it all mean? And how can businesses know when and how to deploy this new offering?
When it comes to deduplication, it helps to focus on the basics. For example, just what is deduplication and what benefits come from using it?
Deduplication explained
First, deduplication is a data discovery and indexing technology which decreases the volume of data in a storage or communication system while maintaining complete data access.
By reducing data volume, deduplication decreases the hardware, software, communications and administration costs associated with maintaining and managing the data. Unlike tools such as data classification, which require human analysis and intervention, data deduplication happens automatically.
A deduplication system finds strings of data which are exactly the same, saves the first instance of each unique string, and stores a pointer (index) for every successive copy. Generally, this process is sub-file. The definitive ROI of any deduplication product is its deduplication ratio-that is, the degree to which it extracts common data, reducing volume. A 100TB data set with a 2:1 deduplication ratio will result in approximately 50TB of data needing to be stored. That same data set at a 20:1 ratio will result in storing five terabytes, while still maintaining application access to all the same information.
Different deduplication products accomplish the string discovery and indexing process in different ways. Despite this, there are four basic rules driving which approach is the best fit for an organization.
Rule No. 1: Higher deduplication ratios are good
Higher deduplication ratios are good; these are delivered by data intelligence and system scalability. Different deduplication approaches and products deliver different deduplication ratios. The success of an approach rests on the solution’s effectiveness in finding common strings. Products that operate sub-file and account for variable length strings tend to discover and extract more duplicate data. Results vary by product and by application usage.
For example, backup natively creates many copies of data both across and within systems over time, but the resulting deduplication ratios can vary widely depending on data type, data change rate and even the customer’s backup model. An average deduplication ratio of 20:1 or higher is not unusual, but underneath this average may be virtual machine file backups at 40:1, e-mail backups at 15:1 and transactional database backups at 3:1. Solutions claiming “content awareness” often promise higher deduplication ratios. Organizations should ignore the lingo and assess the results. Most vendors offer a tool or consulting approach to help businesses size what results their product will deliver for an environment.
System Scalability
System scalability
Beyond each approach’s discovery technology, organizations must also consider the system’s ability to scale. Different products support different levels of index scalability. This is important not only for system robustness (absolutely critical in any system storing one copy of data to serve numerous applications and users), but also because index scalability impacts the deduplication ratio.
A system supporting a “block pool” of unique data only to five terabytes will need to store a duplicate string every time it crosses the 5TB boundary, while a system with a 140TB index won’t store similar data until it hits 140 terabytes. If these two systems had exactly the same deduplication effectiveness, the more scalable system would still have a deduplication ratio 28 times higher and would store 1/28th the volume of data! This is direct savings to the bottom line.
For a deduplication product to extract duplicate data, the duplicate data must be there to find. Primary application or even archive data rarely has the same level of native data duplication as backup. Hence, one deduplication approach does not fit all. A deduplication approach which is more “weighty” in resource usage may be valuable in backup, but it may make no sense to use it on primary or archive data sets where the duplicate data simply doesn’t exist.
Rule No. 2: Price performance is important.
As with any data management technology, data transfer and compute speed is important. This is particularly true when the deduplication technology is “in-line.” In this case, the performance of the deduplication product must be fast enough not to throttle the backup process. Even with deduplication offerings that run “deferred”, be sure the system delivers enough performance to assure that yesterday’s backup data is stored, replicated off-site (if desired), extracted to tape (if desired) and deleted (by policy) before the next day’s backup window. The system should be able to provide sufficient performance without the need for unique, high-cost proprietary hardware.
Decreases Data Volume Where It Runs
Rule No. 3: Deduplication decreases data volume where it runs
Deduplication decreases data volume where it runs but it also causes work where it operates. Deduplication products offer the ability to extract data volumes at different locations in the architecture: on the production server, on a backup media or index server, or on a specialized appliance. The selection of location depends on the value organizations want to extract, as well as the resources they are willing to use pay for it.
If, for example, there are 500 remote branches with limited bandwidth, and an organization wants to centralize backup, a product optimized to run at the source (“source-based”) on the production servers will reduce data over the wire, creating large communication savings.
By contrast, a 12TB data center needing to reduce data storage for rapid backup and off-site vaulting using a specialized appliance (“target-based”) will reap savings in storage as well as communications, as the data is electronically vaulted to the disaster recovery site.
Sadly, there is still no such thing as a free lunch: finding and indexing takes system resources to operate at whatever point at which it runs. “Target-based” systems accommodate this reality by controlling their own resources in an appliance that plugs into the backup environment [such as network-attached storage (NAS) or virtual tape library (VTL)]. This may initially appear expensive but it allows organizations to transparently add deduplication value without the need to redesign their current backup architecture.
By contrast, if production servers are already in relatively full usage, a source-based deduplication process will impact production. In another common case, operating “ISV-embedded” deduplication as a feature on traditional backup media or index server may appear transparent and inexpensive but can require a complete redesign of the backup system. The new workload caused by the deduplication process will cause any already burdened backup systems to blow out existing resources, driving the need for a new media server and rebalancing of the backup environment.
Rule No. 4: Integration with existing tools is valuable
There is high value in the ability to integrate with existing backup processes, management interfaces and tools. The ease of integration derives more from the approach and sophistication of a deduplication product rather than where the process operates. For example, the importance of strong tape integration for vaulting should not be overlooked. Integration with backup software also varies widely, particularly if organizations want to operate a disk-based (versus virtual tape) backup model.
ISV-embedded deduplication clearly has strong value here. For Symantec NetBackup customers wanting to do disk-based backup, the availability of OpenStorage (OST) also offers strong management and integration possibilities across multiple complementary target appliances. Organizations can use OST with a certified target appliance to manage deduplication, replication and copy to tape functions-all through the NetBackup administration console. More information about management and integration options can be found at the SNIA Website.
Janae Lee is Senior Vice President of Marketing at Quantum. Janae has over 30 years experience in the storage market, including nine years of focus on deduplication. Janae can be reached at janae.lee@quantum.com.