Data deduplication promises to use enterprise storage more efficiently, reducing the need to buy as much media-tape or disk-and as a result save space, power, and cooling in the data center. Unfortunately, it is also a term that can have almost as many different meanings as there are specific technologies applied to achieve it.
Broadly, the term applies to technologies that analyze data files, find and remove redundant blocks of information, and engage some sort of compression algorithm, usually g-zip or LZ. In general, files that are edited frequently but with few changes are excellent candidates for deduplication. For this reason, many businesses are turning to deduplication solutions to reduce storage space requirements for backup and archiving of corporate databases, e-mail server message stores and virtual machine images. If your WAN pipes are saturated with such traffic, then you definitely want to keep reading.
The data deduplication market is dominated by Data Domain, so we're starting a series of reviews of products in the space with that company. Other prominent players include NetApp, IBM, EMC and Quantum. Traditionally, reviews have focused almost exclusively on metrics involving the degree of deduplication, or the percentage of raw disk space saved by de-duplication. Not only are other factors-such as throughput performance and ease of installation-as important (if not more), but measuring space savings is extremely difficult to do accurately in a laboratory setting, i.e., without live data with frequent small changes made by many clients at once over a period of months or years.
We wanted to approach reviews of data deduplication gear from a different angle. We chose to focus on ease and potential disruptiveness of implementation, throughput performance, manageability and features while testing in our New York City storage lab, and then interview several Data Domain customers about their real-world experience in order to gain insight into actual deduplication rates. Our primary goal was to evaluate the suitability of the Data Domain solution with respect to multi-site business continuity.
Our testing was designed to simulate a three-location company with a data center, a regional headquarters and a branch office. The branch office backed up locally to a DD120 with 350 GB of internal storage, the regional to a DD510 with 1.2 TB of internal storage, and both of those units replicated to a DD690 with two external drive enclosures housing 10 TB of storage at the data center. Each unit was designed for maximum redundancy with redundant power supplies, NICs, and Fibre Channel controllers, as well as drive arrays configured for RAID 6 plus hot spares. We did this using two separate methodologies, the first being to use Symantec Veritas NetBackup to back up locally and then replicate between the various Data Domain units using Data Domain's replication technology, the second being to use Data Domain's OST (OpenStorage) to control the whole backup and replication process from NetBackUp. It is interesting to note that if your organization already uses NBU, then you can keep all of your old jobs and policies and merely redirect them from tape drives to Data Domain drives.
Deployment could not have been easier, although some aspects are more focused on an enterprise storage skill set than on an IT generalist skill set. Installation should be done using the CLI either by telnet or attached KVM. I was pleased to see that at first login, I was forced to change the default password. We applied licenses for storage, replication and OST, then configured network, file system, system and administrative settings. We confirmed our settings, rebooted and then started setting up our CIFS and NFS shares.