How Deduplication Has Evolved to Handle the Deluge of Data
How Deduplication Has Evolved to Handle the Deluge of Data
Deduplication comes in many unique forms, meaning that a variety of solutions exist to aid small and midsize organizations with their backup needs.
This is an "always on" solution that works in real time as data is being written to the system. By indiscriminately deduplicating all incoming data, this process ensures a comprehensive capture, but it isn't intuitive—spending time deduplicating data sets with minimal duplicates is a waste of time and resources, such as random access memory (RAM).
This method analyzes and eliminates redundant data following a full backup, which yields space savings, but also requires storage space on the disk to hold data until it is deduplicated. Since it requires space to store the full backup in the first place, it is counterintuitive for organizations seeking to reduce their need for storage space through deduplication.
This method involves a separate deduplication agent for each system that needs protecting, which can be effective, but buyer beware: This method is expensive, complex and time-consuming. Some vendors leverage it as an effective solution, but the multiplication of expensive systems, software licenses and bandwidth requirements can diminish this method's overall value.
Workable in real time or post-process, backup data is deduplicated and stored to disk. In this method, the backup software acts as the data mover, so it doesn't require a user to change backup configurations or policies—the only change required is to the destination of the backup streams. This can be an attractive feature, but while it can be effective, the data is not deduplicated until it reaches the backup appliance. This requires an extra layer of software rendered unnecessary by more recent advancements in deduplication. Post-processing is also often combined with this technology, making these systems less storage-efficient.
Considered the next generation of deduplication technology, this method only backs up new and unique data at the source. After an initial full snapshot backup is taken and saved to a recovery point server, future backups capture only new, incremental changes to the data, which results in dramatic efficiencies in required bandwidth, storage requirements, and data protection and recovery across multiple sites. The advantage of source-side deduplication is the reduction of data sent across the network and the resulting performance gain.
Global, Source-Side Deduplication
Global deduplication is optimized source-side deduplication. With this method, every computer, virtual machine or server across local, remote and virtual sites communicates with a recovery point server (RPS) that manages a global database index of all associated files while intuitively determining what needs to be backed up. Then, the RPS pulls only new data as required while eliminating duplicate copies. It then shares the deduplicated intelligence across all source systems. Since backup data is globally deduplicated before it is transferred to the target RPS, only changes are sent over the network, which improves performance and reduces bandwidth usage.
Common Misconceptions About Deduplication
These are the most common: 1) All deduplication is the same and comes standard in every backup and recovery solutions; 2) inline deduplication will slow down performance; 3) source-side deduplication consumes too much processing power on the client. All are wrong. These variations may not appear to create big differences on the surface, but they can have a significant impact on the amount of data you can back up, how much usable capacity is required, how quickly you can recover from unplanned system disruptions and your budget.
Misconception No. 1: All Deduplication Is the Same
Deduplication can mean very different things, and the efficiency of this technology greatly varies from product to product. Some perform target-side, while others perform source-side deduplication; some perform deduplication per backup job, while others perform deduplication across all storage systems. Further, many vendors offer stand-alone deduplication software, which is important to account for when developing your backup and recovery requirements.
Misconception No. 2: Inline Deduplication Slows Performance
As the size of the data increases (e.g., 250Kb, 512Kb, 1024Kb), the less efficient deduplication becomes. Likewise, the more data you process, the more computational resources are required. To achieve inline deduplication that doesn't slow down due to lack of compute resources, vendors must design their own highly sophisticated data management structure. Unbeknown to many, this technology is not simply available off-the-shelf. However, you can quickly identify a vendor's level of data management sophistication by looking at how it supports large data sets. If the inline deduplication only supports large data sets (e.g., 512Kb or 1024Kb), it's a good indication that it's limited to a single backup job or storage volume.
Common Misconception No. 3: Global, Source-Side Deduplication Is Only for VMware
Global deduplication refers to the process of multiple backup devices federating the data management structure for maximum deduplication efficiency. This means every computer, virtual machine or server that is backed up communicates with a backup server that manages a global database index of files on all machines, everywhere. This type requires a sophisticated workflow to optimize replication between the source client and the backup device. This is hard technology to develop, and one that not every vendor has. Knowing this, it makes sense that many people think that global, source-side deduplication is only meant for VMware—not for physical machines or other virtual systems. However, this technology does exist and can yield tremendous operational efficiencies.
Key Trend No. 1: Inline Deduplication
How well deduplication performs is largely based on whether it is post-processed or inline. As its name says, post-process deduplication means that incoming data is first stored to disk and the data is processed for deduplication at a later time. Alternatively, when data is processed for deduplication before being written to disk, this is called inline deduplication. Inline deduplication has the advantage of writing data to disk only once and is the preferred method of deduplication when compared to post-process deduplication, which requires extra storage space and writes to more disk.
Key Trend No. 2: Global, Source-Side Deduplication
The process of source-side deduplication entails backup servers that work in conjunction with agents installed on the clients (the "data source"). The client software communicates with the backup servers to compare new blocks of data and removes redundancies before the data is transferred over the network. Without having to check for duplicate data, this form of deduplication yields dramatic savings in terms of bandwidth, required storage and corresponding costs. Global, source-side deduplication takes this process a step further by sharing all of an organization's deduplicated data intelligence across all source systems. This is quickly replacing target deduplication as the preferred method because of its ability to back up only new and unique data at the source across a global database index of files.