Close
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Subscribe
Logo
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Subscribe
    Home Database
    • Database
    • IT Management
    • Networking
    • Small Business
    • Storage

    How to Leverage Data Deduplication to Green Your Data Center

    Written by

    Chris Poelker
    Published January 20, 2009
    Share
    Facebook
    Twitter
    Linkedin

      eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

      What is data deduplication? What are its benefits? In simplified terms, data deduplication means comparing objects (usually files or blocks) and removing all non-unique objects (that is, copies). The basic benefits of data deduplication can be summarized as follows: reduced hardware costs, reduced data center footprint, reduced backup costs, reduced costs for disaster recovery, and increased efficiency use of storage.

      If you look at the left side of the figure below, you will see several blocks being stored that are not unique. The data deduplication process removes any blocks that are not unique, resulting in the smaller group of blocks to the right.

      You can apply data deduplication in multiple places. Wherever you apply it, data deduplication can affect costs not only for your Storage Area Network (SAN), but also for your entire IT infrastructure.

      Based on an enterprise environment running typical applications, you probably could squeeze out between 10 to 20 percent more storage space just by getting rid of duplicate and unnecessary files. Files are commonly known as “unstructured data” and the data residing in databases is commonly known as “structured data.” Simple unstructured data in files can therefore be deduplicated at the file system level, but the structured data residing in large databases is typically deduplicated underneath the actual operating system’s file system at the block level.

      Interestingly, though, since block-level deduplication does not need to understand the file system, it is sometimes even more efficient to deduplicate files at the block level. Whether you choose a solution that works at the block level, file level or both, you will find that it can pay for itself extremely fast in the amount of savings you get from storage, media, power, cooling and floor space costs.

      How Data Deduplication Works

      How data deduplication works

      1. Divide the input data into blocks or chunks

      2. Calculate a hash value for the data

      3. Use the hash value to determine whether another block of data has already been stored

      4. Replace the original data with a reference to an object in the database

      You can implement the actual process of data deduplication in several ways. For example, you can eliminate duplicate data simply by comparing two files and deleting the one that’s older or no longer needed, or you can use a commercial deduplication product. Commercial solutions use sophisticated methods and the actual math involved can make your head spin. If you want to understand all the nuances of the mathematical techniques used to find duplicate data, you should take college courses in statistical analysis, data security and cryptography (and hey, who knows-if your current line of work doesn’t pan out for you, maybe you could get a job at the CIA).

      Most of the data deduplication solutions on the market today use standard data encryption techniques to create a unique mathematical representation of the dataset in question-a hash-so that the hash can be compared with any new hashes to determine whether the data is unique. The hash also serves as the metadata (that is, the data about other data) for the chunk of data in question. A hash used as metadata serves as an efficient index in a lookup table, allowing you to quickly determine whether or not any new data being stored already is present and can be eliminated.

      Why Data Deduplication is Important

      Why data deduplication is important

      Data deduplication goes a long way toward reducing data storage costs by making storage much more efficient, which in turn can reduce the overall footprint inside the data center. Just think: if by deduplicating your data you can store the exact same amount of information in less than one-tenth the footprint, imagine how much money and energy you could save in power and cooling costs.

      The machine on top is a tape library with 16 tape drives and 6,000 tapes. The bottom machine is a Virtual Tape Library (VTL) with deduplication-which can emulate over 512 of the tape libraries pictured above it. Even if the cost of the equipment is not an issue, the floor space required sure is! So, let’s see, 10 floor tiles in the data center dedicated to housing 6,000 tapes worth of data-or one floor tile dedicated to housing over 65,000 tapes worth of data: hmm, which to choose?

      Why tape is not so green

      Some of the folks who sell tape will tell you, since tape does not require power after it’s used, it’s greener to use tape than disk-even if the data is deduplicated. They would be right. Tape takes up no power at rest. But some of those older, massive tape libraries need a nuclear power plant to operate. Disks draw a lot of power when they are spinning up, but draw much less during normal operation.

      The other not-so-green fact about tape is that you end up with a lot of it over time. If your Disaster Recovery (DR) strategy is to ship tapes offsite for recovery or storage, those tapes are using a heck of a lot of gasoline that disk drives don’t need. In fact, a VTL that implements deduplication can electronically replicate the data to another VTL at a different location-which would also green the other data center. Also, the most prevalent VTL solution can encrypt the replicated virtual tapes so there is no risk of losing or misplacing sensitive data.

      How Backup Environments Benefit

      How backup environments benefit

      Let’s look at a typical backup environment as an example, since that is the area that benefits greatly from data deduplication. Data deduplication solutions can be implemented in many places but data backup and data archiving are the areas where benefits are immediately apparent. The more data you have, and the longer you need to retain it for business reasons or regulatory purposes, the better results you see from your data deduplication solution.

      The figure below shows a sample dataset of 20 TB being retained over five weeks, with typical data growth and change rates. If you use a traditional backup solution (such as Veritas NetBackup, CommVault, IBM Tivoli Storage Manager (TSM), EMC Legato or HP Data Protector) to back up the data to media (disk or tape) with no deduplication, you’ll need to store more than 101 TB of data in only five weeks. [Okay, for you IBMers out there, TSM is a progressive backup solution, so you will probably store less on tape but don’t get me started on all the disk-based file systems being used for the D2D (disk to disk) part of the backup!]

      In the figure below, you can see that after five weeks with no deduplication going on, you will have stored about 110 TB of data.

      /images/stories/knowledge_center/poelker_graphic4of5.jpg

      Now let’s take the same metrics and apply a deduplication ratio of a little over 6-to-1. Instead of storing 110 TB, we now only need to store a little more than 24 TB for the exact same amount of information.

      /images/stories/knowledge_center/poelker_graphic5of5.jpg

      All things being equal, we can see that data deduplication can offer a dramatic savings in data center floor space, tape media costs, tape storage and shipping costs. And, if used in conjunction with disks as a backup methodology, much faster recovery if something goes wrong.

      The green aspects of data deduplication even extend outside the data center to the trucks that are no longer required to ship bulky tapes offsite. I haven’t even mentioned yet how data deduplication can improve disaster recovery. Less WAN bandwidth needed to replicate data is a major benefit. Another benefit is, if you send less, you store less on the other side-which relates to the cost of storage, power and cooling of the DR location. So you can see, the value and the benefits can add up real fast, and that relates to a greener world for you in more ways than one.

      /images/stories/heads/knowledge_center/poelker_christopher70x70.jpg Chris Poelker is Vice President of Enterprise Solutions at FalconStor Software. Prior to working at FalconStor, Chris was a Storage Architect at Hitachi Data Systems. Before that, Chris was a Lead Storage Architect/Senior Systems Architect for Compaq Computer, Inc. While at Compaq, Chris built the sales/service engagement model for Compaq StorageWorks and trained VARs and Compaq ES/PS contacts on StorageWorks. His certifications include MCSE, MCT (Microsoft Trainer), MASE (Compaq Master ASE Storage Architect) and A+ certified (PC Technician). Chris is also the co-author of “Storage Area Networks for Dummies.” He can be reached at [email protected].

      Chris Poelker
      Chris Poelker
      Chris Poelker is Vice President of Enterprise Solutions at FalconStor Software. Prior to working at FalconStor, Chris was a Storage Architect at Hitachi Data Systems. Before that, Chris was a Lead Storage Architect/Senior Systems Architect for Compaq Computer, Inc. While at Compaq, Chris built the sales/service engagement model for Compaq StorageWorks and trained VARs and Compaq ES/PS contacts on StorageWorks. His certifications include MCSE, MCT (Microsoft Trainer), MASE (Compaq Master ASE Storage Architect) and A+ certified (PC Technician). Chris is also the co-author of "Storage Area Networks for Dummies." He can be reached at [email protected].

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      MOST POPULAR ARTICLES

      Artificial Intelligence

      9 Best AI 3D Generators You Need...

      Sam Rinko - June 25, 2024 0
      AI 3D Generators are powerful tools for many different industries. Discover the best AI 3D Generators, and learn which is best for your specific use case.
      Read more
      Cloud

      RingCentral Expands Its Collaboration Platform

      Zeus Kerravala - November 22, 2023 0
      RingCentral adds AI-enabled contact center and hybrid event products to its suite of collaboration services.
      Read more
      Artificial Intelligence

      8 Best AI Data Analytics Software &...

      Aminu Abdullahi - January 18, 2024 0
      Learn the top AI data analytics software to use. Compare AI data analytics solutions & features to make the best choice for your business.
      Read more
      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Video

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2024 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.