Close
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Subscribe
Logo
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Subscribe
    Home Latest News
    • Storage

    Internet Archive Tames Data Costs

    Written by

    Peter Coffee
    Published November 8, 2006
    Share
    Facebook
    Twitter
    Linkedin

      eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

      Archiving the entire Internet—not just as it is but as it has been—is a task that pushes the limits of maximum storage volume while demanding a creative search for minimum cost. Brewster Kahle, digital librarian and founder of the nonprofit Internet Archive, spoke with eWEEK Technology Editor Peter Coffee about the magnitude of the challenge and the surprisingly simple solutions his group has devised.

      As enterprise data volumes swiftly rise into the petabyte realm, and as even the profit sector finds the cost of data center operations to be quickly outpacing the cost of the IT hardware that those centers support, Kahles team finds itself offering pointers to the future as well as the past.

      More of Coffees conversation with Kahle can be found in an eWEEK InfraSpectrum podcast.

      What kind of total storage volume does the Internet Archive represent now?

      If you take the Web collection, its about 55 billion pages, and if it were uncompressed, its well over a petabyte. We get about a 2-1 compression, so I guess its about a 1.6-petabyte [primary data] collection.

      Are you using conventional magnetic storage technology?

      Yes, we started with tape, and were now on spinning disks. We use, basically, Linux boxes stacked up.

      Have you built RAID facilities?

      We tried conventional RAID, and we found that it doesnt work very well for us. Our underlying storage system is what we call the PetaBox—its a cluster thats specifically designed for storing and processing petabytes of information. The hardware design leveraged commodity components and low-power components to make it high density, high reliability, easy to repair and very low in capital cost.

      Weve been able to figure out how to deal with petabytes in a cost-effective way. I mean that in every aspect—the capital cost, the maintenance cost, the people cost to keep these things repaired, the power and air conditioning costs.

      Were finding that its data center space thats one of the killers.

      You mean the cost of owning and operating the facility?

      /zimages/1/152838.jpg

      I mean the time it takes to outfit one. Weve started to develop … putting a petabyte in shipping containers, so you can store them, running, in parking lots. People just dont have the machine room space and the air conditioning systems to be able to deal with this [amount of data]. Air conditioning systems power use is woefully unoptimized for the regular types of machines that are going in. You can do much better if you control air flow.

      Something like the enclosure on an IBM Blue Gene thats optimized for air flow?

      The bigger issue is outside the box: The general idea is that you dump warm air into the environment, and you pull cold air—well, actually, warm air—from the environment. Its just dumb—thermodynamically inefficient. If you want to get your air conditioning cost down or eliminate it because you can use outside air, thats where were going next with our machine design.

      /zimages/1/28571.gifRead more here about IBMs Blue Gene.

      On the next level up, we just use Linux. … We use very simple systems for replicating data from one system to another, and also for serving it to the outside.

      When you say “simple systems,” do you mean on the software side? Because when I look at multidecade curves for storage capacity versus storage subsystem bandwidth, it seems as if that gap is widening, and our ability to pack more petabytes into a box is vastly outstripping our capability for getting bytes in and out of that box.

      Well, were seeing a nice progression from 96—our tape robots and our first cluster and our second cluster. Now were on our third cluster.

      With tape, were you horribly input-output bound?

      You can just stop at “horrible.” Theres almost nothing nice to say about tape.

      Except that its cheap?

      Its not even cheap. … Disks seem like the way to go, this decade.

      And the input/output?

      We use Ethernet. We have four disks on a computer. Up until last year, the computers were 100 megabit, but now theyre gigabit.

      Is that storage on an IP network?

      Its just Linux—we use the processors that are next to the disk. Its a straight cluster.

      Is that essentially what Id find at Google?

      Its what youd find at Google; youd find it at Hotmail; youd find it at Yahoo.

      All Linux boxes with cheap disks?

      They vary a bit with how much CPU, RAM, disk and network they have, but, other than that, theres probably not a lot of difference. Most of us tend to track the same processors. Were using dual-core [Advanced Micro Devices] Athlon [processors], mostly. But its not that easy to make it over the hump to know how to manage a cluster.

      There is a technology change that happened from the big-iron Sun Microsystems-EMC-Oracle lineup of the late 90s. [Its] these clusters, which always seem much easier to do than it turns out.

      For a while, the big word in clusters was “Beowulf”—youre not using that specific Linux cluster model, are you?

      We never tried that, no. I could be wrong, but I believe those were really based for scientific applications, where low-latency communication between machines was important—basically RAM applications, where were fundamentally disk-based.

      If youre disk-based, all sorts of things become easier. You dont have to deal with microseconds, in terms of response time. If youre [working with] milliseconds, you can just use Ethernet networks the way theyre normally designed, and you can use operating systems as theyre normally designed.

      You just have to be artful about how you put all of these pieces together. When you have a couple thousand computers, which we do—and you then have 8,000 or 9,000 disks, which we do—you have to start getting good at making sure that everythings healthy or [at] managing failure.

      Technology Editor Peter Coffee can be reached at [email protected].

      /zimages/1/28571.gifCheck out eWEEK.coms for the latest news, reviews and analysis on enterprise and small business storage hardware and software.

      Peter Coffee
      Peter Coffee
      Peter Coffee is Director of Platform Research at salesforce.com, where he serves as a liaison with the developer community to define the opportunity and clarify developers' technical requirements on the company's evolving Apex Platform. Peter previously spent 18 years with eWEEK (formerly PC Week), the national news magazine of enterprise technology practice, where he reviewed software development tools and methods and wrote regular columns on emerging technologies and professional community issues.Before he began writing full-time in 1989, Peter spent eleven years in technical and management positions at Exxon and The Aerospace Corporation, including management of the latter company's first desktop computing planning team and applied research in applications of artificial intelligence techniques. He holds an engineering degree from MIT and an MBA from Pepperdine University, he has held teaching appointments in computer science, business analytics and information systems management at Pepperdine, UCLA, and Chapman College.

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      MOST POPULAR ARTICLES

      Artificial Intelligence

      9 Best AI 3D Generators You Need...

      Sam Rinko - June 25, 2024 0
      AI 3D Generators are powerful tools for many different industries. Discover the best AI 3D Generators, and learn which is best for your specific use case.
      Read more
      Cloud

      RingCentral Expands Its Collaboration Platform

      Zeus Kerravala - November 22, 2023 0
      RingCentral adds AI-enabled contact center and hybrid event products to its suite of collaboration services.
      Read more
      Artificial Intelligence

      8 Best AI Data Analytics Software &...

      Aminu Abdullahi - January 18, 2024 0
      Learn the top AI data analytics software to use. Compare AI data analytics solutions & features to make the best choice for your business.
      Read more
      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Video

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2024 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.