Internet Archive Tames Data Costs

Q&A: The founder and digital librarian of Internet Archive reveals storage challenges and solutions.

Archiving the entire Internet—not just as it is but as it has been—is a task that pushes the limits of maximum storage volume while demanding a creative search for minimum cost. Brewster Kahle, digital librarian and founder of the nonprofit Internet Archive, spoke with eWEEK Technology Editor Peter Coffee about the magnitude of the challenge and the surprisingly simple solutions his group has devised.

As enterprise data volumes swiftly rise into the petabyte realm, and as even the profit sector finds the cost of data center operations to be quickly outpacing the cost of the IT hardware that those centers support, Kahles team finds itself offering pointers to the future as well as the past.

More of Coffees conversation with Kahle can be found in an eWEEK InfraSpectrum podcast.

What kind of total storage volume does the Internet Archive represent now?

If you take the Web collection, its about 55 billion pages, and if it were uncompressed, its well over a petabyte. We get about a 2-1 compression, so I guess its about a 1.6-petabyte [primary data] collection.

Are you using conventional magnetic storage technology?

Yes, we started with tape, and were now on spinning disks. We use, basically, Linux boxes stacked up.

Have you built RAID facilities?

We tried conventional RAID, and we found that it doesnt work very well for us. Our underlying storage system is what we call the PetaBox—its a cluster thats specifically designed for storing and processing petabytes of information. The hardware design leveraged commodity components and low-power components to make it high density, high reliability, easy to repair and very low in capital cost.

Weve been able to figure out how to deal with petabytes in a cost-effective way. I mean that in every aspect—the capital cost, the maintenance cost, the people cost to keep these things repaired, the power and air conditioning costs.

Were finding that its data center space thats one of the killers.

You mean the cost of owning and operating the facility?


I mean the time it takes to outfit one. Weve started to develop … putting a petabyte in shipping containers, so you can store them, running, in parking lots. People just dont have the machine room space and the air conditioning systems to be able to deal with this [amount of data]. Air conditioning systems power use is woefully unoptimized for the regular types of machines that are going in. You can do much better if you control air flow.

Something like the enclosure on an IBM Blue Gene thats optimized for air flow?

The bigger issue is outside the box: The general idea is that you dump warm air into the environment, and you pull cold air—well, actually, warm air—from the environment. Its just dumb—thermodynamically inefficient. If you want to get your air conditioning cost down or eliminate it because you can use outside air, thats where were going next with our machine design.

/zimages/1/28571.gifRead more here about IBMs Blue Gene.

On the next level up, we just use Linux. … We use very simple systems for replicating data from one system to another, and also for serving it to the outside.

When you say "simple systems," do you mean on the software side? Because when I look at multidecade curves for storage capacity versus storage subsystem bandwidth, it seems as if that gap is widening, and our ability to pack more petabytes into a box is vastly outstripping our capability for getting bytes in and out of that box.

Well, were seeing a nice progression from 96—our tape robots and our first cluster and our second cluster. Now were on our third cluster.

With tape, were you horribly input-output bound?

You can just stop at "horrible." Theres almost nothing nice to say about tape.

Except that its cheap?

Its not even cheap. … Disks seem like the way to go, this decade.

And the input/output?

We use Ethernet. We have four disks on a computer. Up until last year, the computers were 100 megabit, but now theyre gigabit.

Is that storage on an IP network?

Its just Linux—we use the processors that are next to the disk. Its a straight cluster.

Is that essentially what Id find at Google?

Its what youd find at Google; youd find it at Hotmail; youd find it at Yahoo.

All Linux boxes with cheap disks?

They vary a bit with how much CPU, RAM, disk and network they have, but, other than that, theres probably not a lot of difference. Most of us tend to track the same processors. Were using dual-core [Advanced Micro Devices] Athlon [processors], mostly. But its not that easy to make it over the hump to know how to manage a cluster.

There is a technology change that happened from the big-iron Sun Microsystems-EMC-Oracle lineup of the late 90s. [Its] these clusters, which always seem much easier to do than it turns out.

For a while, the big word in clusters was "Beowulf"—youre not using that specific Linux cluster model, are you?

We never tried that, no. I could be wrong, but I believe those were really based for scientific applications, where low-latency communication between machines was important—basically RAM applications, where were fundamentally disk-based.

If youre disk-based, all sorts of things become easier. You dont have to deal with microseconds, in terms of response time. If youre [working with] milliseconds, you can just use Ethernet networks the way theyre normally designed, and you can use operating systems as theyre normally designed.

You just have to be artful about how you put all of these pieces together. When you have a couple thousand computers, which we do—and you then have 8,000 or 9,000 disks, which we do—you have to start getting good at making sure that everythings healthy or [at] managing failure.

Technology Editor Peter Coffee can be reached at

/zimages/1/28571.gifCheck out eWEEK.coms for the latest news, reviews and analysis on enterprise and small business storage hardware and software.