Internet Archive Tames Data Costs
Q&A: The founder and digital librarian of Internet Archive reveals storage challenges and solutions.
Archiving the entire Internetnot just as it is but as it has beenis a task that pushes the limits of maximum storage volume while demanding a creative search for minimum cost. Brewster Kahle, digital librarian and founder of the nonprofit Internet Archive, spoke with eWEEK Technology Editor Peter Coffee about the magnitude of the challenge and the surprisingly simple solutions his group has devised. As enterprise data volumes swiftly rise into the petabyte realm, and as even the profit sector finds the cost of data center operations to be quickly outpacing the cost of the IT hardware that those centers support, Kahles team finds itself offering pointers to the future as well as the past.More of Coffees conversation with Kahle can be found in an eWEEK InfraSpectrum podcast.

Read more here about IBMs Blue Gene.
On the next level up, we just use Linux.
We use very simple systems for replicating data from one system to another, and also for serving it to the outside.
When you say "simple systems," do you mean on the software side? Because when I look at multidecade curves for storage capacity versus storage subsystem bandwidth, it seems as if that gap is widening, and our ability to pack more petabytes into a box is vastly outstripping our capability for getting bytes in and out of that box.
Well, were seeing a nice progression from 96our tape robots and our first cluster and our second cluster. Now were on our third cluster.
With tape, were you horribly input-output bound?
You can just stop at "horrible." Theres almost nothing nice to say about tape.
Except that its cheap?
Its not even cheap.
Disks seem like the way to go, this decade.
And the input/output?
We use Ethernet. We have four disks on a computer. Up until last year, the computers were 100 megabit, but now theyre gigabit.
Is that storage on an IP network?
Its just Linuxwe use the processors that are next to the disk. Its a straight cluster.
Is that essentially what Id find at Google?
Its what youd find at Google; youd find it at Hotmail; youd find it at Yahoo.
All Linux boxes with cheap disks?
They vary a bit with how much CPU, RAM, disk and network they have, but, other than that, theres probably not a lot of difference. Most of us tend to track the same processors. Were using dual-core [Advanced Micro Devices] Athlon [processors], mostly. But its not that easy to make it over the hump to know how to manage a cluster.
There is a technology change that happened from the big-iron Sun Microsystems-EMC-Oracle lineup of the late 90s. [Its] these clusters, which always seem much easier to do than it turns out.
For a while, the big word in clusters was "Beowulf"youre not using that specific Linux cluster model, are you?
We never tried that, no. I could be wrong, but I believe those were really based for scientific applications, where low-latency communication between machines was importantbasically RAM applications, where were fundamentally disk-based.
If youre disk-based, all sorts of things become easier. You dont have to deal with microseconds, in terms of response time. If youre [working with] milliseconds, you can just use Ethernet networks the way theyre normally designed, and you can use operating systems as theyre normally designed.
You just have to be artful about how you put all of these pieces together. When you have a couple thousand computers, which we doand you then have 8,000 or 9,000 disks, which we doyou have to start getting good at making sure that everythings healthy or [at] managing failure.
Technology Editor Peter Coffee can be reached at peter_coffee@ziffdavis.com.
Check out eWEEK.coms for the latest news, reviews and analysis on enterprise and small business storage hardware and software. 








