Ten years in the making, the Internet Archive—an ambitious project to store and archive all the Web pages on the Internet along with other forms of digital content—houses more than 4 petabytes of data (1.6 petabytes of primary data) using standards-based modular hardware and open-source software.
The organizations strategies for storing and managing that data can serve as best practices for any company trying to get its arms around an ever-expanding data load.
Multiterabyte data centers are quite common these days, but petabyte-size data stores remain somewhat novel. To see firsthand how the Internet Archive is handling the storage of all its data, eWEEK Labs went on-site at the digital librarys San Francisco data center.
The Internet Archive had recently relocated its data center from offices in the Presidio of San Francisco. In fact, IT managers had just finished moving the last racks of servers into the new location two weeks prior to our visit in October.
Much of the Internet Archives success has to do with the way its IT managers approach the storage of large amounts of data, said Brewster Kahle, digital librarian and founder of the Internet Archive.
“We are a petabyte-oriented facility, and the question is, How do we work and store petabytes of information that are constantly accessible to the outside world?” said Kahle, during eWEEK Labs visit. “The answer is to have two practical considerations—how to store this massive amount of data and how to preserve it. Preservation and access are part of our mandate.”
The Internet Archive is a nonprofit organization founded in 1996 with the purpose of building an online library made up of saved Web sites. The Internet Archive today includes all manner of digital formats, including text, audio and video, as well as archived Web pages. The collection—which can be accessed at www.archive.org—is continually growing.
Funding for the Internet Archive came originally from Kahle as a result of the sale of his company, WAIS (Wide Area Information Servers), to America Online. The Internet Archive is now funded by private foundations, government grants and in-kind donations from corporations.
In the beginning, the Internet Archive used Storage Technologys StorageTek TimberWolf 9710 tape library with Quantums DLT700 drives, the combination of which could store as much as 70GB of data. (Storage Technology was acquired by Sun Microsystems in 2005.) However, while the tape library was cost-efficient, the disadvantage was its relatively slow access speed.
In 2000, Internet Archive IT managers decided to switch from the StorageTek tape library to desktop machines from Hewlett-Packard. The desktops, each of which had four 160GB disk drives, sat on standard bakers racks purchased from Costco Wholesale.
As the digital library grew, Internet Archive IT staffers began looking for cheaper ways to store data. In 2004, they developed a storage system called the PetaBox, which uses a combination of affordable standards-based parts and open-source software. The PetaBox also boasts low power consumption. The Internet Archive eventually spun off a company, Capricorn Technologies, to manufacture and sell the PetaBox technology.
Today, the Internet Archive has about 2,000 PetaBox systems in its data center. The PetaBoxes are used to crawl the Internet and to store Web pages and other digital content. Each of 50 racks houses 40 1U (1.75-inch) PetaBox servers, most of which are armed with dual-core Opteron processors from Advanced Micro Devices. (Older PetaBoxes use ultra-low-voltage processors from Via Technologies.)
Kahle said this approach helps keep costs down for the nonprofit organization. “We are built out of boxes just stacked up and used for different purposes,” Kahle said. “As a nonprofit, one of the biggest [cost] issues for us is in the building of the data center—the administration and the power. Were trying to keep all of these factors under control.”
PetaBox systems currently being installed each have four 750GB perpendicular hard drives from Seagate Technology, providing up to 120TB of storage per rack. The Internet Archive adds about one new rack of PetaBoxes per month, according to John Berry, vice president of operations at the Internet Archive. Berry said he expects this trend to continue indefinitely.
Potential for Failure
With somewhere between 8,000 and 9,000 disks currently spinning in all these systems, disk failure is common—with 2 to 3 percent of disks failing every year. There is no way to hot-swap the drives in the PetaBoxes, so servers with failed disks need to be pulled out of their respective racks. Kahle said this practice is tolerable at the Internet Archive because data isnt updated as quickly as it would need to be when dealing with mission-critical enterprise data.
The Internet Archive, which has the equivalent of three full-time system administrators, uses Nagios, an enterprise-class open-source network monitoring application. Nagios monitors the status of more than 16,000 checks that run on the 800 machines that make up the Internet Archives primary cluster.
Nagios isnt the only open-source application used at the Internet Archive. The PetaBoxes run Canonicals Ubuntu distribution of Linux.
The Internet Archive also makes use of two applications for the PetaBoxes: PetaBox Catalog manages thousands of tasks running across the cluster, balancing workloads and tracking job progress, and PetaBox Control Panel provides a Web interface for configuration and modification at the cluster, rock, node and partition levels.
To Protect and Serve
To protect data, the Internet Archives IT managers tried RAID 5. However, they found it unable to scale and opted instead to use a JBOD (just a bunch of disks) configuration. For its archive, the organization uses pairs of machines and has two copies of everything on separate machines. The Internet Archive also has copies of all its data stored in other locations, including a data center in Amsterdam, The Netherlands, and the new Library of Alexandria, in Egypt.
“If theres one lesson we can take from the [destruction of the original] Library of Alexandria, its dont have just one copy,” Kahle said. “We wanted to build the Internet Archive to ensure that we dont lose the great works of today. The only way we could do that is to have multiple copies and have multiple places in the world that we synchronize over the Internet.”
The Internet Archive uses the Internet to keep its computing clusters in sync with one another. A protocol called OAI (Open Archives Initiative) is used for metadata harvesting. HTTP and FTP are also used to move batches of files.
Despite the massive amounts of data that the Internet Archive is storing, managing and preserving for posterity, Kahle said the secret to the organizations success is keeping it simple.
“We dont do anything that isnt immediately obvious to college students with Linux on their dorm-room desktop,” Kahle said. “We are allergic to secret sauce. Everything we do is standardized and simple.”
Senior Writer Anne Chen can be reached at [email protected]