Close
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
Read Down
Sign in
Close
Welcome!Log into your account
Forgot your password?
Read Down
Password recovery
Recover your password
Close
Search
Logo
Subscribe
Logo
  • Latest News
  • Artificial Intelligence
  • Video
  • Big Data and Analytics
  • Cloud
  • Networking
  • Cybersecurity
  • Applications
  • IT Management
  • Storage
  • Sponsored
  • Mobile
  • Small Business
  • Development
  • Database
  • Servers
  • Android
  • Apple
  • Innovation
  • Blogs
  • PC Hardware
  • Reviews
  • Search Engines
  • Virtualization
More
    Subscribe
    Home Applications
    • Applications
    • Database
    • IT Management
    • Networking
    • Servers
    • Storage

    Making Web Memories with the PetaBox

    Written by

    Anne Chen
    Published November 6, 2006
    Share
    Facebook
    Twitter
    Linkedin

      eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

      Ten years in the making, the Internet Archive—an ambitious project to store and archive all the Web pages on the Internet along with other forms of digital content—houses more than 4 petabytes of data (1.6 petabytes of primary data) using standards-based modular hardware and open-source software.

      The organizations strategies for storing and managing that data can serve as best practices for any company trying to get its arms around an ever-expanding data load.

      Multiterabyte data centers are quite common these days, but petabyte-size data stores remain somewhat novel. To see firsthand how the Internet Archive is handling the storage of all its data, eWEEK Labs went on-site at the digital librarys San Francisco data center.

      The Internet Archive had recently relocated its data center from offices in the Presidio of San Francisco. In fact, IT managers had just finished moving the last racks of servers into the new location two weeks prior to our visit in October.

      Much of the Internet Archives success has to do with the way its IT managers approach the storage of large amounts of data, said Brewster Kahle, digital librarian and founder of the Internet Archive.

      “We are a petabyte-oriented facility, and the question is, How do we work and store petabytes of information that are constantly accessible to the outside world?” said Kahle, during eWEEK Labs visit. “The answer is to have two practical considerations—how to store this massive amount of data and how to preserve it. Preservation and access are part of our mandate.”

      The Internet Archive is a nonprofit organization founded in 1996 with the purpose of building an online library made up of saved Web sites. The Internet Archive today includes all manner of digital formats, including text, audio and video, as well as archived Web pages. The collection—which can be accessed at www.archive.org—is continually growing.

      Funding for the Internet Archive came originally from Kahle as a result of the sale of his company, WAIS (Wide Area Information Servers), to America Online. The Internet Archive is now funded by private foundations, government grants and in-kind donations from corporations.

      In the beginning, the Internet Archive used Storage Technologys StorageTek TimberWolf 9710 tape library with Quantums DLT700 drives, the combination of which could store as much as 70GB of data. (Storage Technology was acquired by Sun Microsystems in 2005.) However, while the tape library was cost-efficient, the disadvantage was its relatively slow access speed.

      /zimages/4/28571.gifTo read a Q&A with Internet Archives Brewster Kahle, click here.

      In 2000, Internet Archive IT managers decided to switch from the StorageTek tape library to desktop machines from Hewlett-Packard. The desktops, each of which had four 160GB disk drives, sat on standard bakers racks purchased from Costco Wholesale.

      As the digital library grew, Internet Archive IT staffers began looking for cheaper ways to store data. In 2004, they developed a storage system called the PetaBox, which uses a combination of affordable standards-based parts and open-source software. The PetaBox also boasts low power consumption. The Internet Archive eventually spun off a company, Capricorn Technologies, to manufacture and sell the PetaBox technology.

      Today, the Internet Archive has about 2,000 PetaBox systems in its data center. The PetaBoxes are used to crawl the Internet and to store Web pages and other digital content. Each of 50 racks houses 40 1U (1.75-inch) PetaBox servers, most of which are armed with dual-core Opteron processors from Advanced Micro Devices. (Older PetaBoxes use ultra-low-voltage processors from Via Technologies.)

      Kahle said this approach helps keep costs down for the nonprofit organization. “We are built out of boxes just stacked up and used for different purposes,” Kahle said. “As a nonprofit, one of the biggest [cost] issues for us is in the building of the data center—the administration and the power. Were trying to keep all of these factors under control.”

      PetaBox systems currently being installed each have four 750GB perpendicular hard drives from Seagate Technology, providing up to 120TB of storage per rack. The Internet Archive adds about one new rack of PetaBoxes per month, according to John Berry, vice president of operations at the Internet Archive. Berry said he expects this trend to continue indefinitely.

      Potential for Failure

      With somewhere between 8,000 and 9,000 disks currently spinning in all these systems, disk failure is common—with 2 to 3 percent of disks failing every year. There is no way to hot-swap the drives in the PetaBoxes, so servers with failed disks need to be pulled out of their respective racks. Kahle said this practice is tolerable at the Internet Archive because data isnt updated as quickly as it would need to be when dealing with mission-critical enterprise data.

      The Internet Archive, which has the equivalent of three full-time system administrators, uses Nagios, an enterprise-class open-source network monitoring application. Nagios monitors the status of more than 16,000 checks that run on the 800 machines that make up the Internet Archives primary cluster.

      Nagios isnt the only open-source application used at the Internet Archive. The PetaBoxes run Canonicals Ubuntu distribution of Linux.

      The Internet Archive also makes use of two applications for the PetaBoxes: PetaBox Catalog manages thousands of tasks running across the cluster, balancing workloads and tracking job progress, and PetaBox Control Panel provides a Web interface for configuration and modification at the cluster, rock, node and partition levels.

      To Protect and Serve

      To protect data, the Internet Archives IT managers tried RAID 5. However, they found it unable to scale and opted instead to use a JBOD (just a bunch of disks) configuration. For its archive, the organization uses pairs of machines and has two copies of everything on separate machines. The Internet Archive also has copies of all its data stored in other locations, including a data center in Amsterdam, The Netherlands, and the new Library of Alexandria, in Egypt.

      “If theres one lesson we can take from the [destruction of the original] Library of Alexandria, its dont have just one copy,” Kahle said. “We wanted to build the Internet Archive to ensure that we dont lose the great works of today. The only way we could do that is to have multiple copies and have multiple places in the world that we synchronize over the Internet.”

      The Internet Archive uses the Internet to keep its computing clusters in sync with one another. A protocol called OAI (Open Archives Initiative) is used for metadata harvesting. HTTP and FTP are also used to move batches of files.

      Despite the massive amounts of data that the Internet Archive is storing, managing and preserving for posterity, Kahle said the secret to the organizations success is keeping it simple.

      “We dont do anything that isnt immediately obvious to college students with Linux on their dorm-room desktop,” Kahle said. “We are allergic to secret sauce. Everything we do is standardized and simple.”

      Senior Writer Anne Chen can be reached at [email protected].

      /zimages/4/28571.gifCheck out eWEEK.coms for the latest news, views and analysis on servers, switches and networking protocols for the enterprise and small businesses.

      Anne Chen
      Anne Chen
      As a senior writer for eWEEK Labs, Anne writes articles pertaining to IT professionals and the best practices for technology implementation. Anne covers the deployment issues and the business drivers related to technologies including databases, wireless, security and network operating systems. Anne joined eWeek in 1999 as a writer for eWeek's eBiz Strategies section before moving over to Labs in 2001. Prior to eWeek, she covered business and technology at the San Jose Mercury News and at the Contra Costa Times.

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      Get the Free Newsletter!

      Subscribe to Daily Tech Insider for top news, trends & analysis

      MOST POPULAR ARTICLES

      Artificial Intelligence

      9 Best AI 3D Generators You Need...

      Sam Rinko - June 25, 2024 0
      AI 3D Generators are powerful tools for many different industries. Discover the best AI 3D Generators, and learn which is best for your specific use case.
      Read more
      Cloud

      RingCentral Expands Its Collaboration Platform

      Zeus Kerravala - November 22, 2023 0
      RingCentral adds AI-enabled contact center and hybrid event products to its suite of collaboration services.
      Read more
      Artificial Intelligence

      8 Best AI Data Analytics Software &...

      Aminu Abdullahi - January 18, 2024 0
      Learn the top AI data analytics software to use. Compare AI data analytics solutions & features to make the best choice for your business.
      Read more
      Latest News

      Zeus Kerravala on Networking: Multicloud, 5G, and...

      James Maguire - December 16, 2022 0
      I spoke with Zeus Kerravala, industry analyst at ZK Research, about the rapid changes in enterprise networking, as tech advances and digital transformation prompt...
      Read more
      Video

      Datadog President Amit Agarwal on Trends in...

      James Maguire - November 11, 2022 0
      I spoke with Amit Agarwal, President of Datadog, about infrastructure observability, from current trends to key challenges to the future of this rapidly growing...
      Read more
      Logo

      eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site’s focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

      Facebook
      Linkedin
      RSS
      Twitter
      Youtube

      Advertisers

      Advertise with TechnologyAdvice on eWeek and our other IT-focused platforms.

      Advertise with Us

      Menu

      • About eWeek
      • Subscribe to our Newsletter
      • Latest News

      Our Brands

      • Privacy Policy
      • Terms
      • About
      • Contact
      • Advertise
      • Sitemap
      • California – Do Not Sell My Information

      Property of TechnologyAdvice.
      © 2024 TechnologyAdvice. All Rights Reserved

      Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.