How Google Stores All Its Data

By Chris Preimesberger  |  Posted 2009-08-17 Print this article Print

Google stores virtually all of its data in two forms: RecordIO-"a sequential series of records, typically representing some sort of log," Quinlan said-and SSTables.
"SSTables are immutable, key/value pair, sorted tables with indexes on them," Quinlan said. "Those two data structures are fairly simple; there's no update in place. All the records are either sequential through the RecordIO or streaming through the SSTable. This helps us a lot when building these [new] reliable systems."

As for the semi-structured data storage, stored in BigTable's row/column/timestamp subsystem, the URLs, the per-user data and the geographic locations are the data sets stored that are constantly being updated.

"And the scale of these things is large, with the size of the Internet and the number of people using Google," Quinlan said, in an understatement. Google is storing billions of URLs, hundreds of millions of page versions (with an average size of 20KB per data file version), and hundreds of terabytes of satellite image data. Hundreds of millions of users use Google daily.

When the data is stored into tables, Google then breaks up tables into chunks called tablets. "These are the basics that are distributed around our system," Quinlan said. "This is a simple model, and it's worked fairly effectively."

How the basic Google search system works: "A request comes in. We log it in GFS; it updates the storage. We then buffer it in memory in a sorted table. When that memory buffer fills up, we write that out as an SSTable; it's immutable data, it's locked down, we don't modify it.

"The request then reads through SSTables [to find the query answer]."

This is a fairly straightforward and simple process, Quinlan said. At the rate the Google search engine is used on a day-to-day basis, it has to be simple.

Future Directions for Google Storage

Scale remains the biggest issue. "Everything's getting bigger; we're growing exponentially. We're not quite an exabyte system today, but this is definitely in the not-too-distant future," Quinlan said. "I get blas??« about petabytes now."

More automated operation is in the cards. "Our ability to hire people to run these systems is not growing exponentially, so we have to automate more and more. We want to bring what used to be done manually into the systems. We want to bring more and more history of information about what's going on in the system-to allow the system itself to diagnose slow machines, diagnose various problems and rectify them itself," Quinlan said.

How to build these systems on a much more global basis is another Quinlan goal.

"We have many data centers across the world," he said. "On an application point of view, they all need to know exactly where the data is. They'll often have to do replication across data centers for availability, and they have to partition their users across these data centers. We're trying to bring that logic into the storage systems themselves."

So, as Caffeine search is being tested now, so is the new storage file system. Google hopes this will be one that is flexible and self-healing enough to be around for a while.


Chris Preimesberger Chris Preimesberger was named Editor-in-Chief of Features & Analysis at eWEEK in November 2011. Previously he served eWEEK as Senior Writer, covering a range of IT sectors that include data center systems, cloud computing, storage, virtualization, green IT, e-discovery and IT governance. His blog, Storage Station, is considered a go-to information source. Chris won a national Folio Award for magazine writing in November 2011 for a cover story on and CEO-founder Marc Benioff, and he has served as a judge for the SIIA Codie Awards since 2005. In previous IT journalism, Chris was a founding editor of both IT Manager's Journal and and was managing editor of Software Development magazine. His diverse resume also includes: sportswriter for the Los Angeles Daily News, covering NCAA and NBA basketball, television critic for the Palo Alto Times Tribune, and Sports Information Director at Stanford University. He has served as a correspondent for The Associated Press, covering Stanford and NCAA tournament basketball, since 1983. He has covered a number of major events, including the 1984 Democratic National Convention, a Presidential press conference at the White House in 1993, the Emmy Awards (three times), two Rose Bowls, the Fiesta Bowl, several NCAA men's and women's basketball tournaments, a Formula One Grand Prix auto race, a heavyweight boxing championship bout (Ali vs. Spinks, 1978), and the 1985 Super Bowl. A 1975 graduate of Pepperdine University in Malibu, Calif., Chris has won more than a dozen regional and national awards for his work. He and his wife, Rebecca, have four children and reside in Redwood City, Calif.Follow on Twitter: editingwhiz

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel