Google Testing New Storage System for 'Caffeine' - Page 3

Google stores virtually all of its data in two forms: RecordIO-"a sequential series of records, typically representing some sort of log," Quinlan said-and SSTables.
"SSTables are immutable, key/value pair, sorted tables with indexes on them," Quinlan said. "Those two data structures are fairly simple; there's no update in place. All the records are either sequential through the RecordIO or streaming through the SSTable. This helps us a lot when building these [new] reliable systems."
As for the semi-structured data storage, stored in BigTable's row/column/timestamp subsystem, the URLs, the per-user data and the geographic locations are the data sets stored that are constantly being updated.
"And the scale of these things is large, with the size of the Internet and the number of people using Google," Quinlan said, in an understatement. Google is storing billions of URLs, hundreds of millions of page versions (with an average size of 20KB per data file version), and hundreds of terabytes of satellite image data. Hundreds of millions of users use Google daily.
When the data is stored into tables, Google then breaks up tables into chunks called tablets. "These are the basics that are distributed around our system," Quinlan said. "This is a simple model, and it's worked fairly effectively."
How the basic Google search system works: "A request comes in. We log it in GFS; it updates the storage. We then buffer it in memory in a sorted table. When that memory buffer fills up, we write that out as an SSTable; it's immutable data, it's locked down, we don't modify it.
"The request then reads through SSTables [to find the query answer]."
This is a fairly straightforward and simple process, Quinlan said. At the rate the Google search engine is used on a day-to-day basis, it has to be simple.
Future Directions for Google Storage
Scale remains the biggest issue. "Everything's getting bigger; we're growing exponentially. We're not quite an exabyte system today, but this is definitely in the not-too-distant future," Quinlan said. "I get blas??« about petabytes now."
More automated operation is in the cards. "Our ability to hire people to run these systems is not growing exponentially, so we have to automate more and more. We want to bring what used to be done manually into the systems. We want to bring more and more history of information about what's going on in the system-to allow the system itself to diagnose slow machines, diagnose various problems and rectify them itself," Quinlan said.
How to build these systems on a much more global basis is another Quinlan goal.
"We have many data centers across the world," he said. "On an application point of view, they all need to know exactly where the data is. They'll often have to do replication across data centers for availability, and they have to partition their users across these data centers. We're trying to bring that logic into the storage systems themselves."
So, as Caffeine search is being tested now, so is the new storage file system. Google hopes this will be one that is flexible and self-healing enough to be around for a while.

Chris Preimesberger

Chris J. Preimesberger

Chris J. Preimesberger is Editor-in-Chief of eWEEK and responsible for all the publication's coverage. In his 15 years and more than 4,000 articles at eWEEK, he has distinguished himself in reporting...