Google Gives Lessons in Scale

At the recent Surge 2011 show, Google engineers share some of their experiences in scaling systems for large numbers of users.

BALTIMORE - Google engineers shared some of their knowledge on scaling systems, based on the successes experienced by the search giant.

At the Surge 2011 conference here, a trio of Google engineers laid out some of the reasons why Google has been successful scaling its systems that support several hundred million queries per day. And who better to learn from? If you want to be cool with the ladies you want to have the moves like Jagger - at least the Jagger of old. If you want to learn to scale, you need to take a lesson from Google.

And in a session at Surge, the audience got just that. Maxwell Luebbe, a site reliability engineer at Google, discussed the process of running s frontend web service at Google. Luebbe said Google is a service-oriented system so you can separate out all the various parts of Google. They are broken down into manageable components for fine grained scaling and capacity management, he said.

Moreover, Google tries to put machines and data where users are, which nowadays happens to be everywhere, "so capacity planning is critical," Luebbe said. "Service availability is more important that server availability. So you need to be able to plan as if someone is going around smashing things with a hammer in your data center," he added, meaning that enterprises need to be able to reroute traffic quickly both inside and outside the data center.

Meanwhile, Google software engineer Jia Guo talked about her work participating in design and optimization of the crawl engine, the largest scale data processing system of Google. Her interests include parallel programming, large scale infrastructure and the strategies to detect duplicate content in the web. Building and operating a large-scale system that handles billions of documents around the web provides a number of interesting challenges, Guo said.

Guo said the Google File System (GFS) also known as Colossus is a cluster software environment that runs on several thousand machines to manage thousands of active jobs, some of which feature one task and others that features thousands of tasks. Bottom line there is a massive amount of data in play.

That is where MapReduce comes in. MapReduce is a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

"MapReduce is a simple programming model that applies to many large-scale compute problems," Guo said. It help developers deal with "a lot of the messy stuff" associated with parallel programming, she said.

Moreover, Guo said enterprises looking to scale should design for scalability by doing such things as sending "canary" requests to test environments and avoid crashing thousand s of machines. Another strategy to consider is to implement multiple smaller units per machine so as to enable a faster recovery should a fault occur, she said.

Enterprise organizations also need to design for growth by trying to anticipate how system requirements might change over time, Guo said. However, enterprises should not design to scale infinitely, she said. Scaling from 10 times to 50 times the initial system is acceptable, but attempting to scale by 100 times typically requires a redesign, she said.

Following Guo in a talk entitled: Solidifying the Cloud: How to back up the internet, Google site reliability engineer Raymond Blum said ensuring durability and integrity of user data is job one. Blum also laid out several classic lessons learned including: redundancy does not bring recoverability, the backup process has to scale with data value, and if you haven't restored you haven't backed up.

Indeed, redundancy is not a backup, Blum said. The primary purpose of redundancy of data location is to support scalable processing, he said. It is not a guarantee of integrity or recoverability. And local copies do not protect against site outages.

"Diversity of storage technologies further guards against bugs taking out data," Blum said. Blum then discussed a case study from February 2011 where an outage caused a data loss for some Gmail customers and Google was able to recover the data within two days through tape backups the company had made. MapReduce also played a part in that restoration, he said.

Blum added that a backup is a known, stable operation, whereas a restore is "usually a surprise and is a non-maskable interrupt."

Blum also said organizations that plan to scale their operations need to plan to scale everything in parallel, including their software, infrastructure, hardware and processes.

"Design your systems to be parallel from the beginning," he said.