Google's Site Reliability Engineering practices allow a few dozen engineers to manage an enormous cloud infrastructure that serves 100 billion requests per day.
About three years ago in the summer of 2013, Google moved its entire App Engine footprint for the U.S. from one region in the country to another location.
It was an effort that involved an initial transfer of several petabytes worth of stored data between data centers and the online replication of massive data volumes in near real-time till the move was complete.
The effort marked the first time that Google had relocated its entire App Engine platform-as-a-service infrastructure for the U.S. on such a massive scale. Yet, Google says the whole migration was accomplished without any service downtime for its customers or without requiring any scheduled maintenance period.
Enabling that massive transition was a set of formal engineering practices for running production systems that Google calls Site Reliability Engineering (SRE). It is a set of technology best practices that Google has developed and fine-tuned over the years.
Now the company is sharing how it manages its cloud infrastructure with the release of a new book titled Site Reliability Engineering: How Google Runs Production Systems.
on Google’s Cloud Platform blog, Chris Jones, a reliability engineer and one of the editors of the book, described SRE as essentially applying engineering and computer science principles to designing and developing large, distributed computing systems.
At its core is an approach that emphasizes technology standardization and automation to improve the performance and reliability of highly distributed systems. The approach, according to Jones, has allowed Google to operate an infrastructure serving millions of applications and more than 100 billion requests per day with just ‘dozens’ of reliability engineers.
Extensive standardization has allowed Google to deploy systems that all work in similar ways and therefore involve fewer people to operate because there are fewer complexities to overcome.
“Because there are SRE teams working with many of Google’s services, we’re able to extend the principle of standardization across products,” Jones said in his comments on the Cloud Platform blog. For instance, an SRE built tool for deploying a new Gmail version might be generalized for other uses. “This means that each team doesn’t need to build its own way to deploy updates,” Jones said.
Automation meanwhile has allowed Google to minimize human involvement in tedious and potentially error-prone tasks such as building new system capacity or load balancing its networks at scale.
Combining systems engineering practices with software engineering skills has yielded technologies within Google that represent the best of both worlds, Jones said pointing to the company’s Maglev
software network load balancer as one example.
It was by applying SRE practices to the App Engine relocation that Google managed to accomplish the task without serious disruptions, Jones said.
The process began with Google shutting down one App Engine cluster and having it automatically failover to the remaining backup clusters. The company had already created a copy of the U.S. region’s App Engine data store in the destination data center ahead of time. It then kept replicating data in near-real time to keep the data store updated through the migration.
When the time came for Google to turn on App Engine at the new location, apps associated with specific clusters were migrated one cluster at a time from the backup systems to the new ones with all the required data already in place.
The migration was not entirely incident free
. The company for instance, encountered problems with the storage infrastructure at the new location that degraded performance of some Google App Engine services. Similarly, some application servers became overloaded causing performance issues. But because of how SRE works, Google was able to minimize the impact on customers, Jones claimed.