Google said Nov. 14 that it has figured out why its DoubleClick advertising service went wonky the morning of Nov. 12, impacting more than 55,000 Websites displaying their content.
The huge Web services provider, in an email to eWEEK and other users, said it has determined a new strategy to prevent it from happening again.
The Web-serving mechanism that serves up millions of ads and targets them to specific sites sustained a major outage that slowed ad delivery or blocked access to 55,185 Websites, according to Dynatrace, an application performance management (APM) service provider. Those sites included eWEEK.com and other QuinStreet sites, such as CIO Insight, Baseline, IT Business Edge, Datamation and Enterprise Networking Planet.
Others affected were USA Today, the Wall Street Journal, Forbes, BBC.com and YouTube. The Guardian and the Enquirer news sites in the U.K. also were hit.
Google reported that the outage lasted about two hours, from 5:45 a.m. PST to 7:31 a.m. PST.
In some cases, such as eWEEK's, the non-delivery of the ads blocked sections of content on the site's home and article pages from resolving correctly in browsers. In other cases, no ads were served, and sites loaded faster than normal—to the delight of some readers.
"This was the first DFP (DoubleClick for Publishers) outage of this significance for many years and we take it very seriously," Google executives Neal Mohan (vice president, Display and Video Advertising) and Scott Silver (vice president, Engineering) wrote in a memo to users. "Our team's first priority was restoring service and making sure the immediate issue would not recur. We are writing now to explain what happened and what we are doing to protect against future incidents."
The details, according to Mohan and Silver, were:
- "The DFP ad server relies on an internal service that began degrading in performance. This caused a cascading failure on DFP ad servers, leading to the outage."
- "We designed our systems to gracefully handle performance degradation from dependent services. However, due to a misconfiguration, we were unable to prevent the outage."
- "To restore ad serving and prevent cascading failures, we restarted the services by provisioning additional resources."
- "We reproduced the failure in a test by degrading the availability of the internal service, proving the misconfiguration caused the cascading failures. We have since rolled out a fix to the configuration globally."
- "We are conducting a complete review of all our processes and production configurations to prevent this from happening again."
Data from Dynatrace indicated that almost every industry was impacted by the outage, including travel, news, entertainment and retail sites.
"It's rare for an organization of this maturity to suffer a major outage. These people are really good," David Jones, director of sales and APM evangelist at Dynatrace, told eWEEK. "We've seen nothing of this scope in recent memory."