Microsoft Azure Leap Year Glitch: Key Lessons Learned

By Chris Preimesberger  |  Posted 2012-03-01 Print this article Print

Analysis: The bottom line here is very simple: Each enterprise needs to manage its own system as if it were all on-premises -- including all VPN networks, remote offices and devices, clouds and/or cloud services within it.

Nothing besmirches the reputation of cloud services more than a major outage like the one Amazon EC2 suffered last year and the one a red-faced Microsoft endured on Leap Year Day, Feb. 29.

Bad guys hacking into a system can happen to anybody, cloud or no cloud. You secure as best you can for something like that. But a total outage as the fault of a cloud application provider is another thing entirely.

Microsoft confirmed late Feb. 29 that a service outage that affected its Azure cloud computing service was caused by a Leap Year bug. The outage apparently was triggered by a key server in Ireland housing a certificate that expired at midnight on Feb. 28.

That electronic control document hadn't taken into account the extra day in the month of February the Western calendar adds every four years. It was simple human error, the single most common cause of computer errors.

Cloud-System Domino Effect

When the clocks struck midnight, things quickly got janky, and a cloud-system domino effect took charge. A large number of Western Hemisphere sites and the U.K. government's G-Cloud CloudStore were among the many stopped cold by the outage. Microsoft has been retracing its steps in finding out what exactly happened and hasn't said very much yet, although it did report in an Azure team blog that the problem has "mostly" been fixed.

"The issue was quickly triaged, and it was determined to be caused by a software bug,€ Bill Laing, corporate vice president of Microsoft's Server and Cloud, wrote in a Feb. 29 posting on the Windows Azure Team Blog. "While final root-cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year."

Microsoft engineers created a workaround, while still dealing with issues affecting some subregions and customers. According to the Windows Azure Service Dashboard, virtually all regions were back up and running by March 1, with the exception of an alert for the Windows Azure Compute in the South-Central U.S. region; that alert, posted the morning of Feb. 29, suggested some issue with incoming traffic. 

"This is a classic computer science problem," Andres Rodriguez, CEO and founder of cloud gateway provider Nasuni, told eWEEK. Nasuni, a cloud storage front end, uses Azure, Amazon S3, Rackspace and other cloud storage providers as targets for its clients.

"It was a Leap Year problem. The dates were misadjusted. They did not factor in the Leap Year day (Feb. 29). When things start in Ireland, they're starting at GMT zero, and for the 29th of February, they were pointing at it like crazy. There was probably smoke coming out of that hall, like crazy."

Rodriguez (pictured) reminded eWEEK readers that only the compute layer of the Azure cloud crashed, and that the storage service portion€”of which Nasuni itself is a customer€”was not affected. Nasuni's storage service is redundant across multiple cloud systems, so if one goes down, data is not affected.

In fact, Rodriguez said, IT managers might be remiss if they don't take into account replicating their critical business data on stacks in at least two cloud service providers€”for the very reason Azure illustrated on Feb. 29.

A Reason to Revisit the Big Picture

Soon, Microsoft will be fully back up and running, and the world that runs on Azure will get back to work. But there is cause to stop and consider the bigger picture.

We enjoy innumerable benefits of IT in this digital device-crazy world. But we also need to remember that there are also many Achilles heels in data systems that can be directly affected by hackers, environment events, power outages, sunspots, human error€”the list is a long one.

As time moves on, we're getting better at finding those holes and plugging them. But the fact is, we probably will never completely solve even one-quarter of all the security risks inherent in IT systems because there are simply too many variables€”and humans€”involved.

The bottom line here is very simple, but it's taking awhile for many people to learn it: Each enterprise needs to manage its own system as if it were all on-premises€”including all VPN networks, remote offices and devices, clouds and/or cloud services within it.

"The first thing to understand [about events like this] is that this changes nothing," Andi Mann, longtime storage industry analyst who's currently serving as chief cloud strategy guru at CA Technologies, told eWEEK after the Amazon outage in April 2011. The same applies to Microsoft's boo-boo of Feb. 29.

"Cloud will have downtime€”it's a fundamental issue. But you need to be ready for downtime, whether it's your own infrastructure or cloud infrastructure. You need to understand what the risk is. It's all just about risk management."

Rodriguez said that "these cloud providers have humongous data centers, but your own application in that tremendous data center still has to be written to handle a collapse of the compute layer in that data center. You cannot hope that the cloud provider is going to do that for you."

eWEEK Senior Writer Nick Kolakowski contributed to this article. Chris Preimesberger is eWEEK's Editor of Features and Analysis. Twitter: editingwhiz

Chris Preimesberger Chris Preimesberger was named Editor-in-Chief of Features & Analysis at eWEEK in November 2011. Previously he served eWEEK as Senior Writer, covering a range of IT sectors that include data center systems, cloud computing, storage, virtualization, green IT, e-discovery and IT governance. His blog, Storage Station, is considered a go-to information source. Chris won a national Folio Award for magazine writing in November 2011 for a cover story on and CEO-founder Marc Benioff, and he has served as a judge for the SIIA Codie Awards since 2005. In previous IT journalism, Chris was a founding editor of both IT Manager's Journal and and was managing editor of Software Development magazine. His diverse resume also includes: sportswriter for the Los Angeles Daily News, covering NCAA and NBA basketball, television critic for the Palo Alto Times Tribune, and Sports Information Director at Stanford University. He has served as a correspondent for The Associated Press, covering Stanford and NCAA tournament basketball, since 1983. He has covered a number of major events, including the 1984 Democratic National Convention, a Presidential press conference at the White House in 1993, the Emmy Awards (three times), two Rose Bowls, the Fiesta Bowl, several NCAA men's and women's basketball tournaments, a Formula One Grand Prix auto race, a heavyweight boxing championship bout (Ali vs. Spinks, 1978), and the 1985 Super Bowl. A 1975 graduate of Pepperdine University in Malibu, Calif., Chris has won more than a dozen regional and national awards for his work. He and his wife, Rebecca, have four children and reside in Redwood City, Calif.Follow on Twitter: editingwhiz

Submit a Comment

Loading Comments...
Manage your Newsletters: Login   Register My Newsletters

Rocket Fuel