More details are available about troubles with Exchange Online, Microsoft’s cloud-based business email, calendar and contact management platform, which suffered an extended outage on Tuesday, June 24 that left some customers without one of their most fundamental means of communication and collaboration.
IT managers flooded the Office 365 support forums to seek answers and vent their frustration as work ground to a halt at their offices for several hours. The Service Health Dashboard, which administrators use to keep tabs on their cloud subscriptions, failed to properly report any issues.
Microsoft restored the service after eight hours. However, the experience, combined with a Lync Online outage the day before, raised concerns about migrating critical business services to the cloud.
In the aftermath of the major service disruptions, Rajesh Jha, corporate vice president of Office 365 Engineering, turned to the company’s support forums to offer the company’s mea culpa. “First, I want to apologize on behalf of the Office 365 team for the impact and inconvenience this has caused,” he wrote. “Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider,” he added.
Jha acknowledged that the tools customers use to monitor their Microsoft cloud services were not up to par. He explained that his company “also experienced a problem with our Service Health Dashboard (SHD) publishing process, meaning not all impacted customers were notified in a timely way, which we realize was frustrating and this has since been addressed.”
Exchange Online’s troubles were triggered by “an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests,” explained Jha. “This caused a small set of customers to lose email access.”
Claiming that the damage was “contained to a small set of customers,” he said that the “unique nature” of the flaw caused a prolonged recovery time. Exacerbating the problem, the failure sparked “an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers.”
Jha indicated that his team not only fixed the underlying issue, but is also updating its systems to prevent a repeat of the problems. “In addition to fixing the root cause trigger, we are working on further layers of hardening for this pattern,” he informed.
Lync Online’s issues started small and quickly escalated. The service was hit with “a brief loss of client connectivity in our North America data centers due to external network failures,” said Jha. While the problem was solved in mere minutes, an “ensuing traffic spike caused several network elements to get overloaded,” cutting off some customers’ access to Lync for hours.
Microsoft has learned from the experience, and is working to instill confidence in its cloud services. “While we have fixed the root causes of the issues, we will learn from this experience and continue improving our proactive monitoring, prevention, recovery and defense-in-depth systems,” Jha said.