Router Crashes Trigger Major Southwest IT System Failure

The system outage knocked the airline carrier's Website offline, causing major flight delays and long lines at dozens of airports around the country.

Southwest Airlines and its IT people had a pretty grim 24 hours July 20 and 21 after the airline suffered a disastrous IT trifecta: routers and Web servers crashing, backup systems not operating and finally, even the disaster recovery system failing to work.

It doesn't get more nightmarish than that for an IT manager in a real-time production scenario.

The Dallas-based airline canceled about 700 of its 3,900 daily flights July 20 and delayed at least 1,300 others due to the technical mishaps.

It then canceled another 450 flights July 21 as it struggled to recover from the outage that knocked the carrier's Website offline and continued to delay flights and cause long lines at dozens of airports around the country.

Southwest social media staff members first admitted the downed system in a Twitter post to customers just before noon July 20, saying, "We are aware and investigating current issues with our systems."

Shortly thereafter, thousands of travelers from across the country then took to social media to complain of delayed flight departures and arrivals and the inability to check in for flights using Southwest's Website.

Customers also reported crowded airports because they were not able to check in for flights or print out boarding passes at airport kiosks. Without passes, customers griped that they were unable to go through security.

Southwest Chief Operating Officer Mike Van de Ven said that when a router failed July 20, it triggered other crashes, slowing the airline's systems so much that other functions became overloaded and froze up. Router failures aren't uncommon, Van de Ven said, but this outage was unusually severe. It took IT staff about 12 hours to restore most systems to working order, Southwest said.

The problems were complicated when backup systems and the disaster recovery deployments also failed to work, the airline said. Hackers were not believed to be at fault, at least from early reports on July 21.

In total, the carrier canceled about 1,100 flights and delayed hundreds of others since the technical problem started, midday July 20.

Southwest said in a media statement that most systems recovered and were functioning by midday July 21. After a day of recovery, CEO Gary Kelly said in the statement, the airline hoped to be operating normally by July 22.

At the height of the problem late on July 20, airline employees were forced to check in passengers with paper records and could not take new reservations. Kelly estimated that the 24-hour problem cycle probably cost Southwest between $5 million and $10 million in ticket sales.

"We have significant redundancies built into our mission-critical systems, and those redundancies did not work," Kelly said in a conference call to reporters. "We need to understand why and make sure that that doesn't happen again."

Robert Jordan, Southwest's chief commercial officer, said every customer affected on Wednesday or Thursday would be contacted. The company extended for a week a fare sale scheduled that was supposed to end July 21.

Chris Preimesberger

Chris J. Preimesberger

Chris J. Preimesberger is Editor-in-Chief of eWEEK and responsible for all the publication's coverage. In his 15 years and more than 4,000 articles at eWEEK, he has distinguished himself in reporting...