At some point, the system-wide computer outage that took all of British Airways out of action starting on May 27 will provide a valuable lesson in maintaining critical systems. But for now, British Airways' IT staff is investigating why the systems failed so it can decide how to prevent it happening again.
“It was not an IT issue, it was a power issue,” a British Airways spokesperson told eWEEK in an email. “There was a total loss of power. The power then returned in an uncontrolled way causing physical damage to the IT servers.” “We know what happened,” the spokesperson added, “we are investigating why it happened.”
The shut down stranded as many as 300,000 passengers forced the cancellation of hundreds of British Airways flights during a long holiday weekend in both the U.S. and United Kingdom.
Previous statements from British Airways indicated that the power event did so much damage that both the main data system and the backup system were damaged, indicating that they were likely co-located.
Statements from the power companies serving Heathrow airport and the offices surrounding it, which included the BA data centers indicate that there was no power surge from their end, which suggests that power conditioning equipment inside BA’s Heathrow data centers were probably involved.
Electric power is typically sent to a major data center from the power distribution grid to an automatic transfer switch, then to some switch gear and then to uninterruptable power supplies. The automatic transfer switch is designed to instantly switch from commercial power to local power sources, which are usually diesel generators in an N+1 configuration.
The job of the UPS is to condition the power and to provide backup power while the power source is transferring current between commercial power and locally generated power. The UPS delivers conditioned power to the IT center power distribution unit (PDU) which then sends that current to individual servers and other related equipment.
While BA is still investigating how a power surge managed to take out their entire data system, it seems that the main culprit is likely the PDU, which has the job of reducing the voltage to the level used by the servers.
This process involves some big transformers, and when a transformer fails, it can be pretty destructive. But we don’t know for sure that it was transformers at this point. It could also be anything from squirrel in the UPS (which has happened to me) to a terrorist attack.
What was unusual about the BA outage is that it affected all of BA’s data systems, including its websites, the reservation system and all of the flight planning and internal communications systems. This meant that BA was unable to communicate with passengers, with its other offices and with the public by any means other than the telephone.