According to a passenger who was traveling on BA when the data system crash happened, the aircraft crew and the staff at the airport had no idea what was happening until considerably later. This passenger's plane was actually taxiing for takeoff at Germany's Frankfurt Airport, when the flight was brought to a halt. “We were told there was a delay,” explained Esther Schindler of Phoenix, AZ, “they were waiting to space out planes.”
Schindler, veteran freelance technology writer, said that there was confusion among the crew because there was no information available to the crew or at the gate. The plane eventually returned to its original boarding gate. She said that she noticed that the information board inside the boarding area was still listing the flight as still boarding, indicating that it hadn’t been updated by the central system.
“They didn’t know what the IT crash was about,” Schindler said, “all the systems that would be communicating with them were down.”
“It took a little while but they confirmed that the entire IT system had crashed, and they told us it was worldwide,” she said.
The passenger experiences and the information from the airline make a few things clear. First, BA was putting every function into a single computer system, regardless of whether that made sense, which then insured that if anything went wrong with that system, then everything would shut down.
Second, the BA back up system that should have provided redundancy wasn’t truly redundant, or the event that took out the main system would not have been able to take out the backup system at the same time. Normal practice for backup data systems would have required at the very least that the backup system be physically separated from the main system by enough distance that the same catastrophe couldn’t affect them both.
But statements by the airline say that the event that damaged the data system also damaged the backup systems, so clearly its redundancy was compromised. Normally, the power handling and conditioning equipment is also built with redundancy so that even if something major, such as the PDU, goes out, the computers can continue to get power.
Normally, the best practices for such a data system call for the power to enter the data center from two separate commercial grids, usually at two ends of the building, to prevent damage such as what befell BA. Likewise, the standby generators can feed the power from either service entrance, and the N+1 configuration of the generators means that power can be supplied even if one generator goes out.
What will we actually find? We may eventually learn there was poor design in the data center and that there was no effective redundancy. Furthermore there was apparently over-reliance on a single data system, which suggest an organization that was being penny-wise and pound foolish.
BA needs to answer to its customers and shareholders about how an IT outage that was probably avoidable brought the company to a standstill over the Memorial Day weekend.