The Federal Aviation Administration’s flight plan IT network, which went down for about 2.5 hours Aug. 26 and fouled up the takeoff plans of thousands of travelers in more than 40 airports across the country, was back up and running Aug. 27.
IT staff were still troubleshooting it today in Hampton, Ga., where the agency’s primary data center is located.
But for how much longer is it going to be running? The FAA’s antiquated system consists of two 20-year-old redundant mainframe configurations-the primary one in Georgia, the backup in Utah-that apparently are hanging on for dear life until reinforcements arrive in the form of a new, state-of-the-art system this winter.
It is intriguing to note that the company that custom-built the mainframes for the FAA has been out of business for 20 years. More on that shortly.
The Crash
“What happened yesterday at 1:25 p.m. [EDT] was that during a normal daily software load something was corrupted in a file, and that brought [the] system down in Atlanta,” FAA spokesperson Paul Takemoto told me.
“Basically, all the flight plans that were in the system were kicked out. For aircraft already in the air, or [that] had just been pushed back from the gate, they had no problems. But for all other aircraft, it meant delays.”
What made things worse was when operations were shifted to the backup facility in Salt Lake City, which is designed to handle 125 percent of the overall load, Takemoto said.
“It was far more than that [125 percent], because airlines were refiling their flight plans manually. They just kept hitting the ‘Enter’ button. So the queues immediately became huge,” Takemoto said. “On top of that, it happened right during a peak time as traffic was building. Salt Lake City just couldn’t keep up.”
It was a “perfect storm” combination of all these flight plans being refiled plus a congested time of day and a creaky old IT system that caused the airport backups, Takemoto said.
The FAA then instructed the airlines not to file any flight plans for a specified length of time, and that left many passengers sitting and waiting in terminals. By around 4 p.m., Takemoto said, things started clearing up and the system came back to life.
An Antiquated System
The system itself is called NADIN (National Aerospace Data Interchange Network). It was designed by North American Philips for the FAA in the early 1980s. The two Philips DS714/81 mainframes became operational in January 1988. The company went out of business later that year, and the FAA bought out the entire parts inventory.
To its credit, the system has been running 24/7 for a long, long time-since the tail end of the Reagan administration, in fact. But the time has come for it to be replaced, as underscored by the shutdown this week.
By the end of 2008, Takemoto said, the entire system will be replaced by a new, state-of-the-art system: new hardware, software, everything. “It’ll have a memory that will be exponentially larger than this one,” Takemoto said. “It’ll be able to handle spikes like the one we had yesterday.”
Kenny Van Zant, chief product strategist at SolarWinds, a network management software maker, told me that most network outages are not caused by corrupted files.
“If you look at the root causes of most network outages, north of 70 percent of them are caused by configuration errors by humans,” Van Zant told me. “Computers fail a whole lot less often than the humans punching things into computers fail. Network engineers, as smart as they are, are not immune from that.”
Details about the FAA’s proprietary network configuration software setup were not made available.
Detection and Monitoring on the Way
SolarWinds has a new configuration called Orion NCM (Network Configuration Manager) v5, which integrates new features into the previous Cirrus Configuration Manager product. Orion alerts network managers-via a Web-based user interface for handheld devices, cell phones and laptops-when any change in the network structure occurs, so that outages can be handled quickly.
Jim Battenberg, director of product marketing for Neverfail, a disaster recovery software vendor, told me that his software asynchronously replicates all the data between the two environments and monitors the network 24/7.
“So we would detect if the network goes down, if the server goes down, if there’s problems with the hardware, if the processor is being hit too hard, or what have you,” Battenberg said. “We detect everything within the ecosystem. And if there’s a problem, we can a) fix some things ourselves, or b) fail over to the secondary system. And we do that automatically.”