An Antiquated System
Corrupt File Brought Down FAA's Antiquated IT System
The Federal Aviation Administration's flight plan IT network, which went
down for about 2.5 hours Aug. 26 and fouled up the takeoff plans of thousands
of travelers in more than 40 airports across the country, was back up and
running Aug. 27.
IT staff were still troubleshooting it today in Hampton,
Ga., where the agency's primary data center
is located.
But for how much longer is it going to be running? The FAA's antiquated system
consists of two 20-year-old redundant mainframe configurations-the primary one
in Georgia, the backup in Utah-that apparently are hanging on for dear life
until reinforcements arrive in the form of a new, state-of-the-art system this
winter.
It is intriguing to note that the company that custom-built the mainframes for
the FAA has been out of business for 20 years. More on that shortly.
The Crash
"What happened yesterday at 1:25 p.m.
[EDT] was that during a normal daily
software load something was corrupted in a file, and that brought [the] system down
in Atlanta," FAA spokesperson
Paul Takemoto told me.
"Basically, all the flight plans that were in the system were kicked out.
For aircraft already in the air, or [that] had just been pushed back from the
gate, they had no problems. But for all other aircraft, it meant delays."
What made things worse was when operations were shifted to the backup
facility in Salt Lake City, which is designed to handle 125 percent of the
overall load, Takemoto said.
"It was far more than that [125 percent], because airlines were refiling
their flight plans manually. They just kept hitting the 'Enter' button. So the
queues immediately became huge," Takemoto said. "On top of that, it
happened right during a peak time as traffic was building. Salt
Lake City just couldn't keep up."
It was a "perfect storm" combination of all these flight plans being
refiled plus a congested time of day and a creaky old IT system that caused the
airport backups, Takemoto said.
The FAA then instructed the airlines not to file any flight plans for a
specified length of time, and that left many passengers sitting and waiting in
terminals. By around 4 p.m., Takemoto
said, things started clearing up and the system came back to life.
An Antiquated System
The system itself is called NADIN (National Aerospace Data Interchange
Network). It was designed by North American Philips for the FAA in the early
1980s. The two Philips DS714/81 mainframes became operational in January 1988.
The company went out of business later that year, and the FAA bought out the
entire parts inventory.
To its credit, the system has been running 24/7 for a long, long time-since the
tail end of the Reagan administration, in fact. But the time has come for it to
be replaced, as underscored by the shutdown this week.
By the end of 2008, Takemoto said, the entire system will be replaced by a new,
state-of-the-art system: new hardware, software, everything. "It'll have a
memory that will be exponentially larger than this one," Takemoto said.
"It'll be able to handle spikes like the one we had yesterday."
Kenny Van Zant, chief product strategist at SolarWinds,
a network management software maker, told me that most network outages are
not caused by corrupted files.
"If you look at the root causes of most network outages, north of 70
percent of them are caused by configuration errors by humans," Van Zant
told me. "Computers fail a whole lot less often than the humans punching
things into computers fail. Network engineers, as smart as they are, are not
immune from that."
Details about the FAA's proprietary network configuration software setup were
not made available.
Detection and Monitoring on the Way
SolarWinds has a new configuration called Orion NCM (Network
Configuration Manager) v5, which integrates new features into the previous
Cirrus Configuration Manager product. Orion alerts network managers-via a Web-based
user interface for handheld devices, cell phones and laptops-when any change in
the network structure occurs, so that outages can be handled quickly.
Jim Battenberg, director of product marketing for Neverfail, a
disaster recovery software vendor, told me that his software asynchronously replicates
all the data between the two environments and monitors the network 24/7.
"So we would detect if the network goes down, if the server goes down, if
there's problems with the hardware, if the processor is being hit too hard, or
what have you," Battenberg said. "We detect everything within the
ecosystem. And if there's a problem, we can a) fix some things ourselves, or b)
fail over to the secondary system. And we do that automatically."
