The Story Behind the FAA Flight-Plan System Crash
The Story Behind the FAA Flight-Plan System Crash
When the Federal Aviation Administration's national
flight-plan filing system went down for 4 hours the morning of Nov. 19, disrupting
the takeoffs of hundreds of commercial flights and throwing hundreds of
thousands of travelers off schedule, it wasn't yet known that the culprit was
one faulty part inside a telecom router link in Salt Lake
City.
The faulty router, which for reasons not yet established was not able to
default to a backup, also shut down a second major system node in Hampton,
Ga., effectively bringing to a halt the
inputting of flight plans filed by U.S.
commercial pilots. Commercial aircraft cannot take off from a U.S.
airport without filing a flight plan.
The glitch forced hundreds of pilots flying that day to enter their plans
manually via e-mail or by faxing them into the system, causing widespread
flight cancellations and delays.
Most flight plans are routine and pre-entered as a template in the system.
Pilots normally make only few changes in their altitude, speed and directional
plans, depending on weather conditions and the weight of the aircraft. When the
templates are not available, the pilot is forced to reconstruct the entire
flight plan, which is a tedious and time-consuming exercise.
When the router went offline, only the system maintainer-government
telecommunications contractor Harris-knew that the backup card was not
immediately available, and that one technician, who hadn't come to work yet
that day, had the key to the storage closet where the part was kept.
So the FAA had to wait until this technician was able to come to the site in
Salt Lake City to replace the faulty card inside the router, reconfigure the
software, and get the communications backbone back up and running so that the
nation's air traffic could get back to normal.
This information was supplied to eWEEK by the Professional Aviation Safety
Specialists union, an affiliate of AFL-CIO
that represents about 11,000 FAA technicians.
Does the failure of a single router that crimps a national telecommunications
system sound ridiculous in this day and age of virtual links, automated
processes and autonomic computing? It does. But that's what happened, and
that's why the Department of Transportation is going to launch an investigation
into this incident to see that this doesn't happen again.
Harris, the government contractor that installed and runs the FTI
(Federal Telecommunications Infrastructure) system, is the entity responsible
for the infrastructure connecting the nodes for the FAA's flight-plan system.
"If the FAA owned and maintained this system, the problem could have been
corrected within minutes," PASS National President Tom Brantley wrote
on the union's Website. "This could have reduced delays tremendously
and allowed a much quicker resolution to the problem. Meanwhile, because it
took so long for Harris to address the problem, delays continue to plague the
system."
Before 2002, when the FAA contracted out the FTI
system to Harris, the system was maintained by FAA telecom technicians on duty
24 hours per day.
"[Before 2002] the only thing that the FAA used to contract out was the
line services, belonging to MCI, Verizon or whichever company was the local
supplier," Chuck Siragusa, a PASS spokesperson, told eWEEK. "The
on-site FAA technicians are well-trained in mission-critical systems, routers,
modems, all of it. If this [FTI] system had
been maintained by the FAA, the impact [of the Nov. 19 outage] would have been
minimal, because a fix could have been made much quicker."
Not the New IT Systems Fault
The FAA recently spent millions of dollars updating its antiquated Philips
mainframe system with a new one that uses
Stratus Technologies high-performance servers and other elements from Sun
Microsystems, Cisco Systems and other first-tier IT suppliers. The old
system, which
went online in 1988 and served the FAA well for two decades, was
approaching its end of life and had suffered a series of breakdowns in the last
few years. However, the new IT system was not the issue Nov. 19.
The FAA utilizes the NADIN (National Airspace Data Interchange Network)
communications link for the flight-plan system. The two NADIN sites in Salt
Lake City and Hampton, Ga.-along with including the 21 other
FAA IT stations-no longer use a multipath communications
backbone composed of many different redundant links.
As mandated by the Bush administration in 2001, all the communications links
that previously were government-owned and maintained by FAA employees were
contracted to Harris, under the $2.4 billion FTI
contract.
Rep. Jerry Costello issued the following statement Nov. 19 regarding the
outage:
"While today's incident could have been much worse, anytime you have a system-wide outage it needs to be thoroughly reviewed and it brings up several questions that the FAA needs to address. Why did it take four hours to locate a seemingly small technical problem, and why did it have a system-wide effect? Is the FAA's oversight of its contract with the Harris Corporation sufficient? The relationship between the FAA and its vendors is a critical one, given that the transition to the Next Generation Air Transportation System will require more such partnerships. Our staff is discussing these questions with the FAA and we will continue to explore these issues. In addition, Chairman Oberstar and I have asked the Department of Transportation Inspector General to conduct a 60-day study of the outage and FAA's corrective action plan."
Harris spokesperson Marc Raimondi told eWEEK that people should keep in mind
that weather conditions cause most flight delays, and that the FTI
system used by the FAA has a very good performance record. "Five nines-maybe
even nine nines of efficiency," Raimondi said.
Raimondi issued the following statement from Harris: "We're working with
the FAA to evaluate the interruption in order to prevent similar outages in the
future. FTI has proven to be one of the most
reliable and secure communications networks operating within the civilian
government. Safety and security are the highest priorities."
