When the Federal Aviation Administration’s national flight-plan filing system went down for 4 hours the morning of Nov. 19, disrupting the takeoffs of hundreds of commercial flights and throwing hundreds of thousands of travelers off schedule, it wasn’t yet known that the culprit was one faulty part inside a telecom router link in Salt Lake City.
The faulty router, which for reasons not yet established was not able to default to a backup, also shut down a second major system node in Hampton, Ga., effectively bringing to a halt the inputting of flight plans filed by U.S. commercial pilots. Commercial aircraft cannot take off from a U.S. airport without filing a flight plan.
The glitch forced hundreds of pilots flying that day to enter their plans manually via e-mail or by faxing them into the system, causing widespread flight cancellations and delays.
Most flight plans are routine and pre-entered as a template in the system. Pilots normally make only few changes in their altitude, speed and directional plans, depending on weather conditions and the weight of the aircraft. When the templates are not available, the pilot is forced to reconstruct the entire flight plan, which is a tedious and time-consuming exercise.
When the router went offline, only the system maintainer–government telecommunications contractor Harris–knew that the backup card was not immediately available, and that one technician, who hadn’t come to work yet that day, had the key to the storage closet where the part was kept.
So the FAA had to wait until this technician was able to come to the site in Salt Lake City to replace the faulty card inside the router, reconfigure the software, and get the communications backbone back up and running so that the nation’s air traffic could get back to normal.
This information was supplied to eWEEK by the Professional Aviation Safety Specialists union, an affiliate of AFL-CIO that represents about 11,000 FAA technicians.
Does the failure of a single router that crimps a national telecommunications system sound ridiculous in this day and age of virtual links, automated processes and autonomic computing? It does. But that’s what happened, and that’s why the Department of Transportation is going to launch an investigation into this incident to see that this doesn’t happen again.
Harris, the government contractor that installed and runs the FTI (Federal Telecommunications Infrastructure) system, is the entity responsible for the infrastructure connecting the nodes for the FAA’s flight-plan system.
“If the FAA owned and maintained this system, the problem could have been corrected within minutes,” PASS National President Tom Brantley wrote on the union’s Website. “This could have reduced delays tremendously and allowed a much quicker resolution to the problem. Meanwhile, because it took so long for Harris to address the problem, delays continue to plague the system.”
Before 2002, when the FAA contracted out the FTI system to Harris, the system was maintained by FAA telecom technicians on duty 24 hours per day.
“[Before 2002] the only thing that the FAA used to contract out was the line services, belonging to MCI, Verizon or whichever company was the local supplier,” Chuck Siragusa, a PASS spokesperson, told eWEEK. “The on-site FAA technicians are well-trained in mission-critical systems, routers, modems, all of it. If this [FTI] system had been maintained by the FAA, the impact [of the Nov. 19 outage] would have been minimal, because a fix could have been made much quicker.”
Not the New IT Systems Fault
The FAA recently spent millions of dollars updating its antiquated Philips mainframe system with a new one that uses Stratus Technologies high-performance servers and other elements from Sun Microsystems, Cisco Systems and other first-tier IT suppliers. The old system, which went online in 1988 and served the FAA well for two decades, was approaching its end of life and had suffered a series of breakdowns in the last few years. However, the new IT system was not the issue Nov. 19.
The FAA utilizes the NADIN (National Airspace Data Interchange Network) communications link for the flight-plan system. The two NADIN sites in Salt Lake City and Hampton, Ga.–along with including the 21 other FAA IT stations–no longer use a multipath communications backbone composed of many different redundant links.
As mandated by the Bush administration in 2001, all the communications links that previously were government-owned and maintained by FAA employees were contracted to Harris, under the $2.4 billion FTI contract.
Rep. Jerry Costello issued the following statement Nov. 19 regarding the outage:
““While today’s incident could have been much worse, anytime you have a system-wide outage it needs to be thoroughly reviewed and it brings up several questions that the FAA needs to address. Why did it take four hours to locate a seemingly small technical problem, and why did it have a system-wide effect? Is the FAA’s oversight of its contract with the Harris Corporation sufficient? The relationship between the FAA and its vendors is a critical one, given that the transition to the Next Generation Air Transportation System will require more such partnerships. Our staff is discussing these questions with the FAA and we will continue to explore these issues. In addition, Chairman Oberstar and I have asked the Department of Transportation Inspector General to conduct a 60-day study of the outage and FAA’s corrective action plan.”“
Harris spokesperson Marc Raimondi told eWEEK that people should keep in mind that weather conditions cause most flight delays, and that the FTI system used by the FAA has a very good performance record. “Five nines–maybe even nine nines of efficiency,” Raimondi said.
Raimondi issued the following statement from Harris: “We’re working with the FAA to evaluate the interruption in order to prevent similar outages in the future. FTI has proven to be one of the most reliable and secure communications networks operating within the civilian government. Safety and security are the highest priorities.”