A major portion of the decades-old national air traffic control system used to manage thousands of commercial and general aviation takeoffs and landings every day in the United States has crashed multiple times under the 20-year strain of its 24/7 operations.
As a result, industry analysts and a number of former Federal Aviation Administration staff members believe there is heightened likelihood of a major air traffic stoppage, as was demonstrated twice in the last two weeks by the crash of the system head in Atlanta. They also believe there is a new, increasing vulnerability to terrorist cyber-attacks.
The Aug. 26 event in which a corrupt file of some sort entered the system and rendered it useless for about 90 minutes during a high-traffic period was not an isolated incident, as the FAA’s chief administrator originally had told the media.
Hundreds of flights were delayed and thousands of passengers were thrown off schedule by the system crash, which lasted only 90 minutes but caused widespread havoc.
The NADIN (National Airspace Data Interchange Network) system, which processes an average of 1.5 million messages per day, has a history of technical issues, and resulting travel disruptions are not out of the ordinary, according to knowledgeable air industry sources.
The FAA originally had reported Aug. 27 that the breakdown of the automated system was the first of its kind. The crash apparently had baffled FAA officials, who then conducted a technical investigation to determine the cause.
The 90-minute system crash, which pretty much affected all the major airports in the nation, later was blamed on a single corrupt file-most likely a virus-that had entered the system and somehow torpedoed it into uselessness.
The second NADIN system in Salt Lake City, to its credit, continued normally in handling all the West Coast flight plans. But when Atlanta crashed, all the East Coast data switched over immediately to Salt Lake City, which could not handle the extra data traffic-even though it was designed to handle 125 percent of normal load in the event of an emergency.
Commercial aircraft of any type cannot take off with having filed a valid flight plan, one that includes destination, estimated flight speed, description of cargo, estimated altitude, weather conditions and a number of other data points.
So, for a part of the afternoon of Aug. 26, pilots at about 40 U.S. airports were forced to manually type their flight plan information into the system, causing long delays in takeoffs. Chicago’s O’Hare International, one of the two or three busiest airports in the world, and nearby Midway Airport were among the most directly affected.
“We’ve just never seen it fail in this manner,” Hank Krakowski, the chief operating officer for the FAA’s Air Traffic Organization, said in his media remarks.
However, a look at the record shows it had indeed failed several times before, including only five days prior to the Aug. 26 crash.
Several System Failures During Last Nine Years
This excerpt comes from the FAA’s own Web site (PDF format), dated Aug. 22:
“The aforementioned NADIN outage last evening [Aug. 21] caused more than 100 delays after flight plans were rejected. The outage is currently being blamed for 134 departure delays but this figure could climb. The legacy NADIN in Atlanta crashed. Salt Lake City took over but had problems with the high queue level …”
International intelligence analytical firm Stratfor, in an analysis published on Aug. 27, reported a similar system outage back in 2000. Another was reported in June 2007 in addition to the Aug. 21 and Aug. 26 crashes. Those are the ones we know about; we don’t know how many others were never made public information.
“Interruptions to a master flight plan system is not just inconvenience, [they are] a major safety risk,” server administrator, high-mileage air traveler and eWEEK reader Jeff Milne wrote in a comment on one of our earlier stories in this series.
“Airport congestion caused by an extremely high number of flights stuck on the Tarmac can easily cause breaches of air space or worse as the congestion grows due to flights coming in from other locations not affected by the Eastern outage.”
In March 2005, a new contract was awarded for a “NextGen” NADIN replacement. No doubt that process took several months, or perhaps even a year or more, to formulate. So the FAA has been well aware for at least four years that the old system has served its purpose and is ready to be replaced.
In fact, the agency had been given warning as far back as 2000 (and perhaps even sooner than that) that the system was beginning to fail.
The FAA has begun replacing the old Phillips DS714 mainframes with new heavy-duty Stratus FTserver 6400s, which run on Intel Xeon processors. The system was designed by Lockheed Martin engineers. But they are not in main production at this time, or else they would have been able to help share the load during the major outage of Aug. 26.
Thus, the legacy-and that’s putting it nicely-NADIN system appears to be hanging on for dear life. The new replacement system isn’t expected to go online until the end of 2008, FAA spokesman Paul Takemoto told me.
What About the Rest of the System?
If the flight-plan system is suspect and is taking a long time to replace, what about the rest of the air traffic system? What is its condition? The flight-plan and air traffic systems work hand in hand and are essential tools for the coordination of the nation’s air traffic.
Most localized air traffic control systems in use today were designed in the 1960s and ’70s and installed throughout those years and into the ’90s. Radar has been used since World War II.
Many technologies are used in air traffic control systems. Primary and secondary radar is used to enhance a controller’s “situational awareness” within his assigned air space; all types of aircraft send back primary echoes of varying sizes to controllers’ screens as radar energy is bounced off their skins. Transponder-equipped aircraft reply to secondary radar interrogations by giving an ID (Mode A), an altitude (Mode C) and/or a unique call sign (Mode S). Certain types of weather also may register on a radar screen.
The traffic-handling systems used at most international airports are highly proprietary. Systems engineers are tight-lipped about them in general. They work hand in hand with the flight-plan system and have many redundancies built into them.
Stratfor, along with many other industry watchers, is very concerned about the flight-plan system and evidence that the system is wearing out.
“Regardless of what caused the Aug. 26 NADIN crash, [there] is a monumental challenge the event underscores. Here an archaic system that had survived nearly seven years of 9/11-inspired overhauls went down, dumping its entire workload on one other switch. The NADIN system had already been partially upgraded with systems from Lockheed Martin and is slated to be replaced altogether with the FAA’s much-hyped NextGen Air Traffic Control system. But the lack of redundancy and dynamism demonstrated again by the latest NADIN crash makes a cyberattack against critical U.S. infrastructure all the more feasible. And the cost of comprehensively upgrading these systems would be an enormous financial investment, far more than we have seen so far in the years following 9/11.”
A Web site blogged by a number of former FAA staff members, FAAFollies.com, details many of the foibles the agency has suffered in recent years, including these last two system crashes.
Andy Isaksen, a computer scientist for the FAA in Atlanta, was the designer of the flight-plan system. He did not return a call from eWEEK about this story.
In a 2005 NetworkWorld article, Isaksen told writer Deni Connor that the NADIN system’s two Philips DS714/81 mainframe computers were originally manufactured in 1968 and upgraded with new processors in 1981. Since then, they have been getting increasingly harder to maintain, support and write code for, Isaksen said.
That much is obvious. Philips’ Netherlands-based computer-making arm ceased to exist shortly after delivering the FAA’s two custom-made, proprietary mainframes in 1988; the FAA then bought the entire parts inventory as an insurance policy.
The Isaksen flight-plan network is the centerpiece of the FAA’s air traffic system. Any aircraft that enters or leaves U.S. air space has to file a plan into the system. The network also serves as the sole data interchange between the United States and other nations to distribute flight plans for commercial and general aviation, as well as weather and advisory notices to pilots.
To its credit, the air traffic system probably has been running around the clock 99.9 percent of the time since the tail end of the Reagan administration. But the time has come for it to be replaced, and everybody knows it.
Modernization Is Mandatory
The way the FAA has its flight-plan system set up, there simply is no bandwidth for testing a newer system.
“The FAA, with our lives and livelihood in their hands, should be a LOT more proactive in addressing modernization needs before it becomes a crisis with a scope and complexity that defies resolution,” server admin Milne wrote to eWEEK.
“This kind of thing [system crashes] happens when changes are pushed into production without adequate testing. It’s ironic that the system being criticized is (even after 20 years) the state of the art for functionality worldwide, because the infrastructure that was once redundant now has both platforms fully tasked-much less having resources for a test system,” Milne wrote.
The way to port a legacy application is to build an exact replica on a current platform, forgoing the temptation to implement upgrades of any kind. But the age of the system and other limitations do not allow for adequate testing.
“Anything else is unable to function demonstrably in a parallel operations validation scenario that’s necessary to establish sufficient confidence to warrant cut-over,” Milne wrote.
“Until the FAA bites the bullet and accepts this restriction, we are subject to similar outages while they get their NextGen architecture stood up and functioning. But considering the cost and the limited disruption up to now, it may be worth accepting the pain for a time. If that’s what they want to do, the right thing is to establish a deadline, after which they are obliged to stop and update the legacy application as the price of gaining the time to perfect the new system,” Milne wrote.
“My interest is based on being a consumer of the FAA’s services-a little over 2 million miles flown so far. I am shocked, no, I’m way beyond shocked at how antiquated the equipment is (where does someone go to get replacement vacuum tubes?). Sure, it works well most of the time (thank God). But we have better technology on golf carts, it seems.”