The Federal Aviation Administration's national flight-plan
filing system went down for 4 hours on the morning of Nov. 19, disrupting
the takeoffs of hundreds of commercial flights and throwing hundreds of
thousands of travelers off schedule.
The
culprit was eventually determined to be a routing error in the software
configuration inside a telecom router link at the FAA's Salt
Lake City data distribution hub, pushing the router
offline.
The faulty router, which for reasons not yet established was not able to
default to a backup, also shut down a second major system node in Hampton,
Ga., effectively bringing to a halt the
inputting of flight plans filed by U.S.
commercial pilots. Commercial aircraft cannot take off from a U.S.
airport without filing a flight plan.
The glitch forced hundreds of pilots flying that day to enter their plans
manually via e-mail or by faxing them into the system, causing widespread
flight cancellations and delays.
Here is a detailed point-by-point timeline, supplied to eWEEK by the FAA,
telecom system maintainer Harris and the Professional Aviation Safety
Specialists union, an affiliate of AFL-CIO
that represents about 11,000 FAA technicians.
-
Harris, maintainer of the FAA telecom network, installed a replacement router
during planned maintenance release.
- Replacement router contained a route error in its software configuration.
- This router error caused route translation errors in the logical router
"bridge" between the IP ATM network and the OP IP backbone at the
FAA's Salt Lake City hub.
- The problem effectively blocked three-quarters of the IP routes over both
networks.
- Problem was detected immediately (started MR for router replacement at 5 a.m. EST; problem detected at 5:08 a.m. EST)
- Isolation of the problem was complex, driven by a number of failures:
- Another MR at Herndon, Va.
- CPU utilization alarms did not trigger alerts.
- Problem looked like a drop of routes in all of the backbone routers;
suspected software problem in the routers.
- CPU utilization on sample routers looked normal, so routing problem was not
suspected at first.
Below is the timeline of events:
-
10:00 GMT – Router install begun and
"autonomous" route changes begun
- 10:08 GMT – Start of outage, multiple
calls to PNOCC from FAA
- 10:08 GMT to 13:13 GMT
– Isolation underway to determine root cause
- 13:13 GMT – Router CPU utilization
and IP engineering reset
- 13:13 GMT to 13:59 GMT
– FTI field tech in route and IP engineering
constant reset of router
- Services are up and data is flowing between each reset
- 13:59 GMT – FTI
field tech removes router card. Services restore and returned to service
- 14:17 GMT – Tier 3 sites (13 sites) report
services down
- 14:38 GMT – All URET services restored
- 15:08 GMT - ZHN CERAP all services
restored
- 15:30 GMT – ZLC
Tier 3 sites (13 sites) services restored
"This problem could have been mitigated many hours sooner had FAA
specialists maintained the system," PASS spokesman Church Siragusa told
eWEEK.
"At large facilities, FAA specialists are on duty 24/7. No one understands
the intricacies and inter-relationship of NAS systems better than FAA
specialists. We are trained to understand this, and we have an intimate
knowledge due to our maintenance efforts daily."
Harris Issues a Statement
Harris spokesperson Marc Raimondi told eWEEK that people should keep in mind
that weather conditions cause most flight delays, and that the FTI
system used by the FAA has a very good performance record. "Five
nines—maybe even nine nines of efficiency," Raimondi said.
Raimondi issued the following statement from Harris: "We're working with
the FAA to evaluate the interruption in order to prevent similar outages in the
future. FTI has proven to be one of the most
reliable and secure communications networks operating within the civilian
government. Safety and security are the highest priorities."
 |