FAA Shutdown – The Importance of Solid Root Cause Analysis
What Happened
Last week, the FAA ordered a ground hold for all domestic aircraft in the U.S. This occurred because of a fault in the NOTAM (Notice To all Air Missions) System. This system informs aircraft system operators of possible hazardous and closed areas for flight. For example, if you were flying from Charleston to Miami, you would want to know if there were any restrictions due to rocket launches out of Cape Canaveral. The FAA NOTAM system makes sure pilots and air traffic schedulers have all of the restricted areas visible for their flight planning. Without this system active, aircraft are not allowed to fly.
On Tuesday night, the system went down. They are still investigating, but there are several reports that indicate either an incorrect or corrupted file was on the server, possibly due to an error by a software engineer. Apparently, the backup file was also corrupted.
Analysis & Corrective Actions
Of course, some people are already calling for corrective actions (yell at politicians, fire someone at the FAA, etc.). And yet, we haven’t even collected the correct evidence yet, let alone analyzed the results. We’ll let that go for now, but it is frustrating to see the same poor answers being circulated for every failure.
Instead, let’s take a look at IT-types of issues that can be analyzed. Most people think about doing a root cause analysis on accidents (injuries, environmental releases, fires, explosions, etc.). But you can do a TapRooT® investigation on any problem that involves humans. If there is a possibility of a human error, you can investigate to better understand what allowed that mistake to occur.
For IT issues, this is just as relevant. Just because there was no one hurt or there wasn’t an environmental release does not mean that humans weren’t involved. For this particular problem, we’d want to better understand:
- What data backup procedures are in place
- How do we verify the backup is actually working
- Is there a way to make sure we aren’t backing up bad data
- What test programs are in place to check that the correct data is available
- How did the wrong or corrupted data end up on a live server
- What checks are in place to make sure that this bad data is caught before going live
- When a problem does occur, what systems do we have in place to mitigate or recover from the failure
- How do we test our end-to-end system, and how often is the system tested
All of these items are possible problems when humans are involved. Our investigation into computer issues needs to be thorough enough to look at the hardware, the software, the interfaces, past problems, and corrective actions, etc. And all of these issues must be investigated rigorously, using a non-judgmental RCA system that assumes errors will occur and will find the reasons these errors are allowed to proceed to an actual incident. Just blaming people or being happy the system is now up and running is not a very robust answer.
Take a look at how TapRooT® is being used in the Transportation industry.
Did you hear about the other similar failures?
The Philippines system went down just after the first of the year.
The Canadian system went down the day after the US system failed.
Supposedly, there are no commonalities between the three systems.
Do you know of any other systems failing around the world?