The Boeing 737 Max 8 Tragedies: Why RCA is Critical to Successful Solution Generation

David W. Conley

The Boeing 737 Max 8 Tragedies

On October 29th, 2018 an Indonesian airlines Boeing 737 Max 8 aircraft crashed minutes after take-off, killing all 189 on board. The tragic events seemed to play out again on March 10th, 2019 when an Ethiopian Airlines Boeing 737 Max 8 went down in a similar manner, killing all 157 on board. Ethiopian authorities attributed their disaster to a faulty angle-of-attack (AOA) sensor, a device designed to indicate if an aircraft is in a danger of stalling. It roughly indicates the aircraft’s pitch, i.e., nose up, nose down, etc. (See Figure One: AOA Sensor Operation).

 

Comparisons of the two disasters uncovered related elements. To effectively respond to a system failure, it is only necessary to identify the failure and eliminate its cause. Investigation closed, right? Maybe not.

 

System Failure Modes

Why do engineering systems fail? There are of course manifold reasons why a system may fail but they fall into a few general categories:

  • Poor workmanship
  • Poor initial design
  • Operation outside of design parameters
  • Component failure (inferior materials or wear and tear)
  • Not fully thinking through modifications intended to allow for different usage modes
  • Unnecessary complexity leading to increased failure modes.

The root cause(s) of a failure must be determined accurately. Otherwise steps cannot be taken to eliminate a repeat of the failure. However, what is visible to us during failure investigations are often only symptoms of deeper-rooted problems.

 

Symptoms Versus Root Causes

Often, what we think is the problem is merely a symptom of the problem, not the root cause. Here’s a simple example: During the winter heating months, parts of my house are too warm, but the remainder of the space is just right. The temperature delta within the house is the perceived problem. If treated as the root problem (versus a symptom), it leads to system complicating solutions such as opening windows or adding fans or cooling devices in the afflicted rooms. In reality, what’s more probable is that my thermostat is simply located in an inferior location (root cause) and that moving it to a better location might completely eliminate my heat distribution problem.

One real life example of symptoms versus root causes occurred while I was working at Intel. There was a closed loop cooling system, attached to one of our processing tools, which was assigned to a newly hired recent engineering graduate. Semiconductor manufacturing is an expensive and complex process and the individual processing tools also tend to be equally expensive and complex. This cooling loop sub-system was no exception and was tasked with moving large volumes of water to cool the high energy ion implant processing tools it was associated with. The new engineer noticed some particulate build up in the cooling loop and installed a high efficiency in-line filtration system just to be cautious and protect the very expensive pump and motor package. Shortly after the filtration was installed the pump started failing. The filtration unit was installed specifically to avoid pump and motor package failure but the failure happened anyway.

To make a long story short, it turns out that a gasket used within the cooling loop was not rated for the cooling loop’s operating temperature and was thus giving off particulate. The high efficiency filter was indeed catching all of the resulting particulate but over time as the filter did its job, the system pressure would increase, requiring the pump to work against higher pressures, thus putting additional strain on the pump and the motor. The additional system strain slightly raised the overall temperature of the system and the gasket failed at an even higher rate. The accelerated gasket failure resulted in more system back pressure and therefore occasional pump and motor package failure.

In summary, filtering out the particulate, while failing to determine its root cause, resulted in a higher system back-pressure, leading to higher pump strain, and therefore motor burn out. Addressing symptoms of problems (such as the particulate in the previous example) does not stop the symptom but rather only works around it. And even if the “fix” works, the system becomes more complex, creating additional future failure points.

Symptom patching such as in the examples above is commonplace in electro-mechanical and thermal systems but not confined to them. It’s a common occurrence in software system as well. Here’s another Intel example where the trouble-shooting and repair involved a piece of process control software that was modified to allow either manual or automated movement of wafer lots between the tool sets. This modification was completed back when processing wafers were only 200 mm in diameter and an entire lot of 25 wafers could easily be handled by a human. Therefore, manual lot handling was fine back then. However, when processing wafers grew to 300 mm in diameter, a 25-wafer lot weighed so much that it was above the safe handling limits of a human. So the 300 mm lots could only be handled by automation. However, the coding that allowed the processing equipment in question to interface with the automated material handling system was written for a previous process (that utilized 200 mm wafers) and therefore required that the old 200 mm wafer handling software be maintained for many years after it was required for anything outside of the material handling hand-shake. The software was written in a somewhat obsolete code, ran on older computing platforms, and grew into a lumbering digital monster that eventually started causing automation failures. Milking legacy software along to avoid a rewrite almost always keeps obsolete lines of code running, something that is guaranteed to eventually conflict with newer coding modules or I/O equipment that must be attached to the legacy systems.

Working around lines of code that are required for some operations, but in conflict with newer requirements, is exactly like solving visible symptoms rather than addressing root causes. Generally speaking, software systems become so complex from the continual patching of symptoms that any software that is older than five years is a good candidate from a bottom-up redesign and rewrite.

 

737 Max 8 Disaster Root Cause(s)

There were many news articles focusing on the 737 Max 8 crashes. The articles reporting the reasons for the crash tend to lean towards the failure of one of the two angle-of-attack (AOA) sensors as the root cause. In my experience, claiming victory upon identifying a single failure mode associated with a problem is often short sighted. In fact, determining root causes is rarely as simple as identifying a single failure point. Engineering systems are becoming more and more complex and assuming that the identification of a single failure mode indeed fully explains a system problem is a risky venture. However, there are tools that allow excellent insight into complex multi-factor issues.
I suspect that in the long run the 737 Max 8 crashes will be attributed to much more than the failure of a single sensor. In a future  article, I will introduce the Root Cause Analysis (RCA) tools I most recommend and show how to apply them to the 737 Max 8 AOA sensor failure. When we do that, we will see that the AOA sensor failure was likely just a contributing cause and not the root cause of these air travel tragedies.

Editor’s Note: As the head of the Eogogics Systematic Innovation team, Dave Conley is focused on the techniques that enable break-through innovations in engineering products and systems as well as on the integration of those tools into product/process improvement methodologies such as Root Cause Analysis. In a prior life, he was the head of Intel’s worldwide innovation effort. He also worked at  or with Johnson and Johnson, Philips Semiconductor, NASA, Los Alamos and Brookhaven National Labs, and the US Air Force.