Doing Root Cause Analysis

An incident is a terrible thing to waste

Failure is a wonderful teacher. We should use it better to learn how you fail
The root case of an incident is always process, culture or architecture.
A good RCA will wipe out a class of errors, not put a band-aid on the symptom. That is the value of going deep
A good RCA is one that goes deep. It must use a cause and effect fishbone and 5 whys2
Be vocally self critical. RCA’s are not the place to be defensive. Defensiveness during an RCA defeats the purpose of an RCA. This is not about a witch-hunt. It is about preventing a class of errors from happening again
Leaders must be present at an RCA. Decisions need to be made about deep changes. Architecture, process and culture changes should not be delegated to frontline engineers. Give constructive feedback during the RCA.
A good RCA requires you to be boundary-less. Often, we stop at the boundary of what we think we can ‘control’ and make a sub-optimal fix. Fix the problem at the root. If that means fixing it in a different organization, go do that.
Writing a good RCA is hard. Writing a good 6 pager is hard. Writing a good RCA is hard. Seek out the experts and understand what separates an RCA from a great RCA.
Put yourself in the shoes of the customer. When we say ‘the network was unavailable’ or ‘application crashed’ we are abstracting ourselves from the in ability of our customers to run mission critical software and they pain they feel.
Take an End to End perspective. Do not stop at your org boundary. The customer does not know or care about this.
Follow up and ensure completion of the root causes analyzed. Ensure the effectiveness by observing the implemented solutions in operations
Ensure that senior leadership is looking at RCA’s. They can help make the big changes that the org requires to fix root cause.
Share your learning. Don’t let others in the company re-discover the problem (e.g. cookie management)
Recognize a good RCA. Use it to teach others how to do it. Invite them to teach others how to do it well.
Give teams the time to do a proper RCA. It will payback 10x in maturity of the organization
Conduct premortems and FMEA’s as appropriate. (when to do what?)
Collaborate x-team on RCA’s. Just because you are in one business unit does not mean that your RCA stops there

  1. Wikipedia