Pages

Wednesday, February 22, 2012

Five Whys - Getting to the root cause

Anything that happens, happens. Anything that, in happening, causes something else to happen, causes something else to happen. Anything that, in happening, causes itself to happen again, happens again. It doesn't necessarily do it in chronological order, though.
- Douglas Adams
Basic causality, cause and effect, is one of the cornerstones of addressing a problem. When something is broken, it makes sense to find out why it was broken so that you can fix the cause and not just this single instance of the problem.

However, if we stop at just one 'why', we're only fixing an instance of the real problem. Somewhere, probably under layers and layers of policy, process, excuses, finger pointing, and other bureaucratic red tape is the real reason for the failure. It just takes a few more 'whys' to find it.

Five 'whys' is not a hard and fast rule, more of a general idea that you usually need to dig a little deeper to find the real cause of an issue. Sometimes it is three layers deep, sometimes it is six, but five is a good number to start with. Beginning with your original failure or problem, ask why it happened and then come up with a solid, honest answer. Take that answer and then ask why it happened. Continue this chain of 'whys' until you feel that you've really found the root of the problem. Discovering this root cause helps you determine what really needs to be addressed to prevent not only the issue at hand, but related issues as well.

Original problem: Our company's web services are intermittently unavailable and we're losing business!
Why? One of the web servers in the load balancer is encountering an error.
Why? The application wasn't deployed to that server correctly from our test server.
Why? The production deployment team made a mistake when deploying the application and post-deployment testing was not done.
Why? The production deployment team was in a hurry because they have too much work on their plate right now. Post-deployment testing wasn't done because the test team wasn't notified that the deployment was complete.
Why? The production control team is understaffed and overworked. This caused both their original mistake in deploying the service and also caused them to forget to send out the post deployment notification email.

There. We've found out the real why. We could dig down a couple more layers, but at this point we've gotten pretty close to the real cause of the production failure and know what changes we need to make to help prevent these kinds of failures in the future. By fixing the root cause of this issue, we will be correcting an organizational deficiency that would have caused other issues in the future. With a little extra effort and research, now we're not just fixing one problem, we're fixing an entire class of problems.

The next time you find yourself with an issue, don't just stop with the first why. Ask a few more whys and get to the heart of the problem. You might learn more about yourself and your organization than you expect.

No comments:

Post a Comment