Practice4.Lightweight Root-Cause Analysis | Sustainable Software Development: An Agile Perspective

Practice 4. Lightweight Root-Cause Analysis

Use regular lightweight root-cause analysis sessions to understand what caused a defect and how to prevent it in the future. Root-cause analysis is an attempt to understand the real cause of a problem and to prevent similar problems in the future. It's important to stress lightweight; the goal is to learn and move on so that you can minimize the number of defects in your product and hence minimize the amount of time spent fixing defects overall.

Make sure your team and your users know the difference between a defect that must be fixed and one that doesn't or probably never will get fixed. You can't afford to let defects hang around because any form of defect backlog, even a short-term one, is a drain on a team's focus and time.

You should also consider having a special won't fix state for defects in the following categories:

The cost to fix the defect far outweighs the benefit to the user. If the user benefit is high, then you should seriously consider rebranding the bug fix as a feature and treat it as such in planning.
The benefit to the user is very low or questionable.
The defect only occurs through some bizarre sequence of events that is virtually impossible.

Mark these defects won't fix and review the list periodically (perhaps once a year). Make sure that the database entry records your reasoning, and if that reasoning no longer applies, you might want to consider fixing the defect after all.

Well-run root-cause analysis sessions are an extremely efficient mechanism to get a team to collectively understand typical defects in a product and also to build collective ownership for preventing these defects in the future. The goal is to be super efficient time-wise and to understand the answers to three questions:

What caused this defect? If possible, record the cause of the defect in your defect-tracking database. Hopefully, you can do this update live during the root-cause analysis session itself. The goal is not to assign blame; the only reason to answer this question is so you can answer the following questions:
What mechanism, had it been in place, would have caught this defect? This question helps categorize what is required to catch defects before they get to customers. For example, would a test have caught the defect? More user testing? More communication between users and developers? A simple assertion failure in the code?
How can we prevent this defect or this class of defect in the future? The best possible method to prevent defects from recurring again is to write a test or series of tests and add them to your test suite. Fixing defects is not an optimal use of time, and the greater the amount of time your team spends preventing defects, the more time you'll have to develop the features your customers need.

Of these questions, 3 is the most important. The only reason you answer the first two is so you can answer 3. This is counter to what some consider standard doctrine for root-cause analysisthat the answers to the first two questions are where the effort should be spent. I strongly disagree. The value in root-cause analysis is in preventing other defects and in promoting a shared understanding in the team about what can be done to prevent defects. Quite often there are also useful discussions about what constitutes good code and bad code. By not apportioning blame and focusing on prevention, you are putting the emphasis on the positive aspects of the exercise, in particular the aspects that emphasize team learning and collaboration.

It is important to have a simple and efficient process for the root-cause sessions. This ensures the process is repeatable, easily understood, easily changed, and therefore lightweight. A simple process might be something like:

Have a computer in the session with your bug tracking system running so changes can be made live during the session.
The root-cause session is not just for developers. Users and other team members should be present. Even if they get bored during some of the more technical conversations, their presence helps keep everyone motivated to keep those conversations short. Also, most software projects will find that a surprising number of defects would have been prevented through better communication. Having everyone present will help ensure that there is shared recognition of the communication problems so that everyone can work toward communicating more effectively.
Start by reviewing the priority of all the newly reported defects. Users should have the final say on the priority. Determine which defects are going to be fixed before the next session and assign them. Some defects may not be defects at all but feature requests. Other defects may require major effort to fix and so are effectively features. Convert these to feature cards so they can be prioritized during upcoming iteration planning meetings. Some defects will be fixed later by work that is already underway. These can be kept as a backlog; hopefully, these will be the only defects in your backlog.
Answer the questions. Remember that the benefit of this exercise is the collaboration and the process, not the documentation of the results. Therefore, you could keep the documentation simple by categorizing the answers so that you can review them later in a chart.
Spend a bit of time on your answers to question 3. Are there changes you can make right now to prevent similar problems in the future?
Have a quick discussion about the key lessons learned from the defects discussed in the session.
Whenever you do a retrospective, spend some time discussing the data you've collected through root-cause analysis.

Tip: Try to Avoid Setting up Separate Root-Cause Analysis Sessions

If you keep the number of defects low in your product, try including defects in your iteration planning sessions. Write up index cards for each defect that is obvious to the users. Then you can do the root-cause on each defect's index card or record the information in your bug tracking system. This has the advantage that the defects are prioritized against all the other work that must be done in the iteration and will avoid YAM yet another meeting.

You should constantly evaluate how often to hold root-cause sessions and the process used in your retrospectives. I have worked on projects where we did it every day, once a week, once a month, and not at all. It all depends on how much time you want to spend at any one time and possibly also on where you are in your project. The key thing is to be agile and to anticipate a need to change the session frequency.

Root-Cause Analysis

Not all the teams I've worked on or with did root-cause analysis. However, as I mentioned in Chapter 4, there was one team where this was one of our cornerstone practices. Every week, my team sat down and went through every single defect in our software. These meetings lasted anywhere from 5 to 30 minutes, depending on the number of new defects we had to discuss. We briefly discussed every defect found in the previous week, and we focused our discussion on how we could prevent this and similar defects in the future, almost always through enhancing our automated or manual test plans.

Our weekly reviews kept us focused on eliminating and preventing defects. Essentially, we were motivated to keep our defect count low so we could maximize the amount of time working on new features and minimize the amount of time required for the weekly review. The weekly reviews also ensured that we fixed new defects in the week following the meeting. This practice allowed us to ship our product over a number of years with large numbers of new features and the same number of defects year after year, even with all the new features.

Of all the practices in this book, root-cause analysis is probably the most prone to being ineffective and a time drain. Hence, it is vital to discuss your root-cause sessions in your retrospectives. Are they as effective as they could be? Is everyone who needs to be there? Are too many people attending? What changes to the process can be made to make them shorter and maximally effective? Talk about them as a team and aim to constantly improve their effectiveness. As with any practice, it is important to see some benefits early or the practice will not stick.

When starting out, I would advise being conservative. For example, if you have any kind of defect backlog, don't even think about doing root-cause analysis on the backlog until you have an efficient process that has been refined through applying it to some or all of your newly reported defects. I learned this the hard way on a project with a backlog of hundreds of defects. I was the new person on the team, so in a fit of enthusiasm, I organized a review session. After an hour, I realized this practice was a non-starter for this team because they had too many defects to cope. Many of the defects had been reported months to years earlier, and it became clear to me that all the team could do was concentrate on the defects that customers complained about the most. I was only on the project for a short time, and unfortunately I know the problem with the backlog was never addressed. The project was cancelled shortly after I left, largely because of the return on investment of the project: it took too many more developers to keep the project going than were justified by the market size.