23.2. Finding a Defect | Code Complete: A Practical Handbook of Software Construction, Second Edition

< Free Open Study >

Debugging consists of finding the defect and fixing it. Finding the defect and understanding it is usually 90 percent of the work.

Fortunately, you don't have to make a pact with Satan to find an approach to debugging that's better than random guessing. Debugging by thinking about the problem is much more effective and interesting than debugging with an eye of a newt and the dust of a frog's ear.

Suppose you were asked to solve a murder mystery. Which would be more interesting: going door to door throughout the county, checking every person's alibi for the night of October 17, or finding a few clues and deducing the murderer's identity? Most people would rather deduce the person's identity, and most programmers find the intellectual approach to debugging more satisfying. Even better, the effective programmers who debug in one-twentieth the time used by the ineffective programmers aren't randomly guessing about how to fix the program. They're using the scientific method that is, the process of discovery and demonstration necessary for scientific investigation.

The Scientific Method of Debugging

Here are the steps you go through when you use the classic scientific method:

Gather data through repeatable experiments.
Form a hypothesis that accounts for the relevant data.
Design an experiment to prove or disprove the hypothesis.
Prove or disprove the hypothesis.
Repeat as needed.

The scientific method has many parallels in debugging. Here's an effective approach for finding a defect:

Stabilize the error.
Locate the source of the error (the "fault").
1. Gather the data that produces the defect.
2. Analyze the data that has been gathered, and form a hypothesis about the defect.
3. Determine how to prove or disprove the hypothesis, either by testing the program or by examining the code.
4. Prove or disprove the hypothesis by using the procedure identified in 2(c).
Fix the defect.
Test the fix.
Look for similar errors.

The first step is similar to the scientific method's first step in that it relies on repeatability. The defect is easier to diagnose if you can stabilize it that is, make it occur reliably. The second step uses the steps of the scientific method. You gather the test data that divulged the defect, analyze the data that has been produced, and form a hypothesis about the source of the error. You then design a test case or an inspection to evaluate the hypothesis, and you either declare success (regarding proving your hypothesis) or renew your efforts, as appropriate. When you have proven your hypothesis, you fix the defect, test the fix, and search your code for similar errors.

Let's look at each of the steps in conjunction with an example. Assume that you have an employee database program that has an intermittent error. The program is supposed to print a list of employees and their income-tax withholdings in alphabetical order. Here's part of the output:

Formatting, Fred Freeform    $5,877 Global, Gary                 $1,666 Modula, Mildred             $10,788 Many-Loop, Mavis             $8,889 Statement, Sue Switch        $4,000 Whileloop, Wendy             $7,860

The error is that Many-Loop, Mavis and Modula, Mildred are out of order.

Stabilize the Error

If a defect doesn't occur reliably, it's almost impossible to diagnose. Making an intermittent defect occur predictably is one of the most challenging tasks in debugging.

An error that doesn't occur predictably is usually an initialization error, a timing issue, or a dangling-pointer problem. If the calculation of a sum is right sometimes and wrong sometimes, a variable involved in the calculation probably isn't being initialized properly most of the time it just happens to start at 0. If the problem is a strange and unpredictable phenomenon and you're using pointers, you almost certainly have an uninitialized pointer or are using a pointer after the memory that it points to has been deallocated.

Cross-Reference

For details on using pointers safely, see Section 13.2, "Pointers."

Stabilizing an error usually requires more than finding a test case that produces the error. It includes narrowing the test case to the simplest one that still produces the error. The goal of simplifying the test case is to make it so simple that changing any aspect of it changes the behavior of the error. Then, by changing the test case carefully and watching the program's behavior under controlled conditions, you can diagnose the problem. If you work in an organization that has an independent test team, sometimes it's the team's job to make the test cases simple. Most of the time, it's your job.

To simplify the test case, you bring the scientific method into play again. Suppose you have 10 factors that, used in combination, produce the error. Form a hypothesis about which factors were irrelevant to producing the error. Change the supposedly irrelevant factors, and rerun the test case. If you still get the error, you can eliminate those factors and you've simplified the test. Then you can try to simplify the test further. If you don't get the error, you've disproved that specific hypothesis and you know more than you did before. It might be that some subtly different change would still produce the error, but you know at least one specific change that does not.

In the employee withholdings example, when the program is run initially, Many-Loop, Mavis is listed after Modula, Mildred. When the program is run a second time, however, the list is fine:

Formatting, Fred Freeform    $5,877 Global, Gary                 $1,666 Many-Loop, Mavis             $8,889 Modula, Mildred             $10,788 Statement, Sue Switch        $4,000 Whileloop, Wendy             $7,860

It isn't until Fruit-Loop, Frita is entered and shows up in an incorrect position that you remember that Modula, Mildred had been entered just prior to showing up in the wrong spot too. What's odd about both cases is that they were entered singly. Usually, employees are entered in groups.

You hypothesize: the problem has something to do with entering a single new employee. If this is true, running the program again should put Fruit-Loop, Frita in the right position. Here's the result of a second run:

Formatting, Fred Freeform    $5,877 Fruit-Loop, Frita            $5,771 Global, Gary                 $1,666 Many-Loop, Mavis             $8,889 Modula, Mildred             $10,788 Statement, Sue Switch        $4,000 Whileloop, Wendy             $7,860

This successful run supports the hypothesis. To confirm it, you want to try adding a few new employees, one at a time, to see whether they show up in the wrong order and whether the order changes on the second run.

Locate the Source of the Error

Locating the source of the error also calls for using the scientific method. You might suspect that the defect is a result of a specific problem, say an off-by-one error. You could then vary the parameter you suspect is causing the problem one below the boundary, on the boundary, and one above the boundary and determine whether your hypothesis is correct.

In the running example, the source of the problem could be an off-by-one defect that occurs when you add one new employee but not when you add two or more. Examining the code, you don't find an obvious off-by-one defect. Resorting to Plan B, you run a test case with a single new employee to see whether that's the problem. You add Hardcase, Henry as a single employee and hypothesize that his record will be out of order. Here's what you find:

Formatting, Fred Freeform    $5,877 Fruit-Loop, Frita            $5,771 Global, Gary                 $1,666 Hardcase, Henry                $493 Many-Loop, Mavis             $8,889 Modula, Mildred             $10,788 Statement, Sue Switch        $4,000 Whileloop, Wendy             $7,860

The line for Hardcase, Henry is exactly where it should be, which means that your first hypothesis is false. The problem isn't caused simply by adding one employee at a time. It's either a more complicated problem or something completely different.

Examining the test-run output again, you notice that Fruit-Loop, Frita and Many-Loop, Mavis are the only names containing hyphens. Fruit-Loop was out of order when she was first entered, but Many-Loop wasn't, was she? Although you don't have a printout from the original entry, in the original error Modula, Mildred appeared to be out of order, but she was next to Many-Loop. Maybe Many-Loop was out of order and Modula was all right.

You hypothesize again: the problem arises from names with hyphens, not names that are entered singly.

But how does that account for the fact that the problem shows up only the first time an employee is entered? You look at the code and find that two different sorting routines are used. One is used when an employee is entered, and another is used when the data is saved. A closer look at the routine used when an employee is first entered shows that it isn't supposed to sort the data completely. It only puts the data in approximate order to speed up the save routine's sorting. Thus, the problem is that the data is printed before it's sorted. The problem with hyphenated names arises because the rough-sort routine doesn't handle niceties such as punctuation characters. Now, you can refine the hypothesis even further.

You hypothesize one last time: names with punctuation characters aren't sorted correctly until they're saved.

You later confirm this hypothesis with additional test cases.

Tips for Finding Defects

Once you've stabilized an error and refined the test case that produces it, finding its source can be either trivial or challenging, depending on how well you've written your code. If you're having a hard time finding a defect, it could be because the code isn't well written. You might not want to hear that, but it's true. If you're having trouble, consider these tips:

Use all the data available to make your hypothesis When creating a hypothesis about the source of a defect, account for as much of the data as you can in your hypothesis. In the example, you might have noticed that Fruit-Loop, Frita was out of order and created a hypothesis that names beginning with an "F" are sorted incorrectly. That's a poor hypothesis because it doesn't account for the fact that Modula, Mildred was out of order or that names are sorted correctly the second time around. If the data doesn't fit the hypothesis, don't discard the data ask why it doesn't fit, and create a new hypothesis.

The second hypothesis in the example that the problem arises from names with hyphens, not names that are entered singly didn't seem initially to account for the fact that names were sorted correctly the second time around either. In this case, however, the second hypothesis led to a more refined hypothesis that proved to be correct. It's all right that the hypothesis doesn't account for all of the data at first as long as you keep refining the hypothesis so that it does eventually.

Refine the test cases that produce the error If you can't find the source of an error, try to refine the test cases further than you already have. You might be able to vary one parameter more than you had assumed, and focusing on one of the parameters might provide the crucial breakthrough.

Exercise the code in your unit test suite Defects tend to be easier to find in small fragments of code than in large integrated programs. Use your unit tests to test the code in isolation.

Cross-Reference

For more on unit test frameworks, see "Plug unit tests into a test framework" in Section 22.4.

Use available tools Numerous tools are available to support debugging sessions: interactive debuggers, picky compilers, memory checkers, syntax-directed editors, and so on. The right tool can make a difficult job easy. With one tough-to-find error, for example, one part of the program was overwriting another part's memory. This error was difficult to diagnose using conventional debugging practices because the programmer couldn't determine the specific point at which the program was incorrectly overwriting memory. The programmer used a memory breakpoint to set a watch on a specific memory address. When the program wrote to that memory location, the debugger stopped the code and the guilty code was exposed.

This is an example of a problem that's difficult to diagnose analytically but that becomes quite simple when the right tool is applied.

Reproduce the error several different ways Sometimes trying cases that are similar to the error-producing case but not exactly the same is instructive. Think of this approach as triangulating the defect. If you can get a fix on it from one point and a fix on it from another, you can better determine exactly where it is.

As illustrated by Figure 23-1, reproducing an error several different ways helps diagnose the cause of the error. Once you think you've identified the defect, run a case that's close to the cases that produce errors but that should not produce an error itself. If it does produce an error, you don't completely understand the problem yet. Errors often arise from combinations of factors, and trying to diagnose the problem with only one test case often doesn't diagnose the root problem.

Figure 23-1. Try to reproduce an error several different ways to determine its exact cause

Generate more data to generate more hypotheses Choose test cases that are different from the test cases you already know to be erroneous or correct. Run them to generate more data, and use the new data to add to your list of possible hypotheses.

Use the results of negative tests Suppose you create a hypothesis and run a test case to prove it. Suppose further that the test case disproves the hypothesis, so you still don't know the source of the error. You do know something you didn't before namely, that the defect is not in the area you thought it was. That narrows your search field and the set of remaining possible hypotheses.

Brainstorm for possible hypotheses Rather than limiting yourself to the first hypothesis you think of, try to come up with several. Don't analyze them at first just come up with as many as you can in a few minutes. Then look at each hypothesis and think about test cases that would prove or disprove it. This mental exercise is helpful in breaking the debugging logjam that results from concentrating too hard on a single line of reasoning.

Keep a notepad by your desk, and make a list of things to try One reason programmers get stuck during debugging sessions is that they go too far down dead-end paths. Make a list of things to try, and if one approach isn't working, move on to the next approach.

Narrow the suspicious region of the code If you've been testing the whole program or a whole class or routine, test a smaller part instead. Use print statements, logging, or tracing to identify which section of code is producing the error.

If you need a more powerful technique to narrow the suspicious region of the code, systematically remove parts of the program and see whether the error still occurs. If it doesn't, you know it's in the part you took away. If it does, you know it's in the part you've kept.

Rather than removing regions haphazardly, divide and conquer. Use a binary search algorithm to focus your search. Try to remove about half the code the first time. Determine the half the defect is in, and then divide that section. Again, determine which half contains the defect, and again, chop that section in half. Continue until you find the defect.

If you use many small routines, you'll be able to chop out sections of code simply by commenting out calls to the routines. Otherwise, you can use comments or preprocessor commands to remove code.

If you're using a debugger, you don't necessarily have to remove pieces of code. You can set a breakpoint partway through the program and check for the defect that way instead. If your debugger allows you to skip calls to routines, eliminate suspects by skipping the execution of certain routines and seeing whether the error still occurs. The process with a debugger is otherwise similar to the one in which pieces of a program are physically removed.

Be suspicious of classes and routines that have had defects before Classes that have had defects before are likely to continue to have defects. A class that has been troublesome in the past is more likely to contain a new defect than a class that has been defect-free. Reexamine error-prone classes and routines.

Cross-Reference

For more details on error-prone code, see "Target error-prone modules" in Section 24.5.

Check code that's changed recently If you have a new error that's hard to diagnose, it's usually related to code that's changed recently. It could be in completely new code or in changes to old code. If you can't find a defect, run an old version of the program to see whether the error occurs. If it doesn't, you know the error's in the new version or is caused by an interaction with the new version. Scrutinize the differences between the old and new versions. Check the version control log to see what code has changed recently. If that's not possible, use a diff tool to compare changes in the old, working source code to the new, broken source code.

Expand the suspicious region of the code It's easy to focus on a small section of code, sure that "the defect must be in this section." If you don't find it in the section, consider the possibility that the defect isn't in the section. Expand the area of code you suspect, and then focus on pieces of it by using the binary search technique described earlier.

Integrate incrementally Debugging is easy if you add pieces to a system one at a time. If you add a piece to a system and encounter a new error, remove the piece and test it separately.

Cross-Reference

For a full discussion of integration, see Chapter 29, "Integration."

Check for common defects Use code-quality checklists to stimulate your thinking about possible defects. If you're following the inspection practices described in Section 21.3, "Formal Inspections," you'll have your own fine-tuned checklist of the common problems in your environment. You can also use the checklists that appear throughout this book. See the "List of Checklists" following the book's table of contents.

Cross-Reference

For details on how involving other developers can put a beneficial distance between you and the problem, see Section 21.1, "Overview of Collaborative Development Practices."

Talk to someone else about the problem Some people call this "confessional debugging." You often discover your own defect in the act of explaining it to another person. For example, if you were explaining the problem in the salary example, you might sound like this:

Hey, Jennifer, have you got a minute? I'm having a problem. I've got this list of employee salaries that's supposed to be sorted, but some names are out of order. They're sorted all right the second time I print them out but not the first. I checked to see if it was new names, but I tried some that worked. I know they should be sorted the first time I print them because the program sorts all the names as they're entered and again when they're saved wait a minute no, it doesn't sort them when they're entered. That's right. It only orders them roughly. Thanks, Jennifer. You've been a big help.

Jennifer didn't say a word, and you solved your problem. This result is typical, and this approach is a potent tool for solving difficult defects.

Take a break from the problem Sometimes you concentrate so hard you can't think. How many times have you paused for a cup of coffee and figured out the problem on your way to the coffee machine? Or in the middle of lunch? Or on the way home? Or in the shower the next morning? If you're debugging and making no progress, once you've tried all the options, let it rest. Go for a walk. Work on something else. Go home for the day. Let your subconscious mind tease a solution out of the problem.

The auxiliary benefit of giving up temporarily is that it reduces the anxiety associated with debugging. The onset of anxiety is a clear sign that it's time to take a break.

Brute-Force Debugging

Brute force is an often-overlooked approach to debugging software problems. By "brute force," I'm referring to a technique that might be tedious, arduous, and timeconsuming but that is guaranteed to solve the problem. Which specific techniques are guaranteed to solve a problem are context-dependent, but here are some general candidates:

Perform a full design and/or code review on the broken code.
Throw away the section of code and redesign/recode it from scratch.
Throw away the whole program and redesign/recode it from scratch.
Compile code with full debugging information.
Compile code at pickiest warning level and fix all the picky compiler warnings.
Strap on a unit test harness and test the new code in isolation.
Create an automated test suite and run it all night.
Step through a big loop in the debugger manually until you get to the error condition.
Instrument the code with print, display, or other logging statements.
Compile the code with a different compiler.
Compile and run the program in a different environment.
Link or run the code against special libraries or execution environments that produce warnings when code is used incorrectly.
Replicate the end-user's full machine configuration.
Integrate new code in small pieces, fully testing each piece as it's integrated.

Set a maximum time for quick and dirty debugging For each brute-force technique, your reaction might well be, "I can't do that it's too much work!" The point is that it's only too much work if it takes more time than what I call "quick and dirty debugging." It's always tempting to try for a quick guess rather than systematically instrumenting the code and giving the defect no place to hide. The gambler in each of us would rather use a risky approach that might find the defect in five minutes than the sure-fire approach that will find the defect in half an hour. The risk is that if the five-minute approach doesn't work, you get stubborn. Finding the defect the "easy" way becomes a matter of principle, and hours pass unproductively, as do days, weeks, months…. How often have you spent two hours debugging code that took only 30 minutes to write? That's a bad distribution of labor, and you would have been better off to rewrite the code than to debug bad code.

When you decide to go for the quick victory, set a maximum time limit for trying the quick way. If you go past the time limit, resign yourself to the idea that the defect is going to be harder to diagnose than you originally thought, and flush it out the hard way. This approach allows you to get the easy defects right away and the hard defects after a bit longer.

Make a list of brute-force techniques Before you begin debugging a difficult error, ask yourself, "If I get stuck debugging this problem, is there some way that I am guaranteed to be able to fix the problem?" If you can identify at least one brute-force technique that will fix the problem including rewriting the code in question it's less likely that you'll waste hours or days when there's a quicker alternative.

Syntax Errors

Syntax-error problems are going the way of the woolly mammoth and the sabertoothed tiger. Compilers are getting better at diagnostic messages, and the days when you had to spend two hours finding a misplaced semicolon in a Pascal listing are almost gone. Here's a list of guidelines you can use to hasten the extinction of this endangered species:

Don't trust line numbers in compiler messages When your compiler reports a mysterious syntax error, look immediately before and immediately after the error the compiler could have misunderstood the problem or could simply have poor diagnostics. Once you find the real defect, try to determine the reason the compiler put the message on the wrong statement. Understanding your compiler better can help you find future defects.

Don't trust compiler messages Compilers try to tell you exactly what's wrong, but compilers are dissembling little rascals, and you often have to read between the lines to know what one really means. For example, in UNIX C, you can get a message that says "floating exception" for an integer divide-by-0. With C++'s Standard Template Library, you can get a pair of error messages: the first message is the real error in the use of the STL; the second message is a message from the compiler saying, "Error message too long for printer to print; message truncated." You can probably come up with many examples of your own.

Don't trust the compiler's second message Some compilers are better than others at detecting multiple errors. Some compilers get so excited after detecting the first error that they become giddy and overconfident; they prattle on with dozens of error messages that don't mean anything. Other compilers are more levelheaded, and although they must feel a sense of accomplishment when they detect an error, they refrain from spewing out inaccurate messages. When your compiler generates a series of cascading error messages, don't worry if you can't quickly find the source of the second or third error message. Fix the first one and recompile.

Divide and conquer The idea of dividing the program into sections to help detect defects works especially well for syntax errors. If you have a troublesome syntax error, remove part of the code and compile again. You'll either get no error (because the error's in the part you removed), get the same error (meaning you need to remove a different part), or get a different error (because you'll have tricked the compiler into producing a message that makes more sense).

Cross-Reference

The availability of syntax-directed editors is one characteristic of early-wave vs. maturewave programming environments. For details, see Section 4.3, "Your Location on the Technology Wave."

Find misplaced comments and quotation marks Many programming text editors automatically format comments, string literals, and other syntactical elements. In more primitive environments, a misplaced comment or quotation mark can trip up the compiler. To find the extra comment or quotation mark, insert the following sequence into your code in C, C++, and Java:

/*"/**/

This code phrase will terminate either a comment or string, which is useful in narrowing the space in which the unterminated comment or string is hiding.

< Free Open Study >