12.6 Research review

In the previous sections, we have considered how skill-based and rule-based errors result in program bugs and how we can avoid making knowledge-based errors during debugging. In this section, we present a brief review of experimental research on how programmers think when they’re debugging.

12.6.1 Youngs, 1974

Citation: “Human Errors in Programming” [Yo74]

Subjects: Thirty novices and twelve professional programmers

Programming language: Algol, Basic, COBOL, Fortran, and PL/1

Program size: Between twenty and eighty statements

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The subjects implemented various simple numerical algorithms. During their work, compilation logs and execution output were saved automatically. The researcher analyzed questionnaires, compile logs, and execution logs after completion. The analysis of the programs was done by manually working backward from the final correct version to an initial faulty version. At each step, the errors corrected were grouped according to language construct and were classified as syntactic, semantic, logical, or clerical.

Conclusions: The decline in the number of bugs fits an exponential decay curve, with different constants for novice and experienced programmers. Experts eliminated syntactic and semantic errors quickly, while beginners were slower in correcting semantic errors.

12.6.2 Gould, 1975

Citation: “Some Psychological Evidence on How People Debug Computer Programs” [Go75]

Subjects: Ten expert programmers

Programming language: Fortran

Program size: Four statistical programs from a commercial subroutine library

Defect type: Introduced by the researcher beforehand

Experimental procedure: The researcher modified existing Fortran programs to contain bugs. A single line in each program was changed to introduce an error in an arithmetic expression, an invalid array subscript, or an incorrect loop limit. This produced a total of twelve erroneous programs.

The subjects were told to debug each of the programs. They were given labeled, formatted output, and were allowed to use an interactive debugger. The subjects were told to work as fast as possible, and they had a maximum of forty-five minutes on any given program. Their work was assessed based on the number of correct diagnoses, false diagnoses, and the time required to find the problem.

Conclusions: The subjects improved their time in debugging different versions of the same program. There was significant variation in the quality, quantity, and efficiency of the results produced by the programmers.

12.6.3 Brooke and Duncan, 1980

Citation: “Experimental Studies of Flowchart Use at Different Stages of Program Debugging” [BD80]

Subjects: Twenty high-school and college students

Programming language: “Warehouse” language for simple two-dimensional movement

Program size: Twenty to thirty statements in listings or flowcharts

Defect type: Inserted by researcher beforehand

Experimental procedure: The subjects were instructed to locate the procedure in a program that contained the error. Programs simulated the movement of a truck in a warehouse, which could pick up items and deliver them to one of four loading bays. Some subjects were given program listings; others received detailed flowcharts with equivalent content.

Conclusions: Flowcharts were found to be more useful than simple program listings when the debugging task primarily requires following execution paths. Errors in identifying defect causes were equally likely with flowcharts and source listings.

12.6.4 Gilmore and Smith, 1984

Citation: “An Investigation of the Utility of Flowcharts during Computer Program Debugging” [GS84]

Subjects: Twenty-four psychology students, twenty-one of whom had taken a five-week course in POP11

Programming language: POP11

Program size: Twenty to thirty statements

Defect type: Introduced by the researcher beforehand

Experimental procedure: The programs that were debugged were a set of instructions for moving an object through a maze. The subjects were divided into three groups, each of which was given a program listing, a hierarchical diagram of the program (Bowles diagram), or a flowchart of the program. The subjects in each group attempted to debug six different versions of an algorithm. There was one error in each program in a conditional statement. The activities of the subjects were measured in terms of the time required to find the bug and the number of traces they needed to solve it.

Conclusions: The availability of flowcharts did not significantly improve the performance of the subjects. The authors suggest a number of areas needing further research, including choice of debugging strategies by programmers, cognitive style demonstrated by subjects, and the way differences in problem context (reading or writing programs) affect the best way to represent programs.

12.6.5 Anderson and Jeffries, 1985

Citation: “Novice LISP Errors: Undetected Losses of Information from Working Memory” [AJ85]

Experiments 1 and 2

Subjects: Thirty out of seventy-five undergraduates in a class on artificial intelligence

Programming language: LISP

Program size: Single LISP expression

Defect type: Introduced by the subjects during the experiment

Experimental procedure: Subjects evaluated the result of complete expressions; completed expressions in which a result was given, but an input was missing; and completed expressions in which a result was given, but a function name was missing. Expressions were presented in three forms: (1) basic, (2) with extraneous parentheses, and (3) with arguments provided through invoking the list reversal function.

Conclusions: Subjects who showed a general weakness in the subject matter experienced greater difficulty processing the more complex, but equivalent, expressions. Proper use of parentheses caused more problems than the semantics of the functions.

Experiment 3

Subjects: Twenty students

Programming language: LISP

Program size: Single LISP expression

Defect type: Introduced by the subjects during the experiment

Experimental procedure: Subjects were taught four basic LISP functions and given 336 LISP expressions to evaluate. Subjects worked at a CRT with keyboard and were told to work as quickly as possible.

Conclusions: Errors were primarily due to the misuse of parentheses, rather than the semantics of the functions used. Extra or missing pairs of parentheses were more frequent than unbalanced pairs of parentheses.

Experiment 4

Subjects: Twenty-six undergraduate students, half with no knowledge of LISP, the other with minimal training

Programming language: LISP

Program size: Single LISP expression

Defect type: Introduced by the researchers beforehand

Experimental procedure: Subjects were given 180 expressions to assess for correctness, 120 of which did contain errors.

Conclusions: The subjects who had received fifteen minutes of training in list structures did moderately better at detecting errors than the untrained group, but not significantly so.

12.6.6 Spohrer, Soloway, and Pope, 1985

Citation: “A Goal/Plan Analysis of Buggy Pascal Programs” [SSP85a]; “Where The Bugs Are” [SSP85b]

Subjects: About two hundred college students in an introductory programming class

Programming language: Pascal

Program size: Seventy to eighty lines

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The researchers automatically collected each syntactically valid program that students compiled when solving a tax calculation programming assignment. The programs were analyzed manually to map them to a goal-and-plan tree representation of possible solutions to the assigned programming problem.

Conclusions: Student programmers often write programs in which code to achieve multiple logical goals is merged into a single physical source code unit. Those merged code sections include merge-related bugs in addition to bugs found in nonmerged versions. The merged versions contain more bugs than the nonmerged versions. Bugs peculiar to merged versions include loss of a goal and loss of constraints on achieving a goal.

12.6.7 Vessey, 1985, 1986

Citation: “Expertise in Debugging Computer Programs: A Process Analysis” [Ve85]; “Expertise in Debugging Computer Programs: An Analysis of the Content of Verbal Protocol” [Ve86]

Subjects: Sixteen professional programmers

Programming language: COBOL

Program size: A sales reporting program of about one hundred lines

Defect type: Introduced by the researcher beforehand

Experimental procedure: Prior to the experiment, half of the subjects were rated as novice and half as expert by their manager. The subjects were told to find the bug in a COBOL program by looking at a listing. The bug was a logic error inserted into one of two different locations in the program control structure.

They were told to verbalize their thoughts as they worked. These protocols were recorded and transcribed. The subjects were rated on the time they took to debug the program and the number of mistakes they made. After the experiment, the researcher rated the subjects as novice or expert, based on their demonstrated ability to absorb the meaning of a section of code and not return to it. The researcher’s rating system correlated more closely with the subjects’ performance than did the manager’s rating.

Conclusions: Expert programmers followed a breadth-first approach to program comprehension, while novices tended to follow a depth-first approach. Experts weren’t as committed to their hypotheses as novices and were more open to using new information. Expert programmers developed a mental model of the program structure and function, while novices were less likely to do so. The researcher believes that the main difference between novices and experts is that experts spend their time understanding the program while novices focus on finding a solution.

12.6.8 Gugerty and Olson, 1987

Citation: “Comprehension Differences in Debugging by Skilled and Novice Programmers” [GO87]

Experiment 1

Subjects: Eighteen novices taking a first Pascal course and six computer science graduate students

Programming language: Logo

Program size: Fifteen to fifty lines

Defect type: Introduced by the researchers beforehand

Experimental procedure: Subjects were trained in Logo programming. They were given three defective programs, as well as a drawing of what each program should generate. They were given a maximum of thirty minutes to debug each program. Subjects were told to think out loud, and their verbal protocols were recorded.

Experiment 2

Subjects: Ten novices completing a first Pascal course and ten computer science graduate students

Programming language: Pascal

Program size: Forty-six lines

Defect type: Introduced by the researchers beforehand

Experimental procedure: All subjects were familiar with Pascal. They were given a program listing, a listing of the input data file, a listing of the expected output, and a description of the program purpose, all of which were also available online. The subjects were given forty minutes to correct the program. An observer monitored what the subjects looked at and what they were keying.

Conclusions: Novices and experts use the same techniques for exploring a new program. Experts generate better hypotheses about the cause of defects, thus having fewer hypotheses to validate and correcting the defect in less time. Novices were more likely to add bugs in the process of trying to diagnose the original problem.

12.6.9 Kessler and Anderson, 1987

Citation: “A Model of Novice Debugging in LISP” [KA87]

Subjects: Eight undergraduate students with no LISP experience, at most a Pascal course

Programming language: LISP

Program size: Single-line LISP expressions

Defect type: Introduced by the researchers beforehand

Experimental procedure: Subjects were given a first lesson in LISP using an online tutorial. Subjects were presented with eighteen functions, twelve of which contained defects. The subjects would say whether they thought the function had a defect and then would invoke it. Then they would correct the function if necessary. The subjects were encouraged to talk aloud about their analysis as they worked. These protocols were recorded.

Conclusions: Most subjects began by trying to understand the code, and then they proceeded to execute it. The next phase of localization was difficult and time-consuming. The final repair of the problem was also challenging. Educators should teach students the skills of evaluating expressions, localizing problems, and correcting them, in addition to teaching them how to write programs.

12.6.10 Spohrer and Soloway, 1987

Citation: “Analyzing the High Frequency Bugs in Novice Programs” [SS87]

Subjects: Sixty-one students enrolled in an introductory Pascal programming class

Programming language: Pascal

Program size: Sixty to one hundred statements

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The researchers automatically collected each syntactically valid program students compiled when solving three programming assignments. The programs were analyzed manually to map them to a goal-and-plan tree representation of possible solutions to the assigned programming problem. A total of 183 programs were collected; 25 were excluded because the program was significantly incomplete. The researchers developed explanations for the causes of 11 of 101 defect types that comprised over one-third of all the defects identified.

Conclusions: Some defect types occur much more frequently than others in the work of novice programmers. Many of these defects aren’t related to the semantics of a particular programming language construct, but rather to more general programming issues. Instructors should teach students to recognize these types of errors.

12.6.11 Katz and Anderson, 1988

Citation: “Debugging: An Analysis of Bug-Location Strategies” [KA88]

Experiment 1

Subjects: Groups of thirteen, twenty, and eighteen undergraduates taking a LISP course

Programming language: LISP

Program size: Fewer than twenty lines

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The first two groups of subjects solved exercises from an introductory programming class with the assistance of an automatic tutor program, which provided feedback when they made errors. This tutor also collected data on the subjects’ efforts as they worked. The third group wrote functions in an open environment, with a human tutor available. Both verbal protocols and a complete record of the keyboard activities were collected.

Conclusions: Errors were classified as goal errors, misunderstandings of the problem, intrusions of previous solutions, misconceptions about LISP features, and syntax errors. No defects spanned more than one line. Subjects did not generally repeat the same bug within or between programs.

Experiment 2

Subjects: Eight undergraduate students taking a LISP course, who had previously taken a Pascal course

Programming language: LISP

Program size: Three programs of less than twenty lines

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The subjects wrote three LISP functions with the aid of a tutor. Subjects were encouraged to talk aloud as they worked, and these verbal protocols were recorded.

Conclusions: Locating the problem was the most difficult part of the debugging process. Subjects had little problem correcting identified problems. Subjects used three strategies to locate bugs: mapping from error messages to the location of the defect in the program, manually executing the program with sample values, and reasoning from the (incorrect) output back to the program.

Experiment 3

Subjects: Thirty-six undergraduate students taking introductory LISP

Programming language: LISP

Program size: Fewer than twenty lines

Defect type: Introduced by the subjects during the experiment

Experimental procedure: Subjects wrote and debugged four LISP functions that they wrote and four that were written by other subjects. One-third of the students completed the assignments in an open environment; the others used the LISP tutor. Both verbal protocols and a complete record of the keyboard activities were collected.

Conclusions: At the start of a debugging session, subjects spent less time looking at programs they had written than those written by others. They also were significantly less successful in locating the bugs in programs written by others than those they wrote themselves. Subjects debugging their own functions tended to use backward-reasoning strategies (mapping errors to program lines, reasoning from incorrect output to program location), whereas those debugging functions written by others tended to use forward-reasoning strategies (building a mental model of the program, manual execution).

Experiment 4

Subjects: Twenty-seven undergraduates taking a LISP class; all had one or two other programming classes

Programming language: LISP

Program size: Fewer than twenty lines

Defect type: Introduced by the researchers beforehand

Experimental procedure: Every subject worked with two sets of six programs. Eight of the programs had one bug, two programs had two bugs, and two programs had no bugs. The subjects were divided into three groups. With the first set of programs, the first group used forward-reasoning strategies, the second using backward-reasoning strategies, and the third group could use any strategy. The use of strategies was forced by using an interactive tool. With the second set of programs, all subjects could use any strategy.

Conclusions: When the subjects were allowed to use any debugging strategy, they still used the one they had trained on with the first set of programs. The subjects allowed to use any strategy worked the fastest but made the most mistakes. The subjects forced to use working backward were the slowest but most accurate.

12.6.12 Vessey, 1989

Citation: “Toward a Theory of Computer Program Bugs: An Empirical Test” [Ve89]

Subjects: Seventy-eight students and thirty-eight professionals

Programming language: COBOL

Program size: A sales reporting program of about one hundred lines

Defect type: Introduced by the researcher beforehand

Experimental procedure: The same logic error was introduced in each of four levels of the program’s control-flow hierarchy. Subjects were told to verbalize their thoughts as they worked. These protocols were recorded and transcribed. The subjects were rated on the time they took to debug the program and the number of mistakes they made.

Conclusions: The time to locate and correct a bug wasn’t related to the location of the bug in the control-flow hierarchy. Expert programmers debugged more quickly and with fewer mistakes than the novices.

12.6.13 Carver, 1989

Citation: “Programmer Variations in Software Debugging Approaches” [Ca89]

Subjects: Three experienced programmers

Programming language: Not specified

Program size: A billing system of thirteen modules

Defect type: Introduced by the subjects during the experiment

Experimental procedure: Each programmer worked on a different part of the same system. The purpose of the study was to analyze the volume of changes to a program that a programmer makes before testing those changes.

Conclusions: The researcher observed consistent patterns of behavior in each programmer, but no generalizations could be drawn because of the limited sample size and the fact that each programmer worked on different problems.

12.6.14 Stone, Jordan, and Wright, 1990

Citation: “The Impact of Pascal Education on Debugging Skill” [SJW90]

Experiment 1

Subjects: 124 students in five COBOL courses

Programming language: COBOL

Program size: 319-line unstructured, 362-line poorly structured, 417-line well-structured versions

Defect type: Introduced by the researcher beforehand

Experimental procedure: The three versions of the program were randomly distributed to the subjects. All subjects received a program listing and an output listing with the erroneous output clearly marked. The subjects were given twenty minutes to locate and correct the error.

Experiment 2

Subjects: Forty-two students in an introductory COBOL course

Programming language: COBOL

Program size: 319-line unstructured, 417-line well-structured versions

Defect type:

Experimental procedure: Two versions of the program were randomly distributed to the subjects. All subjects received a program listing and an output listing with the erroneous output clearly marked. The subjects were given thirty minutes to locate and correct the error.

Conclusions: Pascal education was strongly correlated with the ability to diagnose and correct errors. Pascal education was also strongly correlated with computer science majors, a greater number of previous programming courses, and previous professional programming experience. Pascal education wasn’t correlated with the ability to maintain structured programs.

12.6.15 Allwood and Bjhorhag, 1991

Citation: “Novices’ Debugging When Programming in Pascal” [AB91]

Subjects: Eight novices taking a Pascal programming class

Programming language: Pascal

Program size: Not given

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The subjects were given a written specification of a numerical programming problem. They were told to think out load, and their verbal protocols were written down. All activity was recorded on video. The researchers manually analyzed the recorded protocols. Errors were distinguished as syntactical, semantic, and logical errors. The researchers analyzed the triggers, durations, and actions of so-called evaluative episodes.

Conclusions: Subjects spent most of the time debugging. They would have spent an even greater amount of time debugging if they hadn’t been stopped by the observers when time expired. Even though they were working on programs they wrote, the subjects spent much of their time interpreting and understanding the program.

12.6.16 Ebrahami, 1994

Citation: “Novice Programmer Errors: Language Constructs and Plan Composition” [Eb94]

Experiment 1

Subjects: Eighty undergraduate students, with four groups of twenty each taking one of the four languages

Programming language: Pascal, C, Fortran, and LISP

Program size: Fewer than ten statements

Defect type: Introduced by the researcher beforehand

Experimental procedure: The subjects were asked to evaluate program segments manually with the specified input data. After processing all of the segments, the subjects were interviewed and asked to think through the problems aloud. The researcher classified the errors made by the subjects according to the programming constructs used.

Experiment 2

Subjects: Eighty undergraduate students, with four groups of twenty each taking one of the four languages

Programming language: Pascal, C, Fortran, and LISP

Program size: Fewer than one hundred statements

Defect type: Introduced by the subjects during the experiment

Experimental procedure: The subjects were given a written problem definition. They wrote programs to implement the specification. The researcher evaluated the programs submitted and recorded the errors. The researcher classified the errors in the programs, comparing the correct plan and the actual plan used by the student to solve the problem.

Conclusions: The language constructs most frequently misused were loop termination conditions, logical operators in conditional statements, and language-specific features. The plan element most frequently missing or in error was a conditional statement serving as a guard.

12.6.17 Summary

Performing experimental research on programmers doing debugging is difficult for several reasons. The subjects available to university researchers are usually novice programmers. Novice and expert programmers don’t exhibit the same behaviors. The economic cost of using enough professional programmers in psychological experiments whose results have statistical significance is high. Debugging actually involves a number of skills and behaviors, which are difficult to separate in a controlled experiment.

Previous experimental research in debugging has limited value to the professional programmer. Researchers tested programming languages and development techniques that are no longer used in modern computing environments. Most experiments were performed on tiny programs that aren’t typical of the complex software that professional programmers develop.

Some of the experimental work is also difficult to assess because of the limited sample size, the design of the experiment, or the subjects used. Most of the subjects in the experiments were complete novices taking a first or second programming course. The related topic of program comprehension has interested researchers more recently, but a survey of that literature is beyond the scope of this section.