Debugging Techniques | Game Coding Complete

I think I could write an entire book about debugging. Certainly many people have, and for good reason. You can't be a good programmer unless you have at least passable debugging skills. Imagine for a moment that you are a programmer who never writes buggy code. Hey, stop laughing. I also want you to close your eyes and imagine that you have absolutely no skill at debugging. Why would you? Your code is always perfect! The moment you were assigned to a team of programmers your days are numbered. If you can't solve logic problems caused by other programmer's code, you are useless to a team.

If you have good debugging skills, you'll have much more fun programming. I've always looked at really tough bugs as a puzzle. Computers are deterministic and they execute instructions without interpretation. That truth paves your way to solve every bug if you devote enough patience and skill.

Debugging Is an Experiment

When you begin a bug hunt, one implication is that you know how to recognize a properly running program. For any piece of code you should be able to predict its behavior just by carefully reading each line. As an aggregate of modules, a large program should, in theory, accept user input and game data and act in a deterministic way.

Debugging a program requires that you figure out why the behavior of the program is different than what you expect. Certainly the computer's CPU isn't surprised. It executes exactly what you instructed. This delta is the cornerstone of debugging. As each instruction executes, the programmer tests the new state of the process against the predicted state by looking at memory and the contents of variables. The moment the prediction is different than the observed, the programmer has found the bug.

Clearly you have to be able to predict the behavior of the system, given certain stimulus such as user input or a set of game data files. You should be able to repeat the steps to watch the behavior on your machine or anyone else's machine. When the bug manifests itself as a divergence from nominal operation you should be able to use what you observed to locate the problem or at least narrow the source of the problem. Repeat these steps enough times and you'll find the bug. What I've just described is the method any scientist uses to perform experiments.

Best Practice

It might seem odd to perform experiments on software, certainly odd when you wrote the software in question. Scientists perform experiments on complicated phenomena that they don't understand in the hopes that they will achieve knowledge. Why then must programmers perform experiments on systems that spawned from their own minds? The problem is that even the simplest, most deterministic systems can behave unpredictably given particular initial states. If you've never read Stephen Wolfram's book, A New Kind of Science, take a few months off and try to get through it. This book makes some surprising observations about complex behavior of simple systems. I'll warn you that once you read it you may begin to wonder about the determinism of any system, no matter how simple!

Debugging is a serious scientific endeavor. If you approach each debugging task as an experiment, just like you were taught in high school, you'll find that debugging is more fun and less frustrating.

Complex and unpredicted behavior in computer programs requires setting up good debugging experiments. If you fell asleep during the lecture in high school on the scientific method, now's a good time to refresh your memory. The examples listed in Table 12.1 show you how to run a successful experiment, but there's a lot more to good debugging than blindly running through the experimental method.

Table 12.1: How to Run a Successful Debugging Experiment.
Scientific Method as it Applies to Software Systems	Example #1	Example #2
Step 1: Observe the behavior of a computer game.	Observation: A call to OpenFile() always fails.	Observation: The game crashes on the low-end machine when it tries to initialize.
Step 2: Attempt to explain the behavior that is consistent with your observations and your knowledge of the system.	Hypothesis: The input parameters to OpenFile() are incorrect, specifically the filename.	Hypothesis: The game is crashing because it is running with out of video memory.
Step 3: Use your explanation to make predictions.	Predictions: If the proper filename is used, OpenFile() will execute successfully.	Predictions: If the amount of video memory were increased, the game will initialize properly. The game will crash when the original amount of video memory is restored.
Step 4: Test your predictions by performing experiments or making additional observations. Modify the hypothesis and predictions based on the results.	Experiment: Send the fully qualified path name of the file and try OpenFile() again.	Experiment: Switch the current video card with others that have more memory.
Step 5: Repeat steps three and four until there is no discrepancy between your explanations and the observations.	Results: OpenFile() executed successfully with a fully qualified path name.	Results: The game properly initializes with an 8Mb video card installed.
Step 6: Explain the results.	Explanation: The current working directory is different than the location of the file in question. The path name must be fully qualified.	Explanation: Video memory requirements have grown beyond expectations.

The first step seems easy: Observe the behavior of the system. Unfortunately, this is not so easy. The most experienced software testers I know do their very best to accurately observe the behavior of a game as it breaks. They record what keys they pressed, what options they turned off, and at best exactly what they did. In many cases, they leave out something innocuous. One of the first things I do when I don't observe the same problem a tester observed is I go down to the test lab myself and watch them reproduce the bug. Sometimes I'll notice a little wiggle of the mouse or the fact that they're running in full-screen mode and have a "Eureka" moment.

Gotcha

Unlike most software systems, games rely not only on random numbers but they change vast amounts of data extremely quickly in seemingly unpredictable ways. The difficulty in finding game bugs lies in the simple fact that games run so much code so quickly that it's easy for a bug to come from any of the many subsystems that manipulate the game state.

The second step, attempt to explain the behavior, can be pretty hard if you don't know the software like the back of your hand. It's probably safe to say that you should know the software, the operating system, the CPU, video hardware, and audio hardware pretty well too. Sound tough? It is. It also helps to have a few years of game programming under your belt, so that you've been exposed to the wacky behavior of broken games. This is probably one of the most frustrating aspects of programming in general—a lack of understanding and experience can leave you shaking your head in dismay when you see your game blow up in your face. Everybody gets through it, though, usually with the help of, dare I say, more experienced programmers.

Steps three through five represent the classic experiment-phase of debugging. Your explanation will usually inspire some sort of test, input modification, or code change that should have predictable results. There's an important trick to this rinse and repeat cycle: Take detailed notes of everything you do. Inevitably your notes will come in handy as you realize that you're chasing a dead-end hypothesis. They should send you back to the point where your predictions were accurate. This will put you back on track.

Best Practice

Another critical aspect to the experiment-driven debugging process is that you should try to limit your changes to one small thing at a time. If you change too much during one experiment cycle you won't be able to point to the exact change that fixed the problem. Change for change's sake is a horrible motivation to modify buggy code. Resist that temptation. Sometimes there is a desire to rip a subsystem out altogether and replace it without truly understanding the nature of the problem. This impulse is especially strong when the subsystem in question was written by a programmer that has less than, shall we say, stellar design and coding skills. The effect of this midnight remodeling is usually negative since it isn't guaranteed to fix the bug and you'll demoralize your teammate at the same time.

Assuming you follow Table 12.1 you'll eventually arrive at the source of the problem. If you're lucky the bug can be fixed with a simple tweak of the code—perhaps a loop exited too soon or a special case wasn't handled properly. You make your mod, rebuild the game, and perform your experiments one last time. Congratulations, your bug is fixed. Not every programmer is so lucky, certainly I haven't been. Some bugs, once exposed in their full glory, tell you things about your game that you don't want to hear. I've seen bugs that told us we had to completely redesign the graphics library we were using. Other bugs enjoy conveying the message that some version of Windows can't be supported without sweeping modifications. Others make you wonder how the game ever worked in the first place. If this ever happens to you, and I'm sure it will, I feel your pain. Grab some caffeine and your sleeping bag; it's going to be a long night.

Reproducing the Bug

A prerequisite of observing the behavior of a broken game is reproducing the bug. I've seen bug reports that say things like, "I was doing so-and-so and the game crashed. I couldn't get it to happen again." In light of an overwhelming number of reports of this kind, you might be able to figure out what's going on. Alone, these reports are nearly useless. You cannot fix what you cannot observe. After all, if you can't observe the broken system with certainty, how can you be sure you fixed the problem? You can't. Most bugs can be reproduced easily by following a specific set of steps, usually observed and recorded by a tester. It's important that each step, however minor, is recorded from the moment the game is initialized. Anything missing might be important. Also, the state of the machine, including installed hardware and software, might be crucial to reproducing the bug's behavior.

Gotcha

Bugs are sometimes tough to nail down. They can be intermittent or disappear altogether as you attempt to create a specific series of steps that will always result in a manifestation of the problem. This can be explained in two ways: Either an important step or initial state has been left out, or the bug cannot be reproduced because the system being tested is too complex to be deterministic. Even if the bug can be reproduced exactly, it might be difficult to create an explanation of the problem. In both of these cases, you must find a way to reduce the complexity of the system, only then can the problem domain become small enough to understand.

Eliminating Complexity

A bug can only manifest itself if the code that contains it is executed. Eliminate the buggy code, and the bug will disappear. By the process of elimination you can narrow your search over a series of steps to the exact line of code that is causing the problem. You can disable subsystems in your game, one by one. One of the first things to try is to disable the entire main loop and have your game initialize and exit without doing anything else. This is a good trick if the bug you're hunting is a memory leak. If the bug goes away you can be sure that it only exists in the main loop somewhere.

Best Practice

Sound systems are usually multi-threaded and can be a source of heinous problems. If you believe a bug is somewhere in the sound system, disable your sound system and rerun the game. If the bug disappears, turn the sound system back on but eliminate only sound effects. Leave the music system on. Divide and conquer as necessary to find the problem. If the bug is in the sound system somewhere you'll find it.

You should be able to creatively disable other systems as well, such as animation or AI. Once these systems are stubbed out, your game will probably act pretty strangely, and you don't want this strangeness to be interpreted as the bug you are looking for. You should have a pretty complete understanding of your game before you embark on excising large pieces of it from execution.

Best Practice

If you can't simply stub out the AI of your game, replace the AI routines with the most trivial AI code you can write, and make triply sure it is bug free and will have limited, predictable, side effects. You can then slowly add the complex AI systems back in, one at a time, and rerun your tests to see when the bug pops back in.

If your game has options for sound, animation, and other subsystems you can use these as debugging tools without having to resort to changing code. Turn everything off via your game options and try to reproduce the bug. Whether the bug continues to exist or disappears, the information you'll gain from the experiment is always valuable. As always, keep good records of what you try and try to change only one option at a time. You can take this tactic to extremes and perform a binary search of sorts to locate a bug. Stub out half of your subsystems and see if the bug manifests itself. If it does, stub out half of what remains and repeat the experiment. Even in a large code base, you'll quickly locate the bug.

If the bug eludes this process, it might depend on the memory map of the process. Change the memory contents of your process and the bug will change too. Because this might be true, it's a good idea to stub out subsystems via a simple boolean value, but leave their code and global data in place as much as possible. This is another example of making small changes rather than large ones.

Setting the next Statement

Most debuggers give you the power to set the next statement to be executed, which is equivalent to setting the instruction pointer directly. This can be useful if you know what you are doing, but it can be a source of mayhem applied indiscriminately. You might want to do this for a few reasons. You may want to skip some statements or rerun a section of code again with different parameters as a part of a debugging experiment. You might also be debugging through some assembler and you want to avoid calling into other pieces of code.

You can set the next statement in Visual C++ by right clicking on the target statement and selecting "Set Next Statement" from the popup menu. In other debuggers, you can bring up a register window and set the EIP register, also known as the instruction pointer, to the address of the target statement, which you can usually find by showing the disassembly window. You must be mindful of the code that you are skipping and the current state of your process. When you set the instruction pointer, it is equivalent to executing an assembly level JMP statement, which simply moves the execution path to a different statement.

In C++, objects can be declared inside local scopes such as for loops. In normal execution, these objects are destroyed when execution passes out of that scope. The C++ compiler inserts the appropriate code to do this, and you can't see it unless you look at a disassembly window. What do you suppose happens to C++ objects that go out of scope if you skip important lines of code? Let's look at an example:

 class MyClass { public:    int num;    char *string;    MyClass(int const n)    {      num = n;      string = new char[128];      sprintf(string, "%d ", n);    }    ~MyClass() { delete string; } }; void SetTheIP() {    char buffer[2048];    buffer[0] = 0;    for (int a=0; a<128; ++a)    {      MyClass m(a):      strcat(buffer, m.string);     // START HERE...    } }                                  // JUMP TO HERE...

Normally the MyClass object is created and destroyed once for each run of the for loop. If you jump out of the loop using "Set Next Statement" the destructor for MyClass never runs, leaking memory. The same thing would happen if you jumped backwards, to the line that initializes the buffer variable. The MyClass object in scope won't be destroyed properly.

Luckily, you don't have to worry about the stack pointer as long as you do all your jumping around within one function. Local scopes are creations of the compiler; they don't actually have stack frames. That's a good thing, because setting the next statement to a completely different function is sure to cause havoc with the stack. If you want to skip the rest of the current function and keep it from executing, just right click on the last closing brace of the function and set the next statement to that point. The stack frame will be kept in tact.

Assembly Level Debugging

Inevitably you'll get to debug through some assembly code. You won't have source code or even symbols for every component of your application, so you should understand a little about the assembly window. Here's the assembly for the SetTheIP function we just talked about. Let's look at the Debug version of this code:

 void SetTheIP() { 00411A10 55                        push        ebp 00411A11 8B EC                     mov         ebp,esp 00411A13 81 EC E8 08 00 00         sub         esp,8E8h 00411A19 53                        push        ebx 00411A1A 56                        push        esi 00411A1B 57                        push        edi 00411A1C 8D BD 18 F7 FF FF         lea         edi,[ebp-8E8h] 00411A22 B9 3A 02 00 00            mov         ecx,23Ah 00411A27 B8 CC CC CC CC            mov         eax,0CCCCCCCCh 00411A2C F3 AB                     rep stos    dword ptr [edi]    char buffer[2048];    buffer[0] = 0; 00411A2E C6 85 F8 F7 FF FF 00      mov         byte ptr [buffer],0    for (int a=0; a<128; ++a) 00411A35 C7 85 EC F7 FF FF 00 00 00 00 mov     dword ptr [a],0 00411A3F EB OF                     jmp         SetTheIP+40h (411A50h) 00411A41 8B 85 EC F7 FF FF         mov         eax,dword ptr [a] 00411A47 83 CO 01                  add         eax,1 00411A4A 89 85 EC F7 FF FF         mov         dword ptr [a],eax 00411A50 81 BD EC F7 FF FF 80 00 00 00 cmp     dword ptr [a],80h 00411A5A 7D 35                     jge         SetTheIP+81h (411A91h)    {      MyClass m(a); 00411A5C 8B 85 EC F7 FF FF         mov         eax,dword ptr [a] 00411A62 50                        push        eax 00411A63 8D 8D DC F7 FF FF         lea         ecx,[m] 00411A69 E8 9C FA FF FF            call        MyClass::MyClass (41150Ah)      strcat(buffer, m.string); 00411A6E 8B 85 E0 F7 FF FF         mov         eax,dword ptr [ebp-820h] 00411A74 50                        push        eax 00411A75 8D 8D F8 F7 FF FF         lea         ecx,[buffer] 00411A7B 51                        push        ecx 00411A7C E8 46 F7 FF FF            call        @ILT+450(_strcat) (4111C7h) 00411A81 83 C4 08                  add         esp,8    } 00411A84 8D 8D DC F7 FF FF         lea         ecx,[m] 00411A8A E8 76 FA FF FF            call        MyClass::~MyClass (411505h) 00411A8F EB B0                     jmp         SetTheIP+31h (411A41h) } 00411A91 52                        push        edx 00411A92 8B CD                     mov         ecx,ebp 00411A94 50                        push        eax 00411A95 8D 15 B6 1A 41 00         lea         edx,[ (411AB6h)] 00411A9B E8 FA F6 FF FF            call        @ILT+405(@_RTC_CheckStackVars@8) (41119Ah) 00411AA0 58                        pop         eax 00411AA1 5A                        pop         edx 00411AA2 5F                        pop         edi 00411AA3 5E                        pop         esi 00411AA4 5B                        pop         ebx 00411AA5 81 C4 E8 08 00 00         add         esp,8E8h 00411AAB 3B EC                     cmp         ebp,esp 00411AAD E8 F0 F8 FF FF            call        @ILT+925(__RTC_CheckEsp) (4113A2h) 00411AB2 8B E5                     mov         esp,ebp 00411AB4 5D                        pop         ebp 00411AB5 C3                        ret

One thing you realize right off is that the disassembly window can be a big help in beginning to understand what assembly language is all about. I wish I had more time to go over each statement, addressing modes, and whatnot but there are better resources for that anyway.

Notice first the structure of the disassembly window. The column of numbers on the left hand side of the window is the memory address of each instruction. The list of one to ten hexadecimal codes that follows each address is the machine code bytes. Notice that the address of each line coincides with the number of machine code bytes. The more readable instruction on the far right is the assembler statement. Each group of assembler statements is preceded by the C++ statement that they execute, if the source is available. You can see that even a close brace can have assembly instructions, usually to return to the calling function or to destroy a C++ object.

The first lines of assembly, pushing various things onto the stack and messing with EBP and ESP, establish a local stack frame. The value 8E8h is the size of the stack frame, which is 2280 bytes.

Check out the assembly code for the for loop. The beginning of the loop has seven lines of assembly code. The first two initialize the loop variable and jump over the lines that increment the loop variable. Skip over the guts of the loop for now and check out the last three assembly lines. They call the destructor for the MyClass object and skip back to the beginning part of the loop that increments the loop variable and performs the exit comparison. If you've ever wondered why the debugger always skips back to the beginning of for loops when the exit condition is met, there's your answer. The exit comparison happens at the beginning.

The inside of the loop has two C++ statements: one to construct the MyClass object and another to call strcat. Notice the assembly code that makes these calls work. In both cases values are pushed onto the stack by the calling routine. The values are pushed from right to left, that is to say that the last variable in a function call is pushed first. What this means for you is that you should be mindful of setting the next statement. If you want to skip a call, make sure you skip any assembly statements that push values onto the stack, or your program will lose its mind.

One last thing: Look at all the code that follows the closing brace of SetTheIP(). There are two calls here to CheckStackVars() and CheckESP(). What the heck are those things? These are two functions inserted into the exit code of every function in Debug builds that perform sanity checks on the integrity of the stack. You can perform a little experiment to see how these things work. Put a breakpoint on the very first line of SetTheIP(), skip over all the stack frame homework and set the next statement to the one where the buffer gets initialized. The program will run fine until the sanity check code runs. You'll get a dialog box telling you your stack has been corrupted.

It's nice to know that this check will keep you from chasing ghosts. If you mess up a debug experiment where you set the next statement across statements important to maintaining a good stack frame, these sanity checks will catch the problem.

Peppering the Code

If you have an elusive bug that corrupts a data structure or even the memory system, you can hunt it down with a check routine. This assumes that the corruption is somewhat deterministic, and you can write a bit of code to see if it exists. Write this function and begin placing this code in strategic points throughout your game.

A good place to start this check is in your main loop, and at the top and bottom of major components like your resource cache, draw code, AI, or sound manager. Place the check at the top and bottom to ensure that you can pinpoint a body of code that caused the corruption. If a check succeeds before a body of code and fails after it, you can begin to drill down into the system, placing more checks, until you nailed the exact source of the problem. Here's an example:

 void BigBuggySubsystem() {    BuggyObject crasher;    CheckForTheBug("Enter BigBuggySubSystem.");    DoSomething();    CheckForTheBug("Calling DoSomethingElse");    DoSomethingElse();    CheckForTheBug("Calling CorruptEverything");    CorruptEverything();    CheckForTheBug("Leave BigBuggySubSystem"); }

In this example, CheckForTheBug() is a bit of code that will detect the corruption, and the other function calls are subsystems of the BigBuggySubsystem. It's a good idea to put a text string in your checking code, so that it's quick and easy to identify the corruption location, even if the stack is trashed.

Since there's plenty of C++ code that runs as a result of exiting a local scope, don't fret if your checking function finds a corruption on entry. You can target your search inside the destructors of any C++ objects used inside the previous scope. If the destructor for the BuggyObject code was wreaking some havoc it won't be caught by your last call to your checking function. You wouldn't notice it until some other function called your checking code.

Draw Debug Information

This might seem incredibly obvious, but since I forget it all the time myself I figure it deserves mentioning. If you are having trouble with graphics or physics related bugs it can help if you draw additional information on your screen such as wireframes, direction vectors, or coordinate axes. This is especially true for 3D games but any game can find draw helpers useful. Here's a few ideas:

Hot Areas: If you are having trouble with user interface code, you can draw rectangles around your controls and change their color when they go active. You'll be able to see why one control is getting activation when you didn't expect it.
Memory/Framerate: In debug versions of your game it can be very useful to draw current memory and framerate information every few seconds. Don't do it every frame because it will slow down your game too much.
Coordinate Axes: A classic problem with 3D games is that the artist will create 3D models in the wrong coordinate system. Draw some additional debug geometry that shows the positive X axis in red, the positive Y axis in green, and the positive Z axis in blue. You'll always know which way is up!
Wireframe: You can apply wireframe drawing to collision geometry to see if they match up properly. A classic problem in 3D games is when these geometries are out of sync, and drawing the collision geometry in wireframe can help you figure out what's going on.
Targets: If you have AI routines that select targets or destinations, it can be useful to draw them explicitly by using lines. Whether you are a 3D game or a 2D game, line drawing can give you information about where the targets are. Use color information to convey additional information such as friend or foe.

Best Practice

In 3D games, it's a good idea to construct a special test object that is asymmetrical on all three coordinate axes. Your game renderer and physics system can easily display things like cubes in a completely wrong way, but they will look right because a cube looks the same from many different angles. A good example of an asymmetrical object is a shoe, since there's no way you can slice it and get a mirror image from one side to another. In your 3D game, build something with similar properties, but make sure the shape is so asymmetrical that it will be obvious if any errors pop up.

Lint and Other Code Analyzers

These tools can be incredibly useful. Their best application is one where code is being checked often, perhaps each night. Dangerous bits of code are fixed as they are found, and don't get the chance to exist in the system for any length of time. If you don't have Lint, make sure you ramp up the warning level of the compiler as high as you can stand it. It will be able to make quite a few checks for you and catch problems as they happen.

A less useful approach involves using code analysis late in your project in the hopes it will pinpoint a bug. You'll probably be inundated with warnings and errors, any of which could be perfectly benign for your game. The reason this isn't as useful at the end of your project is that you may have to make sweeping changes to your code to address every issue. This is not wise. It is much more likely that sweeping changes will create a vast set of additional issues, the aggregate of which could be worse than the original problem. It's best to perform these checks often and throughout the life of your project.

BoundsChecker and Run Time Analyzers

BoundsChecker is a great program, and every team should have at least one copy. In some configurations it can run so slowly that your game will take three hours to display a single screen. Rather, use a targeted approach and filter out as many checks as you can and leave only the checks that will trap your problem.

Disappearing Bugs

The really nasty bugs seem to actually posses intelligence, awareness of itself and your attempts to destroy it. Just as you get close, the bug changes and it can't be reproduced using your previously observed steps. It's likely that recent changes such as adding checking code have altered the memory map of your process. The bug might be corrupting memory that is simply unused. This is where your notes will really come in handy. It's time to backtrack, remove your recent changes one at a time, and repeat until the bug reappears. Begin again, but try a different approach in the hopes you can get closer.

Best Practice

Another version of the disappearing bug is one where a known failure simply disappears without any programmer actually addressing it. The bug might have been related to another issue that someone fixed—you hope. The safest thing to do is to analyze recent changes and attempt to perform an autopsy of sorts. Given the recent fixes, you might even be able to recreate the original conditions and code that made the bug happen, apply the fix again, and prove beyond a shadow of a doubt that a particular fix addressed more than one bug.

What's more likely is that the number of changes to the code will preclude the possibility of this examination, especially on a large team. Then you have a decision to make: Is the bug severe enough to justify a targeted search through all the changes to prove the bug is truly fixed? It depends on the seriousness of the bug.

Tweaking Values

A classic problem in programming is getting a constant value "just right." This is usually the case for things like the placement of a user interface object like a button or perhaps the velocity value of a particle stream. While you are experimenting with the value, put it in a static variable in your code:

 void MyWierdFountain::Update() {    static float __dbgVelocity = 2.74f;    SetParticleVelocity(__dbgVelocity);    // More code would follow… }

It then becomes a trivial thing to set a breakpoint on the call to SetParticleVelocity() to let you play with the exact velocity value in real time. This is much faster than recompiling, and even faster than making the value data driven, since you won't even have to reload the game data. Once you find the values you're looking for, you can take the time to put them in a data file.

Caveman Debugging

If you can't use a debugger, or don't even know they exist as I did in college, you get to do something I call caveman debugging. You might be curious as to why you wouldn't be able to use a debugger, and it's not because you work for someone so cheap that they won't buy one. Sometimes you'll see problems only in the release build of the application. These problems usually result from uninitialized variables, unexpected or even incorrect code generation. The problem simply goes away in the debug version. You might also be debugging a server application that fails intermittently, perhaps after hours of running nominally. It's useless to attempt debugging in that case.

Best Practice

Make good use of stderr if you program in Unix or OutputDebugString if you program under Windows. These are your first and best tools for caveman debugging.

In both cases, you should resort to the caveman method: You'll write extra code to display variables or other important information on the screen, in the output window, or in a permanent log file. As the code runs, you'll watch the output for signs of misbehavior or you'll pour over the log file in the hopes you can discern the nature of the bug. This is a slow process and takes a great deal of patience, but if you can't use a debugger this method will work.

A Tale from the Pixel Mines

When I was on Ultima Online, one of my tasks was to write the UO login servers. These servers were the main point of connection for the Linux game servers and the SQL server, so login was only a small portion of what the software actually did. An array of statistical information flowed from the game servers, was collated in the login server, and was written to the SQL database. The EA executives liked pretty charts and graphs and we gave it to them. Anyway, the login process was a Win32 console application, and to help me understand what was going on I printed debug messages for logins, statistics data, and anything else that looked reasonable. When the login servers were running these messages were scrolling by so fast I certainly couldn't read them, but I could feel them. Imagine me sitting in the UO server room, staring blankly at three login server screens. I could tell just by the shape of the text flowing by whether or not a large number of logins were failing or a UO server was disconnected. It was very weird.

The best caveman debugging solution I ever saw was one that used the PC speaker. Herman was a programmer that worked on Ultima V through Ultima IX, and one of his talents was perfect pitch. He could tell you the difference between a B and a B flat and get it right every time. He used this to his advantage when he was searching for the nastiest crasher bugs of them all—they didn't even allow the debugger window to pop up. He wrote a special checker program that output specific tones through the PC speaker, and peppered the code with these checks. If you walked into his office while his spiced up version of the game was running, it sounded a little like raw modem noise, until the game crashed. Because the PC speaker wasn't dependant on the CPU it would remain emitting the tone of his last check. "Hrm….that's a D," he would say, and zero in on the line of code that caused the crash.

When all Else Fails

So you tried everything and hours later you are no closer to solving the problem than when you started. Your boss is probably making excuses to pass by your office and ask you cheerily, "How's it going?" You suppress the urge to jump up and make an example of his annoying behavior, but you still have no idea what to do. Here's a few last resort ideas.

First, go find another programmer and explain your problem. It doesn't really matter if you can find John Carmack or the greenest guy in your group, just find someone. Walk them through each step, explaining the behavior of the bug and each hypothesis you had however failed. Talk about your debugging experiments and step through the last one with them watching over your shoulder. For some odd reason, you'll find the solution to your problem without them ever even speaking a single word. It will just come as if it were handed to you by the Universe itself. I've never been able to explain that phenomenon, but it's real. This will solve half of the unsolvable bugs.

Another solution is static code analysis. You should have enough observations to guess at what is going on, you just can't figure out how the pieces of the puzzle fit together. Print out a suspect section of code on paper—the flat stuff you find in copy machines—and take it away from your desk. Study it and ask yourself how the code could fail. Getting away from your computer and the debugger helps to open your mind a bit, and removes your dependency on them.

If you get to this point and you still haven't solved the problem, you've probably been at it for a few solid hours if not all night. It's time to walk away—not from the problem, but from your computer. Just leave. Do something to get your mind off the problem. Drive home. Eat dinner. Introduce yourself to your family. Take a shower.

The last one is particularly useful for me—not that I need any of you to visualize me in the shower. The combination of me being away from the office and in a relaxing environment frees a portion of my mind to continue working on the problem without adding to my stress level. Sometimes, a new approach to the problem or an even better a solution will simply deposit itself in my consciousness. That odd event has never happened to me when I'm under pressure sitting at the computer. It's scary when you're at dinner and it dawns on you and you've solved a bug just by getting away from it.