Different Kinds of Bugs | Game Coding Complete

Tactics and technique are great but they only describe debugging in the most generic sense. Everyone should build a taxonomy of bugs, a dictionary of bugs as it were, so that you can instantly recognize a type of bug and associate it with the beginning steps of a solution. One way to do this is constantly trade "bug" stories with other programmers—a conversation that will bore non-programmers completely to death.

Memory Leaks and Heap Corruption

A memory leak is caused when a dynamically allocated memory block is "lost." The pointer that holds the address of the block is reassigned without freeing the block, and it will remain allocated until the application exits. This kind of bug is especially problematic if this happens frequently. The program will chew up physical and virtual memory over time, and eventually fail. Here's a classic example of a memory leak. This class allocates a block of memory in a constructor, but fails to declare a virtual destructor:

 class LeakyMemory : public SomeBaseClass { protected:    int *leaked;    LeakyMemory() { leaked = new int[128]; }    ~LeakyMemory() { delete leaked; } };

This code might look fine but there's a potential memory leak in there. If this class is instantiated, and is referenced by a pointer to SomeBaseClass, the destructor will never get called:

 void main() {    LeakyMemory *ok = new LeakyMemory;    SomeBaseClass *bad = new LeakyMemory;    delete ok;    delete bad;                      // MEMORY LEAK RIGHT HERE! }

You fix this problem by declaring the destructor in the base class as virtual. Memory leaks are easy to fix if the leaky code is staring you in the face. This isn't always the case. A few bytes leaked here and there as game objects are created and destroyed can go unnoticed for a long time until it is obvious that your game is chewing up memory without any valid reason.

Memory bugs and leaks are amazingly easy to fix, but tricky to find, if you use a memory allocator that doesn't have special code to give you a hand. Under Win32, the C runtime library lends a hand under the debug builds with the debug heap. The debug heap sets the value of uninitialized memory and freed memory.

Uninitialized memory allocated on the heap is set to 0xCDCDCDCD.
Unintialized memory allocated on the stack is set to 0xCCCCCCCC. This is dependent on the /GX complier option in Microsoft Visual C++.
Freed heap memory is set to 0xFEEEFEEE, before it has been reallocated. Sometimes this freed memory is set to OxDDDDDDDD, depending on how the memory was freed.
The lead byte and trailing byte to any memory allocated on the heap is set to 0xFDFDFDFD.

Win32 programmers commit these values to memory. They'll come in handy when you are viewing memory windows in the debugger. You can tell what has happened to a block of memory.

The C-Runtime debug heap also provides many functions to help you examine the heap for problems. I'll tell you about three of them, and you can hunt for the rest in the Visual Studio help files or MSDN:

_CrtSetDbgFlag( int newFlag ): Sets the behavior of the debug heap.
_CrtCheckMemory( void ): Runs a check on the debug heap.
_CrtDumpMemoryLeaks( void ): Reports any leaks to stdout.

Here's an example of how to put these functions into practice:

 #include <crtdbg.h> #if defined _DEBUG    #define new new(_NORMAL_BLOCK,__FILE__, __LINE__) #endif int main() {    // get the current flags    int tmpDbgFlag = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);    // don't actually free the blocks    tmpDbgFlag |= _CRTDBG_DELAY_FREE_MEM_DF;    // perform memory check for each alloc/dealloc    tmpDbgFlag |= _CRTDBG_CHECK_ALWAYS_DF;    _CrtSetDbgFlag(tmpDbgFlag);    char *gonnaTrash = new char[15];    _CrtCheckMemory();                      // everything is fine....    strcpy(gonnaTrash, "Trash my memory!"); // overwrite the buffer    _CrtCheckMemory();                      // everything is NOT fine!    delete gonnaTrash;                      // This brings up a dialog box too...    char *gonnaLeak = new char[100];        // Prepare to leak!    _CrtDumpMemoryLeaks();                  // Reports leaks to stderr    return 0; }

Notice that the new operator is redefined. A debug version of new is included in the debug heap that records the file and line number of each allocation. This can go a long way to help detecting the cause of a leak.

The first few lines set the behavior of the debug heap. The first flag tells the debug heap to keep deallocated blocks around in a special list instead of recycling them back into the useable memory pool. You might use this flag to help you track a memory corruption or simply alter your process's memory space in the hopes a tricky bug will be easier to catch. The second flag tells the debug heap that you want to run a complete check on the debug heap's integrity each time memory is allocated or freed. This can be incredibly slow, so turn it on and off only when you are sure it will do you some good. The output of the memory leak dump looks like this:

 Detected memory leaks! Dumping objects -> c:\tricks\tricks.cpp(78) : {42}  normal block at 0x00321100, 100 bytes long.  Data: <                > CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD CD Object dump complete. The program '[2940] Tricks.exe: Native' has exited with code 0 (0x0).

As you can see the leak dump pinpoints the exact file and line of the leaked bits. What happens if you have a core system that allocates memory like crazy, such as a custom string class? Every leaked block of memory will look like it's coming from the same line of code, because it is. It doesn't tell you anything about who called it, which is the real perpetrator of the leak. If this is happening to you, tweak the redeclaration of new, and store a self-incrementing counter instead of __LINE__:

 #include <crtdbg.h> #if defined _DEBUG    static int counter = 0;    #define new new(_NORMAL_BLOCK,__FILE__, counter++) #endif

The memory dump report will tell you exactly when the leaky bits were allocated, and you can track it down easily.

Gotcha

You can't look at the Task Manager under Windows to determine if your game is leaking memory. The Task Manager is the process window you can show if you hit Ctrl-Alt-Del, and click the Task Manager button. This window lies. For one thing, memory might be reported wrong if you have set the _CRTDBG_DELAY_FREE_MEM_DF flag. Even if you are running a release build, freed memory isn't reflected in the process window until the window is minimized and restored. This one stymied even the Microsoft test lab. They wrote a bug telling us that our game was leaking memory like crazy, and we couldn't find it. It turned out that if you minimize the application window and restore it, the Task Manager will report the memory correctly, at least for a little while.

If you happen to write your own memory manager, make sure you take the time to write some analogs to the C runtime debug heap functions. If you don't, you'll find chasing memory leaks and corruptions a full time job.

Best Practice

Make sure your debug build detects and reports memory leaks, and convince every programmer that they should fix all memory leaks before they check in their code. It's a lot harder to fix someone else's memory leak than your own.

COM objects can leak memory too, and those leaks are also painful to find. If you fail to call Release() on a COM object when you're done with it, the object will remain allocated because it's reference count will never drop to zero. Here's a neat trick: first put the following function somewhere in your code:

 int Refs(IUnknown *pUnk) {    pUnk->AddRef();    return pUnk->Release(); }

You can then put Refs(myLeakingResourcePtr) in the watch window in your debugger. This will usually return the current reference count for a COM object. Be warned, however, that COM doesn't require that Release() return the current reference count, but it usually does.

Game Data Corruption

Most memory corruptions are easy to diagnose. Your game crashes and you find funky trash values where you were used to seeing valid data. The frustrating thing about memory corrupter bugs is that they can happen anywhere, anytime. Since the memory corruption is not trashing the heap, you can't use the debug heap functions, but you can use your own home grown version of them. You need to write your own version of CrtCheckMemory(), built especially for the data structures being vandalized. Hopefully, you'll have a reasonable set of steps you can use to reproduce the bug. Given those two things, the bug has only moments to live. If the trasher is intermittent, leave the data structure check code in the game. Perhaps someone will begin to notice a pattern of steps that can cause the corruption to occur.

A Tale from the Pixel Mines

I recall a truly excellent hack we encountered on Savage Empire, an Ultima VI spin-off that Origin shipped in late 1990. Origin was using Borland's 3.1 C Compiler, and the runtime module's exit code always checked memory location zero to see if a wayward piece of code accidentally overwrote that piece of memory, which was actually unused. If it detected the memory location was altered, it would print out "Error: (null) pointer assignment" at the top of the screen. Null pointer assignments were tough to find in those days since the CPU just happily assumed you knew what you were doing. Savage Empire programmers tried in vain to hunt down the null pointer assignment until the very last day of development. Origin's QA had signed off on the build, and Origin Execs wanted to ship the product, since Christmas was right around the corner. Steve, one of the programmers, "fixed" the problem with an amazing hack. He hex edited the executable, savage.exe, and changed the text string "Error: (null) pointer assignment." to another string exactly the same length: "Thanks for playing Savage Empire."

If the memory corruption seems random—writing to memory locations here and there without any pattern—here's a useful but brute force trick: Declare an enormous block of memory and initialize it with an unusual pattern of bytes. Write a check routine that runs through the memory block and finds any bytes that don't match the original pattern, and you've got something that can detect your bug. I've been using this trick since Ultima VII.

A Tale from the Pixel Mines

Ultima games classicly stored their game data in large blocks of memory, and the data was organized as a linked list. If the object lists became corrupted, all manner of mayhem would result. If you ever played Savage Empire, you might have been one of the lucky people to see a triceratops walking across the opening screen, in two pieces.

Another example of this object corruption was a bug I saw in Martian Dreams - as I was walking my character across the alien landscape, all the plants turned into pocket watches and my character turned into a pair of boots. If I hadn't seen it with my own eyes I wouldn't have believed it. The worst of these bugs became something of a legend at Origin Systems—"The Barge Bug." The Ultima VI team found that the linked object lists could be used to create barges, a generic term for a bunch of linked objects that could move about the map as a group. This led to neat stuff like flying carpets, boats, and the barges of Martian Dreams that navigated the canals.

QA was observing a bug that made barges explode. The objects and their passengers would suddenly shatter into pieces, and if you attempted to move them one step in any direction that game would crash. I was assigned the task of fixing this bug. I tried again and again, each time I was completely sure that barge bug was dead. QA didn't share my optimism, and for four versions of the game I would see the bug report come back—"Not fixed." The fourth time I saw the bug report my exhausted mind simply snapped. I don't need to tell you what happened, because an artist friend of mine, Denis, drew the picture of me shown in Figure 12.3.

click to expand
Figure 12.3: Artist's Rendering of Earwax Blowing out of Mr.Mike's Ears.

Stack Corruption

Stack corruption is evil because it wipes evidence from the scene of the crime. Take a look at this lovely code:

 void StackTrasher() {    char hello[10];    memset(hello, 0, 1000); }

The call to memset never returns, since it wipes the stack clean, including the return address. The most likely thing your computer will do is break into some crazy, codeless, area—the debugger equivalent of shrugging its shoulders and leaving you to figure it out for yourself. Stack corruptions almost always happen as a result of sending bad data into an otherwise trusted function, like memset. Again, you must have a reasonable set of steps you can follow to reproduce the error.

Begin your search by eliminating subsections of code, if you can. Set a breakpoint at the highest level of code in your main loop, and step over each function call. Eventually you should be able to find a case where stepping over a function call will cause the crash. Begin your experiment again, only this time step into the function and narrow the list of perpetrators. Repeat these steps until you've found the call that causes the crash.

Notice carefully with each step the call stack window. The moment it is trashed the debugger will be unable to display the call stack. It is unlikely that you'll be able to continue or even set the next statement to a previous line for retesting, so if you missed the cause of the problem you'll have to begin again. If the call that causes that stack to go south is something trusted like memset, study each input parameter carefully. Your answer is there: one of those parameters is bogus.

Cut and Paste Bugs

This kind of bug doesn't have a specific morphology, an SAT way of saying "pattern of behavior." It does have a common source, which is cutting and pasting code from one place to another. I know how it is; sometimes it's easier to cut and paste a little section of code rather than factor it out into a member of a class or utility function. I've done this myself many times to avoid a heinous recompile. I tell myself that I'll go back and factor the code later. Of course I never get around to it. The danger of cutting and pasting code is pretty severe.

First, the original code segment could have a bug that doesn't show up until much later. The programmer that finds the bug will likely perform a debugging experiment where a tentative fix is applied to the first block of code, but he misses the second one. The bug may still occur exactly as it did before, convincing our hero that he has failed to find the problem and begins a completely different approach. Second, the cut and pasted code might be perfectly fine in the original location, but cause a subtle bug in the destination. You might have local variables stomping on each other or some such thing.

If you're like me at all, you feel a pang of guilt every time you hit Ctrl-V and you see more than two or three line pop out of the clipboard. That guilt is there for a reason: Heed it and at least create a local free function while you get the logic straightened out. When you're done, you can refactor your work, make your change to game.h, and compile through the night.

Running out of Space

You should never run out of space. By space, I mean any consumable resource: memory, hard drive space, Windows handles, or memory blocks on a console's memory card. If you do run out of space, you're game is either leaking these resources, or never had them to begin with.

We've already talked about the leaking problem, so let's talk about the other case. If your game needs certain resources to run properly, like a certain amount of hard drive space or memory blocks for save game files, then by all means check for the appropriate headroom when your game initializes. If any consumable is in short supply, you should bail right there or at least warn the player that they won't be able to save games.

A Tale from the Pixel Mines

In the final days of Ultima VIII, it took nine floppy disks to hold all of the install files. Origin execs had a hard limit on eight floppy disks and we had to find some way of compressing what we had into one less disk. It made sense to concentrate on the largest file, SHAPES.FLX, which held all of the graphics for the game.

Zack, one of Origin's best programmers, came up with a great idea. The SHAPES.FLX file essentially held filmstrip animations for all the characters in Ultima VIII, and each frame was only slightly different from the previous frame. Before the install program compressed SHAPES.FLX, Zack wrote a program to delta compress all of the animations. Each frame stored only the pixels that changed from the previous frame, and the blank space left over was run length encoded. The whole shebang was compressed with a general compression algorithm for the install program. It didn't make installation any faster, that's for sure, but Zack saved Origin a few tens of thousands of dollars with a little less than a single night of hard core programming.

Release Mode Only Bugs

If you ever have a bug in the release build that doesn't happen in the debug build, most likely you have an uninitialized variable somewhere. The best way to find this type of bug is to use a run time analyzer like BoundsChecker.

Another source of this problem can be a compiler problem, in that certain optimization settings or other project settings are causing bugs. If you suspect this, one possibility is to start changing the project settings one by one to look more like the debug build, until the bug disappears. Once you have the exact setting that causes the bug, you may get some intuition about where to look next.

Multithreading Gone Bad

Multithreaded bugs are really nasty because they can be nigh impossible to reproduce accurately. The first clue you may have a multithreaded issue is by a bug's unpredictable behavior. If you think you have a multithreaded bug on your hands, the first thing you should do is disable multithreading and try to reproduce the bug.

A good example of a classic multithreaded bug is a sound system crash. The sound system in most games runs in a separate thread, grabbing sound bits from the game every now and again as it needs them. It's these communications points where two threads need to synch up and communicate that most multithreading bugs occur.

Sound systems like Miles from RAD Game Tools are extremely well tested. It's much more likely that a sound system crash is due to your game deallocating some sound memory before its time or perhaps simply trashing the sound buffer. In fact, this is so likely that my first course of action when I see a really strange, unreproducable bug is to turn off the sound system and see if I can get the problem to happen again.

The same is true for other multithreaded subsystems, such as AI or resource preloading. If your game uses multiple threads for these kinds of systems make sure you can turn it off easily for testing. Sure, the game will run in a jerky fashion since all the processing has to be performed in a linear fashion, but the added benefit is that you can eliminate the logic of those systems and focus on the communication and thread synchronization for the source of the problem.

A Tale from the Pixel Mines

Ultima VIII had an interrupt driven multi-tasking system, which was something of a feat in DOS 5. A random crash was occurring in QA, and no one could figure out how to reproduce it, which meant there was little hope of it getting fixed. It was finally occurring once every thirty minutes or so—way too often to be ignored. We set four or five programmers on the problem with each one attempting to reproduce the bug. Finally, the bug was reproduced by a convoluted path. We would walk the Avatar character around the map in a specific sequence, teleporting to one side of the map, then the other, and the crash would happen. We were getting close.

Herman, the guy with perfect pitch, turned on his pitch debugger. We followed the steps exactly, and when the crash happened Herman called it: a B-flat meant that the bug was somewhere in the memory manager. We eventually tracked it down to a lack of protection in the memory system. Two threads were accessing the memory management system at the same time, and the result was a trashed section of memory. Since the bug was related to multi-threading, it never corrupted the same piece of memory twice in a row.

Had we turned the multi-threading off, the bug would have disappeared, causing us to focus our effort on any shared data structure that could be corrupted by multiple thread access. In other words, we were extremely lucky to find this bug, and the only thing that saved us was a set of steps we could follow that made the bug happen.

Weird Ones

There are some bugs that are very strange, either by their behavior, intermittence, or the source of the problem. Driver related issues are pretty common, not necessarily because there's a bug in the driver. It's more likely that you are assuming the hardware or driver can do something that it cannot. Your first clue that an issue is driver related is that it only occurs on specific hardware, such as a particular brand of video card. Video cards are sources of constant headaches in Windows games because each manufacturer wants to have some feature stand out from the pack, and do so in a manner that keeps costs down. More often than not this will result in some odd limitations and behavior.

Gotcha

A great example of an unruly video card is found on an old video card that was once made by the now defunct 3Dfx company. This card had a limitation that no video memory surface could have a width to height ratio greater than 8:1. A 256x32 surface would work just fine, but a 512x32 surface would fail in a very strange way: It would create a graphic effect not unlike a scrambled TV channel. If you weren't aware of this limitation you would debug relentlessly through every line of code in your whole game and you'd never find the problem. It turns out that problems like this are usually found through a targeted search of the Internet. Google groups (groups.google.com) is my personal favorite.

Weird bugs can also crop up in specific operating system versions, for exactly the same reasons. Windows 9x based operating systems are very different than Windows 2000 and Windows XP, based on the much beefier NT kernel. These different operating systems make different assumptions about parameters, return values, and even logic for the same API calls. If you don't believe me just look at the bottom of the help files for any Win32 API like GetPrivateProfileSection.That one royally screwed me.

Again, you diagnose the problem by attempting to reproduce the bug on a different operating system. Save yourself some time and try a system that is vastly different. If the bug appears in Windows 98 try it again in Windows XP. If the bug appears in both operating systems it's extremely unlikely that your bug is OS specific.

A Tale from the Pixel Mines

Be especially aware of new things. One of the latest changes to MFC7 was a complete restructuring of how it handled strings. The old code was thrown out in favor of an ATL-based system. MFC7 was distributed with Visual Studio.NET, and we noticed immediately that our game was failing under Windows 98. After a painful remote debugging session it seemed that the tried and true CFileFind class was corrupting memory. Go figure! One of the reasons it took me so long to find it was that I wasn't looking inside CFileFind even though the source code was there right in front of me. I guess I'm just too trusting.

A much rarer form of the weird bug is a specific hardware bug, one that seems to manifest as a result of a combination of hardware and operating systems, or even a specific piece of defective or incompatible hardware. These problems can manifest themselves most often in portable computers, oddly enough. If you've isolated the bug to something this specific, the first thing you should try is to update all the relevant drivers. This is a good thing to do in any case, since most driver-related bugs will disappear when the fresh drivers are installed.

Finally, the duckbilled platypus of weird bugs are the ones generated by the compiler. It happens, more often that anyone would admit. The bug will manifest itself most often in a release build with full optimizations. This is the most fragile section of the compiler. You'll be able to reproduce the bug on any platform, but it may disappear when release mode settings are tweaked. The only way to find this problem is to stare at the assembly code and realize that the compiler generated code that is not semantically equal to the original source code. This is not that easy, especially in fully optimized assembly.

By the way, if you are wondering what you do if you don't know assembly, here's a clue: Go find a programmer that knows assembly. Watch them work, and learn something. Then convince yourself that maybe learning a little assembly is a good idea.

Best Practice

If you happen to be lucky (or unlucky) enough to find a weird compiler problem (especially one that could impact other game developers), do everyone a favor and write a tiny program that isolates the compiler bug, and post it to the internet so everyone can watch out for the problem. You'll be held in high regard if you find a workaround and post that too. Be really sure that you are right about what you see. The Internet lasts forever and it would be unfortunate if you blamed the compiler programmers for something they didn't do. In your posts, be gentle. Rather than say something like, "Those idiots who developed the xyz compiler really screwed up and put in this nasty bug…," try, "I think I found a tricky bug in the xyz compiler…"