Almost every Windows NT or Windows 2000 user has heard of, if not experienced, the infamous "blue screen of death." This ominous term refers to the blue screen that is displayed when Windows 2000 crashes, or stops executing, because of a catastrophic fault or an internal condition that prevents the system from continuing to run.
In this section, we'll cover the basic problems that cause Windows 2000 to crash, describe the information presented on the blue screen, and explain the various configuration options available to create a crash dump, a record of system memory at the time of a crash that can help you figure out which component caused the crash. This section is not intended to provide detailed troubleshooting information on how to analyze a Windows 2000 system crash.
Windows 2000 crashes (stops execution and displays the blue screen) for the following reasons:
When a kernel-mode device driver or subsystem causes an illegal exception, Windows 2000 faces a difficult dilemma. It has detected that a part of the operating system with the ability to access any hardware device and any valid memory has done something it wasn't supposed to do.
But why does that mean Windows 2000 has to crash? Couldn't it just ignore the exception and let the device driver or subsystem continue as if nothing had happened? The possibility exists that the error was isolated and that the component will somehow recover. But what's more likely is that the detected exception resulted from deeper problems—for example, from a general corruption of memory or from a hardware device that's not functioning properly. Permitting the system to continue operating would probably result in more exceptions, and data stored on disk or other peripherals could become corrupt—a risk that's too high to take.
Regardless of the reason for a system crash, the function that actually performs the crash is KeBugCheckEx (documented in the Windows 2000 DDK). This function takes a stop code (sometimes called a bug check code), and four parameters that are interpreted on a per-stop code basis. After KeBugCheckEx masks out all interrupts on all processors of the system, it switches the display into blue-screen mode (80-columns-by-50-lines text mode), paints a blue background, and then displays the stop code, followed by some text suggesting what the user can do. (It's possible that system data structures have been so seriously corrupted that the blue screen isn't displayed.) Figure 4-4 shows a sample blue screen.
Figure 4-4 Example blue screen
The first line lists the stop code and the four additional parameters passed to KeBugCheckEx. The text line below the stop code provides the text equivalent of the stop code's numeric identifier. According to the example in Figure 4-4, the stop code 0x0000000A is an IRQL_NOT_LESS_OR_EQUAL crash. When a parameter contains an address of a piece of operating system or device driver code (as in Figure 4-4), Windows 2000 displays the base address of the module the address falls in, the date stamp, and the file name of the device driver. This information alone might help you pinpoint the faulty component.
Although there are more than a hundred unique stop codes, most are rarely, if ever, seen on production systems. Instead, just a few common stop codes represent the majority of Windows 2000 system crashes. Also, the meaning of the four additional parameters depends on the stop code (and not all stop codes have extended parameter information). Nevertheless, looking up the stop code and the meaning of the parameters (if applicable) might at least assist you in diagnosing the component that is failing (or the hardware device that is causing the crash). You can find stop code information in the following places:
You often begin seeing blue screens after you install a new software product or piece of hardware. If you've just added a driver, rebooted, and gotten a blue screen early in system initialization, you can reset the machine, press the F8 key when instructed, and then select Last Known Good Configuration. Enabling last known good causes Windows 2000 to revert to a copy of the registry's device driver registration key (HKLM\SYSTEM\CurrentControlSet\Services) from the last successful boot (before you installed the driver). From the perspective of last known good, a successful boot is one in which all services and drivers have finished loading and at least one logon has succeeded.
If you keep getting blue screens, an obvious approach is to uninstall the components you added just before the first blue screen appeared. If some time has passed since you added something new or you added several things at about the same time, you need to note the names of the device drivers referenced in any of the parameters. If you recognize any of the names as being related to something you just added (such as Scsiport.sys if you put on a new SCSI drive), you've possibly found your culprit.
Many device drivers have cryptic names, but one approach you can take to figure out which application or hardware device is associated with a name is to find out the name of the service in the registry associated with a device driver by searching for the name of the device driver under the HKLM\SYSTEM\CurrentControlSet\Services key. This branch of the registry is where Windows 2000 stores registration information for every device driver in the system. If you find a match, look for values named DisplayName andDescription. Some drivers fill in these values to describe the device driver's purpose. For example, you might find the string "Virus Scanner" in the DisplayName value, which can implicate the antivirus software you have running. The list of drivers can be displayed in the Computer Management tool (from the Start menu, select Programs/Administrative Tools/Computer Management). In Computer Management, expand System Tools, System Information, and Software Environment, and then select Drivers.
More often than not, however, the stop code and the four associated parameters aren't enough information to troubleshoot a system crash. For example, you might need to examine the kernel-mode call stack to pinpoint the driver or system component that triggered the crash. Also, because the default behavior on Windows 2000 systems is to automatically reboot after a system crash, it's unlikely that you would have time to record the information displayed on the blue screen. That is why, by default, Windows 2000 attempts to record information about the system crash to the disk for later analysis, which takes us to our final topic, crash dump files.
By default, all Windows 2000 systems are configured to attempt to record information about the state of the system when the system crashes. You can see these settings by opening the System tool in Control Panel, then in the System Properties dialog box, click the Advanced tab and then click the Startup And Recovery button. The default settings for a Windows 2000 Professional system are shown in Figure 4-5.
Figure 4-5 Crash dump settings
Three levels of information can be recorded on a system crash:
When Windows 2000 is configured to write crash dump information, it writes the information to the paging file because trying to create a new file on the disk would depend on more of the system data structures being intact. (If there is more than one paging file, the first or primary page file is used.) After the system reboots, the logon process (Winlogon.exe) creates a child process (Savedump.exe) to copy the crash dump information out of the page file and into a new file. Small memory dumps are by default created in the directory \Winnt\Minidump and are given unique file names consisting of the string "Mini" plus the date plus a sequence number (for example, Mini031000-01.dmp). Kernel memory and complete memory dumps are copied to a file named \Winnt\Memory.dmp, which means that only the latest dump file is retained on the disk.
As mentioned earlier, there's no guarantee that the crash dump information will be recorded since the data structures used to access the paging file might themselves be corrupted, thus preventing the system from being able to write anything to disk. If the system isn't able to record the crash dump, you can try booting the crashing system with the kernel debugger so that you can gain control from a host debugger when the system crashes. In that way, you can use the interactive kernel debugger to look at the kernel stack or examine other operating system structures to try and determine the reason for the crash. For more information on how to set up the kernel debugger, see the Windows 2000 Debugging help file (Ddkdbg.chm) mentioned earlier.
Once you have a crash dump file (whether it's a small memory dump, a kernel memory dump, or a complete memory dump), how can you retrieve the stop code or perform further analysis? The simplest tool to use is Dumpchk (available in the Windows 2000 Support Tools, the Platform SDK, the Windows 2000 DDK, and the debugging tools). By default, Dumpchk opens a dump file and displays the basic information about a crash, such as the operating system version, stop code, and parameters. If you call it with the "-e" option, it displays more details, such as the list of loaded device drivers, the current process and thread, and the kernel stack. (This option requires the symbol file for Ntoskrnl.exe to match the version of Windows 2000 that crashed. See the section "Symbols for Kernel Debugging" in Chapter 1 for more information on symbol files).
Finally, an advanced tool called the Kernel Memory Space Analyzer (Kanalyze.exe) might also be useful in debugging a crash dump. This tool is part of the debugging tools package, the Windows 2000 DDK, and the Platform SDK and is documented in a separate Microsoft Word document called OEM Tool Help. (You can find this file at \Program Files\Debuggers\bin\kanalyze\userdocs.doc if you have the Windows 2000 debugging tools installed. You can also find it in the Platform SDK and the DDK directory trees.)
Unfortunately, you can't run a magical program to identify the exact cause of blue screens or to make them go away. Even with extensive knowledge of Windows 2000 internals and device drivers, analyzing a blue screen or a crash dump can be very difficult. However, being able to retrieve the stop code and parameters can at least point you in the right direction.
EXPERIMENT
Forcing a Crash and Retrieving the Stop CodeTo generate a crash dump for the purposes of experimenting with the dump analysis tools referred to in this section, you can force Windows 2000 to crash by either running the \Sysint\Bsod.exe tool on this book's companion CD (this program loads a device driver that calls KeBugCheckEx) or by enabling the support added in Windows 2000 to force the system to crash by holding the right Ctrl key down and pressing Scroll Lock twice. To enable this feature, add a DWORD value named CrashOnCtrlScroll with a value of 1 to the registry key HKLM\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters. You must reboot the system for this change to take effect.
Once you've created a crash dump file, try using Dumpchk to display the basic crash dump information. Then try the Dumpchk -e option to display the extended information. Finally, try opening the crash dump file with the interactive kernel debugger (Kd, I386kd.exe, or Windbg.exe) and using some of the built-in kernel debugger extensions, such as !process, !thread, and !drivers.
For instructions on how to use these tools, see the Debugging help file (Ddkdbg.chm).