Advanced Crash Dump Analysis | Microsoft Windows Internals (4th Edition): Microsoft Windows Server 2003, Windows XP, and Windows 2000

< Day Day Up >

The preceding section leverages the Driver Verifier to create crashes that the debugger's automated analysis engine can resolve. You might still encounter cases where you cannot get a system to produce easily analyzable crashes and, if so, you will need to execute manual analysis to try and determine what the problem is.

Use the !process 0 0 debugger command to look at the processes running and make sure that you understand the purpose of each one. Try disabling or uninstalling unnecessary applications and services.
Use the lm command with the kv option to list the loaded kernel-mode drivers. Make sure that you understand the purpose of any third-party drivers and that you have the most recent versions.
Use the !vm command to see whether the system has exhausted virtual memory, paged pool, or nonpaged pool. If virtual memory is exhausted, the committed pages will be close to the commit limit, so try to identify a potential memory leak by examining the list of processes to see which one reports high commit usage. If nonpaged pool or paged pool is exhausted (that is, the usage is close to the maximum) see the "Troubleshooting Pool a Leak" experiment in Chapter 7.

There are other debugging commands that can prove useful, but more advanced knowledge is required to apply them. The !irp command is one of them. The next section shows the use of this command to identify a suspect driver.

Stack Trashes

Stack overrun or stack trashing results from buffer overrun or underrun bugs. However, instead of residing in pool, as you saw with Notmyfault's buffer overrun bug, the target buffer is on the stack of the thread that executes the bug. This type of bug is another one that's difficult to debug because the stack is the foundation for any crash dump analysis.

When you run Notmyfault and select Stack Trash, the Myfault driver overruns a buffer it allocates on the kernel stack of the thread that executes it. When Myfault tries to return control to the Ntoskrnl function that invoked, it reads the return address, which is the address at which it should continue executing, from the stack. The address was corrupted by the stack-buffer overrun, so the thread continues execution at some different address in memory an address that might not even contain code. An illegal exception and crash occurs when the thread executes an illegal CPU instruction or it references invalid memory.

The driver that the crash dump analysis of a stack overrun points the blame at will vary from crash to crash, but the stop code will almost always be KMODE_EXCEPTION_NOT_HANDLED. If you execute a verbose analysis, the stack trace looks like this:

STACK_TEXT: b7b0ebd4 00000000 00000000 00000000 00000000 0x0

This is consistent with the stack having been overwritten with zeros. Unfortunately, mechanisms like special pool and system code write protection can't catch this type of bug. Instead, you must take some manual analysis steps to determine indirectly which driver was operating at the time of the corruption. One way is to examine the IRPs that are in progress for the thread that was executing at the time of the stack trash. When a thread issues an I/O request, the I/O manager stores a pointer to the outstanding IRP on the Irp list of the ETHREAD structure for the thread. The !thread debugger command dumps the thread list of the target IRP. (If you don't specify a thread object address,!thread dumps the processor's current thread.) Then you can look at the IRP with the !irp command:

kd> !thread THREAD ff740020 Cid 8f8.420 Teb: 7ffde000 Win32Thread: a20cdbe8 RUNNING IRP List:     bc5a7f68: (0006, 0094) Flags: 00000000  Mdl: 00000000 Not impersonating Owning Process ff75f120 ... kd> !irp bc5a7f68 Irp is active with 1 stacks 1 is current ( = 0xbc5a7fd8)  No Mdl Thread ff740020:  Irp stack trace.      cmd flg  cl Device   File     Completion-Context >[  e, 0] 0   0  ff79e4c0 ff7ac028 00000000-00000000        \Driver\MYFAULT Args: 00000000 00000000 83360010 00000000

The output shows that the IRP's current and only stack location (designated with the ">" prefix) is owned by the Myfault driver. If this were a real crash, the next steps would be to ensure that the driver version installed is the most recent available, install the new version if it isn't, and if it is, to enable the Driver Verifier on the driver (with all settings except low memory simulation).

Hung or Unresponsive Systems

If a system becomes unresponsive (that is, you are receiving no response to keyboard or mouse input), the mouse freezes, or you can move the mouse but the system doesn't respond to clicks, the system is said to have hung. There can be a number of things that can cause the system to hang:

A device driver does not return from its interrupt service (ISR) routine or deferred procedure call (DPC) routine
A high priority realtime thread preempts the windowing system driver's input threads
A deadlock (when two threads or processors hold resources each other wants and neither will yield what they have) occurs in kernel mode

If you have Windows XP or Windows Server 2003, you can check for deadlocks by using the Driver Verifier option called deadlock detection. Deadlock detection monitors the use of spin locks, fast mutexes, and mutexes, looking for patterns that could result in a deadlock. (For more information on these and other synchronization primitives, see Chapter 3.) If one is found, Driver Verifier crashes the system with an indication of which driver causes the dead-lock. The simplest form of deadlock occurs when two threads hold resources each other thread wants and neither will yield what they have or give up waiting for the one they want. If you are running Windows XP or Windows Server 2003, the first step to troubleshooting hung systems is therefore to enable deadlock detection on suspect drivers, then unsigned drivers, and then all drivers, until you get a crash that pinpoints the driver causing the deadlock.

If you are running Windows 2000 or have verified all drivers and still get hangs, you must either manually crash the hung system and analyze the resulting dump or break in with the kernel debugger to take an alternate approach to investigating the hang.

There are two ways to approach a hanging system so that you can apply the manual crash troubleshooting techniques described in this chapter to determine what driver or component is causing the hang: The first is to crash the hung system and hope that you get a dump that you can analyze; the second is to break into the system with a kernel debugger and analyze the system's activity. Both approaches require prior setup and a reboot. You use the same exploration of system state with both approaches to try and determine the cause of the hang.

To manually crash a hung system, you must first add the DWORD registry value HKLM\ System\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScroll and set it to 1. After rebooting, the i8042 port driver, which is the port driver for PS2 keyboard input, monitors keystrokes in its interrupt service routine (ISR, which is discussed further in Chapter 3) looking for two presses of the scroll lock key while the right control key is depressed. When the driver sees that sequence, it calls KeBugCheckEx with the MANUALLY_INITIATED_ CRASH (0xE2) stop code that indicates a user-manually initiated crash. When the system reboots, open the crash dump file and apply the techniques mentioned earlier to try and determine why the system was hung (for example, determining what thread was running when the system hung, what the kernel stack indicates was happening, and so on). Note that this works for most hung system scenarios, but it won't work if the i8042 port driver's ISR doesn't execute. (The i8042 port driver's ISR won't execute if all processors are hung as a result of their IRQL being higher than the IRQL of the ISR's IRQL, or if corruption of system data structures extends to interrupt-related code or data.)

Note

Manually crashing a hung system by using the support provided in the i8042 port driver does not work with USB keyboards. It works with PS2 keyboards only.

Another way to trigger a crash is if your hardware has a built in "crash" button. (Some high-end servers have this.) In this case, the crash is initiated by signaling the nonmaskable interrupt (NMI) pin of the system's motherboard. To enable this, set the registry DWORD value HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump to 1. Then when you press the dump switch, an NMI is delivered to the system and the kernel's NMI interrupt handler calls KeBugCheckEx. This works in more cases than the i8042 port driver mechanism because the NMI IRQL is always higher than that of the i8042 port driver interrupt. See http://www.microsoft.com/whdc/system/CEC/dmpsw.mpsx for more information.

If you are unable to manually generate a crash dump, you can attempt to break into the hung system by first making the system boot into debugging mode. You do this in one of two ways. You can press the F8 key during the boot and select Debugging Mode, or you can create a debugging-mode boot option in Boot.ini by copying an existing boot entry from the system's Boot.ini and adding the /DEBUG switch. When using the F8 approach, the system will use the default connection (Serial Port COM2 and 19200 Baud). With the /DEBUG option, you must also configure the connection mechanism to be used between the host system running the kernel debugger and the target system booting in debugging mode and then configure the /Debugport and /Baudrate switches appropriately for the connection type. The two connection types are a null modem cable using a serial port or, for Windows XP and Windows Server 2003 systems, an IEEE 1394 (Firewire) cable using 1394 ports on each system. For details on configuring the host and target system for kernel debugging, see the Windows Debugging Tools help file.

When booting in debugging mode, the system loads the kernel debugger at boot time and makes it ready for a connection from a kernel debugger running on a different computer connected through a serial cable or IEEE 1394 cable. Note that the kernel debugger's presence does not affect performance. When the system hangs, run the Windbg or Kd debugger on the connected system, establish a kernel debugging connection, and break into the hung system. This approach will not work if interrupts are disabled or the kernel debugger has become corrupted.

Note

Booting a system in debugging mode does not affect performance if it's not connected to another system; however, a system that's configured to automatically reboot after a crash will not do so if it's booted with kernel debugging enabled, because the kernel debugger waits for a connection from another system after a crash.

Instead of leaving the system in its halted state while you perform analysis, you can also use the debugger ".dump" command to create a crash dump file on the host debugger machine. Then you can reboot the hung system and analyze the crash dump offline (or submit it to Microsoft). Note that this can take a long time if you are connected using a serial null modem cable (vs. a higher speed 1394 connection), so you might want to just capture a minidump using the ".dump /m" command. Alternatively, if the target machine is capable of writing a crash dump, you can force it to do so by issuing the ".crash" command from the debugger. This will cause the target machine to create a dump onto its local hard drive that you can examine after the system reboots.

You can cause a hang by running Notmyfault and selecting the Hang option. This causes the Myfault driver to queue a DPC on each processor of the system that executes an infinite loop. Because the IRQL of the processor while executing DPC functions is DPC/dispatch level, the keyboard ISR will respond to the special keyboard crashing sequence.

Once you've broken into a hung system or loaded a manually generated dump from a hung system into a debugger, you should execute the !analyze command with the -hang option. This causes the debugger to examine the locks on the system and try to determine whether there's a deadlock, and if so, what driver or drivers are involved. However, for a hang like the one that Notmyfault's hang generates, the !analyze analysis command will report nothing useful.

If the !analyze command doesn't pinpoint the problem, execute !thread and !process in each of the dump's CPU contexts to see what each processor is doing. (Switch CPU contexts with the command for example, use 1 to switch to processor 1s context.) If a thread has hung the system by executing in an infinite loop at an IRQL of DPC/dispatch level or higher, you'll see the driver module in which it has become stuck in the stack trace of the !thread command. The stack trace of the crash dump you get when you crash a system experiencing the Notmyfault hang bug looks like this:

STACK_TEXT: f9e66ed8 f9b0d681  000000e2 00000000  00000000 nt!KeBugCheckEx+0x19 f9e66ef4 f9b0cefb  0069b0d8 010000c6  00000000 i8042prt!I8xProcessCrashDump+0x235 f9e66f3c 804ebb04  81797d98 8169b020  00010009 i8042prt!I8042KeyboardInterruptService+0x21c f9e66f3c fa12e34a  81797d98 8169b020  00010009 nt!KiInterruptDispatch+0x3d WARNING: Stack unwind information not available. Following frames may be wrong. ffdff980 8169b288  f9e67000 0000210f  00000004 myfault+0x34a 8054ace4 ffdff980  804ebf58 00000000  0000319c 0x8169b288 8054ace4 ffdff980  804ebf58 00000000  0000319c 0xffdff980 8169ae9c 8054ace4  f9b12b0f 8169ac88  00000000 0xffdff980 ...

The top few lines of the stack trace reference the routines that execute when you type the i8042 port driver's crash key sequence. The presence of the Myfault driver indicates that it might be responsible for the hang.

Another command that might be revealing is !locks, which dumps the status of all executive resource locks. By default, the command lists only resources that are under contention, which means that they are both owned and have at least one thread waiting to acquire them. Examine the thread stacks of the owners with the !thread command to see what driver they might be executing in.

When There Is No Crash Dump

In this section, we'll address how to troubleshoot systems that for some reason are not recording a crash dump. One reason why a crash dump might not be recorded is if the paging file on the boot volume is too small to hold the dump or if there is not enough free disk space to extract the dump after the reboot. These two cases can easily be remedied by either increasing the size of the paging file or configuring the dump to be saved to a volume with enough disk space to hold the extracted dump.

A third reason why there might not be a crash dump recorded is because the kernel code and data structures needed to write the crash dump have been corrupted at the time of the crash.

As described earlier, this data is checksummed when the system boots, and if the checksum made at the time of the crash does not match, the system does not even attempt to save the crash dump (so as not to risk corrupting data on the disk). So in this case, you need to catch the system as it crashes and then try to determine the reason for the crash.

A final reason occurs when the disk subsystem for the system disk is not able to process disk write requests (a condition that might have triggered the system failure itself). One such condition would be a hardware failure in the disk controller or maybe a cabling issue near the hard disk.

One simple option is to turn off the Automatically Restart option in the Startup And Recovery settings so that if the system crashes, you can examine the blue screen on the console. However, only the most straightforward crashes can be solved from just the blue-screen text.

To perform more in-depth analysis, you need to use the kernel debugger to look at the system at the time of the crash. This can be done by booting the system in debugging mode, which is described in the previous section. When a system booted in debugging mode crashes, instead of painting the blue screen and attempting to record the dump, it will wait forever until a host kernel debugger is connected. In this way, you can see the reason for the crash and perhaps perform some basic analysis using the kernel debugger commands described earlier. As mentioned in the previous section, you can use the .dump command in the debugger to save a copy of the crashed system's memory space for later debugging, thus allowing you to reboot the crashed system and debug the problem offline.

EXPERIMENT: The Blue Screen Screen Saver

A great way to remind yourself of what a blue screen looks like or to fool your office workers and friends is to run the Sysinternals Blue Screen screen saver from http://www.sysinternals.com. The screen saver simulates authentic looking blue screens that reflect the version of Windows on which you run it, generating all blue screen text using actual system information such as the list of loaded drivers. It also mimics an automatic-reboot, complete with the Windows startup splash screen. Note that unlike other screen savers where a mouse movement dismisses it, the Blue Screen screen saver requires a key press.

Using the following syntax for the Psexec tool from Sysinternals, you can even run the screen saver on another system:

psexec \\computername -i -d "c:\sysinternals bluescreen.scr" -s

The command requires that you have administrative privilege on the remote system. (You can use the u and p Psexec switches to specify alternate credentials.) Make sure that your coworker has a sense of humor!

< Day Day Up >