|< Day Day Up >|
Advanced Crash Dump Analysis
There are other debugging commands that can
Stack overrun or stack trashing results from buffer
When you run Notmyfault and select Stack Trash, the Myfault driver overruns a buffer it
The driver that the crash dump analysis of a stack overrun points the blame at will vary from crash to crash, but the stop code will almost always be KMODE_EXCEPTION_NOT_HANDLED. If you execute a verbose analysis, the stack trace looks like this:
STACK_TEXT: b7b0ebd4 00000000 00000000 00000000 00000000 0x0
This is consistent with the stack having been overwritten with zeros. Unfortunately, mechanisms like special pool and system code write protection can't catch this type of bug. Instead, you must take some manual analysis steps to determine indirectly which driver was operating at the time of the corruption. One way is to examine the IRPs that are in progress for the thread that was executing at the time of the stack trash. When a thread issues an I/O request, the I/O manager stores a pointer to the outstanding IRP on the Irp list of the ETHREAD structure for the thread. The !thread debugger command dumps the thread list of the target IRP. (If you don't specify a thread object address, !thread dumps the processor's current thread.) Then you can look at the IRP with the !irp command:
kd> !thread THREAD ff740020 Cid 8f8.420 Teb: 7ffde000 Win32Thread: a20cdbe8 RUNNING IRP List: bc5a7f68: (0006, 0094) Flags: 00000000 Mdl: 00000000 Not impersonating Owning Process ff75f120 ... kd> !irp bc5a7f68 Irp is active with 1 stacks 1 is current ( = 0xbc5a7fd8) No Mdl Thread ff740020: Irp stack trace. cmd flg cl Device File Completion-Context >[ e, 0] 0 0 ff79e4c0 ff7ac028 00000000-00000000 \Driver\MYFAULT Args: 00000000 00000000 83360010 00000000
The output shows that the IRP's current and only stack location (designated with the ">" prefix) is owned by the Myfault driver. If this were a real crash, the next steps would be to ensure that the driver version installed is the most recent available, install the new version if it isn't, and if it is, to enable the Driver Verifier on the driver (with all settings except low memory simulation).
Manually crashing a hung system by using the support provided in the i8042 port driver does not work with USB keyboards. It works with PS2 keyboards only.
Another way to trigger a crash is if your hardware has a built in "crash" button. (Some high-end servers have this.) In this case, the crash is initiated by signaling the nonmaskable interrupt (NMI) pin of the system's motherboard. To enable this, set the registry DWORD value HKLM\System\CurrentControlSet\Control\CrashControl\NMICrashDump to 1. Then when you press the dump switch, an NMI is delivered to the system and the kernel's NMI interrupt handler calls KeBugCheckEx . This works in more cases than the i8042 port driver mechanism because the NMI IRQL is always higher than that of the i8042 port driver interrupt. See http://www.microsoft.com/whdc/system/CEC/dmpsw.mpsx for more information.
If you are unable to manually generate a crash dump, you can attempt to break into the hung system by first making the system boot into debugging mode. You do this in one of two ways. You can press the F8 key during the boot and select Debugging Mode, or you can create a debugging-mode boot option in Boot.ini by copying an existing boot entry from the system's Boot.ini and adding the /DEBUG switch. When using the F8 approach, the system will use the default connection (Serial Port COM2 and 19200 Baud). With the /DEBUG option, you must also configure the connection mechanism to be used between the host system running the kernel debugger and the target system booting in debugging mode and then configure the /Debugport and /Baudrate switches appropriately for the connection type. The two connection types are a null modem cable using a serial port or, for Windows XP and Windows Server 2003 systems, an IEEE 1394 (Firewire) cable using 1394 ports on each system. For details on configuring the host and target system for kernel debugging, see the Windows Debugging Tools help file.
When booting in debugging mode, the system loads the kernel debugger at boot time and makes it ready for a connection from a kernel debugger running on a different computer connected through a serial cable or IEEE 1394 cable. Note that the kernel debugger's presence does not affect performance. When the system hangs, run the Windbg or Kd debugger on the connected system, establish a kernel debugging connection, and break into the hung system. This approach will not work if interrupts are disabled or the kernel debugger has become corrupted.
Booting a system in debugging mode does not affect performance if it's not connected to another system; however, a system that's configured to automatically reboot after a crash will not do so if it's
Instead of leaving the system in its halted state while you perform analysis, you can also use the debugger ".dump" command to create a crash dump file on the host debugger machine. Then you can reboot the hung system and analyze the crash dump offline (or submit it to Microsoft). Note that this can take a long time if you are connected using a serial null modem cable (vs. a higher speed 1394 connection), so you might want to just capture a minidump using the ".dump /m" command. Alternatively, if the target machine is capable of writing a crash dump, you can force it to do so by issuing the ".crash" command from the debugger. This will cause the target machine to create a dump onto its local hard drive that you can examine after the system reboots.
You can cause a hang by running Notmyfault and selecting the Hang option. This causes the Myfault driver to queue a DPC on each processor of the system that executes an infinite loop. Because the IRQL of the processor while executing DPC functions is DPC/dispatch level, the keyboard ISR will respond to the special keyboard crashing sequence.
Once you've broken into a hung system or loaded a manually generated dump from a hung system into a debugger, you should execute the !analyze command with the -hang option. This causes the debugger to examine the locks on the system and try to determine whether there's a deadlock, and if so, what driver or drivers are involved. However, for a hang like the one that Notmyfault's hang generates, the !analyze analysis command will report nothing useful.
If the !analyze command doesn't pinpoint the problem, execute !thread and !process in each of the dump's CPU contexts to see what each processor is doing. (Switch CPU contexts with the command—for example, use !thread command. The stack trace of the crash dump you get when you crash a system experiencing the Notmyfault hang bug looks like this:
STACK_TEXT: f9e66ed8 f9b0d681 000000e2 00000000 00000000 nt!KeBugCheckEx+0x19 f9e66ef4 f9b0cefb 0069b0d8 010000c6 00000000 i8042prt!I8xProcessCrashDump+0x235 f9e66f3c 804ebb04 81797d98 8169b020 00010009 i8042prt!I8042KeyboardInterruptService+0x21c f9e66f3c fa12e34a 81797d98 8169b020 00010009 nt!KiInterruptDispatch+0x3d WARNING: Stack unwind information not available. Following frames may be wrong. ffdff980 8169b288 f9e67000 0000210f 00000004 myfault+0x34a 8054ace4 ffdff980 804ebf58 00000000 0000319c 0x8169b288 8054ace4 ffdff980 804ebf58 00000000 0000319c 0xffdff980 8169ae9c 8054ace4 f9b12b0f 8169ac88 00000000 0xffdff980 ...
The top few lines of the stack trace reference the routines that execute when you type the i8042 port driver's crash key sequence. The presence of the Myfault driver indicates that it might be responsible for the hang.
Another command that might be
In this section, we'll address how to troubleshoot systems that for some reason are not recording a crash dump. One reason why a crash dump might not be recorded is if the paging file on the boot volume is too small to hold the dump or if there is not enough free disk space to extract the dump after the reboot. These two cases can easily be remedied by either increasing the
A third reason why there might not be a crash dump recorded is because the kernel code and data structures needed to write the crash dump have been corrupted at the time of the crash.
As described earlier, this data is checksummed when the system boots, and if the checksum made at the time of the crash does not match, the system does not even attempt to save the crash dump (so as not to risk corrupting data on the disk). So in this case, you need to catch the system as it crashes and then try to determine the reason for the crash.
A final reason occurs when the disk subsystem for the system disk is not able to process disk write
One simple option is to
To perform more
|< Day Day Up >|