An OS panic is caused by some unexpected condition or kernel state that results in a voluntary kernel shutdown. In this case, we are not talking about the OS shutdown command, but rather a condition where the code finds itself calling panic(). Because the panic is a voluntary kernel shutdown, a reboot is necessary before troubleshooting can begin. By default, Linux does not reboot when encountering a panic(). Automatic system reboots can be set by entering the number of seconds to wait in /proc/sys/kernel/panic. 0 is the default for most Linux distributions, meaning that the system will not reboot and will remain in a hung state. Otherwise, a hardware-forced reset can be used. Troubleshooting OS PanicsTo troubleshoot an OS panic, first try to obtain a dump. Consult the console, which contains the panic string. The panic string leads us to the source of the panic. From there, we can determine the function calls that were made at the time of the panic. Sometimes the console data is not enough. If more data is required, a dump utility must be enabled. When the kernel pulls panic(), the crash dump takes control and writes the kernel memory to a dump device. To date, this feature is not in kernel.org. Several competing technologies are available for obtaining a dump. For example, the Linux distributions SGI, SUSE, and HP Telco use LKCD (Linux Kernel Crash Dump). Red Hat offers netdump and its alternative diskdump (similar to lkcd). Again, these mechanisms get triggered as a result of a kernel panic() and depend on the dump driver supporting the underlying hardware. That being said, if the system's state is unstable (for example, spinlocks, bus state compromised, and CPU interrupt state), these utilities might not be able to save a kernel dump. Unlike other flavors of Unix, the Linux kernel does not have a native dump mechanism. Rather, in Linux, a kernel dump is the result of a panic in which one of the aforementioned capabilities is enabled. Scenario 2-3: Users Experience Multiple OS PanicsIn this case, the system administrator has a machine that has been in service for some time. This is her primary production machine, and she needs to add a new PCI host bus adapter (HBA). Instead of confirming that the kernel is at the supported patch level or, for that matter, confirming that the machine boots properly after having just installed 250 package updates two weeks earlier, she decides to simply shut down the system and install the new card. Because this will affect production, management has allotted 30 minutes to perform hardware addition, and then the machine must be back online. After shutting down the system and adding the hardware, the system administrator gets the machine to boot with no errors. After a few minutes pass, however, the system administrator notices that the machine is no longer responding, and the console shows that the machine has panicked. Because production has been impacted, managers become involved, and the system administrator is under pressure to get the machine stabilized. Because the machine was not booted since the last package updates were installed, it is very difficult to determine whether the PCI card is causing the problem. The first step is to review the stack trace in an attempt to isolate the code section that triggered the panic. Stack traces appear like this: Bad slab found on cache free list slab 0xf53d8580: next 0xf7f7a0b0, prev 0xf7f7a0b0, mem 0xf43ef000 colouroff 0x0000, inuse 0xffffffe3, free 0x0000 cache 0xf7f7a0a0 ("names_cache"): full 0xf7f788a0 <-> 0xf53d85a0 partial 0xf7f7a0a8 <-> 0xf7f7a0a8 free 0xf53d8580 <-> 0xf53d8580 next 0xf7f7a200 <-> 0xf7f7bf38 objsize 0x1000, flags 0x12000, num 0x0001, batchcount 0x001e order 0, gfp 0x0000, colour 0x0000:0x0020:0x0000 slabcache 0xf7f7c060, growing 0, dflags 0x0001, failures 0 kernel BUG at slab.c:2010! invalid operand: 0000 Kernel 2.4.9-e.49smp CPU: 1 EIP: 0010:[<c0138d95>] Tainted: P EFLAGS: 00010082 EIP is at proc_getdata [kernel] 0x145 eax: 0000001e ebx: f53d8580 ecx: c02f8b24 edx: 000054df esi: f7f7a0a0 edi: 0000085e ebp: 00000013 esp: f42ffef8 ds: 0018 es: 0018 ss: 0018 Process bgscollect (pid: 2933, stackpage=f42ff000) Stack: c0267dfb 000007da 00000000 00000013 f42fff68 f8982000 00000c00 00000000 c0138eec f8982000 f42fff68 00000000 00000c00 f8982000 00000c00 00000000 c0169e8a f8982000 f42fff68 00000000 00000c00 f42fff64 00000000 f42fe000 Call Trace: [<c0267dfb>] .rodata.str1.1 [kernel] 0x2c16 (0xf42ffef8) [<c0138eec>] slabinfo_read_proc [kernel] 0x1c (0xf42fff18) [<c0169e8a>] proc_file_read [kernel] 0xda (0xf42fff38) [<c0146296>] sys_read [kernel] 0x96 (0xf42fff7c) [<c01073e3>] system_call [kernel] 0x33 (0xf42fffc0) Immediately, we can see that this is a tainted kernel and that the module that has tainted the kernel is proprietary in nature. This module might be the culprit. However, because the machine has been in production for a while, it would be difficult to blame the panic on the driver module. However, the panic occurred because of memory corruption, which could be hardware or software related. Continuing with troubleshooting, we note that additional stack traces from the console appear like this: 2:40:50: ds: 0018 es: 0018 ss: 0018 12:40:50: Process kswapd (pid: 10, stackpage=f7f29000) 12:40:50: Stack: c0267dfb 00000722 00000000 f7f7a0b0 f7f7a0a8 c0137c13 c5121760 00000005 12:40:50: 00000000 00000000 00000000 00000018 000000c0 00000000 0008e000 c013ca6f 12:40:51: 000000c0 00000000 00000001 00000000 c013cb83 000000c0 00000000 c0105000 12:40:51: Call Trace: [<c0267dfb>] .rodata.str1.1 [kernel] 0x2c16 (0xf7f29f78) 12:40:51: [<c0137c13>] kmem_cache_shrink_nr [kernel] 0x53 (0xf7f29f8c) 12:40:51: [<c013ca6f>] do_try_to_free_pages [kernel] 0x7f (0xf7f29fb4) 12:40:51: [<c013cb83>] kswapd [kernel] 0x103 (0xf7f29fc8) 12:40:51: [<c0105000>] stext [kernel] 0x0 (0xf7f29fd4) 12:40:51: [<c0105000>] stext [kernel] 0x0 (0xf7f29fec) 12:40:51: [<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf7f29ff0) 12:40:51: [<c013ca80>] kswapd [kernel] 0x0 (0xf7f29ff8) 12:40:51: 12:40:51: 12:40:51: Code: 0f 0b 58 5a 8b 03 45 39 f8 75 dd 8b 4e 2c 89 ea 8b 7e 4c d3 12:40:51: <0>Kernel panic: not continuing 12:40:51: Uhhuh. NMI received for unknown reason 30. 12:51:30: Dazed and confused, but trying to continue. 12:51:30: Do you have a strange power saving mode enabled? It is difficult to identify exactly what is causing the problem here; however, because NMI caused the panic, the problem is probably hardware related. The kernel error message, "NMI received for unknown reason," informs us that the system administrator has set up NMI in case of a hardware hang. Looking through the source, we find this message mentioned in linux/arch/i386/kernel/traps.c. The following is a snapshot of the source: ... static void unknown_nmi_error(unsigned char reason, struct pt_regs * regs) { #ifdef CONFIG_MCA /* Might actually be able to figure out what the guilty party * is. */ if( MCA_bus ) { mca_handle_nmi(); return; } #endif printk("Uhhuh. NMI received for unknown reason %02x.\n", reason); printk("Dazed and confused, but trying to continue\n"); printk("Do you have a strange power saving mode enabled?\n"); } ... Solution 2-3: Replace the PC CardBecause the HBA was new to the environment, and because replacing it was easier and faster than digging through the stacks and debugging each crash, we suggested that the administrator simply replace the card with a new HBA. After obtaining a new replacement for the PCI card, the kernel no longer experienced panics. Troubleshooting Panics Resulting from OopsIt is possible for an oops to cause an OS panic. Sometimes applications attempt to use invalid pointers. As a result, the kernel identifies and kills the process that called into the kernel and lists its stack, memory address, and kernel register values. This scenario is known as a kernel oops. Usually the result of bad code, the oops is debugged with the ksymoops command. In today's Linux distributions, klogd uses the kernel's symbols to decode the oops and pass it off to the syslog daemon, which in turn writes the oops to the message file (normally /var/log/messages). If the kernel, in killing the offending process, does not kill the interrupt handler, the OS does not panic. However, this does not mean that the kernel is safe to use. It is possible that the program just made a bad code reference; however, it is also possible for the application to put the kernel in such a state that more oops follow. If this occurs, focus on the first oops rather than subsequent ones. To avoid running the machine in this relatively unstable state, enable the "panic on oops" option, controlled by the file /proc/sys/kernel/panic_on_oops. Of course, the next time the kernel encounters any kind of oops (whether or not it is the interrupt handler), it panics. If the dump utilities are enabled, a dump that can be analyzed occurs. Scenario 2-4: Oops Causes Frequent System PanicsIn this scenario, an application has been performing many NULL pointer dereferences (oops), and because the system administrator has the kernel configured to panic on oops, each oops results in a panic. The oops only seems to take place when the system is under a heavy load. The heavy load is caused by an application called VMware. This product creates a virtual machine of another OS type. In this case, the system administrator is running several virtual machines under the Linux kernel. We gather the VMware version from the customer along with the kernel version (also noted in the oops). The next step is to review the logs and screen dumps. The dump details are as follows: Unable to handle kernel NULL pointer dereference at virtual address 00000084 *pde = 20ffd001 Oops: 0000 Kernel 2.4.9-e.38enterprise CPU: 4 EIP: 0010:[<c0138692>] Tainted: PF EFLAGS: 00013002 EIP is at do_ccupdate_local [kernel] 0x22 eax: 00000000 ebx: 00000004 ecx: f7f15efc edx: c9cc8000 esi: 00000080 edi: c9cc8000 ebp: c0105420 esp: c9cc9f60 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c9cc9000) Stack: c9cc8000 c9cc8000 c9cc8000 c0113bef f7f15ef8 c0105420 c02476da c0105420 c9cc8000 00000004 c9cc8000 c9cc8000 c0105420 00000000 c9cc0018 c9cc0018 fffffffa c010544e 00000010 00003246 c01054b2 0402080c 00000000 00000000 Call Trace: [<c0113bef>] smp_call_function_interrupt [kernel] 0x2f (0xc9cc9f6c) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f74) [<c02476da>] call_call_function_interrupt [kernel] 0x5 (0xc9cc9f78) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f7c) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f90) [<c010544e>] default_idle [kernel] 0x2e (0xc9cc9fa4) [<c01054b2>] cpu_idle [kernel] 0x32 (0xc9cc9fb0) [<c011ceb8>] printk [kernel] 0xd8 (0xc9cc9fd0) [<c0265e4a>] .rodata.str1.1 [kernel] 0xd25 (0xc9cc9fe4) Code: 8b 3c 1e 89 04 1e 8b 42 20 89 3c 81 5b 5e 5f c3 8d b4 26 00 LLT:10035: timer not called for 122 ticks Kernel panic: not continuing In idle task - not syncing Interesting"oops" in the swapper code. This does not sound right because we know that virtually no Linux machines panic because of swapper. So, in this case, we assume that some other software bundle or hardware exception is causing the anomaly. Notice that this kernel is tainted because it has a proprietary module that has been forcibly loaded. While we are researching the traces and studying the source, the machine panics again. The details of the subsequent kernel panic follow: Unable to handle kernel NULL pointer dereference at virtual address 00000074 *pde = 24c6b001 Oops: 0000 Kernel 2.4.9-e.38enterprise CPU: 0 EIP: 0010:[<c0138692>] Tainted: PF EFLAGS: 00013002 EIP is at do_ccupdate_local [kernel] 0x22 eax: 00000000 ebx: 00000004 ecx: c9cedefc edx: e7998000 esi: 00000070 edi: 0000013b ebp: e7999f90 esp: e7999df8 ds: 0018 es: 0018 ss: 0018 Process vmware (pid: 9131, stackpage=e7999000) Stack: e7998000 e7998000 0000013b c0113bef c9cedef8 00000000 c02476da 00000000 efa1fca0 c0335000 e7998000 0000013b e7999f90 00000000 e7990018 f90c0018 fffffffa f90cc2a2 00000010 00003286 00003002 ca212014 00000001 e6643180 Call Trace: [<c0113bef>] smp_call_function_interrupt [kernel] 0x2f (0xe7999e04) [<c02476da>] call_call_function_interrupt [kernel] 0x5 (0xe7999e10) [<f90cc2a2>] .text.lock [vmmon] 0x86 (0xe7999e3c) [<c0119af2>] __wake_up [kernel] 0x42 (0xe7999e5c) [<c01da9a9>] sock_def_readable [kernel] 0x39 (0xe7999e84) [<c021e0ad>] unix_stream_sendmsg [kernel] 0x27d (0xe7999ea0) [<c01d7a21>] sock_recvmsg [kernel] 0x31 (0xe7999ed0) [<c01d79cc>] sock_sendmsg [kernel] 0x6c (0xe7999ee4) [<c01d7bf7>] sock_write [kernel] 0xa7 (0xe7999f38) [<c0146d36>] sys_write [kernel] 0x96 (0xe7999f7c) [<c0156877>] sys_ioctl [kernel] 0x257 (0xe7999f94) [<c01073e3>] system_call [kernel] 0x33 (0xe7999fc0) Code: 8b 3c 1e 89 04 1e 8b 42 20 89 3c 81 5b 5e 5f c3 8d b4 26 00 <4>rtc: lost some interrupts at 256Hz. Kernel panic: not continuing [Tue Jul 27 LLT:10035: timer not called for 150 ticks The second panic reveals that the machine was in VMware code and on a different CPU. Solution 2-4: Install a PatchIt took the combined efforts of many engineers to isolate and provide a fix for this scenario. The problem was found to reside in the smp_call_function() in the Red Hat Advanced Server 2.1 release, which used the 2.4.9 kernel. It turns out that the Linux kernel available on http://www.kernel.org did not contain the bug, so no other distributions experienced the issue. The Red Hat team, with the assistance of the VMware software team, provided a fix for the condition and resolved the oops panics. |