OS Panics | Real World Mac Maintenance and Backups

An OS panic is caused by some unexpected condition or kernel state that results in a voluntary kernel shutdown. In this case, we are not talking about the OS shutdown command, but rather a condition where the code finds itself calling panic().

Because the panic is a voluntary kernel shutdown, a reboot is necessary before troubleshooting can begin. By default, Linux does not reboot when encountering a panic(). Automatic system reboots can be set by entering the number of seconds to wait in /proc/sys/kernel/panic. 0 is the default for most Linux distributions, meaning that the system will not reboot and will remain in a hung state. Otherwise, a hardware-forced reset can be used.

Troubleshooting OS Panics

To troubleshoot an OS panic, first try to obtain a dump. Consult the console, which contains the panic string. The panic string leads us to the source of the panic. From there, we can determine the function calls that were made at the time of the panic. Sometimes the console data is not enough. If more data is required, a dump utility must be enabled. When the kernel pulls panic(), the crash dump takes control and writes the kernel memory to a dump device. To date, this feature is not in kernel.org.

Several competing technologies are available for obtaining a dump. For example, the Linux distributions SGI, SUSE, and HP Telco use LKCD (Linux Kernel Crash Dump). Red Hat offers netdump and its alternative diskdump (similar to lkcd). Again, these mechanisms get triggered as a result of a kernel panic() and depend on the dump driver supporting the underlying hardware. That being said, if the system's state is unstable (for example, spinlocks, bus state compromised, and CPU interrupt state), these utilities might not be able to save a kernel dump.

Unlike other flavors of Unix, the Linux kernel does not have a native dump mechanism. Rather, in Linux, a kernel dump is the result of a panic in which one of the aforementioned capabilities is enabled.

Scenario 2-3: Users Experience Multiple OS Panics

In this case, the system administrator has a machine that has been in service for some time. This is her primary production machine, and she needs to add a new PCI host bus adapter (HBA). Instead of confirming that the kernel is at the supported patch level or, for that matter, confirming that the machine boots properly after having just installed 250 package updates two weeks earlier, she decides to simply shut down the system and install the new card. Because this will affect production, management has allotted 30 minutes to perform hardware addition, and then the machine must be back online.

After shutting down the system and adding the hardware, the system administrator gets the machine to boot with no errors. After a few minutes pass, however, the system administrator notices that the machine is no longer responding, and the console shows that the machine has panicked. Because production has been impacted, managers become involved, and the system administrator is under pressure to get the machine stabilized.

Because the machine was not booted since the last package updates were installed, it is very difficult to determine whether the PCI card is causing the problem. The first step is to review the stack trace in an attempt to isolate the code section that triggered the panic.

Stack traces appear like this:     Bad slab found on cache free list       slab 0xf53d8580:         next 0xf7f7a0b0, prev 0xf7f7a0b0, mem 0xf43ef000         colouroff 0x0000, inuse 0xffffffe3, free 0x0000       cache 0xf7f7a0a0 ("names_cache"):         full    0xf7f788a0 <-> 0xf53d85a0         partial 0xf7f7a0a8 <-> 0xf7f7a0a8         free    0xf53d8580 <-> 0xf53d8580         next    0xf7f7a200 <-> 0xf7f7bf38         objsize 0x1000, flags 0x12000, num 0x0001, batchcount 0x001e         order 0, gfp 0x0000, colour 0x0000:0x0020:0x0000         slabcache 0xf7f7c060, growing 0, dflags 0x0001, failures 0     kernel BUG at slab.c:2010!     invalid operand: 0000     Kernel 2.4.9-e.49smp     CPU:      1     EIP:      0010:[<c0138d95>]    Tainted: P     EFLAGS: 00010082     EIP is at proc_getdata [kernel] 0x145     eax: 0000001e   ebx: f53d8580   ecx: c02f8b24    edx: 000054df     esi: f7f7a0a0   edi: 0000085e   ebp: 00000013    esp: f42ffef8     ds: 0018   es: 0018       ss: 0018     Process bgscollect (pid: 2933, stackpage=f42ff000)     Stack: c0267dfb 000007da 00000000 00000013 f42fff68 f8982000 00000c00     00000000        c0138eec f8982000 f42fff68 00000000 00000c00 f8982000 00000c00     00000000        c0169e8a f8982000 f42fff68 00000000 00000c00 f42fff64 00000000     f42fe000     Call Trace: [<c0267dfb>] .rodata.str1.1 [kernel] 0x2c16 (0xf42ffef8)     [<c0138eec>] slabinfo_read_proc [kernel] 0x1c (0xf42fff18)     [<c0169e8a>] proc_file_read [kernel] 0xda (0xf42fff38)     [<c0146296>] sys_read [kernel] 0x96 (0xf42fff7c)     [<c01073e3>] system_call [kernel] 0x33 (0xf42fffc0)

Immediately, we can see that this is a tainted kernel and that the module that has tainted the kernel is proprietary in nature. This module might be the culprit. However, because the machine has been in production for a while, it would be difficult to blame the panic on the driver module. However, the panic occurred because of memory corruption, which could be hardware or software related.

Continuing with troubleshooting, we note that additional stack traces from the console appear like this:

  2:40:50: ds: 0018   es: 0018   ss: 0018 12:40:50: Process kswapd (pid: 10, stackpage=f7f29000) 12:40:50: Stack: c0267dfb 00000722 00000000 f7f7a0b0 f7f7a0a8 c0137c13 c5121760 00000005 12:40:50:        00000000 00000000 00000000 00000018 000000c0 00000000 0008e000 c013ca6f 12:40:51:        000000c0 00000000 00000001 00000000 c013cb83 000000c0 00000000 c0105000 12:40:51: Call Trace: [<c0267dfb>] .rodata.str1.1 [kernel] 0x2c16 (0xf7f29f78) 12:40:51: [<c0137c13>] kmem_cache_shrink_nr [kernel] 0x53 (0xf7f29f8c) 12:40:51: [<c013ca6f>] do_try_to_free_pages [kernel] 0x7f (0xf7f29fb4) 12:40:51: [<c013cb83>] kswapd [kernel] 0x103 (0xf7f29fc8) 12:40:51: [<c0105000>] stext [kernel] 0x0 (0xf7f29fd4) 12:40:51: [<c0105000>] stext [kernel] 0x0 (0xf7f29fec) 12:40:51: [<c0105856>] arch_kernel_thread [kernel] 0x26 (0xf7f29ff0) 12:40:51: [<c013ca80>] kswapd [kernel] 0x0 (0xf7f29ff8) 12:40:51: 12:40:51: 12:40:51: Code: 0f 0b 58 5a 8b 03 45 39 f8 75 dd 8b 4e 2c 89 ea 8b 7e 4c d3 12:40:51:  <0>Kernel panic: not continuing 12:40:51: Uhhuh. NMI received for unknown reason 30. 12:51:30: Dazed and confused, but trying to continue. 12:51:30: Do you have a strange power saving mode enabled?

It is difficult to identify exactly what is causing the problem here; however, because NMI caused the panic, the problem is probably hardware related. The kernel error message, "NMI received for unknown reason," informs us that the system administrator has set up NMI in case of a hardware hang. Looking through the source, we find this message mentioned in linux/arch/i386/kernel/traps.c.

The following is a snapshot of the source:

... static void unknown_nmi_error(unsigned char reason, struct pt_regs * regs) { #ifdef CONFIG_MCA         /* Might actually be able to figure out what the guilty party         * is. */         if( MCA_bus ) {                 mca_handle_nmi();                 return;         } #endif         printk("Uhhuh. NMI received for unknown reason %02x.\n", reason);         printk("Dazed and confused, but trying to continue\n");         printk("Do you have a strange power saving mode enabled?\n"); } ...

Solution 2-3: Replace the PC Card

Because the HBA was new to the environment, and because replacing it was easier and faster than digging through the stacks and debugging each crash, we suggested that the administrator simply replace the card with a new HBA. After obtaining a new replacement for the PCI card, the kernel no longer experienced panics.

Troubleshooting Panics Resulting from Oops

It is possible for an oops to cause an OS panic. Sometimes applications attempt to use invalid pointers. As a result, the kernel identifies and kills the process that called into the kernel and lists its stack, memory address, and kernel register values. This scenario is known as a kernel oops. Usually the result of bad code, the oops is debugged with the ksymoops command. In today's Linux distributions, klogd uses the kernel's symbols to decode the oops and pass it off to the syslog daemon, which in turn writes the oops to the message file (normally /var/log/messages).

If the kernel, in killing the offending process, does not kill the interrupt handler, the OS does not panic. However, this does not mean that the kernel is safe to use. It is possible that the program just made a bad code reference; however, it is also possible for the application to put the kernel in such a state that more oops follow. If this occurs, focus on the first oops rather than subsequent ones. To avoid running the machine in this relatively unstable state, enable the "panic on oops" option, controlled by the file /proc/sys/kernel/panic_on_oops. Of course, the next time the kernel encounters any kind of oops (whether or not it is the interrupt handler), it panics. If the dump utilities are enabled, a dump that can be analyzed occurs.

Scenario 2-4: Oops Causes Frequent System Panics

In this scenario, an application has been performing many NULL pointer dereferences (oops), and because the system administrator has the kernel configured to panic on oops, each oops results in a panic. The oops only seems to take place when the system is under a heavy load. The heavy load is caused by an application called VMware. This product creates a virtual machine of another OS type. In this case, the system administrator is running several virtual machines under the Linux kernel.

We gather the VMware version from the customer along with the kernel version (also noted in the oops). The next step is to review the logs and screen dumps. The dump details are as follows:

Unable to handle kernel NULL pointer dereference at virtual address 00000084 *pde = 20ffd001 Oops: 0000 Kernel 2.4.9-e.38enterprise CPU:    4 EIP:    0010:[<c0138692>] Tainted: PF EFLAGS: 00013002 EIP is at do_ccupdate_local [kernel] 0x22 eax: 00000000   ebx: 00000004   ecx: f7f15efc   edx: c9cc8000 esi: 00000080   edi: c9cc8000   ebp: c0105420   esp: c9cc9f60 ds: 0018   es: 0018   ss: 0018 Process swapper (pid: 0, stackpage=c9cc9000) Stack: c9cc8000 c9cc8000 c9cc8000 c0113bef f7f15ef8 c0105420 c02476da c0105420        c9cc8000 00000004 c9cc8000 c9cc8000 c0105420 00000000 c9cc0018 c9cc0018        fffffffa c010544e 00000010 00003246 c01054b2 0402080c 00000000 00000000 Call Trace: [<c0113bef>] smp_call_function_interrupt [kernel] 0x2f (0xc9cc9f6c) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f74) [<c02476da>] call_call_function_interrupt [kernel] 0x5 (0xc9cc9f78) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f7c) [<c0105420>] default_idle [kernel] 0x0 (0xc9cc9f90) [<c010544e>] default_idle [kernel] 0x2e (0xc9cc9fa4) [<c01054b2>] cpu_idle [kernel] 0x32 (0xc9cc9fb0) [<c011ceb8>] printk [kernel] 0xd8 (0xc9cc9fd0) [<c0265e4a>] .rodata.str1.1 [kernel] 0xd25 (0xc9cc9fe4) Code: 8b 3c 1e 89 04 1e 8b 42 20 89 3c 81 5b 5e 5f c3 8d b4 26 00 LLT:10035: timer not called for 122 ticks Kernel panic: not continuing In idle task - not syncing

Interesting"oops" in the swapper code. This does not sound right because we know that virtually no Linux machines panic because of swapper. So, in this case, we assume that some other software bundle or hardware exception is causing the anomaly. Notice that this kernel is tainted because it has a proprietary module that has been forcibly loaded.

While we are researching the traces and studying the source, the machine panics again. The details of the subsequent kernel panic follow:

Unable to handle kernel NULL pointer dereference at virtual address 00000074 *pde = 24c6b001 Oops: 0000 Kernel 2.4.9-e.38enterprise CPU:    0 EIP:    0010:[<c0138692>] Tainted: PF EFLAGS: 00013002 EIP is at do_ccupdate_local [kernel] 0x22 eax: 00000000   ebx: 00000004   ecx: c9cedefc   edx: e7998000 esi: 00000070   edi: 0000013b   ebp: e7999f90   esp: e7999df8 ds: 0018   es: 0018   ss: 0018 Process vmware (pid: 9131, stackpage=e7999000) Stack: e7998000 e7998000 0000013b c0113bef c9cedef8 00000000 c02476da 00000000        efa1fca0 c0335000 e7998000 0000013b e7999f90 00000000 e7990018 f90c0018        fffffffa f90cc2a2 00000010 00003286 00003002 ca212014 00000001 e6643180 Call Trace: [<c0113bef>] smp_call_function_interrupt [kernel] 0x2f (0xe7999e04) [<c02476da>] call_call_function_interrupt [kernel] 0x5 (0xe7999e10) [<f90cc2a2>] .text.lock [vmmon] 0x86 (0xe7999e3c) [<c0119af2>] __wake_up [kernel] 0x42 (0xe7999e5c) [<c01da9a9>] sock_def_readable [kernel] 0x39 (0xe7999e84) [<c021e0ad>] unix_stream_sendmsg [kernel] 0x27d (0xe7999ea0) [<c01d7a21>] sock_recvmsg [kernel] 0x31 (0xe7999ed0) [<c01d79cc>] sock_sendmsg [kernel] 0x6c (0xe7999ee4) [<c01d7bf7>] sock_write [kernel] 0xa7 (0xe7999f38) [<c0146d36>] sys_write [kernel] 0x96 (0xe7999f7c) [<c0156877>] sys_ioctl [kernel] 0x257 (0xe7999f94) [<c01073e3>] system_call [kernel] 0x33 (0xe7999fc0) Code: 8b 3c 1e 89 04 1e 8b 42 20 89 3c 81 5b 5e 5f c3 8d b4 26 00  <4>rtc: lost some interrupts at 256Hz. Kernel panic: not continuing [Tue Jul 27 LLT:10035: timer not called for 150 ticks

The second panic reveals that the machine was in VMware code and on a different CPU.

Solution 2-4: Install a Patch

It took the combined efforts of many engineers to isolate and provide a fix for this scenario. The problem was found to reside in the smp_call_function() in the Red Hat Advanced Server 2.1 release, which used the 2.4.9 kernel. It turns out that the Linux kernel available on http://www.kernel.org did not contain the bug, so no other distributions experienced the issue. The Red Hat team, with the assistance of the VMware software team, provided a fix for the condition and resolved the oops panics.