How do you know if your system has panic ed? | PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)

How do you know if your system has panic'ed?

If your system panic'ed and rebooted while no one was witness to it, for example at 4 a.m. on Sunday morning, you may notice that the uptime of the system is not what you expect it to be. Also, using the last command on some UNIX systems, you might see entries such as:

 kbrown    console        Thu Jan 20 20:03 - crash  root      /dev/ttya      Wed Jan 19 16:40 - crash

These entries are a fairly reliable indication that the system crashed while folks were logged in.

If you were logged in when the crash occurred, you will find that you are no longer logged in. If you suspect that the system panic'ed and you have set up your system to capture system crash dumps, you'll find new entries in the savecore directory, assuming everything went well. If disk space was full, you will not find new dump files. Again, we will talk about savecore in greater detail in Chapter 3.

During a panic, the system no longer functions as expected. Those logged in will get no response from the system. Those utilizing data via NFS and other network-based data retrieval systems will no longer have access to that data. Only the person sitting at the system console will see actual evidence as it happens that the system is panic'ing.

During the panic, the system console displays some information about why the system is panic'ing. This information alone, however, is only partially useful. The contents of memory, now being safely stored onto the dump device by the panic() routine, will later be a critical piece of the overall puzzle as to why the panic occurred.

While sitting at the system console during a system panic caused by a bad trap condition, you will see something like the following.

Figure 1-1 Example of console messages seen during a panic triggered by a bad trap

 BAD TRAP  sh: Data fault  kernel read fault at addr=0x0, pme=0x0  Sync Error Reg 80<INVALID>  pid=556, pc=0xf000aaa8, sp=0xf0331670, psr=0x4000c4, context=3  g1-g7: 0, 0, ffffff80, 0, f03319e0, 1, ff467800  Begin traceback... sp = f0331670  Called from f0050668, fp=f03317e0, args=f0331844 0 f033184c 0 0 ff35be08  Called from f0093b68, fp=f0331850, args=0 0 1 0 f03318b4 f00c5b70  Called from f00245e4, fp=f03318b8, args=f0331e94 f0331920 0 0 4f074 f00b5218  Called from f0005acc,fp=f0331938, args=f00bc334 f0331eb4 0 f0331e90 fffffffc ffffffff  Called from 13c24, fp=effff678, args=4f074 effff6d8 3a 2f 1 4dc00  End traceback...  panic: Data fault  syncing file systems... done  static and sysmap kernel pages    56 dynamic kernel data pages   168 kernel-pageable pages     0 segkmap kernel pages     0 segvn kernel pages    51 current user process pages  total pages (1892 chunks)  dumping to vp ff1e9d84, offset 116888  rebooting...

The panic sequence consists of:

The actual panic message
A stack traceback if a bad trap occurred
Dump messages
Reboot or reboot attempt

Let's talk about each of these.

Panic messages

Again, depending on the system programmer and the current operation, some panic messages are quite brief, whereas others provide great detail. Sometimes you will see messages that include the name of the calling program, the variables in use, as well as the line number of the source! Others might simply be a cryptic word that only the programmer will easily recognize.

The example above shows that the program sh , the Bourne shell, which was running as process ID#556, generated a bad trap. Specifically, the trap was a data fault, in this case an illegal attempt by the kernel to read memory address 0x0. This illegal action triggered the bad trap and panic.

This is an easy panic to force by altering a critical value in the kernel, rootdir , while the system is running. Later on, we will cause a similar panic and use it as a practice system crash dump for analysis.

Stack traceback

panic() shows the current stack traceback if a bad trap occurred. This is a history of sorts, showing the hexadecimal addresses of the routines that were called by other routines, working from the most recent kernel routine down to the least recently called, usually a system call or an interrupt handler. Shown along with the addresses of the routines will be the calling parameters used, again in hexadecimal. It won't be until we look at the crash's savecore files that we will know which routines were at those addresses and thus in use at the time of the crash.

The stack traceback only goes back to the point where the kernel was most recently entered. A stack traceback will not show the routines in use by the application that made the system call. To find out what application was actually in use, we will examine the user area, executing threads, and process structures.

Dumping messages

When panic() writes the contents of memory to the dump device, you will see several messages that describe how the pages of memory were in use, followed by the total number of pages.

This will be followed by a message telling us where the image of memory is being dumped, giving us the pointer to the vnode structure, which in turn points us to the device. Later on, we will look at the vnode structure in greater detail.

Reboot

Once an image of memory is saved to the dump device, the system will attempt to reboot. Depending on the nature of the panic, the system may reboot without incident and not panic again for hours, months, or years . However, again depending on the problem that initiated the first panic, the system may get in a loop of panic'ing and rebooting until the system administrator intervenes.