The first captured crash | PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)

The first crash was not captured, because the savecore program had not been enabled. After the system rebooted, the customer called Sun's United Kingdom Answer Centre for assistance. He was advised how to use savecore and was told what to do should the system crash again.

Later that day, the system panic'ed. The system crash dump files were immediately forwarded to the UKAC engineer.

Always get initial information

Let's look at the first crash dump, collecting initial information.

 Hiya...  strings vmcore.0 more  SunOS Release  4.1.3_U1  (cityboy) #1: Thu May 19 11:49:57 BST 1994  Copyright (c) 1983-1993, Sun Microsystems, Inc.  Data fault  /dev/mouse  /dev/mouse  BAD TRAP  PROTOCOM_Z_VIP_Spid 218, `clock':  Data fault  kernel read fault at addr=0x160, pme=0x0  SunOS Release  4.1.3_U1  (cityboy) #1: Thu May 19 11:49:57 BST 1994  Copyright (c) 1983-1993, Sun Microsystems, Inc.  mem = 24576K (0x1800000)  avail mem = 22339584  Ethernet address = 8:0:20:1f:d9:aa  cpu = SUNW,Sun 4/40  zs0 at obio 0xf1000000 pri 12  zs1 at obio 0xf0000000 pri 12  ... and so on ...

Based on what we found in the strings output, we will use a sun4c system running SunOS 4.1.3_U1 to do the analysis. We also now know that we're going to see a "bad trap: data fault" in this crash.

 Hiya on p4c-50a...  adb -k vmunix.0 vmcore.0  physmem 17a9  hostname/s  _hostname:  _hostname:      cityboy  *boottime=Y  1994 Oct 25 09:53:49  *time=Y  1994 Oct 25 10:26:46  $c  _panic(0xf81443e3,0xf82b3bf4,0x160,0x80,0xf82b4,0xf7fff7e8) + 6c  _trap(0x9,  0xf82b3bf4  ,0x160,0x80,0x1,0x0) + 184  callhatfault(0x0,0x1,0xff00a3e8,0x4,0xf76dfa10,0xfce2f658) + 34  _vno_select(?)  _swtch(0x800ae5,0xf81f5948,0x4000e5,0x0,0x3,0x2) + 30  _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f5ac0) + 1a0  _select(0xffbfffff,0x20008001,0xf82b4380,0xf76dfa10,0xf82b3d70,0xf82b4000) + 4cc  _indir(0xf82b4000,0x5d,0xf8131010,0x14,0xf82b4394,0xf82b3ff8) + 1d4  _syscall(0xf82b4000) + 3b4  $<msgbuf  0xf8002000:     magic           size            bufx            bufr                  63062           1ff0            1200            d8d  0xf8002d9d:     BAD TRAP                  pid 218, `clock': Data fault  kernel read fault at addr=0x160  , pme=0x0                  Sync Error Reg 80<INVALID>  pc=0xf80e5d64  , sp=  0xf82b3c40  , psr=0x4000c3, context=0x5                  g1-g7: 0, 0, ffffffff, 0, f82b4000, f8148000, f8148000                  Begin traceback... sp = f82b3c40  (Trimmed output)

A trap occurred

The system didn't panic due to a software condition. Instead, it panic'ed due to a "bad trap: data fault." The hardware detected a condition that should not have occurred while trying to access memory.

Using the second argument to the trap() call, we dump out the trap registers and look further into the trap.

  0xf82b3bf4$<regs  0xf82b3bf4:     psr             pc              npc                  4000c3  f80e5d64       f80e5d68  0xf82b3c00:     y               g1              g2              g3                  c800000         0               0               ffffffff  0xf82b3c10:     g4              g5              g6              g7                  0               f82b4000        f8148000        f8148000  0xf82b3c20:     o0              o1              o2              o3                  f814c708        2               1               0  0xf82b3c30:     o4              o5              o6              o7                  182             f82b3c60  f82b3c40  a  f80e5d64/i        <---   Instruction being executed  _idle:  _idle:          sethi   %hi(0xf8148000), %g1  f80e5d68/i        <---   Instruction next in the pipeline  _idle+4:        ld      [%g1 + 0x160], %g1       ! -0x7eb7ea0  f82b3c40$c        <---   Get stack traceback starting at the stack pointer  ?(?)  _swtch(0x800ae5,0xf81f5948,0x4000e5,0x0,0x3,0x2) + 4034  _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f5ac0) + 1a0  _select(0xffbfffff,0x20008001,0xf82b4380,0xf76dfa10,0xf82b3d70,0xf82b4000) + 4cc  _indir(0xf82b4000,0x5d,0xf8131010,0x14,0xf82b4394,0xf82b3ff8) + 1d4  _syscall(0xf82b4000) + 3b4

Did you notice that the regs macro output and the msgbuf ring buffer both contained the stack pointer and the program counter?

The sethi instruction

The failed instruction, sethi , is the first instruction in the idle() routine, a routine called thousands, maybe millions, of times per day. To have it crash there seems unlikely. But what is more unlikely is for the sethi instruction to cause a trap, especially a data fault, which is caused by illegal attempts to access memory.

The sethi instruction does not access memory. Instead, it loads a 22-bit value into a register. So, the trap type and the instruction (according to the PC ) do not match.

Examining nPC

If we look again at the trap error messages, we see that the failure is reported as being due to a "kernel read fault at addr=0x160." The sethi instruction had nothing to do with address 0x160.

At this point, it is a good idea to look at the next instruction in the pipeline. Maybe, somehow, the trap() routine just didn't report the correct PC . Let's look at nPC now.

  f80e5d68/i        <---   Instruction next in the pipeline  _idle+4:        ld      [%g1 + 0x160], %g1       ! -0x7eb7ea0

That's a bit better. However, in the previous instruction, we had set %g1 to contain the high-order 22 bits of the value 0xf8148000. Therefore, this load instruction should have been trying to read address f8148160, not address 160.

Something went wrong!

The sethi instruction never completed its job. Had it done so, %g1 would contain a valid address. At this point, we can only assume that the system had suffered some sort of hiccup.