The second captured crash | PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)

When looking at a collection of related crashes, you hope to find a pattern. In the case of "cityboy," we found a very definite pattern.

Initial information

Even though this is a crash from the same customer, it's a good idea to make sure that it's definitely from the same system. More than once we've puzzled over a crash that didn't match the others, only to find that it was from a different system, but the customer was hoping you might look at it anyway.

Also, it sometimes helps to know how long the system was up before it went down. What is the crash frequency? Is there a recognizable pattern in the crash times? Is the crash frequency increasing? As you can probably imagine, there's a lot that you can deduce from the uptime and crash times.

Here's what we found.

 Hiya on p4c-75a...  adb -k vmunix.1 vmcore.1  physmem 17a9  hostname/s  _hostname:  _hostname:      cityboy  *boottime=Y  1994 Oct 25 10:43:51  *time=Y  1994 Oct 26 04:20:12  $c  _panic(0xf81443e3,0xf82f4bf4,0x338,0x80,0xf82f5,0xf7fff9a0) + 6c  _trap(0x9,  0xf82f4bf4  ,0x338,0x80,0x1,0x0) + 184  callhatfault(?)  _splx(0x100,0x1001e3,0xf8127c00,0xf81f6ae8,0xb80b0,0x0) + 34  _soo_select(?)  _swtch(0x800ae4,0xf81f644c,0x4000e4,0x0,0x3,0xa7) + 15c  _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f6ae8) + 1a0  _select(0xffbfffff,0x20088001,0xf82f5380,0xb80b0,0xf82f4d70,0xf82f5000) + 4cc  _indir(0xf82f5000,0x5d,0xf8131010,0x14,0xf82f5394,0xf82f4ff8) + 1d4  _syscall(0xf82f5000) + 3b4  $<msgbuf  0xf8002000:     magic           size            bufx            bufr                  63062           1ff0            1d7e            1917  0xf8003927:     BAD TRAP                  pid 752, `event_manager': Data fault  kernel read fault at addr=0x338  , pme=0x0                  Sync Error Reg 80<INVALID>                  pc=0xf80e5dc0, sp=0xf82f4c40, psr=0x4000c2, context=0x2                  g1-g7: 0, 0, ffffffff, 0, f82f5000, f8148000, f8148000                  Begin traceback... sp = f82f4c40  (Trimmed output)

The system suffered another data fault, this time while trying to read the contents of location 0x338.

Collecting trap information

So, again, we need to look at the trap registers to find the failed instruction.

  0xf82f4bf4$<regs  0xf82f4bf4:     psr             pc              npc                  4000c2  f80e5dc0  f80e5dc4  0xf82f4c00:     y               g1              g2              g3                  19000000        0               0               ffffffff  0xf82f4c10:     g4              g5              g6              g7                  0               f82f5000        f8148000        f8148000  0xf82f4c20:     o0              o1              o2              o3                  8000            0               0               fce31880  0xf82f4c30:     o4              o5              o6              o7                  0               f8047e68        f82f4c40        f8089c00  f80e5dc0/i  _idle+0x5c:     orcc    %g0, %g1, %g0

The trap is reported as having occurred during the ORing of two registers. And, again, we are in the idle() routine. At 4 a.m., that's not surprising!

The orcc instruction

Like the sethi instruction we saw in the first crash, the orcc instruction does not attempt to access memory. Therefore, a data fault is not a trap we would expect to see generated by this instruction.

Once again, the fault that occurred and the instruction that was referenced by the PC do not match.

Examining nPC

Let's take a look at the next instruction in the pipeline. It's possible that we have another case of the nPC referencing the failing instruction. Let's look forward a few instructions.

  f80e5dc4/3i  _idle+0x60:     be      _idle                  nop                  call    _runqueues

Well, that's not what we expected, is it? Maybe the problem occurred before the reported PC . Let's go back a few instructions.

  idle+50,4/ia  _idle+0x50:     stba    %g0, [%g2] 0x2  _idle+0x54:     sethi   %hi(0xf8147800), %g1  _idle+0x58:     ldub    [%g1 + 0x338], %g1       ! -0x7eb84c8  _idle+0x5c:     orcc    %g0, %g1, %g0  _idle+0x60:

Well, this is very interesting! The instruction executed just prior to the reported PC was a "load unsigned byte" instruction, ldub . We were trying to read a byte from the memory location whose address was stored in %g1 . According to the code shown above, %g1 should have contained address 0xf8147b38 (0xf8147800 + 0x338). However, according to the trap registers and the trap error messages, we actually tried to read memory location 0x338.