The first crash was not captured, because the savecore program had not been enabled. After the system rebooted, the customer called Sun's United Kingdom Answer Centre for assistance. He was advised how to use savecore and was told what to do should the system crash again. Later that day, the system panic'ed. The system crash dump files were immediately forwarded to the UKAC engineer. Always get initial informationLet's look at the first crash dump, collecting initial information. Hiya... strings vmcore.0 more SunOS Release 4.1.3_U1 (cityboy) #1: Thu May 19 11:49:57 BST 1994 Copyright (c) 1983-1993, Sun Microsystems, Inc. Data fault /dev/mouse /dev/mouse BAD TRAP PROTOCOM_Z_VIP_Spid 218, `clock': Data fault kernel read fault at addr=0x160, pme=0x0 SunOS Release 4.1.3_U1 (cityboy) #1: Thu May 19 11:49:57 BST 1994 Copyright (c) 1983-1993, Sun Microsystems, Inc. mem = 24576K (0x1800000) avail mem = 22339584 Ethernet address = 8:0:20:1f:d9:aa cpu = SUNW,Sun 4/40 zs0 at obio 0xf1000000 pri 12 zs1 at obio 0xf0000000 pri 12 ... and so on ... Based on what we found in the strings output, we will use a sun4c system running SunOS 4.1.3_U1 to do the analysis. We also now know that we're going to see a "bad trap: data fault" in this crash. Hiya on p4c-50a... adb -k vmunix.0 vmcore.0 physmem 17a9 hostname/s _hostname: _hostname: cityboy *boottime=Y 1994 Oct 25 09:53:49 *time=Y 1994 Oct 25 10:26:46 $c _panic(0xf81443e3,0xf82b3bf4,0x160,0x80,0xf82b4,0xf7fff7e8) + 6c _trap(0x9, 0xf82b3bf4 ,0x160,0x80,0x1,0x0) + 184 callhatfault(0x0,0x1,0xff00a3e8,0x4,0xf76dfa10,0xfce2f658) + 34 _vno_select(?) _swtch(0x800ae5,0xf81f5948,0x4000e5,0x0,0x3,0x2) + 30 _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f5ac0) + 1a0 _select(0xffbfffff,0x20008001,0xf82b4380,0xf76dfa10,0xf82b3d70,0xf82b4000) + 4cc _indir(0xf82b4000,0x5d,0xf8131010,0x14,0xf82b4394,0xf82b3ff8) + 1d4 _syscall(0xf82b4000) + 3b4 $<msgbuf 0xf8002000: magic size bufx bufr 63062 1ff0 1200 d8d 0xf8002d9d: BAD TRAP pid 218, `clock': Data fault kernel read fault at addr=0x160 , pme=0x0 Sync Error Reg 80<INVALID> pc=0xf80e5d64 , sp= 0xf82b3c40 , psr=0x4000c3, context=0x5 g1-g7: 0, 0, ffffffff, 0, f82b4000, f8148000, f8148000 Begin traceback... sp = f82b3c40 (Trimmed output) A trap occurredThe system didn't panic due to a software condition. Instead, it panic'ed due to a "bad trap: data fault." The hardware detected a condition that should not have occurred while trying to access memory. Using the second argument to the trap() call, we dump out the trap registers and look further into the trap. 0xf82b3bf4$<regs 0xf82b3bf4: psr pc npc 4000c3 f80e5d64 f80e5d68 0xf82b3c00: y g1 g2 g3 c800000 0 0 ffffffff 0xf82b3c10: g4 g5 g6 g7 0 f82b4000 f8148000 f8148000 0xf82b3c20: o0 o1 o2 o3 f814c708 2 1 0 0xf82b3c30: o4 o5 o6 o7 182 f82b3c60 f82b3c40 a f80e5d64/i <--- Instruction being executed _idle: _idle: sethi %hi(0xf8148000), %g1 f80e5d68/i <--- Instruction next in the pipeline _idle+4: ld [%g1 + 0x160], %g1 ! -0x7eb7ea0 f82b3c40$c <--- Get stack traceback starting at the stack pointer ?(?) _swtch(0x800ae5,0xf81f5948,0x4000e5,0x0,0x3,0x2) + 4034 _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f5ac0) + 1a0 _select(0xffbfffff,0x20008001,0xf82b4380,0xf76dfa10,0xf82b3d70,0xf82b4000) + 4cc _indir(0xf82b4000,0x5d,0xf8131010,0x14,0xf82b4394,0xf82b3ff8) + 1d4 _syscall(0xf82b4000) + 3b4 Did you notice that the regs macro output and the msgbuf ring buffer both contained the stack pointer and the program counter? The sethi instructionThe failed instruction, sethi , is the first instruction in the idle() routine, a routine called thousands, maybe millions, of times per day. To have it crash there seems unlikely. But what is more unlikely is for the sethi instruction to cause a trap, especially a data fault, which is caused by illegal attempts to access memory. The sethi instruction does not access memory. Instead, it loads a 22-bit value into a register. So, the trap type and the instruction (according to the PC ) do not match. Examining nPCIf we look again at the trap error messages, we see that the failure is reported as being due to a "kernel read fault at addr=0x160." The sethi instruction had nothing to do with address 0x160. At this point, it is a good idea to look at the next instruction in the pipeline. Maybe, somehow, the trap() routine just didn't report the correct PC . Let's look at nPC now. f80e5d68/i <--- Instruction next in the pipeline _idle+4: ld [%g1 + 0x160], %g1 ! -0x7eb7ea0 That's a bit better. However, in the previous instruction, we had set %g1 to contain the high-order 22 bits of the value 0xf8148000. Therefore, this load instruction should have been trying to read address f8148160, not address 160. Something went wrong!The sethi instruction never completed its job. Had it done so, %g1 would contain a valid address. At this point, we can only assume that the system had suffered some sort of hiccup. |