When looking at a collection of related crashes, you hope to find a pattern. In the case of "cityboy," we found a very definite pattern. Initial informationEven though this is a crash from the same customer, it's a good idea to make sure that it's definitely from the same system. More than once we've puzzled over a crash that didn't match the others, only to find that it was from a different system, but the customer was hoping you might look at it anyway. Also, it sometimes helps to know how long the system was up before it went down. What is the crash frequency? Is there a recognizable pattern in the crash times? Is the crash frequency increasing? As you can probably imagine, there's a lot that you can deduce from the uptime and crash times. Here's what we found. Hiya on p4c-75a... adb -k vmunix.1 vmcore.1 physmem 17a9 hostname/s _hostname: _hostname: cityboy *boottime=Y 1994 Oct 25 10:43:51 *time=Y 1994 Oct 26 04:20:12 $c _panic(0xf81443e3,0xf82f4bf4,0x338,0x80,0xf82f5,0xf7fff9a0) + 6c _trap(0x9, 0xf82f4bf4 ,0x338,0x80,0x1,0x0) + 184 callhatfault(?) _splx(0x100,0x1001e3,0xf8127c00,0xf81f6ae8,0xb80b0,0x0) + 34 _soo_select(?) _swtch(0x800ae4,0xf81f644c,0x4000e4,0x0,0x3,0xa7) + 15c _sleep(0xf814c810,0x1a,0x29,0xa00,0x1a,0xf81f6ae8) + 1a0 _select(0xffbfffff,0x20088001,0xf82f5380,0xb80b0,0xf82f4d70,0xf82f5000) + 4cc _indir(0xf82f5000,0x5d,0xf8131010,0x14,0xf82f5394,0xf82f4ff8) + 1d4 _syscall(0xf82f5000) + 3b4 $<msgbuf 0xf8002000: magic size bufx bufr 63062 1ff0 1d7e 1917 0xf8003927: BAD TRAP pid 752, `event_manager': Data fault kernel read fault at addr=0x338 , pme=0x0 Sync Error Reg 80<INVALID> pc=0xf80e5dc0, sp=0xf82f4c40, psr=0x4000c2, context=0x2 g1-g7: 0, 0, ffffffff, 0, f82f5000, f8148000, f8148000 Begin traceback... sp = f82f4c40 (Trimmed output) The system suffered another data fault, this time while trying to read the contents of location 0x338. Collecting trap informationSo, again, we need to look at the trap registers to find the failed instruction. 0xf82f4bf4$<regs 0xf82f4bf4: psr pc npc 4000c2 f80e5dc0 f80e5dc4 0xf82f4c00: y g1 g2 g3 19000000 0 0 ffffffff 0xf82f4c10: g4 g5 g6 g7 0 f82f5000 f8148000 f8148000 0xf82f4c20: o0 o1 o2 o3 8000 0 0 fce31880 0xf82f4c30: o4 o5 o6 o7 0 f8047e68 f82f4c40 f8089c00 f80e5dc0/i _idle+0x5c: orcc %g0, %g1, %g0 The trap is reported as having occurred during the ORing of two registers. And, again, we are in the idle() routine. At 4 a.m., that's not surprising! The orcc instructionLike the sethi instruction we saw in the first crash, the orcc instruction does not attempt to access memory. Therefore, a data fault is not a trap we would expect to see generated by this instruction. Once again, the fault that occurred and the instruction that was referenced by the PC do not match. Examining nPCLet's take a look at the next instruction in the pipeline. It's possible that we have another case of the nPC referencing the failing instruction. Let's look forward a few instructions. f80e5dc4/3i _idle+0x60: be _idle nop call _runqueues Well, that's not what we expected, is it? Maybe the problem occurred before the reported PC . Let's go back a few instructions. idle+50,4/ia _idle+0x50: stba %g0, [%g2] 0x2 _idle+0x54: sethi %hi(0xf8147800), %g1 _idle+0x58: ldub [%g1 + 0x338], %g1 ! -0x7eb84c8 _idle+0x5c: orcc %g0, %g1, %g0 _idle+0x60: Well, this is very interesting! The instruction executed just prior to the reported PC was a "load unsigned byte" instruction, ldub . We were trying to read a byte from the memory location whose address was stored in %g1 . According to the code shown above, %g1 should have contained address 0xf8147b38 (0xf8147800 + 0x338). However, according to the trap registers and the trap error messages, we actually tried to read memory location 0x338. |