Is it a hardware failure?


Two possible scenarios come to mind. The first involves the hardware. It is possible that the instruction was set to zero via a hardware error during the loading of the module into memory. However, if the system were suffering from hardware problems, even rare flukes, we would expect to see a rash of problems develop with the system. We would also hope to see a less "clean" failure. Seeing a zero where the instruction should be is not comforting. Had we lost a single bit of the instruction, it would look more like a hardware failure. Had a larger section of code been corrupted, it would be easier to declare it a hardware problem. There is one other possibility, though

We noted in our analysis that the CPU module was a 50 MHz model. On rare occasions, we saw a timing problem with these chips, one that was readily fixed by a simple patch to the kernel. The patch set kernel variable enable_sm_wa to 1. Let's see if that patch had been made to the customer's system.

 Hiya...  adb -k unix.0 vmcore.0  physmem 3e15  enable_sm_wa/X  enable_sm_wa:  enable_sm_wa:   1  $q  Hiya... 

Obviously, this customer is well informed and had already installed the fix. The hardware failure theory is getting weaker.

Another interesting piece of data that we have picked up from the analysis is that the system has been up and running for two days. From talking to the customer, it sounds as if they use the semaphore code all of the time, so it is probably safe to assume that the semsys module had been loaded and in use for quite some time, in terms of computer time. Is there a way to confirm this?

In the /usr/include/sys/modctl.h file, we find that a count is maintained , showing how many times a module has been loaded. However, if the module has been loaded since boot time and was never unloaded, a count of 1 would be perfectly acceptable. Therefore, using the load count won't be of help to us.

Also, digging through modctl.h , we find that the time a module is loaded is not recorded anywhere . But, even if it was, that wouldn't prove that the module had been used successfully prior to the time of the crash.

So, we have no way of proving or disproving the hardware theory at this point, other than to wait for another crash and see if it's similar.



PANIC. UNIX System Crash Dump Analysis Handbook
PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)
ISBN: 0131493868
EAN: 2147483647
Year: 1994
Pages: 289
Authors: Chris Drake

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net