Two possible scenarios come to mind. The first involves the hardware. It is possible that the instruction was set to zero via a hardware error during the loading of the module into memory. However, if the system were suffering from hardware problems, even rare flukes, we would expect to see a rash of problems develop with the system. We would also hope to see a less "clean" failure. Seeing a zero where the instruction should be is not comforting. Had we lost a single bit of the instruction, it would look more like a hardware failure. Had a larger section of code been corrupted, it would be easier to declare it a hardware problem. There is one other possibility, though We noted in our analysis that the CPU module was a 50 MHz model. On rare occasions, we saw a timing problem with these chips, one that was readily fixed by a simple patch to the kernel. The patch set kernel variable enable_sm_wa to 1. Let's see if that patch had been made to the customer's system. Hiya... adb -k unix.0 vmcore.0 physmem 3e15 enable_sm_wa/X enable_sm_wa: enable_sm_wa: 1 $q Hiya... Obviously, this customer is well informed and had already installed the fix. The hardware failure theory is getting weaker. Another interesting piece of data that we have picked up from the analysis is that the system has been up and running for two days. From talking to the customer, it sounds as if they use the semaphore code all of the time, so it is probably safe to assume that the semsys module had been loaded and in use for quite some time, in terms of computer time. Is there a way to confirm this? In the /usr/include/sys/modctl.h file, we find that a count is maintained , showing how many times a module has been loaded. However, if the module has been loaded since boot time and was never unloaded, a count of 1 would be perfectly acceptable. Therefore, using the load count won't be of help to us. Also, digging through modctl.h , we find that the time a module is loaded is not recorded anywhere . But, even if it was, that wouldn't prove that the module had been used successfully prior to the time of the crash. So, we have no way of proving or disproving the hardware theory at this point, other than to wait for another crash and see if it's similar. |