What is a watchdog?


Watchdog resets usually indicate a software problem, although the root cause may be hardware related . The immediate cause is a trap, like a page fault, that occurs in the middle of handling another trap. The kernel processes a trap with the Enable Traps bit in the Processor Status Register reset (turned off), which prevents the CPU from accepting another trap until the initial processing of the first one is complete. This means that another trap is not supposed to be generated until the system has done enough work to successfully handle the first trap. If for some reason a trap is caused during this period, the system has to take the trap ” but it can't because the bit is off, so it quits right there. This is a watchdog reset ” an unrecoverable situation that essentially forces a reset of the CPU.

The only thing you can do after a watchdog reset is reboot the machine. Due to the nature of watchdog resets, not even use of the kernel absolute debugger, kadb , will allow you to capture watchdog resets as they happen. There are, however, as you'll see shortly, a few OpenBoot PROM commands you can use to get some status information before you do that reboot.

sun4d systems

On the Sun SPARCserver 1000 and SPARCcenter 2000 systems (sun4d architecture), there are actually two different types of watchdog resets.

The first one is as described above, when a single CPU finds itself in trouble and causes the system to drop into the PROM.

The second is a more drastic problem caused by a major hardware failure. In this case, called a system watchdog , the entire system is rebooted automatically: No PROM ok prompt appears, and you will not have the opportunity to attempt to debug it. During this process, some information is saved into the NVRAM (NonVolatile Random Access Memory).

/usr/kvm/prtdiag: A special sun4d command

In the case of a system watchdog on a SPARCserver 1000 or SPARCcenter 2000 system, there is one command you can run after the system reboots and comes back up again. The prtdiag utility is only available on sun4d systems and only prints information about the last system watchdog, so it's not useful for a normal CPU watchdog reset. However, it will assist in identifying the bad hardware that caused the system reset.



PANIC. UNIX System Crash Dump Analysis Handbook
PANIC! UNIX System Crash Dump Analysis Handbook (Bk/CD-ROM)
ISBN: 0131493868
EAN: 2147483647
Year: 1994
Pages: 289
Authors: Chris Drake

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net