Now that we've seen that the core dump appears to match the circumstances, we can continue and look at things with some confidence. Since this is a hang, one of the first steps should be to look at the processes on the machine and see if any are stuck. With Solaris 1 systems, we can use the ps command to look at a core file, then check for processes in a "D" state ("disk-wait"). If the system is locked up due to a lack of resources or some sort of deadlock, the D-state processes are likely candidates. Some of the processes in the following ps and adb traceall output have been deleted for space reasons, so this listing is a little short. Yours may well look a lot longer, especially if the system was busy at the time. zatch 13: ps -axk vmunix.2 vmcore.2 PID TT STAT TIME COMMAND 0 ? D 8:03 swapper 1 ? I 4:59 /sbin/init - 2 ? D 0:20 pagedaemon 55 ? RW 2:46 portmap 60 ? IW 0:00 keyserv 69 ? I 1:13 in.routed 72 ? I 0:03 (biod) 73 ? I 0:03 (biod) 74 ? I 0:03 (biod) 75 ? I 0:03 (biod) 86 ? RW 0:01 syslogd 98 ? RW 0:08 /usr/lib/sendmail -bd -q1h 104 ? IW 0:01 rpc.mountd -n 106 ? I 0:00 (nfsd) 107 ? I 0:00 (nfsd) 108 ? I 0:00 (nfsd) 109 ? IW 0:01 rarpd -a 110 ? IW 0:00 rarpd -a 111 ? I 0:00 (nfsd) 113 ? I 0:00 (nfsd) 114 ? I 0:00 (nfsd) 115 ? I 0:00 (nfsd) 116 ? I 0:00 (nfsd) 117 ? IW 0:00 rpc.bootparamd 120 ? IW 0:00 rpc.statd 122 ? IW 0:00 rpc.lockd 128 ? RW 18:25 automount -m -f /etc/auto.master 132 ? D 532:35 update 135 ? RW 0:02 cron 141 ? RW 0:54 inetd 145 ? IW 0:00 /usr/lib/lpd 5346 co IW 0:00 -csh (csh) 5357 co IW 0:00 sh /home/sat/.openwin 5358 co IW 0:00 /bin/sh /usr/openwin/bin/openwin -dev /dev/fb 5362 co IW 0:00 /usr/openwin/bin/xinit -- /usr/openwin/bin/xnews :0 5363 co D 67:05 /usr/openwin/bin/xnews :0 -dev /dev/fb 5370 co IW 0:00 sh /home/sat/.xinitrc 5382 co IW 0:04 mwm -multiscreen -display :0.0 :0.1 :0.2 5385 co IW 0:00 calctool -display :0.1 -title CALC -scale medium 5387 co IW 0:00 sv_xv_sel_svc 5388 co IW 0:00 vkbd -nopopup 5389 co IW 0:00 dsdm 5391 co IW 0:00 calctool -display :0.2 -title CALC -scale medium 5393 co IW 0:00 calctool -display :0.0 -title CALC -scale medium 5394 co IW 0:00 xterm -C -name SystemConsole -title System Console -sb 5408 co IW 1:36 xterm -sb -font -misc-*-medium-r-normal-*-15-120-*-*-*-*-* 148 ? IW 0:00 - std.9600 ttya (getty) 5398 ? IW 0:00 -sh (csh) 5409 ? IW 0:00 -sh (csh) ? RW 0:00 sleep 30 4434 ? R 0:25 gendis second 4435 ? I 0:30 recdis second 4436 ? R 0:53 gendis first 4437 ? I 1:04 recdis first 5436 ? IW 0:00 csh go 5438 ? IW 0:00 /bin/csh go 5440 ? I 3:19 ranger/trcque go.out 5487 ? IW 0:30 /bin/sh ps_repeat -p mwm -d 30 5489 ? R 0:02 wsini 5497 ? IW 0:00 wprntfhost .0 5498 ? IW 0:00 wprntfhost .1 5499 ? IW 0:00 wprntfhost .2 5500 ? R 0:04 csupv primary FICC_PRIMARY_MMF 5502 ? RW 0:01 resync 5503 ? Z 0:00 <defunct> 5504 ? R 0:01 icon32 5505 ? R 0:05 colorhost .0 5507 ? R 0:05 colorhost .1 5508 ? R 0:05 colorhost .2 5513 ? IW 0:00 wprntfhost .1 5514 ? IW 0:00 wprntfhost .2 5515 ? IW 0:00 wprntfhost .0 5516 ? IW 0:00 colorhost .0 5517 ? IW 0:00 colorhost .2 5518 ? I 0:30 colorhost .0 5519 ? I 0:33 colorhost .2 5520 ? IW 0:00 colorhost .1 5521 ? I 0:33 colorhost .1 5522 ? I 27:11 keys .0 5523 ? R 0:06 navwin .0 5524 ? D 23:48 crtout .0 -sync 5525 ? R 0:05 navwin .1 5526 ? R 0:14 crtout .1 -sync 5527 ? R 0:05 navwin .2 5528 ? R 0:14 crtout .2 -sync 5529 ? D 0:14 taskmon ? RW 0:00 sleep 30 ? D 0:00 play -v 25 dialtone.au ? D 0:00 <exiting> Let's note a few things about this ps output. First of all, there are a few processes in D state, but not a lot. However, one of them is OpenWindows, which might account for the console freezing up. We also have the swapper and pagedaemon in D state. Although this is fairly normal, it may be worth looking at the stack traceback to see why they are stopped . There is nothing at all in DW state, which could mean that the system froze up, and before it had a chance to swap everything out, the alert user at the console decided to kill the system and get a crash. It might also mean that the freeze was related to swapping, and nothing could get swapped in or out. There are also quite a few processes in R or RW state. These are runnable programs. The relatively high number makes it look like the system was fairly active at the time of the freeze. However, we have no way of knowing how long those programs were in this state. If a program is waiting, say, for input from the network and the machine crashes, the network does not stop. Even though the panic function is writing out the contents of memory onto disk, a chunk of data may arrive from the Ethernet and cause a process to be "awakened," or put back on the run queue. Interrupts are not disabled, so some of these runnable processes may have become runnable during the time the system was trying to shut itself down. We should also note that this is obviously a networked system, as we have four biod processes running (so this is an NFS server) and eight nfsd s running (so this is an NFS client). None of these are in IW state or D state, which means that the network has been fairly busy because all the processes were active recently, and none of them seem to be stuck. To be sure, though, we should check the stack traces for these. |