10.3 Was It a PANIC, a TOC, or an HPMC? | HP-UX CSE(c) Official Study Guide and Desk Reference

After a system has crashed, one of the main things you want to do is establish why it crashed. In order to do this, we need to employ the services of our local HP Response Center. They have engineers trained in crashdump analysis and will endeavor to get to the root cause of why your system crashed. When we place a support call, we will be asked if we want to place a Software support call or Hardware support call. This is where we can do a little bit of investigation in order to streamline the process of getting to the root cause of the system crash.

There are essentially three types of system crashes:

High Priority Machine Check (HPMC) : This is normally the result of a piece of hardware causing a Group 1 interrupt, an HPMC. A Group 1 interrupt is the highest priority interrupt the system can generate. Such an interrupt signifies THE MOST serious event has just occurred. The interrupt will be handled by a processor and passed to the operating system for it to process further. When the operating system receives an HPMC, the only thing it can do is to cause the system to crash. This will produce a system crashdump. As an example, a double-bit memory error will cause an HPMC. Many other hardware- related events will cause an HPMC. There is a small chance that an HPMC could be caused by a software error, but the vast majority of HPMCs are caused by hardware problems.

There is also a Low Priority Machine Check (LPMC). An LPMC does not necessarily cause the system to crash. An LPMC may be related to a hardware error that is recoverable, e.g., a single-bit memory error.
Transfer of Control (TOC) : If a system hangs , i.e., you can't get any response from a ping , from the system console, the system has frozen, and you may decide to initiate a TOC from the system console by using the TC command from the Command Menu (pressing ctrl-b on the console or via the GSP). If you are using Serviceguard, the cmcld daemon may cause the system to TOC in the event of a cluster reformation . All of these situations are normally associated with some form of software problem (the Serviceguard issue may be related to a hardware problem in our networking, but it was software that initiated the TOC).
PANIC : A PANIC occurs when the kernel detects a situation that makes no logical sense, e.g., kernel data structures becoming corrupted or logical corruption in a software subsystem such as a filesystem trying to delete a file twice ( freeing free frag ). In such situations, the kernel decides that the safest thing to do is to cause the system to crash. A PANIC is normally associated with a software problem, although it could be an underlying hardware problem (the filesystem problem mentioned above may have been caused by a faulty disk).

In summary, an HPMC is probably a hardware problem, and a TOC or PANIC is probably some form of software problem.

If we can distinguish between these three types of crashes, we can assist the analysis process by placing the appropriate call with our local Response Center. When we speak to a Response Center engineer, he may require us to send in the crashdump files on tape, as well as something called a tombstone . A tombstone details the last actions of the processor(s) when an HPMC occurred. We see this later.

In some instances, the engineer may log in to our systems remotely and perform crashdump analysis on our systems. If you don't want the engineer to log in to your live production systems, you will need to relocate the files from the savecrash directory ( /var/adm/crash ) onto another system to which the engineer does have access.

Let's look at a number of crashdumps in order to distinguish which are an HPMC, a TOC, or a PANIC. To look inside a crashdump, we need a debugging tool. HP-UX comes with a kernel debugger called q4 . The debugger is installed by default with HP-UX. We could spend an entire book talking about q4 . You'll find some documentation on q4 in the file /usr/ contrib /docs/Q4Docs.tar.Z if you want to have a look. In reality, you need to know kernel internals to be able to exploit q4 to its fullest. This is why we need to employ the help of our local Response Center to analyze the crashdump in full. I will give you some idea of how to use it by going through some examples. It is an interactive command and once you get used to it, it is quite easy to use.

10.3.1 An HPMC

An HPMC is a catastrophic event for a system. This is the highest priority interrupt that an HP system can generate. This is regarded as an unrecoverable error. The operating system must deal with this before it does anything else. For HP-UX, this means it will perform a crashdump. That means we will have files to analyze in /var/adm/crash . Our task is to realize that this crash was an HPMC, locate the tombstone (if there is one), and place a hardware call with our local HP Response Center: Here's a system that recently had an HPMC:

 root@hpeos002[] #  more /var/adm/shutdownlog  12:21  Thu Aug 22, 2002.  Reboot:  (by hpeos002!root) 01:01  Tue Aug 27, 2002.  Reboot:  (by hpeos002!root) 04:38  Sun Sep  1, 2002.  Reboot: 22:40  Wed Sep 25, 2002.  Reboot:  (by hpeos002!root) 09:33  Sun Sep 29, 2002.  Reboot: 10:19  Sun Sep 29, 2002.  Reboot:  (by hpeos002!root) ...   17:00  Sun Nov 16 2003.  Reboot after panic: trap type 1 (HPMC), pcsq.pcoq = 0.aa880, isr  .ior = 0.7dc8   root@hpeos002[] # root@hpeos002[] #  cd /var/adm/crash  root@hpeos002[crash] #  ll  total 4 -rwxr-xr-x   1 root       root             1 Nov 16 16:59 bounds drwxr-xr-x   2 root       root          1024 Nov 16 17:00 crash.0 root@hpeos002[crash] #  cd crash.0/  root@hpeos002[crash.0] #  cat INDEX  comment   savecrash crash dump INDEX file version   2 hostname  hpeos002 modelname 9000/715   panic     trap type 1 (HPMC), pcsq.pcoq = 0.aa880, isr.ior = 0.7dc8   dumptime  1069001748 Sun Nov  16 16:55:48 GMT 2003 savetime  1069001959 Sun Nov  16 16:59:19 GMT 2003 release   @(#)          $Revision: vmunix:    vw: -proj    selectors: CUPI80_BL2000_1108  -c 'Vw for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8 19:05:38 PST 2000 $ memsize   268435456 chunksize 8388608 module    /stand/vmunix vmunix 20418928 3531348543 module    /stand/dlkm/mod.d/rng rng 55428 3411709208 image     image.1.1 0x0000000000000000 0x00000000007fe000 0x0000000000000000  0x000000000000113f 2736590966 image     image.1.2 0x0000000000000000 0x00000000007fa000 0x0000000000001140  0x0000000000001a07 3970038878 image     image.1.3 0x0000000000000000 0x00000000007fc000 0x0000000000001a08  0x00000000000030d7 3687677982 image     image.1.4 0x0000000000000000 0x0000000000800000 0x00000000000030d8  0x00000000000064ef 2646676018 image     image.1.5 0x0000000000000000 0x00000000007fe000 0x00000000000064f0  0x0000000000009c57 3361770983 image     image.1.6 0x0000000000000000 0x0000000000464000 0x0000000000009c58  0x000000000000ffff 569812247 root@hpeos002[crash.0] #

The first thing to note is that this appears to be a definite HPMC. I can confirm this by looking at the dump itself.

 root@hpeos002[crash.0] #  q4 .  @(#) q4 $Revision: B.11.20f $ $Fri Aug 17 18:05:11 PDT 2001 0 Reading kernel symbols ... This kernel does not look like it has been prepared for debugging. If this is so, you will need to run pxdb or q4pxdb on it before you can use q4. You can verify that this is the problem by asking pxdb:         $ pxdb -s status ./vmunix If pxdb says the kernel has not been preprocessed, you will need to run it on the kernel before using q4:         $ pxdb ./vmunix Be aware that pxdb will overwrite your kernel with the fixed-up version, so you might want to save a copy of the file before you do this. (If the "-s status" command complained about an internal error, you will need to get a different version of pxdb before proceeding.) If you were not able to find pxdb, be advised that it moved from its traditional location in /usr/bin to /opt/langtools/bin when the change was made to the System V.4 file system layout. If you do not have pxdb, it is probably because the debugging tools are now an optional product (associated with the compilers and debuggers) and are no longer installed on every system by default. In this case you should use q4pxdb in exactly the same manner as you would use pxdb. quit root@hpeos002[crash.0] #

This error is not uncommon, and it tells me that the kernel needs some preprocessing in order to be debugged :

 root@hpeos002[crash.0] #  q4pxdb vmunix  . Procedures: 13 Files: 6 root@hpeos002[crash.0] # root@hpeos002[crash.0] #  q4 .  @(#) q4 $Revision: B.11.20f $ $Fri Aug 17 18:05:11 PDT 2001 0 Reading kernel symbols ... Reading data types ... Initialized PA-RISC 1.1 (no buddies) address translator ... Initializing stack tracer ... script /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl executable /usr/contrib/Q4/bin/perl version 5.00502 SCRIPT_LIBRARY = /usr/contrib/Q4/lib/q4lib perl will try to access scripts from directory /usr/contrib/Q4/lib/q4lib q4: (warning) PXDB:Some debug info sections were missing in the module. q4: (warning) PXDB:Some debug info sections were missing in the module. Processing module rng for debug info q4: (warning) Debug info not found in the module q4: (warning) Debug info not found in the module q4>  ex panicstr using s    trap type 1 (HPMC), pcsq.pcoq = 0.aa880, isr.ior = 0.7dc8   q4>

Here I can see the panic string that we see in the shutdownlog file. For every processor in the system, an event is stored in a structure known in the crash event table . These events are numbered from 0. We can trace each of these events individually:

 q4>  trace event 0  stack trace for event 0 crash event was an HPMC vx_event_post+0x14 invoke_callouts_for_self+0x8c sw_service+0xcc up_ext_interrupt+0x108 ivti_patch_to_nop2+0x0 idle+0x57c swidle_exit+0x0 q4>

This can become tedious if you have more than one running processor ( runningprocs ). Alternately, you can load the entire crash event table and trace every (a pile ) event that occurred.

 q4>  load crash_event_t from &crash_event_table until crash_event_ptr max 100  loaded 1 crash_event_t as an array (stopped by "until" clause) q4>  trace pile  stack trace for event 0   crash event was an HPMC   vx_event_post+0x14 invoke_callouts_for_self+0x8c sw_service+0xcc up_ext_interrupt+0x108 ivti_patch_to_nop2+0x0 idle+0x57c swidle_exit+0x0 q4>

In this case, every (okay, there's only one) processor has indicated that an HPMC was called. At this point, I would be looking to place a Hardware support call. If we can find the associated tombstone for this system, it might speed up the process of root cause analysis quite a bit. We need contributed diagnostic software loaded in order to automatically save a tombstone . The program I am looking for is called pdcinfo ; it normally resides under the /usr/sbin/diag/contrib directory and is supported on most machines (it's not supported on some workstations). If we don't have the program, we can still extract the tombstone using the Online Diagnostic tools ”the Support Tool Manager. I can run an info diagnostic on the processors, which will extract the PIM (Processor Information Module) information from the processor. The PIM information is the tombstone .

 root@hpeos002[crash.0] #  pdcinfo  HP-UX hpeos002 B.11.11 U 9000/715 2007116332 pdcinfo: The host machine is not supported by this program. root@hpeos002[crash.0] #

If this is the error message you receive, then your only option is to place a Hardware support call with the Response Center and let them take it from there. Let's look at extracting the PIM from a different system using the STM diagnostics. First, if pdcinfo is available, it is run at boot time and creates the most recent tombstone in a file called /var/tomstones/ts99 :

 root@hpeos003[]  cd /var/tombstones/  root@hpeos003[tombstones] root@hpeos003[tombstones]  ll  more  ... -rw-r--r--   1 root       root          3683 Nov 14 10:35 ts97 -rw-r--r--   1 root       root          3683 Nov 15 09:10 ts98   -rw-r--r--   1 root       root          3683 Nov 16 10:54 ts99   root@hpeos003[tombstones] root@hpeos003[tombstones]  more ts99  HP-UX hpeos003 B.11.11 U 9000/8000 894960601 CPU-ID(Model) = 0x13 PROCESSOR PIM INFORMATION -----------------  Processor 0 HPMC Information - PDC Version: 42.19  ------ Timestamp =    Fri Nov  14 23:41:28 GMT 2003    (20:03:11:14:23:41:28) HPMC Chassis Codes = 0xcbf0  0x20b3  0x5008  0x5408  0x5508  0xcbfb General Registers 0 - 31  0 -  3  0x00000000  0x00004a4f  0x0004dc33  0x00000001  4 -  7  0x40001840  0x00000001  0x7b36bf04  0x00000001  8 - 11  0x41936338  0x40001844  0x40001844  0x41934338 12 - 15  0x00000000  0x41932168  0x00000001  0x00000020 16 - 19  0x419322ab  0x7b36bf60  0x00000001  0x00000003 ... root@hpeos003[tombstones]

The tombstone has valid data in it, which is a series of what seems like inexplicable hex codes. The hex codes relate to the state of various hardware components at the time of the crash. This needs careful analysis by a hardware engineer who can decipher it, and it tells what caused the HPMC in the first place. We should place a Hardware support call and inform the Response Center that you have a tombstone for the engineer to analyze. If we don't have a tombstone in the form of a ts99 file, we can attempt to extract the PIM information from the processors themselves .

 root@hpeos003[]  cstm  Running Command File (/usr/sbin/stm/ui/config/.stmrc). -- Information -- Support Tools Manager Version A.34.00 Product Number B4708AA (C) Copyright Hewlett Packard Co. 1995-2002 All Rights Reserved Use of this program is subject to the licensing restrictions described in "Help-->On Version". HP shall not be liable for any damages resulting from misuse or unauthorized use of this program. cstm> cstm>  map  hpeos003.hq.maabof.com   Dev                                                 Last        Last Op   Num  Path                 Product                   Active Tool Status   ===  ==================== ========================= =========== ===========     1  system               system ()                 Information Successful     2  0                    Bus Adapter (582)     3  0/0                  PCI Bus Adapter (782)     4  0/0/0/0              Core PCI 100BT Interface     5  0/0/1/0              PCI SCSI Interface (10000     6  0/0/1/1              PCI SCSI Interface (10000     7  0/0/1/1.15.0         SCSI Disk (HP36.4GST33675 Information Successful     8  0/0/2/0              PCI SCSI Interface (10000     9  0/0/2/1              PCI SCSI Interface (10000    10  0/0/2/1.15.0         SCSI Disk (HP36.4GST33675 Information Successful    11  0/0/4/1              RS-232 Interface (103c104    12  0/2                  PCI Bus Adapter (782)    13  0/2/0/0              PCI Bus Adapter (8086b154    14  0/2/0/0/4/0          Unknown (10110019)    15  0/2/0/0/5/0          Unknown (10110019)    16  0/2/0/0/6/0          Unknown (10110019)    17  0/2/0/0/7/0          Unknown (10110019)    18  0/4                  PCI Bus Adapter (782)    19  0/4/0/0              Fibre Channel Interface (20  0/6                  PCI Bus Adapter (782)    21  0/6/0/0              PCI SCSI Interface (10000    22  0/6/0/1              PCI SCSI Interface (10000    23  0/6/2/0              Fibre Channel Interface (24  8                    MEMORY (9b)               Information Successful  25  160                  CPU (5e3)                 Information Successful  cstm> cstm>  sel dev 25  cstm>  info  -- Updating Map -- Updating Map... cstm> cstm>  infolog  -- Converting a (5324) byte raw log file to text. -- Preparing the Information Tool Log for CPU on path 160 File ... .... hpeos003  :  192.168.0.65 .... -- Information Tool Log for CPU on path 160 -- Log creation time: Sun Nov 16 17:31:35 2003 Hardware path: 160 Product ID:                CPU          Module Type:              0 Hardware Model:            0x5e3        Software Model:           0x4 Hardware Revision:         0            Software Revision:        0 Hardware ID:               0            Software ID:              894960601 Boot ID:                   0x1          Software Option:          0x91 Processor Number:          0            Path:                     160 Hard Physical Address:     0xfffffffffffa0000     Soft Physical Address:    0 Slot Number:               8            Software Capability:      0x100000f0 PDC Firmware Revision:     42.19        IODC Revision:            0 Instruction Cache [Kbyte]: 768          Processor Speed:          N/A Processor State:           N/A Monarch:                   Yes          Active:                   Yes Data Cache        [Kbyte]: 1536 Instruction TLB   [entry]: 240          Processor Chip Revisions: 2.3 Data TLB Size     [entry]: 240          2nd Level Cache Size:[KB] N/A Serial Number:             N/A -----------------  Processor 0 HPMC Information - PDC Version: 42.19  ------ CPU-ID(Model) = 0x13 PROCESSOR PIM INFORMATION Timestamp =    Fri Nov  14 23:41:28 GMT 2003    (20:03:11:14:23:41:28) HPMC Chassis Codes = 0xcbf0  0x20b3  0x5008  0x5408  0x5508  0xcbfb General Registers 0 - 31  0 -  3  0x00000000  0x00004a4f  0x0004dc33  0x00000001  4 -  7  0x40001840  0x00000001  0x7b36bf04  0x00000001  8 - 11  0x41936338  0x40001844  0x40001844  0x41934338 12 - 15  0x00000000  0x41932168  0x00000001  0x00000020 16 - 19  0x419322ab  0x7b36bf60  0x00000001  0x00000003

There are numerous pages of output that I have left out for reasons of brevity. When you have finished looking at, you should save it to a disk file to pass on to the Response Center engineer.

 Module              Revision         ------              --------         System Board        A24245         PA 8700 CPU Module  2.3 -- Information Tool Log for CPU on path 160 -- View   - To View the file. Print  - To Print the file. SaveAs - To Save the file. Enter Done, Help, Print, SaveAs, or View: [Done]  SA  -- Save Information Tool Log for CPU on path 160 -- Information Tool Log for CPU on path 160 File Path: / File Name:  /tmp/pim.HPMC.16Nov03  Enter Done, Help, Print, SaveAs, or View: [Done] cstm>  quit  -- Exit the Support Tool Manager -- Are you sure you want to exit the Support Tool Manager? Enter Cancel, Help, or OK: [OK] root@hpeos003[] root@hpeos003[tombstones]  ll /tmp/pim.HPMC.16Nov03  -rw-rw-r--   1 root       sys         4791 Nov 16 17:33 /tmp/pim.HPMC.16Nov03 root@hpeos003[tombstones]

On a V-Class system, the tools to extract a tombstone are located on the test-station. The command is pim_dumper and needs to be run by the sppuser user . The PIM is usually stored in a file /spp/data/<node>/pimlog (or /spp/data/pimlog on a V2200).

We should make both the tombstone and the crashdump files available to the Response Center engineers. In most cases, an HPMC is related to some form of hardware fault. However, there are situations were an HPMC is software related. Ensure that you keep the crashdump files until the Response Center engineers are finished with them.

10.3.2 A TOC

This system has been experiencing a number of problems. It has a number of crashdumps in /var/adm/crash :

 root@hpeos001[crash] #  pwd  /var/adm/crash root@hpeos001[crash] #  ll  total 12 -rwxr-xr-x   1 root       root             1 Aug  2  2002 bounds drwxr-xr-x   2 root       root          1024 Feb  5  2003 crash.0 drwxr-xr-x   2 root       root          1024 Feb  5  2003 crash.1 drwxr-xr-x   2 root       root          1024 Feb  5  2003 crash.2 drwxr-xr-x   2 root       root          1024 Apr  5  2003 crash.3 drwxr-xr-x   2 root       root          1024 Aug  2  2002 crash.4 root@hpeos001[crash] #

We start with the latest one, crash.4 :

 root@hpeos001[crash] #  cd crash.4  root@hpeos001[crash.4] #  ll  total 65660 -rw-r--r--   1 root       root          1184 Aug  2  2002 INDEX -rw-r--r--   1 root       root       3649642 Aug  2  2002 image.1.1.gz -rw-r--r--   1 root       root       5366814 Aug  2  2002 image.1.2.gz -rw-r--r--   1 root       root       5132853 Aug  2  2002 image.1.3.gz -rw-r--r--   1 root       root       5389805 Aug  2  2002 image.1.4.gz -rw-r--r--   1 root       root       4722164 Aug  2  2002 image.1.5.gz -rw-r--r--   1 root       root       1341565 Aug  2  2002 image.1.6.gz -rw-r--r--   1 root       root       7999699 Aug  2  2002 vmunix.gz root@hpeos001[crash.4] #

As you can see, the savecrash command has compressed these files. Let's have a look in the INDEX file to see if we can pick up any information in there:

 root@hpeos001[crash.4] #  cat INDEX  comment   savecrash crash dump INDEX file version   2 hostname  hpeos001 modelname 9000/777/C110   panic     TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.95c4b8   dumptime  1028316356 Fri Aug   2 20:25:56 BST 2002 savetime  1028316646 Fri Aug   2 20:30:46 BST 2002 release   @(#)      $Revision: vmunix:    vw: -proj    selectors: CUPI80_BL2000_1108 -c  'Vw for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8  19:05:38 PST 2000 $ memsize   134217728 chunksize 16777216 module    /stand/vmunix vmunix 19959792 3822072703 image     image.1.1 0x0000000000000000 0x0000000000ffb000 0x0000000000000000  0x0000000000001ad7 2506990029 image     image.1.2 0x0000000000000000 0x0000000000ffb000 0x0000000000001ad8  0x0000000000003547 2619725050 image     image.1.3 0x0000000000000000 0x0000000000ffa000 0x0000000000003548  0x0000000000004c4f 3285117231 image     image.1.4 0x0000000000000000 0x0000000000ffd000 0x0000000000004c50  0x0000000000006227 1045138142 image     image.1.5 0x0000000000000000 0x0000000001000000 0x0000000000006228  0x0000000000007957 3167489837 image     image.1.6 0x0000000000000000 0x00000000004d5000 0x0000000000007958  0x0000000000007fff 2277772794 root@hpeos001[crash.4] #

All I can tell from the panic string is that this was a TOC. Sometimes, there is a more descriptive panic string , which I could feed into the ITRC knowledge database and see if the panic string had been seen before. For most people, the fact that this was a TOC is enough information. You should now place a Software call with your local Response Center and get an engineer to take a detailed look at the crashdump.

If the file /var/adm/shutdownlog exists, we should see the panic string in that file as well.

 root@hpeos001[crash] #  more /var/adm/shutdownlog  19:52  Wed Feb 27, 2002.  Reboot: 08:50  Mon Mar  4, 2002.  Halt: 13:03  Mon Jun 10, 2002.  Halt: ... 20:56  Sat Apr 05 2003.   Reboot after panic: TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.95c4b8   root@hpeos001[crash] #

In this instance, I will look a little further to see what else I can find out. In order to look at the crashdump itself, I will gunzip at least the kernel file:

 root@hpeos001[crash.4] #  gunzip vmunix.gz  root@hpeos001[crash.4] #

Before I run q4 , I will preprocess the kernel with the q4pxdb command:

 root@hpeos001[crash.4] #  q4pxdb vmunix  . Procedures: 13 Files: 6 root@hpeos001[crash.4] #

Now I can run q4 :

 root@hpeos001[crash.4] #  q4 .  @(#) q4 $Revision: B.11.20f $ $Fri Aug 17 18:05:11 PDT 2001 0 Reading kernel symbols ... Reading data types ... Initialized PA-RISC 1.1 (no buddies) address translator ... Initializing stack tracer ... script /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl executable /usr/contrib/Q4/bin/perl version 5.00502 SCRIPT_LIBRARY = /usr/contrib/Q4/lib/q4lib perl will try to access scripts from directory /usr/contrib/Q4/lib/q4lib q4: (warning) No loadable modules were found q4: (warning) No loadable modules were found System memory: 128 MB Total Dump space configured: 256.00 MB Total Dump space actually used: 84.74 MB Dump space appears to be sufficient : 171.26 MB extra q4> q4>  examine panicstr using s  TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.95c4b8 q4>

This is the panic string that we see in the INDEX file. I want to find out what each processor was doing at the time of the crash. First, I want to know how processors were configured on this system:

 q4>  runningprocs  01      1       0x1 q4>

I can look at a structure known as the multi-process information table. This structure (one per processor) will list the instructions that each processor was executing at the time of the crash.

 q4>  load mpinfo_t from mpproc_info max nmpinfo  loaded 1 mpinfo_t as an array (stopped by max count) q4>  trace pile  processor 0 claims to be idle stack trace for event 0 crash event was a TOC   Send_Monarch_TOC+0x2c     safety_time_check+0x110   per_spu_hardclock+0x308 clock_int+0x7c inttr_emulate_save_fpu+0x100 idle+0x56c swidle_exit+0x0 q4>

In this particular instance, a safety_time_check instruction tells me that Serviceguard was running on this machine (the safety timer is an integral part of a Serviceguard node's regular checking of the status of the cluster). If Serviceguard TOCs a server, there are normally messages in the kernel message buffer (the buffer read by the dmesg command). The message buffer has a 8-byte header, which I am not interested in, so I can skip the header and read the data in the buffer itself:

 q4>  ex &msgbuf+8 using s  NOTICE: nfs3_link(): File system was registered at index 3. NOTICE: autofs_link(): File system was registered at index 6. NOTICE: cachefs_link(): File system was registered at index 7. 8 ccio 8/12 c720 8/12.5 tgt 8/12.5.0 sdisk 8/12.6 tgt 8/12.6.0 sdisk 8/12.7 tgt 8/12.7.0 sctl 8/16 bus_adapter 8/16/4 asio0 8/16/5 c720 8/16/5.0 tgt 8/16/5.0.0 sdisk 8/16/5.2 tgt 8/16/5.2.0 sdisk 8/16/5.3 tgt 8/16/5.3.0 stape 8/16/5.7 tgt 8/16/5.7.0 sctl 8/16/6 lan2 8/16/0 CentIf 8/16/10 fdc 8/16/1 audio ps2_readbyte_timeout: no byte after 500 uSec ps2_readbyte_timeout: no byte after 500 uSec 8/16/7 ps2 8/20 bus_adapter 8/20/5 eisa 8/20/5/2 lan2 8/20/2 asio0 8/20/1 hil 10 ccio 10/12 c720 10/12.6 tgt 10/12.6.0 sctl 10/16 graph3 32 processor 49 memory     System Console is on the Built-In Serial Interface Entering cifs_init... Initialization finished successfully... slot is 9 Logical volume 64, 0x3 configured as ROOT Logical volume 64, 0x2 configured as SWAP Logical volume 64, 0x2 configured as DUMP     Swap device table:  (start & size given in 512-byte blocks)         entry 0 - major is 64, minor is 0x2; start = 0, size = 524288     Dump device table:  (start & size given in 1-Kbyte blocks)         entry 00000000 - major is 31, minor is 0x6000; start = 88928, size = 262144 Warning: file system time later than time-of-day register Getting time from file system Starting the STREAMS daemons-phase 1 Create STCP device files           $Revision: vmunix:    vw: -proj    selectors: CUPI80_BL2000_1108 -c 'Vw for  CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8 19:05:38  PST 2000 $ Memory Information:     physical page size = 4096 bytes, logical page size = 4096 bytes     Physical: 131072 Kbytes, lockable: 82636 Kbytes, available: 96004 Kbytes SCSI: Reset requested from above -- lbolt: 547387, bus: 1 SCSI: Resetting SCSI -- lbolt: 547687, bus: 1 SCSI: Reset detected -- lbolt: 547687, bus: 1 SCSI: Reset requested from above -- lbolt: 670315, bus: 1 SCSI: Resetting SCSI -- lbolt: 670615, bus: 1 SCSI: Reset detected -- lbolt: 670615, bus: 1   MC/ServiceGuard: Unable to maintain contact with cmcld daemon.     Performing TOC to ensure data integrity.   q4>

This is definitely a Serviceguard issue. The SCSI lbolt messages are normal during a Serviceguard cluster reformation. Analyzing the dump may reveal more, but my immediate task is to log a software call with my local Response Center to take this analysis further. In the meantime, I would be investigating my Serviceguard logfiles for any more clues as to why this Serviceguard node went through a cluster reformation and ended up TOC'ing.

 q4>  exit  root@hpeos001[crash.4] #

10.3.3 A PANIC

In this instance, we don't have an HPMC or a TOC to deal with. This one is a PANIC. This type of problem is normally associated with a problem with a kernel device driver or software subsystem, but it is not inconceivable that it could be associated with an underlying hardware problem. We are back on the system hpeos001 we saw earlier:

 root@hpeos001[] #  cd /var/adm/crash  root@hpeos001[crash] #  grep panic crash*/INDEX  crash.0/INDEX:panic     TOC, pcsq.pcoq = 0.afb04, isr.ior = 0.0 crash.1/INDEX:panic     TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.9561f8   crash.2/INDEX:panic     free: freeing free frag   crash.3/INDEX:panic     TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.95c4b8 crash.4/INDEX:panic     TOC, pcsq.pcoq = 0.15f4b4, isr.ior = 0.95c4b8 root@hpeos001[crash] #

I am interested in crash.2 because there is no mention of an HPMC or a TOC, an early indication that this is a PANIC:

 root@hpeos001[crash] #  cd crash.2  root@hpeos001[crash.2] #  ll  total 213178 -rw-r--r--   1 root       root          1218 Feb  5  2003 INDEX -rw-r--r--   1 root       root       16744448 Feb  5  2003 image.1.1 -rw-r--r--   1 root       root       16777216 Feb  5  2003 image.1.2 -rw-r--r--   1 root       root       16764928 Feb  5  2003 image.1.3 -rw-r--r--   1 root       root       16777216 Feb  5  2003 image.1.4 -rw-r--r--   1 root       root       16773120 Feb  5  2003 image.1.5 -rw-r--r--   1 root       root       10465280 Feb  5  2003 image.1.6 -rw-r--r--   1 root       root       14842104 Feb  5  2003 vmunix root@hpeos001[crash.2] # root@hpeos001[crash.2] #  cat INDEX  comment   savecrash crash dump INDEX file version   2 hostname  hpeos001 modelname 9000/777/C110   panic     free: freeing free frag   dumptime  1044424474 Wed Feb   5 05:54:34 GMT 2003 savetime  1044424740 Wed Feb   5 05:59:00 GMT 2003 release   @(#)     $Revision: vmunix:    vw: -proj    selectors: CUPI80_BL2000_1108 -c 'Vw  for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8 19:05  :38 PST 2000 $ memsize   134217728 chunksize 16777216 module    /stand/vmunix vmunix 19931120 1462037576 warning   savecrash: savecrash running in the background image     image.1.1 0x0000000000000000 0x0000000000ff8000 0x0000000000000000  0x00000000000019f7 3186480777 image     image.1.2 0x0000000000000000 0x0000000001000000 0x00000000000019f8  0x0000000000003017 3525696154 image     image.1.3 0x0000000000000000 0x0000000000ffd000 0x0000000000003018  0x0000000000004a57 3554239297 image     image.1.4 0x0000000000000000 0x0000000001000000 0x0000000000004a58  0x0000000000005eff 811243188 image     image.1.5 0x0000000000000000 0x0000000000fff000 0x0000000000005f00  0x000000000000724f 2125486394 image     image.1.6 0x0000000000000000 0x00000000009fb000 0x0000000000007250  0x0000000000007fff 4051446221 root@hpeos001[crash.2] #

Let's run q4 and see what happens:

 root@hpeos001[crash.2] #  q4 .  @(#) q4 $Revision: B.11.20f $ $Fri Aug 17 18:05:11 PDT 2001 0 q4: (warning) Here are the savecore warning messages - q4: (warning) savecrash: savecrash running in the background Reading kernel symbols ... Reading data types ... Initialized PA-RISC 1.1 (no buddies) address translator ... Initializing stack tracer ... script /usr/contrib/Q4/lib/q4lib/sample.q4rc.pl executable /usr/contrib/Q4/bin/perl version 5.00502 SCRIPT_LIBRARY = /usr/contrib/Q4/lib/q4lib perl will try to access scripts from directory /usr/contrib/Q4/lib/q4lib q4: (warning) No loadable modules were found q4: (warning) No loadable modules were found System memory: 128 MB Total Dump space configured: 356.00 MB Total Dump space actually used: 89.91 MB Dump space appears to be sufficient : 266.09 MB extra q4> q4>  ex &msgbuf+8 using s  NOTICE: nfs3_link(): File system was registered at index 3. NOTICE: autofs_link(): File system was registered at index 6. NOTICE: cachefs_link(): File system was registered at index 7. 8 ccio 8/12 c720 8/12.5 tgt 8/12.5.0 sdisk 8/12.6 tgt 8/12.6.0 sdisk 8/12.7 tgt 8/12.7.0 sctl 8/16 bus_adapter 8/16/4 asio0 8/16/5 c720 8/16/5.0 tgt 8/16/5.0.0 sdisk 8/16/5.2 tgt 8/16/5.2.0 sdisk 8/16/5.3 tgt 8/16/5.3.0 stape 8/16/5.7 tgt 8/16/5.7.0 sctl 8/16/6 lan2 8/16/0 CentIf 8/16/10 fdc 8/16/1 audio 8/16/7 ps2 8/20 bus_adapter 8/20/5 eisa 8/20/5/2 lan2 8/20/2 asio0 8/20/1 hil 10 ccio 10/12 c720 10/12.6 tgt 10/12.6.0 sctl 10/16 graph3 32 processor 49 memory     System Console is on the Built-In Serial Interface Entering cifs_init... Initialization finished successfully... slot is 9 Logical volume 64, 0x3 configured as ROOT Logical volume 64, 0x2 configured as SWAP Logical volume 64, 0x2 configured as DUMP     Swap device table:  (start & size given in 512-byte blocks)         entry 0 - major is 64, minor is 0x2; start = 0, size = 524288     Dump device table:  (start & size given in 1-Kbyte blocks)         entry 00000000 - major is 31, minor is 0x6000; start = 88928, size = 262144 Starting the STREAMS daemons-phase 1 Create STCP device files          $Revision: vmunix:    vw: -proj    selectors: CUPI80_BL2000_1108 -c 'Vw for  CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8 19:05:38  PST 2000 $ Memory Information:     physical page size = 4096 bytes, logical page size = 4096 bytes     Physical: 131072 Kbytes, lockable: 82676 Kbytes, available: 94672 Kbytes   dev = 0x4000000d, block = 144, fs = /data, cgp = 0xbac50000, ip = 0x7fff0ca0   linkstamp:          Thu Jan 9 13:40:49 GMT 2003 _release_version:   @(#)     $Revision: vmunix:    vw: -proj    selectors:  CUPI80_BL2000_1108 -c 'Vw for CUPI80_BL2000_1108 build' -- cupi80_bl2000_1108  'CUPI80_BL2000_1108'  Wed Nov  8 19:05:38 PST 2000 $   panic: free: freeing free frag     PC-Offset Stack Trace   (read across, top of stack is 1st):   0x0015e58c  0x0036a708  0x0035f310  0x0035df00  0x0005d09c  0x0005d1e8   0x00066d34  0x001360d0  0x00069d60  0x000e0814  0x00034578 End Of Stack sync'ing disks (0 buffers to flush): (0 buffers to flush): 0 buffers not flushed 0 buffers still dirty q4>

First, in this specific instance, we can see output relating to the affected filesystem. Don't necessarily expect this type of information for every PANIC. The PC-Offset Stack Trace is the list of instructions leading up to the crash. These may give us some clues. We can use a structure known as the crash event table to analyze what was happening at the time of the crash. This is an alternative structure to the multi-processor information table:

 q4>  load crash_event_t from &crash_event_table until crash_event_ptr max 100  loaded 1 crash_event_t as an array (stopped by "until" clause) q4>  trace pile  stack trace for event 0   crash event was a panic   panic+0x60   free+0x7b8     itrunc+0xd84     post_inactive_one+0x7c     post_inactive+0xdc     flush_all_inactive+0x10     ufs_sync+0x44     update+0x4c     tsync+0x124     syscall+0x1bc   $syscallrtn+0x0 q4>

There was something happening to a UFS (HFS) filesystem. I would immediately be logging a Software call with my local HP Response Center. While it looks like something strange was happening with the UFS code, it is not inconceivable that a disk problem introduced some form of unique corruption in the filesystem. It would be up to an engineer to diagnose this and possibly run a diagnostic check on the disk in question.

While we were waiting for contact from the Response Center, we could take the entire stack trace along with our panic string and feed them into the ITRC knowledge database to see if this problem has been seen before. It may suggest possible reasons for the problem and possible solutions. We can pass any information we get from the ITRC to the Response Center engineer to help him get to root cause of the problem.

10.3.4 Storing a crashdump to tape

If we are asked to store a crashdump to tape, we should store all the files under the /var/adm/crash/crash.X directory. To avoid any issues with pathnames, it's a good idea to change into the /var/adm/crash directory and use relative pathnames when storing your crashdump files to tape; absolute pathnames would just overwrite the crashdump files for the server in the Response Center! It just makes the whole process come to a conclusion much quicker! It's best to use a common backup command such as tar . Make sure that you put a label on the tape with your Response Center case number and the command you used to create the tape. Some people put their company name on the label. HP realizes that there are potential confidentiality issues with that, so your name is optional but make sure the Support Call Case Number is on the label. If, for whatever reason, the files in the /var/adm/crash/crash.X directory are accidentally deleted or corrupted, we can always attempt to resave a crashdump. If the swapping system has overwritten the dump, then it is lost forever. We can but try by using the “r option to the savecrash utility:

 root@hpeos002[crash] #  pwd  /var/adm/crash root@hpeos002[crash] #  ll  total 4 -rwxr-xr-x   1 root       root             1 Nov 16 16:59 bounds drwxr-xr-x   2 root       root          1024 Nov 16 17:58 crash.0 root@hpeos002[crash] #  savecrash -r /var/adm/crash  root@hpeos002[crash] #  ll  total 6 -rwxr-xr-x   1 root       root             1 Nov 16 19:14 bounds drwxr-xr-x   2 root       root          1024 Nov 16 17:58 crash.0   drwxrwxrwx   2 root       sys           1024 Nov 16 19:14 crash.1   root@hpeos002[crash] #

As you can see, we specify the directory where the crashdump will be stored. Alternately, we could have used the “t <tape device> to store that crashdump direct to tape.