Identifying Failed Devices | Real World Mac Maintenance and Backups

After an error is located, the failed device can be identified. The goal is to determine the root cause. As previously mentioned, dmesg and /var/log/messages are commonly used with lspci and the /proc filesystem. These combined tools are used to troubleshoot and locate hardware faults.

The lspci command presents a user with the hardware layout on a machine. In the following example, we use a small Linux IA32 server with dual processors and Fibre Channel attached storage to demonstrate what an lspci would look like. First, we display the kernel we are using with the uname command.

[root@cyclops lpfc]# uname -a Linux cyclops 2.4.9-e.10custom-gt #4 SMP Mon Nov 1 14:17:36 EST 2004 i686 unknown

We continue with dmesg to list the PCI bus. Note that because we use the Emulex HBAs often in this chapter's examples, they are shown in bold.

[root@cyclops lpfc]# dmesg | grep PCI PCI: PCI BIOS revision 2.10 entry at 0xfda11, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Discovered primary peer bus 01 [IRQ] PCI->APIC IRQ transform: (B0,I2,P0) -> 22 PCI->APIC IRQ transform: (B0,I8,P0) -> 23 PCI->APIC IRQ transform: (B0,I15,P0) -> 33 PCI->APIC IRQ transform: (B1,I3,P0) -> 17 <-Bus 1, interface/slot 3, function/port 0 (Emulex HBA), lspci will help identify this HBA in a future example. PCI->APIC IRQ transform: (B1,I4,P0) -> 27 <-Bus 1, interface/slot 4, function/port 0 (Emulex HBA), lspci will help identify this HBA in a future example. PCI->APIC IRQ transform: (B1,I5,P0) -> 24 PCI->APIC IRQ transform: (B1,I5,P1) -> 25 Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SERIAL_PCI ISAPNP enabled ide: Assuming 33MHz PCI bus speed for PIO modes; override with idebus=xx ServerWorks OSB4: IDE controller on PCI bus 00 dev 79 pci_hotplug: PCI Hot Plug PCI Core version: 0.3 sym53c8xx: at PCI bus 1, device 5, function 0 sym53c8xx: at PCI bus 1, device 5, function 1

We conclude by using lspci to depict the PCI bus devices.

[root@cyclops lpfc]# lspci 00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06) 00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06) 00:02.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08) 00:07.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:08.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 08) 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50) 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04) 01:03.0 Fibre Channel: Emulex Corporation: Unknown device f800 (rev 02) 01:04.0 Fibre Channel: Emulex Corporation: Unknown device f800 (rev 02) 01:05.0 SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3 SCSI Adapter (rev 01) 01:05.1 SCSI storage controller: LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3 SCSI Adapter (rev 01)

Note that lspci uses /proc/bus/pci to build its device tree. All devices have a descriptor commonly referred to as the Virtual Page Descriptor (VPD), in which lspci uses the source decode list at /usr/share/pci.ids to determine the devices' characteristics (note that the latest PCI IDs are maintained at http://pciids.sf.net/). By using lspci with the -v flag, a user can obtain complete details of PCI devices including subsystem, flags, memory, I/O ports, and expansion ROM locations. However, using the -t flag with the -v flag yields only the basic description. Although the -v output is missing great detail when used in conjunction with the -t flag, the -t flag remains a very nice option because it delivers a table view of the devices seen from the master bus.

For example, the following lspci -t -v output is from the same machine as the dmesg and lspci discussed previously. In addition, notice how the Emulex HBAs are denoted by the lspci as unknown under the function code f800. This is due to the fact that we are using a 2001 pci.ids update on our test server. Our latest production lab machine has a mid-August 2004 version loaded. With the pci.ids file dated 2004-08-24, f800 is decoded to be "LP8000 Fibre Channel Host Adapter," as shown in the following example. Note there is no penalty for having an older pci.ids file. However, be prepared for lots of "unknown" devices to appear in lspci's output. In addition, the pci.ids file is updated for each distribution, but it also can be researched manually by going to http://pciids.sourceforge.net/.

[root@cyclops root]# lspci -t -v -+-[01]-+-03.0  Emulex Corporation: Unknown device f800 |       +-04.0  Emulex Corporation: Unknown device f800 |       +-05.0  LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3                 SCSI Adapter |       \-05.1  LSI Logic / Symbios Logic (formerly NCR) 53c1010 Ultra3                 SCSI Adapter \-[00]-+-00.0   ServerWorks CNB20LE Host Bridge        +-00.1   ServerWorks CNB20LE Host Bridge        +-02.0   Intel Corporation 82557 [Ethernet Pro 100]        +-07.0   ATI Technologies Inc Rage XL        +-08.0   Intel Corporation 82557 [Ethernet Pro 100]        +-0f.0   ServerWorks OSB4 South Bridge        +-0f.1   ServerWorks OSB4 IDE Controller        \-0f.2   ServerWorks OSB4/CSB5 OHCI USB Controller

Having a good understanding of the bus structure is critical to finding the device that is failing. In the next example, we find that the SCSI disk errors are being reported by syslogd, which by default writes to /var/log/messages (we can confirm where syslogd writes by viewing /etc/syslogd.conf or by using the logger command), and we can view the same errors through the dmesg buffer. The error code looks like the following:

SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022  I/O error: dev 08:80, sector 5439488 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022  I/O error: dev 08:80, sector 5439552

Viewing the same data in /var/log/messages, note that a timestamp is included.

Feb  4 14:04:37 cyclops kernel: SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 Feb  4 14:04:37 cyclops kernel:  I/O error: dev 08:80, sector 5439488 Feb  4 14:04:37 cyclops kernel: SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 Feb  4 14:04:37 cyclops kernel:  I/O error: dev 08:80, sector 5439552

The previous error log reports a disk I/O error on host 2, channel 0, id 0, lun 2, with a return code of 70022. The next task is to decipher the return code. The following is an example of deciphering the return code. In the example, we break down an I/O error from a different host in our lab, which received a return code of 2603007f. Note that the return code is always 4 bytes (8 digits) long, so in our previous example, 70022 is actually 00070022.

Next, we have broken down a SCSI I/O error from a syslog entry. In this example, the error code is 2603007f, and we complete the example by breaking down every component of the syslog entry for the I/O error.

Month Day Hour:Min:Sec localhost kernel: SCSI disk error :      host 0 channel 0 id 0 lun 0 return code = 2603007f Month Day Hour:Min:Sec localhost kernel: scsidisk I/O error:      dev 08:01, sector 10 Month Day Hour:Min:Sec localhost kernel: raid1: Disk failure on      sda1, disabling device.

As shown here, the first entry in the syslog with respect to the SCSI error contains the device that is failing. The next challenge is breaking down the return code. In this case, we have a SCSI disk error on host 0, channel 0, id 0, lun 0, which breaks down as follows:

Host	=	Host Bus Adapter
Channel	=	SCSI Adapter Channel
ID	=	SCSI ID
LUN	=	Logical Unit

It is important to break down the return code to determine root cause. The first entry in the previous example not only determines the location of the error but also provides the return code. Breaking down the return code of 2603007f is not very difficult, as long as bit order is maintained. Bit order is explained in greater detail in Chapter 6.

To break down the return code, we must look at scsi_lib.c from the SCSI source included in a current Linux kernel release. Reviewing the source, we see that the breakdown of the SCSI hardware address is as follows:

printk("SCSI error : <%d %d %d %d> return code = 0x%x\n",                        cmd->device->host->host_no,                        cmd->device->channel,                        cmd->device->id,                        cmd->device->lun, result);

Upon reviewing the scsi_ioctl.c code, we find the following:

*       If the SCSI command succeeds then 0 is returned. *       Positive numbers returned are the compacted SCSI error codes (4 bytes in one int) where the lowest byte is the SCSI status. See the drivers/scsi/scsi.h file for more information on this.

While reviewing drivers/scsi/scsi.h, we determine that the driver design for SCSI is changing and that the file provides a reference point for many SCSI subsets. However, the scope of this chapter excludes building drivers and focuses on device failure and status return codes.

Now that we have a general understanding of the device location of the PCI bus, we need to understand the order of the return code bytes. The return code is made up of four bytes, appearing in the order of 3, 2, 1, 0 and breaking down as follows:

   lsb |    ...    |    ...    | msb  ======|===========|===========|============ status | sense key | host code | driver byte (far left)   Byte 3:    SCSI driver status byte              Byte 2:    Host adapter driver status byte              Byte 1:    Message following the status byte returned by the drive (far right)  Byte 0:    Status byte returned by the drive (bits 5-1)

Now that we have defined the byte locations, we need to define the possible values to be able to decode the return code. Again, upon reviewing include/scsi/scsi.h, the user finds that the previous bytes are defined as follows:

... /*  *  SCSI Architecture Model (SAM) Status codes. Taken from SAM-3 draft  *  T10/1561-D Revision 4 Draft dated 7th November 2002.  */ #define SAM_STAT_GOOD            0x00 #define SAM_STAT_CHECK_CONDITION 0x02 #define SAM_STAT_CONDITION_MET   0x04 #define SAM_STAT_BUSY            0x08 #define SAM_STAT_INTERMEDIATE    0x10 #define SAM_STAT_INTERMEDIATE_CONDITION_MET 0x14 #define SAM_STAT_RESERVATION_CONFLICT 0x18 #define SAM_STAT_COMMAND_TERMINATED 0x22 /* obsolete in SAM-3 */ #define SAM_STAT_TASK_SET_FULL   0x28 #define SAM_STAT_ACA_ACTIVE      0x30 #define SAM_STAT_TASK_ABORTED    0x40 /** scsi_status_is_good - check the status return.  *  * @status: the status passed up from the driver (including host and  *          driver components)  *  * This returns true for known good conditions that may be treated as  * command completed normally  */ static inline int scsi_status_is_good(int status) {        /*        * FIXME: bit0 is listed as reserved in SCSI-2, but is        * significant in SCSI-3. For now, we follow the SCSI-2        * behaviour and ignore reserved bits.        */       status &= 0xfe;       return ((status == SAM_STAT_GOOD) ||             (status == SAM_STAT_INTERMEDIATE) ||             (status == SAM_STAT_INTERMEDIATE_CONDITION_MET) ||             /* FIXME: this is obsolete in SAM-3 */             (status == SAM_STAT_COMMAND_TERMINATED)); } /*  *  Status codes. These are deprecated as they are shifted 1 bit right  *  from those found in the SCSI standards. This causes confusion for  *  applications that are ported to several OSes. Prefer SAM Status codes  *  above.  */ #define GOOD                 0x00 #define CHECK_CONDITION      0x01 #define CONDITION_GOOD       0x02 #define BUSY                 0x04 #define INTERMEDIATE_GOOD    0x08 #define INTERMEDIATE_C_GOOD  0x0a #define RESERVATION_CONFLICT 0x0c #define COMMAND_TERMINATED   0x11 #define QUEUE_FULL           0x14 #define STATUS_MASK          0x3e /*  *  SENSE KEYS  */ #define NO_SENSE             0x00 #define RECOVERED_ERROR      0x01 #define NOT_READY            0x02 #define MEDIUM_ERROR         0x03 #define HARDWARE_ERROR       0x04 #define ILLEGAL_REQUEST      0x05 #define UNIT_ATTENTION       0x06 #define DATA_PROTECT         0x07 #define BLANK_CHECK          0x08 #define COPY_ABORTED         0x0a #define ABORTED_COMMAND      0x0b #define VOLUME_OVERFLOW      0x0d #define MISCOMPARE           0x0e /*  * Host byte codes  */ #define DID_OK           0x00   /* NO error                                */ #define DID_NO_CONNECT   0x01   /* Couldn't connect before timeout period  */ #define DID_BUS_BUSY     0x02   /* BUS stayed busy through time out period */ #define DID_TIME_OUT     0x03   /* TIMED OUT for other reason              */ #define DID_BAD_TARGET   0x04   /* BAD target.                             */ #define DID_ABORT        0x05   /* Told to abort for some other reason     */ #define DID_PARITY       0x06   /* Parity error                            */ #define DID_ERROR        0x07   /* Internal error                          */ #define DID_RESET        0x08   /* Reset by somebody                       */ #define DID_BAD_INTR     0x09   /* Got an interrupt we weren't expecting   */ #define DID_PASSTHROUGH  0x0a   /* Force command past mid-layer            */ #define DID_SOFT_ERROR   0x0b   /* The low level driver just wish a retry  */ #define DID_IMM_RETRY    0x0c   /* Retry without decrementing retry count  */ #define DRIVER_OK        0x00   /* Driver status                           */ /*  *  These indicate the error that occurred, and what is available.  */ #define DRIVER_BUSY         0x01 #define DRIVER_SOFT         0x02 #define DRIVER_MEDIA        0x03 #define DRIVER_ERROR        0x04 #define DRIVER_INVALID      0x05 #define DRIVER_TIMEOUT      0x06 #define DRIVER_HARD         0x07 #define DRIVER_SENSE        0x08 #define SUGGEST_RETRY       0x10 #define SUGGEST_ABORT       0x20 #define SUGGEST_REMAP       0x30 #define SUGGEST_DIE         0x40 #define SUGGEST_SENSE       0x80 #define SUGGEST_IS_OK       0xff #define DRIVER_MASK         0x0f #define SUGGEST_MASK        0xf0 ...

The return code from our previously mentioned example had a value of 2603007f, and it breaks down as follows:

26 = SCSI driver byte = 0010 0110 = RR10011R = Reserved code. 03 = Host adapter driver byte   (DID_TIME_OUT - TIMED OUT                                 for other reason) 00 = Message byte               (COMMAND_COMPLETE) 7f = Status byte                (bogus value), bits 5-1 = 1f

Additional information to help break down SCSI error detection can be found for the following topics at http://tldp.org/HOWTO/SCSI-Generic-HOWTO/index.html (search for scsithe documents will change over time):

SCSI programming HOWTO
Decoding SCSI error status
Sense codes and sense code qualifiers

Now that we have discussed where to find status codes and how to decode return codes for SCSI errors, we can decipher our lab error code of 00070022.

00 = define GOOD: 0x00, Driver status code is Good. We can now tell that it is not a driver issue.
07 = define DATA_PROTECT: 0x07, Sense key data points us in the right direction. It has informed us that the suspect drive has set a read/write exclusive lock; however, we need to know why this has occurred.
00 = define DID_OK: 0x00 NO error. No error with respect to Host.
22 = Breakdown of two nibbles:
- 0x02 = define DRIVER_SOFT.
- 0x20 = define SUGGEST_ABORT.

Note that in our previous lab case, the return code indicated a driver problem with the final byte; however, the indication of a driver issue is a little misleading. The actual cause is a drive issue. The sense key was the critical piece of data that helped to determine the source of the problem. Although the drive appeared to be visible, based upon the sense key data, we can conclude that the drive was locked. In fact, we did lock all the read and write I/Os to a drive while the LUN remained visible to the host, thus causing the error return codes to be misleading.