After we determine the applicable error and decoding information, we are ready to determine whether replacement is necessary and possible. Replacement might not be necessary if a good workaround is applicable. Furthermore, it might be impossible to replace a device for a number of reasonsmost importantly, the need for keeping the system online.
To determine whether a device can be replaced online or offline, we must first define these terms. Within the computer industry, the terms online and offline simply refer to the application's status. In short, the primary goal should always be to keep the application online. When a disk-type failure occurs, unless you are using a large type storage array, such as an HP XP storage array, EMC, or IBM shark, then you must be able to handle a disk failure through other means, such as a hardware RAID controller or a software RAID, such as logical volume manager mirroring.
If errors are being logged for disk I/O failures, as shown previously, the easiest thing to do is find the device driver that controls the device and determine the impact of removing it.
In the previous example, we saw an I/O error on a LUN within a storage array connected to an Emulex HBA. Installing the lpfcdd.o driver in a 2.4.9-e.10 kernel allowed access to many LUNs for the HP storage array through the Emulex HBA. It is critical to understand how to map the LUNs back to the driver that allows access. As with any SCSI LUN, the SCSI I/O driver allows read and write I/O; however, the lpfcdd driver allows the path to be available for all the LUNs down the HBA path.
By using dmesg, after running the command insmod lpfcdd, we can see the newly found disk, as depicted in the following example.
[root@cyclops root]# insmod lpfcdd Using /lib/modules/2.4.9-e.10custom-gt/kernel/drivers/scsi/lpfcdd.o Warning: loading /lib/modules/2.4.9-e.10custom- gt/kernel/drivers/scsi/lpfcdd.o will taint the kernel: no license (Note: Tainting of kernel is discussed in Chapter 2) [root@cyclops log]# dmesg Emulex LightPulse FC SCSI/IP 4.20p PCI: Enabling device 01:03.0 (0156 -> 0157) !lpfc0:045:Vital Product Data Data: 82 23 0 36 !lpfc0:031:Link Up Event received Data: 1 1 0 0 PCI: Enabling device 01:04.0 (0156 -> 0157) scsi2 : Emulex LPFC (LP8000) SCSI on PCI bus 01 device 18 irq 17 scsi3 : Emulex LPFC (LP8000) SCSI on PCI bus 01 device 20 irq 27 Vendor: HP Model: OPEN-9-CVS-CM Rev: 2110 <--- Scsi_scan.c code scanning PCI bus to find devices... Type: Direct-Access ANSI SCSI revision: 02 Vendor: HP Model: OPEN-9-CVS-CM Rev: 2110 Type: Direct-Access ANSI SCSI revision: 02 Vendor: HP Model: OPEN-8*13 Rev: 2110 Type: Direct-Access ANSI SCSI revision: 02 Attached scsi disk sdg at scsi2, channel 0, id 0, lun 0 Attached scsi disk sdh at scsi2, channel 0, id 0, lun 1 Attached scsi disk sdi at scsi2, channel 0, id 0, lun 2 SCSI device sdg: 1638720 512-byte hdwr sectors (839 MB) sdg: sdg1 SCSI device sdh: 1638720 512-byte hdwr sectors (839 MB) sdh: sdh1 SCSI device sdi: 186563520 512-byte hdwr sectors (95521 MB) sdi: unknown partition table <---VERY Important... MBR is discussed in great detail in Chapter 6.
We can force I/O on a drive by using the dd command. For example, dd if=/dev/sdi of=/dev/null bs=1024k easily creates 55+ MBps of read on a quality array device. The following depicts the I/O load discussed previously. For a complete picture, a general background must first be established.
The HBA used on our Linux server connects to a brocade switch on port 15, with port 2 going to an upstream ISL for its storage allocation. By using the switchshow command, we can see the WWN of the HBA connected to port 15, and by using the portperfshow command, we can determine the exact performance of our previous dd command. Again, the following demonstrates a heavy I/O performance during a total I/O failure. First, switchshow illustrates the HBA, followed by portperfshow, which illustrates the performance.
roadrunner:admin> switchshow switchName: roadrunner switchType: 5.4 switchState: Online switchMode: Interop switchRole: Subordinate switchDomain: 123 switchId: fffc7b switchWwn: 10:00:00:60:69:10:6b:0e switchBeacon: OFF Zoning: ON (STC-zoneset-1) port 0: sw Online E-Port 10:00:00:60:69:10:64:7e "coyote"\ (downstream) port 1: sw Online E-Port 10:00:00:60:69:10:2b:37 "pepe"\ (upstream) < - Upstream ISL to Storage Switch. port 2: -- No_Module port 3: sw Online F-Port 20:00:00:05:9b:a6:65:40 port 4: sw Online F-Port 50:06:0b:00:00:0a:b8:9e port 5: sw Online F-Port 50:00:0e:10:00:00:96:e4 port 6: sw Online F-Port 50:00:0e:10:00:00:96:ff port 7: sw Online F-Port 50:00:0e:10:00:00:96:e5 port 8: sw Online F-Port 50:00:0e:10:00:00:96:fe port 9: sw Online F-Port 20:00:00:e0:69:c0:81:b3 port 10: sw Online L-Port 1 private, 1 phantom port 11: sw No_Sync port 12: sw Online L-Port 1 private, 1 phantom port 13: sw No_Sync port 14: sw No_Light port 15: sw Online F-Port 10:00:00:00:c9:24:13:27 <--- HBA on \ Linux host roadrunner:admin> portperfshow 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ---------------------------------------------------------------- 0 57m 0 0 0 0 0 0 0 0 0 0 0 0 0 57m 0 55m 0 0 0 0 0 0 0 0 0 0 0 0 0 55m
To create a complete hardware failure, we disable port 15, thus halting all I/O. By using dmesg, we can capture what the kernel is logging about the failure.
After disabling the port 15, dmesg logs the following:
!lpfc0:031:Link Down Event received Data: 2 2 0 20
After waiting 60 seconds for the I/O acknowledgement, the Emulex driver abandons the bus, forcing the SCSI layer to recognize the I/O failure. The dmesg command shows the following:
[root@cyclops log]# dmesg !lpfc0:120:Device disappeared, nodev timeout: Data: 780500 0 0 1e I/O error: dev 08:80, sector 5140160 I/O error: dev 08:80, sector 5140224 I/O error: dev 08:80, sector 5140160 I/O error: dev 08:80, sector 5140224
The user shell prompt looks similar to the following:
[root@cyclops proc]# dd if=/dev/sdi of=/dev/null bs=1024k dd: reading '/dev/sdi': Input/output error 12067+1 records in 12067+1 records out [root@cyclops proc]#
In the previous case, the Emulex driver detected a Fibre Channel Protocol (FCP) failure and deallocated the storage. After the path was restored, the link and I/O access return to normal. To demonstrate I/O returning, we simply issue the portenable command on the brocade port 15, thus enabling the HBA FCP connection. The following dd command demonstrates the I/O returning with the host not going offline. Note that in this case, the application would have lost access to the device, resulting in an offline condition to the application.
[root@cyclops proc]# dd if=/dev/sdi of=/dev/null bs=1024k 57+0 records in 57+0 records out [root@cyclops proc]#
In the meantime, while the portenable and dd commands are being issued, dmesg reports the following:
[root@cyclops proc]# dmesg !lpfc0:031:Link Up Event received Data: 3 3 0 20 !lpfc0:031:Link Up Event received Data: 5 5 0 74 !lpfc1:031:Link Up Event received Data: 1 1 0 0 !lpfc1:031:Link Up Event received Data: 3 3 0 74 [root@cyclops proc]#
In the previous condition, the entire device was offline. Although recovery was simple, applications would have failed. Depending on application parameters, such as buffer cache, remaining I/O operations could cause a system-like hang. Hangs of this nature are discussed in Chapter 2, "System Hangs and Panics." Having a complete hardware device failure on an entire bus, in a SAN, or on some other type of storage network is usually quick to isolate and recover with the tactics described in this chapter; however, data integrity is another matter. Data integrity is beyond the scope of this chapter.
As mentioned earlier, logical path failures are easier to manage than the failure of a given device on a logical path. In the following example, we block the I/O to a given LUN on a bus, as done previously in this chapter to illustrate return code 70022. However, now the goal is to determine the best course of corrective action.
Repeating the same LUN I/O block as before, the dd read test results in the following errors:
[root@cyclops root]# dd if=/dev/sdi of=/dev/null bs=1024k dd: reading '/dev/sdi': Input/output error 6584+1 records in 6584+1 records out
Notice how the prompt did not return; instead, this process is hung in kernel space (discussed in Chapter 8, "Linux Processes: Structure, Hangs, and Core Dumps"). This behavior results because the kernel knows the size of the disk and because we set the block size so large; the remaining I/O's should fail, and the process will die.
[root@cyclops root]# dmesg SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485760 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485824 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485762 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485826 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485760 ~~~Errors continue
While the read I/O errors continue, we have already decoded the return code. This informed us that the disk is read/write protected; thus, we must find a way to restore I/O or move the data to a new location. To get all the information with regards to the PID accessing the device, we run the command ps -ef | grep dd. This command enables us to confirm the PID of 8070. After the PID is established, we go to the PID directory found under the /proc filesystem and check the status of the process using the following method:
root@cyclops / ]# cd /proc/8070 root@cyclops 8070]# cat status Name: dd State: D (disk sleep) <--- Note the state. The state should be in R (running) condition. Pid: 8070 PPid: 8023 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 TGid: 8070 FDSize: 256 Groups: 0 1 2 3 4 6 10 VmSize: 1660 kB VmLck: 0 kB VmRSS: 580 kB VmData: 28 kB VmStk: 24 kB VmExe: 28 kB VmLib: 1316 kB SigPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000001206 CapInh: 0000000000000000 CapPrm: 00000000fffffeff CapEff: 00000000fffffeff [root@cyclops 8070]# cat cpu cpu 8 8534 cpu0 8 8527 cpu1 0 7 [root@cyclops 8070]# cat cmdline ddif/dev/sdiof/dev/nullbs1024k
The status of the process on the device is disk sleep, meaning that the process is waiting on the device to return the outstanding request before processing the next one. In this condition, I/O errors will not continue forever; they will stop after the read I/O block has expired or timed out at the SCSI layer. However, if an application continues to queue multiple I/O threads to the device, removing any device driver from the kernel will be impossible.
~~~~Errors continue~~~~~ SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485822 SCSI disk error : host 2 channel 0 id 0 lun 2 return code = 70022 I/O error: dev 08:80, sector 13485886
Notice the last sector that failed; the difference between the last sector and the previous failed sector is 64,512 bytes or 126 sectors at 512 bytes each. No matter the drive size, if a user issues a dd command on a failed drive, the duration of the I/O hang depends on the size of the outstanding block request. Setting the block size to 64K or less hangs the I/O at the SCSI layer on a buffer wait on each outstanding 2048-byte read request. To see the wait channel, issue ps -ef | grep dd, and then using the PID, issue ps -eo comm,pid,wchan | grep PID to find something such as dd ##### wait_on_buffer. Refer to Chapter 8 to acquire a better understanding of process structure because a discussion of system calls is beyond the scope of this chapter. The point of this general discussion is to demonstrate that the I/O will eventually time out or abort.
Again, the most important thing to understand with a failed device on a given bus is that we cannot remove the driver that controls the bus access path, such as lpfcdd.o, and of course we cannot remove the protocol driver, such as SCSI in this case. This is due to the fact that all remaining devices on the bus are online and in production transmitting I/Os. Issuing a command, such as rmmod lpfcdd, simply yields the result "device busy." The only recovery method for this particular example is to restore access to the given device. If this involves replacing the device, such as installing a new LUN, the running kernel will have the incorrect device tree characteristics for the data construct. This is discussed in great detail in Chapter 8. In this particular case, device access must be restored. Otherwise, a new device will have to be put online and the server rebooted with the data restored from backup.