22.6 Replacing a Failed Quorum Disk

If the quorum disk fails when the cluster is up, you can replace it online as long as the loss of the disk does not cause the cluster to lose quorum.

Since we do not have a quorum disk that is about to break, we will physically remove the quorum disk to illustrate how to replace it. In case you're interested in the errors that the cluster will see when it detects a problem with the quorum disk, we will monitor the action with the evmwatch (8) command. Additionally, to save space we will filter by one member – both members will log the events so you would see the same event for each member. And some duplicated events have been trimmed.

 # export EVM_SHOW_TEMPLATE=" (@host) @name\n@@\n" # evmwatch –f "[host molari]" | evmshow -T "[%T]" [08:44:53] (molari) sys.unix.binlog.hw.scsi._hwid.106 SCSI event ... [08:44:59] (molari) sys.unix.hw.state_change.unavailable.disk._hwcomp onent.SCSIWWID0c00000800000e1100189f1f._hwid.106 Component State Change: Component "SCSI-WWID:0c000008:0000-0e11-0018-9f1f" is in the unavailable state (HWID=106) [08:44:59] (molari) sys.unix.hw.no_connections.disk._hwid.106 Connectivity has been lost for device (HWID=106 lid=5 btl=3/3/0) ... [08:45:09] (molari) sys.unix.binlog.hw.scsi._hwid.106 SCSI event ... [08:45:14] (molari) sys.unix.clu.cnx.qdisk.loss CNX MGR: Cluster quorum disk 19,240 has become unavailable due to a read error ( status 5) [08:45:14] (molari) sys.unix.clu.drd.server_leave._hwid.106 DRD: Removed (unmapped) DRD server molari ... [08:45:14] (molari) sys.unix.syslog.kern vmunix: CNX QDISK: Cluster quorum disk 19,240 has become unavailable due to a re ad error (status 5). [08:45:15] (molari) sys.unix.clu.drd.server_add._hwid.106 DRD: Added (mapped) DRD server sheridan [08:45:15] (molari) sys.unix.clu.drd.new_accessnode._hwid.106 DRD: Server sheridan selected for device 106 [08:45:15] (molari) sys.unix.clu.drd.new_accessnode._hwid.106 DRD: Server sheridan selected for device 106 [08:45:15] (molari) sys.unix.binlog.hw.scsi._hwid.106 SCSI event [08:45:15] (molari) sys.unix.syslog.kern vmunix: CNX QDISK: Quorum disk lost, removing 1 vote. ... [08:45:16] (molari) sys.unix.clu.drd.no_accessnode._hwid.106 DRD: No server found for device 106

Remove the quorum disk.

 # clu_quorum -d remove Collecting quorum data for Member(s): 1 2 Quorum disk successfully removed.

EVM should note that this happened.

 [08:46:10] (molari) sys.unix.syslog.kern vmunix: CNX QDISK: Bad close status (6) on device 19, 240. [08:46:10] (molari) sys.unix.syslog.kern vmunix: CNX MGR: Delete quorum disk operation completed with quorum.

Verify that the quorum disk has been removed.

 # clu_quorum | grep "Quorum disk" Quorum disk: Not Configured

Locate the HWID of the failed device.

The EVM events will have noted the failing component, but it can also be found using the hwmgr (8) command.

 # hwmgr view device -dsf dsk4  HWID:     Device Name       Mfg       Model            Location -----------------------------------------------------------------------------   106:     /dev/disk/dsk4c   COMPAQ    BD009635C3       bus-3-targ-3-lun-0

Remove the failed component.

 # hwmgr del -id 106 hwmgr: Delete operation was successful

The EVM will see this as:

 [08:46:59] (molari) sys.unix.hw.deregistered.undefined._hwid.106 A hardware component has been de-registered (HWID=106) [08:46:59] (sheridan) sys.unix.hw.deregistered.undefined._hwid.106 A hardware component has been de-registered (HWID=106) [08:46:59] (molari) sys.unix.syslog.kern vmunix: drd_delete_device: received status 16 when attempting to delete hwid 106. [08:47:01] (molari) sys.unix.syslog.daemon syslog: hotswapd: subsystem hwc may have successfully run the command '/sbin/dsfmgr -Z rm_cluster_hwid 106 4611977471623797587' [08:47:02] (molari) sys.unix.syslog.daemon syslog: hotswapd: subsystem hwc may have successfully run the command '/sbin/dsfmgr -Z rm_local_hwid 106 4612241719486683987'

Replace the disk.

You can either physically replace the disk at this point, or identify another disk that you can use as the quorum disk. In this example, we will go plug our disk back in.

Scan for the new component.

Use the hwmgr command to identify the new device.

 # hwmgr scan component -category disk -cluster hwmgr: Scan request successfully initiated hwmgr: Scan request successfully initiated

The EVM will see events similar to the list below.

 [08:48:29] (sheridan) sys.unix.hw.scan_completed.platform._hwid.56 A hardware scan has just completed [08:48:30] (molari) sys.unix.hw.scan_completed.platform._hwid.1 A hardware scan has just completed [08:48:46] (molari) sys.unix.binlog.hw.scsi SCSI event [08:48:46] (molari) sys.unix.hw.registered.disk._hwid.107 A hardware component has been registered (HWID=107) [08:48:46] (molari) sys.unix.sysman.station.update.MEMBER SysMan Station: daemon on host molari.fafrak.com has written new serialization f iles. These files allow the daemons to communicate state and topology changes. [08:48:47] (sheridan) sys.unix.hw.registered.disk._hwid.107 A hardware component has been registered (HWID=107) [08:48:49] (molari) sys.unix.hw.cluster_attribute_change.disk._hwid.107 A change has occurred in a cluster attribute for device (HWID=107 lid=5) [08:48:50] (sheridan) sys.unix.hw.cluster_attribute_change.disk._hwid.107 A change has occurred in a cluster attribute for device (HWID=107 lid=6) [08:48:51] (molari) sys.unix.hw.dev_base_name_changed.disk._hwid.107 Device base name changed from "unknown" to "dsk12" (HWID=107) [08:48:51] (sheridan) sys.unix.hw.dev_base_name_changed.disk._hwid.107 Device base name changed from "unknown" to "dsk12" (HWID=107) [08:48:54] (molari) sys.unix.syslog.daemon syslog: hotswapd: subsystem hwc may have successfully run the command '/sbin/dsf mgr -Z cr_new 4612491815432330179'

An alternative method to scan for the new hardware:

 # for i in molari sheridan > do >   hwmgr scan scsi -member $i > done hwmgr: Scan request successfully initiated hwmgr: Scan request successfully initiated

Or you can use the clu_scan_scsi script we wrote which essentially does the same thing.

If the new disk is an RZ26, RZ28, RZ29, or RZ1CB-CA that is not located behind a RAID controller (e.g., an HSZ or HSG controller), run the clu_disk_install command.

The clu_disk_install script runs the scu (8) program to enable bad block replacement for the disk, which allows the device to function as a Direct-Access I/O (DAIO) device. This command can take awhile to complete, so be patient.
```
 # clu_disk_install 
```
Note
If the disk is present at boot time, clu_disk_install automatically runs and finds it.

Locate the new device.

 # hwmgr -show scsi           SCSI                    DEVICE DEVICE    DRIVER  NUM    DEVICE FIRST   HWID:   DEVICEID     HOSTNAME   TYPE   SUBTYPE   OWNER   PATH   FILE   VALID PATH ------------------------------------------------------------------------------------     50:   3            sheridan   disk   none      2       1      dsk1   [3/0/0]     51:   4            sheridan   disk   none      2       1      dsk2   [3/1/0]     52:   5            sheridan   disk   none      2       1      dsk3   [3/2/0]     54:   7            sheridan   disk   none      2       1      dsk5   [3/4/0]     55:   8            sheridan   disk   none      0       1      dsk6   [3/5/0]    102:   0            sheridan   cdrom  none      0       1      cdrom1 [0/0/0]    103:   1            sheridan   disk   none      2       1      dsk8   [2/0/0]    104:   2            sheridan   disk   none      2       1      dsk9   [2/1/0]    107:   6            sheridan   disk   none      0       1      dsk10  [3/3/0]

The new device is dsk10. Note that on V5.1, the device name is not always assigned. You may see something like:

  107:     6           sheridan    disk    none      0      1      (null)    [3/3/0]

If this happens, use the dsfmgr (8) command with the "-K" option to assign a device special file name. Output is not shown.

 # dsfmgr -K

Rename the device special file name back to the old device special file name (optional).

 # dsfmgr -m dsk10 dsk4   dsk10a=>dsk4a dsk10b=>dsk4b dsk10c=>dsk4c dsk10d=>dsk4d dsk10e=>dsk4e dsk 10f=>dsk4f dsk10g=>dsk4g dsk10h=>dsk4h dsk10a=>dsk4a dsk10b=>dsk4b dsk10c=> dsk4c dsk10d=>dsk4d dsk10e=>dsk4e dsk10f=>dsk4f dsk10g=>dsk4g dsk10h=>dsk4h

The EVM will see the following event.

 [08:49:44] (molari) sys.unix.hw.dev_base_name_changed.disk._hwid.107 Device base name changed from "dsk10" to "dsk4" (HWID=107) [08:49:44] (sheridan) sys.unix.hw.dev_base_name_changed.disk._hwid.107 Device base name changed from "dsk10" to "dsk4" (HWID=107) [08:49:58] (molari) sys.unix.hw.disk_label_memory_written.disk._hwid.107 Base hardware event [08:49:58] (molari) sys.unix.hw.disk_label_disk_written.disk._hwid.107 Base hardware event [08:49:58] (molari) sys.unix.hw.disk_label_disk_written.disk._hwid.107 Base hardware event

If you receive the following message, see section 7.5.1.4.

 dsfmgr: ERROR: second device status is active: dsk4a

Add the new disk as the quorum disk.

 # clu_quorum -d add dsk4 1 Collecting quorum data for Member(s): 1 2    Initializing cnx partition on quorum disk : dsk4h Quorum disk successfully created.

The EVM will see the following events.

 [08:50:02] (molari) sys.unix.syslog.kern vmunix: CNX MGR: Add quorum disk operation completed with quorum. [08:50:22] (molari) sys.unix.syslog.kern vmunix: CNX QDISK: Successfully claimed quorum disk, adding 1 vote.