22.2 Replacing HBA andor HSx Controllers


22.2 Replacing HBA and/or HSx Controllers

During the life of your cluster, you could have certain hardware failures, which in and of themselves are no big deal since you have your cluster configured with No Single Point of Failure (NSPOF). Handling these maintenance items usually requires a small dose of thought and then the cluster continues to hum merrily along.

22.2.1 Replacing an HSG80 Controller

In most cases, if you have a storage controller failure such as an HSG80, only one of the dual pair will fail at a time. In that case, the prospect of the WWID changing is not a problem; it's taken up by the remaining controller and survives the failed controller's replacement. If you should have a dual failure or some other failure where the WWID is destroyed, however, you must be able to reset the WWID on the HSG80. You do this with a single command at the HSG80 console.

 HSG80> set this node=5000-1FE1-0000-9630 YO 

In this case, "5000-1FE1-000-9630" is the WWID and "YO" is the checksum. You can get both of these values from the HSG80 controller. The values are on the device, not on the replaceable controllers themselves but on the more permanent part of the HSG80 enclosure.

22.2.2 Replacing an HSZ70 Controller

If your controller is a dual (either transparent failover or multibus failover) HSZ70, and you have a single failure, you'll be okay as long as you replace the failed controller before there is a problem with the other one. If you happen to have a dual failure (which is extremely rare), then you will have to do some work with dsfmgr (8) to return your disk names to their previous state. This is one reason that we recommend keeping on hand the files and command output we mentioned in Chapter 21. In addition, it is very important that you have the latest, patched HSOF software. For this procedure to work properly, your HSZx0 should be at least at these patch levels:

  • HSZ40: V37Z-1.

  • HSZ50: V57Z-1.

  • HSZ70: V77Z-1.

Become familiar with these HSOF commands before proceeding. They can result in data loss if used incorrectly.

  • set nofailover

  • set failover copy=this

  • set multibus_failover copy=this

22.2.2.1 HSZ70 in Transparent Failover Mode

This procedure works for the HSZ70 but also works for the HSZ40 and HSZ50 controller when configured in transparent failover mode. The operating system must be booted to handle the controller change. If not already booted, you must at least boot to single-user mode.

  1. Begin with the running operating system (the devices on the impacted HSZx0 should be present and seen).

  2. Issue the "set nofailover" command at the HSZx0 while connected to the functioning controller.

  3. When the green LED status light is no longer blinking, remove failed controller.

  4. Scan the bus with "hwmgr –scan scsi –bus N" (where "N" is the bus number associated with that HSZx0) on each member or use our clu_scan_scsi script (this will allow us to notice the change from dual controller to single controller).

    You may see several “was dual, now single" messages on the HSZx0 console.

    Even though the prompt returns rather quickly after the "hwmgr" command, the scan was only initiated. Give the scan about a minute to complete before proceeding to the next step.

  5. Replace failed controller.

  6. Scan the bus with "hwmgr –scan scsi –bus N" again.

  7. Issue "set failover copy=this" command (while connected to the original good controller).

    You should see several "upgraded to dual" messages on the HSZx0 console.

  8. Scan the bus a final time with "hwmgr –scan scsi –bus N" or clu_scan_scsi (this will allow us to notice the change from single controller back to the new dual redundant pair).

22.2.2.2 HSZ70 in Multibus Failover Mode

This procedure requires either a patch to TruCluster Server version 5.1A or version 5.1B. As of this writing, the patch is not yet part of an aggregate patch kit.

This procedure works for the HSZ70 when configured in multibus failover mode. The operating system must be booted to handle the controller change. If not already booted, you must at least boot to single-user mode.

Note

Avoid applying this procedure during heavy I/O periods as you may get "phantom" devices. Eliminate I/O to the controller and/or refer to the HSZ70 documentation concerning port quiescence.

  1. Begin with the running operating system (the devices on the impacted HSZ70 should be present and seen).

  2. From the Tru64 UNIX host, issue the "sysconfig -r cam_disk rec_use_alt_params=1" command to switch to alternate recovery timing (required for this type of procedure).

  3. Issue the "set nofailover" command at the HSZ70 while connected to the functioning controller.

    You may see several "was dual, now single" messages on the HSZx0 console.

  4. When the green LED status light is no longer blinking, remove failed controller.

  5. Scan the bus with "hwmgr –scan scsi –bus N" (where "N" is the bus number associated with that HSZx0) on each member or use our clu_scan_scsi script (this will allow us to notice the change from dual controller to single controller).

    Even though the prompt returns rather quickly after the "hwmgr" command, the scan was only initiated. Give the scan about a minute to complete before proceeding to the next step.

  6. Replace failed controller.

  7. Scan the bus with "hwmgr –scan scsi –bus N" again.

  8. Issue "set multibus_failover copy=this" command (while connected to the original good controller).

    You should see several "upgraded to dual" messages on the HSZ70 console.

  9. Issue the "sysconfig -r cam_disk rec_use_alt_params=0" command to return to the default recovery timing.

  10. Scan the bus a final time with "hwmgr –scan scsi –bus N" or clu_scan_scsi (this will allow us to notice the change from single controller back to the new multibus failover pair).

22.2.3 Replacing a Fiber Channel HBA

When you replace a failed KGPSA in Tru64 UNIX version 5, you don't need to do anything special to preserve the existing (disk) device names. However, the new KGPSA will have a new WWID and therefore will have new connections at the HSG80 (or other FC storage box) that need to be addressed especially if you are implementing Selective Storage Presentation to allow only certain systems to access certain storage. Also the SAN zones may need to be adjusted to account for the new connections. For more on SAN storage, refer to Chapter 4.

22.2.4 Replacing a Shared SCSI HBA

To replace a KZPBA, KZPSA, or other shared storage HBA you should make sure that you keep the same SCSI ID and ensure proper bus termination. For more details about storage setup, see Chapter 4.




TruCluster Server Handbook
TruCluster Server Handbook (HP Technologies)
ISBN: 1555582591
EAN: 2147483647
Year: 2005
Pages: 273

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net