22.1 Backup and Restore of Critical Cluster File Systems

First of all, after your cluster is built you must put together a regimen of cluster backups. Backing up your cluster includes saving each member's boot disk, cluster_root, cluster_usr, and cluster_var. While you may use an enterprise-wide backup package to backup your data, we recommend something simple (and something that's on the Operating System CD-ROM) like vdump (8)/vrestore (8) to handle these critical operating system-specific file systems. Otherwise you have to do a full operating system installation and then install the enterprise package and probably index a few tapes to finally get to the point where you can recover anything. With a simple solution like vdump/vrestore, you just boot the OS CD or a base OS standalone disk and begin recovering right away.

22.1.1 Backup of Member Boot Disk and Cluster-Common File Systems

Let's remember what the member boot disk contains: on partition "a" is the file system that holds /vmunix, sysconfigtab and a few other important files; partition "b" is swap; and partition "h" is the CNX partition. Regarding the CNX partition, each time the system boots, it automatically saves a copy of the configuration file that can be used to rebuild the CNX partition as /.local../boot_partition/etc/clu_bdmgr.conf. See sections 17.2.5 and 17.8.5 for more information about the CNX partition.

Given a current copy of this file, the CNX partition of any of the cluster boot disks can be restored using clu_bdmgr (8). For example, given member gilligan's clu_bdmgr.conf file, you can repair member skipper's CNX partition (which is on dsk3) with this command.

 [gilligan] # clu_bdmgr -h dsk3 /.local../boot_partition/etc/clu_bdmgr.conf

Now that we've taken care of the CNX partition, it's a simple vdump command to produce backup tapes of the AdvFS file systems on each key partition.

For example, to create a vdump tape of member3's boot partition, you can use the following command:

 # vdump 0f /dev/ntape/tape2 /cluster/members/member3/boot_partition

Repeat this or your favorite variation of the 0-level vdump for each of the file systems we mentioned above. In most environments there will be both 0-level vdumps as well as incremental vdumps. Just be sure to apply all of the incremental vdumps if you have to perform a restoration of any particular file system.

Once you have a good backup scheme in place, you will be prepared to restore each member boot disk, cluster_root, cluster_usr, and cluster_var.

As a final note on member boot disk backups, because of their limited size, you can fairly easily fit them in one of the cluster-common OS file systems (especially if they are gzip'd) so that you could rebuild an entire boot disk in mere moments (as opposed to restoring from tapes or resorting to clu_delete_member and clu_add_member).

 [/var/bd_backups] root@molari # ls -l total 42312 -rw-r--r--     1 root           system       21627236    May 28 00:28 molari_bd_vdump.gz -rw-r--r--     1 root           system       21688542    May 28 00:27 sheridan_bd_vdump.gz

22.1.2 Restoring the Cluster Root

Let's say that critical files on your cluster_root file system have been deleted, and after some investigation you determine that the problem is widespread (and to make things more interesting, cluster_usr and cluster_var are also affected and are on the same disk) and you decide that you need to restore the file systems. There is no need to destroy the file system that is currently known as the cluster_root, since keeping that file system around will allow you to troubleshoot the original problem when the heat is off. Don't panic! Since you have the backups (we will assume vdump tapes), you can restore the cluster either onto a disk already known to the cluster (the preferred option) or to a disk new to the cluster.

22.1.2.1 Restoring Cluster Root to an Existing Disk

The least complicated scenario for restoring the cluster_root file system is to restore it to a disk that is already known to the cluster.

To perform the restore, we need a disk on a shared bus that is accessible to all members of the cluster, that was known to the cluster before the problem with cluster_root, and that was known to the cluster prior to the cluster_root backup. The restoring member must have access to the Emergency Repair (ER) disk (or equivalent), and this OS disk must be at the same operating system version and patch kit as the cluster. See section 22.1.3 for more discussion about the ER disk.

In this scenario:

Original cluster root was dsk5.
Bootable standalone (ER) disk is dsk13.
Disk to which cluster_root is restored is dsk11.

Ensure all members are halted.
Boot the ER disk or equivalent.
```
 >>> boot DKB100 
```
DKB100 corresponds to dsk13 in this cluster.

Warning
Once booted from the ER disk you may have a different view of the storage, namely the device special file names.
Determine if the device name for the new cluster_root is the same while booted from the ER disk as it is while booted from the cluster.
```
 # hwmgr –view device 
```
Verify by checking the Bus/Target/LUN and the disk name. If the target device doesn't have the same device special file name, rename it so that it does. This will avoid confusion later.
```
 # dsfmgr –m dsk55 dsk11 
```
Partition the new cluster_root disk (using saved disklabel information per Chapter 21, probably the hardcopy version of the saved disklabels).
```
 # disklabel -e dsk11 
```

Create the new cluster_root, cluster_usr, and cluster_var domains and filesets (they all reside on the same disk in our example, but this may not always be true). If you are using the ER disk, the /etc/fdmns/cluster_* directories will already exist, so remove those before attempting to create the new domains.

 # cd /etc/fdmns # rm –r cluster_root cluster_usr cluster_var

Now create the domains.

 # mkfdmn /dev/disk/dsk11a cluster_root # mkfset cluster_root root

 # mkfdmn /dev/disk/dsk11g cluster_usr # mkfdmn /dev/disk/dsk11h cluster_var # mkfset cluster_var var

Note

If you need the extra file system space, you can add up to two additional volumes to cluster_root at this point. Just remember to account for them in steps 10, 11, and 12. See Chapter 11 of the TruCluster Server Cluster Administration manual for more details.

 # addvol /dev/disk/dsk70c cluster_root

Mount cluster_root.
```
 # mount cluster_root#root /mnt 
```

Restore cluster_root from backup.

 # vrestore –xf /dev/tape/tape2 –D /mnt # umount /mnt

Restore cluster_usr and cluster_var by repeating steps 6 and 7 but for the cluster_usr#usr and cluster_var#var filesets. (Or you could leave cluster_root mounted and mount these filesets at /mnt/usr and /mnt/var respectively and restore them.)
Re-mount cluster_root if you unmounted it in step 7.
```
 # mount cluster_root#root /mnt 
```
Fix the restored /etc/fdmns/cluster_root device linkage to reflect the new device name. (Also update cluster_usr and cluster_var if those file systems were also restored.)
```
 # cd /mnt/etc/fdmns/cluster_root # rm * # ln -s /dev/disk/dsk11a 
```

Take note of the major and minor numbers of the new cluster_root device.

 # file /dev/disk/dsk11a /dev/disk/dsk11a:                   block special (19/97)

Boot the member boot disk interactively specifying the major and minor numbers (above) for the newly restored cluster_root.
```
 >>> boot –flags "ia" ... Enter: <kernel_name> [option_1 ... option_n]    or: ls [name]['help'] or: 'quit' to return to console Press Return to boot 'vmunix' # vmunix cfs:cluster_root_dev1_maj=19 cfs:cluster_root_dev1_min= 97 
```
Note
In cluster_root_dev#_maj and cluster_root_dev#_min, the "#" can range from 1 to 3, meaning that you can have from 1 to 3 AdvFS volumes in cluster_root.
Boot each cluster member one at a time.

22.1.2.2 Restoring Cluster Root to a New Disk

This procedure is more involved, as you'll see, so only choose this option if there is no disk already available on a shared bus. In fact, you might even configure a "hot spare" disk for such an emergency.

In this scenario:

Original cluster_root was dsk5.
Member boot disk is dsk6.
Bootable standalone (ER) disk is dsk13.
Disk to which cluster_root is restored is dsk21.

Ensure all members are halted.
Boot the ER disk or equivalent.
```
 >>> boot DKB100 
```
DKB100 corresponds to dsk13 in this cluster.

Warning
Once booted from the ER disk you may have a different view of the storage, namely the device special file names.
Note the device name and Bus/Target/LUN of the disk that will be the new cluster_root.
```
 # hwmgr -view device 
```
Partition the new cluster_root disk (using saved disklabel information per Chapter 21, probably the hardcopy version of the saved disklabels).
```
 # disklabel -e dsk21 
```

Create the new cluster_root, cluster_usr, and cluster_var domains and filesets (they all reside on the same disk in our example, but this may not always be true).

If you are using the ER disk, the /etc/fdmns/cluster_* directories will already exist, so remove those before attempting to create the new domains.

 # cd /etc/fdmns # rm -r cluster_root cluster_usr cluster_var

Now create the domains and filesets.

 # mkfdmn /dev/disk/dsk21a cluster_root # mkfset cluster_root root

 # mkfdmn /dev/disk/dsk21g cluster_usr # mkfset cluster_usr usr

 # mkfdmn /dev/disk/dsk21h cluster_var # mkfset cluster_var var

Note

If you need the extra file system space, you can add up to two additional volumes to new_root at this point. Just remember to account for them in steps 21, 27, and 28. See Chapter 11 of the TruCluster Server Cluster Administration manual for more details.

 # addvol /dev/disk/dsk70c cluster_root

Mount cluster_root.
```
 # mount cluster_root#root /mnt 
```

Restore cluster_root from backup.

 # vrestore -xf /dev/tape/tape2 -D /mnt # umount /mnt

Restore cluster_usr and cluster_var by repeating steps 6 and 7 but for the cluster_usr#usr and cluster_var#var filesets. (Or you could leave cluster_root mounted and mount these filesets at /mnt/usr and /mnt/var respectively and restore them.)
Re-mount cluster_root if you unmounted it in step 7.
```
 # mount cluster_root#root /mnt 
```
Copy the restored cluster databases and member-specific databases to /etc on the ER disk so that when you reboot from the ER disk you will have the proper (cluster) view of storage.
```
 # cd /mnt/etc # cp dec_unid_db dec_hwc_cdb dfsc.dat /etc # cd /mnt/cluster/members/member1/etc # cp dfsl.dat /etc 
```
Create the links in /etc/fdmns for the member boot disk if you don't already have them.
```
 # cd /etc/fdmns # mkdir root1_domain # cd root1_domain # ln -s /dev/disk/dsk6a 
```

Mount the member boot partition.

 # cd / # umount /mnt # mount root1_domain#root /mnt

Copy the databases from the member boot partition to /etc on the ER disk so that when you reboot from the ER disk you will have the proper (cluster) view of storage.
```
 # cd /mnt/etc # cp dec_devsw_db dec_hw_db dec_hwc_ldb dec_scsi_db /etc 
```
Unmount the member boot partition.
```
 # cd / # umount /mnt 
```

Create the .bak (backup) database files.

 # cd /etc # for i in dec_*db > do >   cp $i $i.bak > done

Reboot to single-user mode using the ER disk as the boot device. The system will now boot with the same hardware configuration that the cluster previously had.
```
 >>> boot -fl s DKB100 
```
Scan the SCSI buses.
```
 # hwmgr –scan scsi 
```
Mount the root file system as read/write.
```
 # mount -u / 
```
Verify and update the device database.
```
 # dsfmgr -v -F 
```
Display and note the current device layout.
```
 # hwmgr -view devices 
```
Fix the local file domains, if necessary, by removing and remaking the correct links in the /etc/fdmns directory. Examine usr_domain, root1_domain, and cluster_root to see if any changes are necessary based on the output from step 20. If you need to make changes follow this example:
```
 # cd /etc/fdmns/cluster_root # rm * # ln -s /dev/disk/dsk21a 
```
Mount the local file systems.
```
 # bcheckrc 
```
Copy the updated cluster database files to cluster_root so that when you reboot with the new cluster disk, you will have the proper (cluster) view of storage.
```
 # mount cluster_root#root /mnt # cd /etc # cp dec_unid_db* dec_hwc_cdb* dfsc.dat /mnt/etc # cp dfsl.dat /mnt/cluster/members/member1/etc 
```

Fix /etc/fdmns/cluster_root on the cluster_root.

 # rm /mnt/etc/fdmns/cluster_root/* # cd /etc/fdmns/cluster_root # tar cf - * | (cd /mnt/etc/fdmns/cluster_root && tar xf -)

Fix /etc/fdmns/cluster_usr and /etc/fdmns/cluster_var on the cluster_root as in the previous step. (This step is only necessary if cluster_usr and cluster_var are on the same disk as cluster_root.)

 # rm /mnt/etc/fdmns/cluster_usr/* # cd /etc/fdmns/cluster_usr # tar cf - * | (cd /mnt/etc/fdmns/cluster_usr && tar xf -) # rm /mnt/etc/fdmns/cluster_var/* # cd /etc/fdmns/cluster_var # tar cf - * | (cd /mnt/etc/fdmns/cluster_var && tar xf -)

Copy the updated cluster database files to the member boot disk.

 # cd / # umount /mnt # mount root1_domain#root /mnt # cd /etc # cp dec_devsw_db* dec_hw_db* dec_hwc_ldb* dec_scsi_db* /mnt/etc

Take note of the major and minor numbers of the new cluster_root device.
```
 # file /dev/disk/dsk21a /dev/disk/dsk21a: block special (19/467) 
```
Boot the member boot disk interactively specifying the major and minor numbers (above) for the newly restored cluster_root.
```
 >>> boot -flags "ia" ... Enter: <kernel_name> [option_1 ... option_n]     or: ls [name]['help'] or: 'quit' to return to console Press Return to boot 'vmunix' 
```
```
 # vmunix cfs:cluster_root_dev1_maj=19 cfs:cluster_root_dev1_min=467 
```
Note
In cluster_root_dev#_maj and cluster_root_dev#_min, the "#" can range from 1 to 3, meaning that you can have from 1 to 3 AdvFS volumes in cluster_root.
Boot each cluster member one at a time.

22.1.3 Emergency Repair Disk

As mentioned in the previous section, to restore cluster_root you must have a bootable non-clustered operating system disk at the same operating system version and patch level as the current cluster. Our recommendation is that when you install a patch kit on your cluster, you always copy the patch kit tar (1) file to a file system on the Emergency Repair (ER) disk; that way you'll have it if you need it. What is the Emergency Repair disk you ask? It's the disk that you originally used to build the operating system and from which you ran clu_create (8). We recommend keeping this disk and the operating system file systems intact after installation.

22.1.4 Replacing a Member Disk

If you need to replace only a member disk and you have a vdump of the "a" partition, you can restore that disk or restore the data to a new disk. In the procedure that follows, you can skip any steps that "correct" for the fact that we are restoring to a different disk already known to the cluster (those steps will be marked with a "†"). Please see section 7.4.4 for more on adding disks to a system. Our original configuration had molari's (member1) member boot disk as dsk2 and its replacement disk is dsk10. Unless otherwise stated, all steps take place on a booted member of the cluster.

Make sure that the member with the failed boot disk is down.
```
 # clu_get_info –m 1 
```

Setup dsk10 as the boot disk for member 1.

 # /usr/sbin/clu_bdmgr –c dsk10 1

If dsk10 is not the original boot disk for molari, you will get a warning message and will see instructions to change the /etc/sysconfigtab file to reflect this change†. Write down these changes. They will look something like this:

 The new member's disk, dsk10, is not the same name as the original disk configured for domain root1_domain. If you continue the following changes will be required in member1's /etc/sysconfigtab file:         vm:         swapdevice=/dev/disk/dsk10b         clubase:         cluster_seqdisk_major=19         cluster_seqdisk_minor=195

Mount molari's new boot disk root domain. (The clu_bdmgr command created the domain and fileset.)
```
 # mount root1_domain#root /mnt 
```

Restore the member's boot_partition.

 # vrestore –xf /dev/tape/tape0 –D /mnt

Modify the /etc/sysconfigtab to reflect the new disk (dsk10) per the earlier instructions in step 2. Extract the existing member's clubase and vm sections into a stanza file:

 # sysconfigdb –t /mnt/etc/sysconfigtab -l clubase vm > /tmp/bd_mod.stanza

Edit the stanza file:

 # vi /tmp/bd_mod.stanza

Make it so:

 # sysconfigdb –t /mnt/etc/sysconfigtab -m -f /tmp/bd_mod.stanza vm # sysconfigdb –t /mnt/etc/sysconfigtab –m –f /tmp/bd_mod.stanza clubase

Restore the "h" partition (cnx).
```
 # /usr/sbin/clu_bdmgr –h dsk10 
```
Unmount the member's boot_partition.
```
 # umount /mnt 
```
Adjust the bootdef_dev console variable on molari's console to reflect the change.
```
 >>> set bootdef_dev DKB202 
```
The device DKB202 is the appropriate console device name that corresponds to dsk10 in UNIX.
Boot molari.
```
 >>> boot 
```

22.1.5 Restoring a Critical Data File System

Just as losing a file system critical to the operating system can be a problem for your cluster, so can losing a critical data file system. Restoring a data or non-OS file system is really no different in a cluster than in a standalone Tru64 UNIX system. For example, if your backups are made with vdump, you can restore them quite easily with this vrestore command:

 # vrestore –xf /dev/tape/tape0 -D /data

If restoring to a new disk, you will, of course, have to disklabel it and create a new file system (either AdvFS or UFS) before restoring and either rename the device special file or adjust any references to the device name (like in /etc/fdmns) to point to the new device. The point here is that it isn't special because it's a cluster.