14.3 Recovering from a Missing Critical Boot File: /stand/rootconf This last scenario doesn't happen very often but when it does, it can be difficult to recover from. The /stand/rootconf file is used by the kernel in a maintenance mode boot to locate the root filesystem when it is a separate filesystem; this is common in any version of HP-UX as of 11.0. If the file doesn't exist during a normal boot process, it is recreated if necessary by the sysint process /sbin/ioinitrc . During a normal boot, the kernel can locate the root filesystem via the LABEL file and the BDRA. During a maintenance mode boot, these structures are not necessarily there or may be corrupt. This means that, just when you need a maintenance mode, it is going to fail if /stand/rootconf is not there. Some people have commented to me that this is a bit of a corner case scenario . I agree because it doesn't happen often. The reason I am including it here is to demonstrate that we can get lots of functionality from the Recovery Shell by simply mounting the boot volume and performing some manual configuration changes ( otherwise impossible ) in order to recover an un-bootable situation. The premise here is that we are trying to boot the system in maintenance mode and even that is failing; it may be due to a corrupt or missing /stand/rootconf file. The file doesn't contain much information; it's usually around 12 bytes in size . Essentially, the file contains the following: As such, this shouldn't be too difficult to recreate or is it? 14.3.1 A magic label of 0xdeadbeef This is simply a hexadecimal magic number at the beginning of the file, which the kernel uses to identify that the file is not corrupt. 14.3.2 Start block address of the root LV This is a little more difficult. To calculate this, we need some additional information relating to vg00: -
The block address where the user data starts -
The extent size of vg00 -
The size of lvol1 in extents -
The size of lvol2 in extents In my configuration the following values are: -
The block address where the user data starts = 2192 Unless HP makes a massive change to the structure of an LVM disk (which is unlikely ), this will always remain the same at 2192KB. A block is regarded as being 1K in size. -
The extent size of vg00 = 8MB = 8192 -
The size of lvol1 = 112MB/8 = 14 extents -
The size of lvol2 = 2048MB/8 = 256 extents The extent size of lvol1 and lvol2 allow us to work out the first physical extent of lvol3: -
lvol1 = 14 extents = extents 0-13 -
lvol2 = 256 extents = extents 14 “ 269 -
lvol3 starts at extent 270 You can see this if you run the command pvdisplay “v <boot/root disk> . We can calculate the starting block address of lvol3: (extent size) * (starting block address of lvol3) + (start of user data) | 8192 * 270 + 2912 | = 2214752 | = 0x21CB60 | 14.3.3 Size of the root LV The size of my lvol3 = 1304MB = 163 extents Block address of 163 extent = 163 * 8192 = 1335296 = 0x146000 14.3.4 Creating the /stand/rootconf file by hand We have now established the three 32-bit words of information that we need to construct the /stand/rootconf file by hand: -
A magic label of 0xdeadbeef -
Starting block address of the root LV = 0x21CB60 -
Size of the root LV = 0xA30000 The three words will look like this: dead beef 0021 cb60 0014 6000 We must remember that in the Recovery Shell we don't have lots of utilities, especially a tool that will allow us to edit a hexadecimal file. The easiest way I have found to recreate the rootconf file is to take the three decimal/hexadecimal words and convert them to octal. We can then use the echo command to echo the octal values (prefixed with \0) and redirect the output to create the rootconf file. Here's how it works: HEX | de | Ad | be | ef | 00 | 21 | cb | 60 | 00 | 14 | 60 | 00 | OCTAL | 336 | 255 | 276 | 357 | 00 | 41 | 313 | 140 | 00 | 24 | 140 | 00 | This goes to make a command line look like: echo "36557657 echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf 011340 echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf 0440 echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf 0\c" > rootconf If it makes you feel better, here is the above example run on my system that currently does have a valid rootconf file: root@hpeos003[stand] echo "36557657 root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 011340 root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 0440 root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 0\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] ! cb ` root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 14 ` root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] ! cb ` root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 14 ` root@hpeos003[stand] echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf.test root@hpeos003[stand] xd -c rootconf 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] xd -c rootconf.test 0000000 de ad be ef \0 ! cb ` \0 14 ` \0 000000c root@hpeos003[stand] 000000c root@hpeos003[stand] Now let's get to the actual problem itself: I will recreate this by removing the rootconf file and attempting a maintenance mode boot. You will typically receive one of two PANIC messages: -
Up to HP-UX 11.0 panic: (display==0xb800, flags==0x0) all VFS_MOUNTROOTs failed: NEED DRIVERS ??? -
HP-UX 11i panic: Could not create /dev/ip I particularly like the message panic: Could not create /dev/ip . It gives no clue as to what the problem is, but now you know. The PANIC messages are not unique to this problem. They could be caused by a number of factors including the following: -
Bad contents of the AUTO file If someone has been trying to be clever with your AUTO file ( putting hardware addresses of root/boot disks which subsequently get moved to a new hardware location), you can rectify it by replacing the AUTO file with a default command from the ISL prompt: ISL> hpux set autofile "hpux" -
Corrupt LVM Boot Data Reserved Area (BDRA) We have looked at rebuilding the BDRA from a maintenance mode boot. -
Missing /dev/console on 10.20 with JFS for root filesystem This sounds bizarre, but we are talking about HP-UX 10.20! We can rectify by exiting to the shell from the Recovery Shell, performing a chroot_lvmdisk (more on that in a minute), and ensuring that the following device files exist: # mknod /dev/systty c 0 0x000000 # mknod /dev/console c 0 0x000000 # mknod /dev/tty c 207 0x000000 # ln /dev/systty /dev/syscon -
Root filesystem is corrupted beyond repair This is a nasty one. Anything could have happened ; the problem could be anything from a suspect disk (get an HP Hardware Customer Engineer to diagnose a potential hardware problem) to someone destroying the root disk ( accidentally of course). We can try to repair this by exiting to the shell from the Recovery Shell and running fsck using an alternate super-block for the filesystem. The list of alternate super-blocks for an HFS filesystem can be found in the file /etc/sbtab . We can use any of them; if the corruption was severe, we may want to try a super-block deep inside the filesystem (a high number). # /sbin/fs/hfs/fsck -b <no. from sbtab> /dev/rdsk/cXtYdZs1lvm (ie if the root and boot logical volumes are the same) OR # /sbin/fs/hfs/fsck -b <no. from sbtab> /dev/rdsk/cXtYdZs2lvm (ie if the root and boot logical volumes are separate) Once we have repaired the primary super-block, we need to run fsck a number of times until we achieve a clean filesystem. It is always advisable to have HP look at the disk in these circumstances to rule out a hardware problem. -
Missing driver (or Bad/Corrupted kernel) This is rare, but if we have a backup kernel, we can try booting from it. -
Other problems I can't think of many other problems that could cause such a PANIC message, but you never know. It's always worthwhile to check on the state of the level of patching that your system has currently achieved. Keeping abreast of current patches can mean that you don't experience known problems that have already been resolved. Now back to our original problem. We are in a position where we need to boot in maintenance mode but are prevented from doing so because of a missing or corrupt /stand/rootconf file. First, we need to boot our system from our Core OS Install and Recovery media and get to the Recovery Shell. I am assuming that you have already read section 14.1, so I will continue our discussion from that point. HP-UX NETWORK SYSTEM RECOVERY MAIN MENU s. Search for a file b. Reboot l. Load a file r. Recover an unbootable HP-UX system x. Exit to shell c. Instructions on chrooting to an LVM /(root). This menu is for listing and loading the tools contained on the core media. Once a tool is loaded, it may be run from the shell. Some tools require other files to be present in order to successfully execute. Select one of the above: It is worth mentioning option c. Instructions on chrooting to an LVM /(root) at this point. If you have never seen the chroot command before, it allows you to run a command relative to a new root directory. FTP uses it with the anonymous ftp user. The ftpd recognizes a user (usually) called ftp , a shell of /usr/bin/false and a non-null password, and limits access to the system by making the user's root directory that of the home directory of the ftp user. See man ftpd for more details. In a similar way, we can run a shell from the Recovery Shell but have our root directory relative to a different directory to the root directory created via the mini-kernel on the Recovery media. Our new root directory will be the mount-point where the Recovery Shell mounts our boot/root disk. This makes traversing our disk-based boot/root filesystem easier; after the chroot , we can run a command like cd /etc , and it will take us to the /etc directory on disk because our new root directory is relative to the root-based filesystem, not the RAM-based Recovery Shell. It also means that our PATH statement will reference directories on disk when trying to locate commands; the Recovery media has only a limited number of commands available. I will use this to effect the changes necessary for this problem. HP-UX NETWORK SYSTEM RECOVERY MAIN MENU s. Search for a file b. Reboot l. Load a file r. Recover an unbootable HP-UX system x. Exit to shell c. Instructions on chrooting to an LVM /(root). This menu is for listing and loading the tools contained on the core media. Once a tool is loaded, it may be run from the shell. Some tools require other files to be present in order to successfully execute. Select one of the above: c Exit to the shell and run 'chroot_lvmdisk'. Type <return> to return to the MAIN MENU. Let's try the chroot_lvdisk command. It will attempt to run the fsck command against the / and /stand filesystems via two special device files /dev/dsk/cXtYdZs[12]lvm and then give us further instructions: Select one of the above: x Type menu to return to the menu environment. # # chroot_lvmdisk Loading commands needed for recovery! Enter the hardware path associated with the '/'(ROOT) file system (example: 0/0/1/1.15.0) Is 0/0/1/1.15.0 the hardware path of the root/boot disk?[ynq]- y /sbin/fs/hfs/fsck -c 0 -y /dev/rdsk/c1t15d0s1lvm ** /dev/rdsk/c1t15d0s1lvm ** Last Mounted on /ROOT ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 64 files, 0 icont, 65777 used, 45860 free (108 frags, 5719 blocks) Mounting c1t15d0s1lvm to the Core Tape's /ROOT directory... /sbin/fs/vxfs/fsck -y /dev/rdsk/c1t15d0s2lvm file system is clean - log replay is not required /sbin/fs/vxfs/mount /dev/dsk/c1t15d0s2lvm /ROOT /sbin/fs/hfs/mount /dev/dsk/c1t15d0s1lvm /ROOT/stand loading /usr/sbin/chroot x ./usr/sbin/chroot, 12288 bytes, 24 tape blocks Enter 'cd /ROOT; chroot /ROOT /sbin/sh' at the shell prompt to chroot to the customer's /(root) disk. # Our root filesystem has been mounted on /ROOT with /stand being mounted under /ROOT/stand . We just need to follow the instructions on screen to run the chroot command. After that, all our commands will be relative to the disk-based filesystem and not relative to the RAM-based mini-root filesystem created by the Recovery media. Here goes: # cd /ROOT # chroot /ROOT /sbin/sh # cd /stand # ls al total 125580 drwxrwxrwx 10 bin bin 1024 Sep 24 05:59 . drwxr-xr-x 24 root root 1024 Sep 24 06:25 .. -rw------- 1 root root 0 Mar 10 2003 .kminstall_lock -rw-r--r-- 1 root sys 20 Sep 23 07:44 bootconf drwxr-xr-x 4 root sys 2048 Sep 24 02:59 build drwxr-xr-x 5 root root 1024 Sep 24 03:03 dlkm drwxr-xr-x 5 root sys 1024 Sep 24 01:22 dlkm.vmunix.prev -rw-r--r-- 1 root sys 3440 Sep 24 05:57 ioconfig -r--r--r-- 1 root sys 82 Feb 18 2003 kernrel drwxr-xr-x 2 root sys 1024 Sep 24 05:59 krs drwxr-xr-x 2 root root 1024 Sep 24 05:57 krs_lkg drwxr-xr-x 2 root root 1024 Sep 24 05:59 krs_tmp drwxr-xr-x 2 root root 8192 Feb 18 2003 lost+found -rw------- 1 root root 12 Sep 24 05:57 rootconf.test -rw-r--r-- 1 root root 1104 Sep 24 02:57 system drwxr-xr-x 2 root sys 1024 Sep 23 07:49 system.d -r--r--r-- 1 root sys 1104 Sep 24 02:56 system.prev -rwxr-xr-x 1 root root 25931352 Sep 24 02:57 vmunix -rw-r--r-- 1 root sys 12342712 Sep 24 02:17 vmunix.prev -rwxr-xr-x 1 root sys 25927256 Sep 24 01:21 vmunixBK # This certainly looks like my /stand filesystem; you can see my rootconf.test file from before. Now I can effect the changes necessary; I suppose I could just rename the rootconf.test file and reboot. I will run the echo command just for completeness: # echo "36557657 # echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf # ls -al rootconf* -rw------- 1 root root 12 Sep 24 07:04 rootconf -rw-rw-rw- 1 root sys 12 Sep 24 05:39 rootconf.test # 011340 # echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf # ls -al rootconf* -rw------- 1 root root 12 Sep 24 07:04 rootconf -rw-rw-rw- 1 root sys 12 Sep 24 05:39 rootconf.test # 0440 # echo "\0336\0255\0276\0357\000\041\0313\0140\000\024\0140\000\c" > rootconf # ls -al rootconf* -rw------- 1 root root 12 Sep 24 07:04 rootconf -rw-rw-rw- 1 root sys 12 Sep 24 05:39 rootconf.test # 0\c" > rootconf # ls -al rootconf* -rw------- 1 root root 12 Sep 24 07:04 rootconf -rw-rw-rw- 1 root sys 12 Sep 24 05:39 rootconf.test # It would be helpful to perform my xd command to compare the content of the files and ensure that they are the same, but the xd command is located in the /usr filesystem. I suppose I could use the cat command and some shell tests: # x=$(cat -v rootconf) # y=$(cat -v rootconf.test) # [[ "$x" = "$y" ]] # echo $? 0 # Using shell tests, we can say that the two files look the same. This looks as good as we can expect under the circumstances. Now I can try my maintenance mode boot. I could exit from this shell and return to the Recovery Shell. I am just going to reboot from here, because this is sufficient for what I need to do right now. # reboot Shutdown at 07:13 (in 0 minutes) System shutdown time has arrived I will interact with the boot process to ensure that I can boot into maintenance mode: Processor is booting from first available device. To discontinue, press any key within 10 seconds. Boot terminated. ---- Main Menu -------------------------------------------------------------- Command Description ------- ----------- BOot [PRIALT<path>] Boot from specified path PAth [PRIALT] [<path>] Display or modify a path SEArch [DIsplayIPL] [<path>] Search for boot devices COnfiguration menu Displays or sets boot values INformation menu Displays hardware information SERvice menu Displays service commands DIsplay Redisplay the current menu HElp [<menu><command>] Display help for menu or command RESET Restart the system ---- Main Menu: Enter command or menu > bo pri Interact with IPL (Y, N, or Cancel)?> y Booting... Boot IO Dependent Code (IODC) revision 1 HARD Booted. ISL Revision A.00.43 Apr 12, 2000 ISL> hpux -lm Boot : disk(0/0/1/1.15.0.0.0.0.0;0)/stand/vmunix 10018816 + 1753088 + 1499968 start 0x1f41e8 alloc_pdc_pages: Relocating PDC from 0xf0f0000000 to 0x3fb01000. ... /sbin/ioinitrc: fsck: /dev/vg00/lvol1: possible swap device (cannot determine) fsck SUSPENDED BY USER. /dev/vg00/lvol1: No such device or address Unable to mount /stand - please check entries in /etc/fstab Skipping KRS database initialization - /stand can't be mounted INITSH: /sbin/init.d/vxvm-startup2: not found INIT: Overriding default level with level 's' INIT: SINGLE-USER MODE INIT: Running /sbin/sh # As you can see, I am now in maintenance mode, because fsck cannot find /dev/vg00/lvol1: No such device or address . I can now continue with repairing whatever it is I need to repair while in maintenance mode. At this time I don't know if the LVM structures are consistent. As in previous demonstration, I would proceed with a full recovery of this system on a step-by-step basis. |