Section 26.2. Objective 2: System Recovery | LPI Linux Certification in a Nutshell (In a Nutshell (OReilly))

26.2. Objective 2: System Recovery

Unfortunately, things don't always go well when working on a system . The most frequent system recovery happens automatically. Filesystems are fixed when booting, and while it takes some time, it's fully automatic most of the time. All other problems are infrequent.

One occasional problem is that one of the runlevel scripts cannot complete and causes the whole boot sequence to halt. This is easy to fix: just boot in single-user mode as described in Chapter 14, and edit the script. A harder problem is if the system initialization script or init somehow fails and foils the boot. Coping with this is described later in "init or the System Initialization Fails.

26.2.1. Filesystem Damage

Sometimes a filesystem is damaged in a way that the boot process does not want to fix automatically; in this case, it drops into a root password prompt with a display like the following (this is from Red Hat):

 *** An error occurred during the file system check." *** Dropping you to a shell; the system will reboot" *** when you leave the shell." Give root password for maintenance(or type Control-D for normal startup): password (Repair filesystem) 1 #

After entering the password, you can fix the filesystem manually. Fixing a filesystem manually requires running fsck device. Answer y to all the questions. Answering anything but y to all the questions requires quite some knowledge of how your filesystem works. Very few have such knowledge, and so the questions are cunningly posed in such a manner that answering y to all of them is the safest option. It may seem a bit silly to force you to answer y so many times, but at least it forces you to notice a bad error condition and enables you to follow the boot process after the repair is done.

26.2.2. init or the System Initialization Fails

If there is a failure during init, the system initialization script (/etc/rc.sysinit or /etc/init.d/rcS, as the case may be) or a script that is executed as part of the single-user runlevel, the machine will fail to reach even single-user mode. You must take unusual measures to put your system in a state where you can fix it.

26.2.2.1. Bypassing init

The first step is to try to bypass init. In your boot loader, GRUB or LILO, you must enter the parameter init=/bin/bash or another interpreter. Instead of executing/sbin/init to perform system initialization, bash will be started and it is then your job to initialize the system. This enables you to correct error conditions, debug scripts, or even get the machine networked and to fetch files or perform backups.

After the kernel is done initializing, you will pretty quickly be dropped into a root shell with a regular root prompt (#). You do not not have any virtual consoles now, nor any mouse support or cut-and-paste. Furthermore, you do not have a real tty or job control (and bash may complain about those things). Because you don't have a tty, some software will refuse to work. The lack of job control is very important, because interrupting with Ctrl-C and Ctrl-Z does not work. If you start a command that for some reason does not terminate, you will not be able to terminate it yourself; all you can do is reboot the system and make sure not to repeat the mistake (which could be as simple as trying to ping another systemthe ping by default does not stop). All in all, the environment is quite hard to use.

Commands that may not complete should be started in the background by ending the command in & (ampersand). But this does not work well for interactive programs, because they will not be able to read input.

26.2.2.2. Working in the shell environment

The first thing to do when you have the prompt, if you are sure the root filesystem is sound, is to mount it read/write. If you are uncertain of its soundness, you should run fsck -f / on it first. The command to mount it is mount -o remount,rw /. If your /etc/fstab file is damaged, you may have to add the device name: fsck -f device and mount -o remount,rwdevice.

Once your root filesystem is mounted, you can do almost whatever you want to troubleshoot and fix your system. Most things, however, will involve software from /usr and files in /var and /tmp. Additionally, /proc usually gets into the picture pretty quickly. All of these can be mounted with a simple mount -av, if your fstab is intact. You may want to fsck everything first, through fsck -A -a.

A lot of work will involve looking at and editing files. The software to do this invariably resides in /usr, and if your /usr is unavailable you need to get creative. Both cat and dd are in /bin. cat can be used to list entire files, and the Linux console can be paged up and down with the key combinations Shift-Page up and Shift-Page down. dd with parameters skip=n and count=m can be used to read a file starting at the nth block for m blocks. When all is lost, the cat command is your hardcore editor. You can use cat >/etc/fstab to rewrite your fstab file. But if your system is this damaged, it may be easier to use a rescue CD, as discussed later in this chapter.

Your general tactic should be to try to find the (hopefully) single problem that stops your boot from working smoothly. You can run each of the scripts of the startup process by hand and watch them work, and note any errors without risking them scrolling beyond the top of your screen. Once you get the system initialization script and the scripts in /etc/rc1.d working, you can do a single-user boot. In a single-user environment you will regain virtual consoles and job control; it will be much more comfortable to work in.

Once in single-user mode, if there are more problems, you can again run each single script in the default runlevel directory /etc/rcn.d in the correct sequence. If the scripts appear to work one by one, you can stop them again and use initn to get init to run the runlevel scripts in the ordinary manner. The init command is discussed extensively in Chapter 14.

26.2.3. Booting from a Rescue CD

It is particularly hard to restore your system to working order if your disk bootstrap is damaged or if your system has serious damagein particular, a very damaged root filesystem in which files in /etc are corrupted or lost (/etc/fstab and /etc/inittab are particularly important to booting). The best solution then is a rescue disk or your distribution's boot disks.

Using your distribution's installation or rescue CDs to boot can be immensely helpful. Red Hat's installation CD has a distinct rescue boot mode. Both the Debian and Red Hat CD installation procedures fork a shell on a virtual console, so that when the first dialog appears on the main screen you can switch virtual consoles and find a root shell on one of them. Both CD environments are complete enough that you can do a good deal of troubleshooting. If you need a more complete environment, there are many dedicated rescue disks on the Internet that provide everything but the kitchen sink.

26.2.3.1. Restoring the bootstrap

There is a fairly simple workaround for damage to your disk bootstrap program, GRUB or LILO. Mount the root filesystem and then reinstall the boot loader. You can easily reinstall your bootstrap as described in Chapter 4 without any exotic options.

The following command sequence assumes that /dev/hda is your boot disk and that /mnt is where you mounted your root filesystem. Some boot and rescue disks use /mnt for their own purposes. By the way, if your bootstrap program is damaged, did you put something bad in the configuration file? Perhaps you should review it.

 # chroot /mnt /bin/bash# mount /boot# lilo -v    or # grub-install /dev/hda

26.2.3.2. Exploring the damaged system

The chroot command in the previous example is also useful for doing other things to explore your damaged disks. Such exploration is easier if you have a correct mtab file. On a lot of rescue environments, /etc/mtab is linked to /proc/mounts. In any case, the mtab file is usually correct in the rescue environment. Not so in the chroot environment. If you do the following, your mtab will be correct and a lot of important commands such as df will work right. mount -a can mount all your disks into the chroot environment. This is a good thing, because sometimes the problem that stops you from booting normally turns out to be a disk chock full of stuff; it needs a good cleaning before you can boot.

 # chroot /mnt /bin/bash # mount /proc # cat /proc/mounts /etc/mtab # fsck -A -a# mount -av ... # df -k ...

26.2.3.3. Loss of key files

There are some things you can't recover from without doing some sort of reinstallation. Among these are missing binaries and shared libraries. A good many of the files in /etc can easily be copied from other machines or written by hand. Booting a rescue CD, going into the chroot environment, and getting the network up by running the network initialization script are quite easy. You can then use a file copy command such as ftp or scp, or even a package update command such as apt-get, to copy the needed files from another machine or an FTP site. If the system is too badly damaged, it will be hard or impossible to execute programs on it. For such as a situation, both dpkg and rpm support options (--root=/mnt and --root /mnt, respectively) that let you install packages without doing the chroot first. But the easiest recovery solution is perhaps to make a tarfile of the damaged files from a similar healthy machine and get them onto the patient by FTP, floppy, USB memory stick, or some other mechanism and then untar the tarfile on top of the damaged filesystem.

Rescue initrd

Considering the usefulness of rescue disks for some kinds of system recovery scenarios and the completeness of Debian's initrd environment, it's not that far-fetched to imagine a rescue initrd. In fact, in Debian it's as simple as setting DELAY in /etc/mkinitrd/mkinitrd.conf to a nonzero value and pressing the Enter key when booting at the Waiting for $DELAY seconds, press ENTER to obtain a shell prompt. The boot process will drop you into ash with the full Debian environment described earlier at your disposal. This is a bit harder to do in Red Hat.

The exam won't ask about this.