Certification Objective 16.01-Troubleshooting Strategies | Linux Patch Management: Keeping Linux Systems Up To Date

When you encounter problems, proceed calmly. If you've read this book thoroughly, have the requisite experience, and do not panic, you'll usually be able to identify the cause of a problem fairly quickly.

If you can't identify the cause right away, try the simplest solutions first. They take less time and are less likely to sabotage your system.

If you have to go into more detail, remember the seven basic steps of the scientific method (as defined in Wikipedia). They can be applied to the Troubleshooting and System Maintenance portion of your Red Hat exam. If you have experience, you may be able to jump to a solution at any of these steps.

Inside the Exam

Troubleshooting and System Maintenance

As described in the Red Hat Exam Prep guide (www.redhat.com/training/rhce/examprep.html), there are Troubleshooting and System Maintenance requirements for both the RHCT and RHCE exams. To qualify as an RHCE, you need to complete all RHCT requirements during the first hour of the exam. These requirements can fall into the following categories:

Boot systems into different runlevels for troubleshooting and system maintenance.
Diagnose and correct misconfigured networking.
Diagnose and correct hostname resolution problems.
Configure the X Window System and a desktop environment.
Add new partitions, filesystems, and swap to existing systems.
Use standard command line tools to analyze problems and configure system.

To qualify as an RHCE, you also need to complete enough of the RHCE requirements for an overall score of 80%, which can fall in the following categories:

Use the rescue environment provided by first installation CD.
Diagnose and correct boot failures arising from boot loader, module, and filesystem errors.
Diagnose and correct problems with network services (see Installation and Configuration below for a list of these services). (The reference is to the Installation and Configuration section of the RHCE exam.)
Add, remove, and resize logical volumes.
Diagnose and correct networking services problems where SELinux contexts are interfering with proper operation.

For example, if there are five RHCT problems and five RHCE problems, you'll have to answer all five RHCT problems and three RHCE problems correctly to qualify as an RHCE on this part of the exam.

The network service issues you may encounter may include one or more of the services described throughout this book, including Apache, Samba, NFS, FTP, Squid, sendmail, Postfix, Dovecot, SSH, DNS, and NTP.

Define the question.

Understand what happened. Take the error messages you see. If possible, analyze log files for other messages. If you've read this book and run the labs, you may recognize the problem and cause immediately.
Gather information and resources.

Analyze your system. This may require that you check the relevant configuration files to make sure that appropriate services are running and that security or other characteristics of your system are working as they should. If you have experience, you'll often recognize the problem and cause when you see something wrong in these areas.
Form a hypothesis.

If you're still not sure what's wrong, make your best guess. Remember that time is severely limited during the Red Hat exams, so if you can afford it, consider skipping a problem. (To qualify for either the RHCT or RHCE, you're required to solve all RHCT-level Troubleshooting and System Maintenance issues.)
Perform experiments and collect data.

Before performing any experiments, back up anything you might change. For example, if you think the problem is with your Samba configuration file, back up your /etc/samba/smb.conf file, in case your hypothesis makes things worse.
Analyze data.

This is essentially identical to step 1. If what you do doesn't solve the problem, you'll need to analyze what went wrong, using error messages and log files as appropriate.
Interpret data and draw conclusions that serve as a starting point for new hypotheses.

In many cases, you'll want to restore what you did from the backup in step 4, repeat steps 2 through 4, and try again.
Publish the results.

Once you've solved the problem, you'll want to make sure the problem remains solved after rebooting your system. For example, if you've addressed a Samba problem, you'll want to "publish" by making sure the Samba daemon starts the next time your Linux system boots.

Two places where you are likely to make errors that result in an unbootable system are in the boot loader and init configuration files, /boot/grub/grub.conf and /etc/ inittab. For example, identifying the wrong partition as the root partition (/) can lead to a kernel panic. Other configuration errors in /boot/grub/grub.conf can also cause a kernel panic when you boot Linux. Whenever you make changes to these files, the only way to fully test them out is to reboot Linux.

Exam Watch

As a Red Hat Enterprise Linux administrator, you will be expected to know how to fix improperly configured files related to the boot process. For this reason, a substantial portion of the exam is devoted to testing your troubleshooting and analysis skills.

The following scenarios and solutions list some possible problems and solutions that you can have during the boot process, and possible associated solutions. It is far from comprehensive. The solutions that I've listed work on my computer, as I've configured it. There may be (and often is) more than one possible cause. These solutions may not work for you on your computer or on the Red Hat exams. To know what else to try, use your experience.

To get the equivalent of more experience, try additional scenarios (remember: never do these things on a production computer). Once you're familiar with the linux rescue environment, test these scenarios. These scenarios worked as shown when I tested them on RHEL 5. However, they lead to different errors on RHEL 4 and RHEL 3.

For the first scenario shown, change the name of the grub.conf file so it can't be loaded. Reboot and see what it does on your system. Use the linux rescue environment to boot into RHEL and use the noted solution to fix your system.

For the second scenario shown, overwrite the MBR; on a SATA/SCSI drive, you can do so with the following command (substitute hda for sda if your system uses an IDE/PATA drive):

 # dd if=/dev/zero of=/dev/sda bs=446 count=1

The third scenario is misleading; it's what happened when I overwrote my /bin/ mount command with /sbin/mount.nfs and rebooted.

The fourth scenario is what happened when I overwrote my /bin/init command.

The fifth scenario is based on a missing /etc/inittab; I suspect it's much more likely that you'll see some major error (such as a key command, commented out) in that file.

The sixth scenario results in the messages shown in Figure 16-1, which happened when I set the default runlevel to 3 and commented out the commands with the mingetty directives in /etc/inittab.

image from book
Figure 16-1: One possible error message

The seventh scenario is based on a typo in the root directive in /boot/grub/grub. conf, which results in the messages shown in Figure 16-2.

image from book
Figure 16-2: A second possible error message

Sometimes, you may run into a problem with the default runlevel. But you're not stuck. There are two ways to boot into different runlevels. You can boot directly from the GRUB configuration menu, or you can boot into the linux rescue environment from the first RHEL installation CD.

SCENARIO & SOLUTION
When you boot, you see a grub> prompt.	You may have a problem that prevents the boot loader from reading the GRUB configuration file, grub.conf. The file may be missing or corrupt. For hints on creating a new grub.conf, see menu.1st in the /usr/share/doc/grub-versionnum directory.
When you boot your computer, you see a message such as "Missing operating system" or "Operating System Not Found."	Your master boot record (MBR) has been erased, and you'll need to reload GRUB on the MBR using grub-install. (It's possible that everything has been erased, which I believe is beyond the scope of this part of the exam.)
During the boot process, you see the "Could not start the X server (graphical environment) due to some internal error" message.	You could have problems with a full or unmounted /tmp or /home directory. If these directories are not mounted, the mount command may be corrupt. In that case, you'll need to reload it from the mount RPM.
You see an "exec of init (/sbin/init) failed!!!" error.	Your init command may be corrupt. Try reloading it from the SysVinit RPM.
You see the "INIT: No inittab file found" message.	This is straightforward-there is something wrong with your /etc/inittab file. RHEL 5 prompts you to "Enter runlevel"; as of this writing, if /etc/inittab is missing, enter s to see a bash prompt.
You see a message such as what's shown in Figure 16-1.	You may not have anything starting a text or GUI console in the active runlevel; trace it starting with /etc/inittab.
You see a message such as what's shown in Figure 16-2. Take careful note of the last file cited in the message.	RHEL has encountered some problems when reading the grub.conf configuration file. Start the linux rescue environment and check this file as well as the referenced files in the /boot directory.

Booting Into Different Runlevels

In brief, you can boot into the runlevel of your choice from the GRUB configuration menu. This is one of the RHCT Troubleshooting and System Maintenance skills and also an essential skill for all Linux administrators. Specifically, you can boot into the runlevels described in Table 16-1.

Table 16-1: Linux Runlevels
Runlevel	Description
0	Halts the system
1	Activates SELinux; runs /etc/rc.sysinit, which checks and mounts filesystems; executes all scripts in the /etc/rc1.d directory
s or single	Single-user mode; activates SELinux; runs /etc/rc.sysinit, which checks and mounts filesystems
emergency	Emergency boot mode; activates SELinux; mounts only the root (/) filesystem
init=/bin/sh	Emergency boot mode; mounts only the root (/) filesystem
2	Multiuser mode with some networking; does not include some NFS functions, the automounter, or CUPS
3	Multiuser mode with networking; boots into a text login console
4	Generally unused; however, the defaults support near-identical settings to runlevel 3
5	Multiuser mode with the X Window; boots into an X-based login screen
6	Reboots the system

The Red Hat Exam Prep guide states that "RHCTs should be able to boot systems into different run levels for troubleshooting and system maintenance." This is straightforward; at the boot loader prompt, you can start Linux at a different runlevel. This may be useful for two purposes. If your default runlevel in /etc/inittab is 5, your system normally boots into the GUI. If you're having problems booting into the GUI, you can start RHEL into the standard text mode, runlevel 3.

One other option to help rescue a damaged Linux system is single-user mode. This is appropriate if your system can find at least the root filesystem (/). Your system may not have problems finding its root partition and starting the boot process, but it may encounter problems such as damaged configuration files or an inability to boot into one of the higher runlevels. When you boot into single-user mode, options are similar to those of the standard linux rescue environment described later in this chapter. Other runlevels shown in Table 16-1 may be useful in specialized circumstances.

To boot into a different runlevel, first assume that you're using the default RHEL boot loader, GRUB. In that case, press (lowercase) p to enter the GRUB password if required. Type (lowercase) a to modify the kernel arguments. When you see a line similar to

 grub append> ro root=LABEL=/ rhgb quiet

add one of the following commands (shown in boldface) to the end of that line:

 grub append> ro root=LABEL=/ single grub append> ro root=LABEL=/ init=/bin/sh grub append> ro root=LABEL=/ emergency grub append> ro root=LABEL=/ 1

You can use the same technique to boot into another runlevel. For example, to boot from the GRUB boot loader into runlevel 3, navigate to where you can modify the kernel arguments, and add the following command to the end of the following line:

 grub append> ro root=LABEL=/ 3

On the Job

The terms boot loader and bootloader are used interchangeably. In this book, I've normally used the term boot loader, as that seems to be the direction of the Red Hat documentation. However, the term bootloader is still common even in Red Hat documentation.

 grub append> ro root=LABEL=/ 3

When you boot into runlevel 1, no password is required to access the system. As you'll see later in this chapter, running your system in this runlevel is somewhat similar to running a system booted in rescue mode. Many of the commands and utilities you normally use are unavailable. You may have to mount additional drives or partitions and specify the full pathname when running some commands.

When you have corrected the problem, you can reboot the system. Alternatively, you can type the exit command to boot into the default runlevel as defined in /etc/ inittab, probably runlevel 3 or 5.

On the Job

In runlevel 1, any user can change the root password. You do not want people rebooting your computer to go into this runlevel to change your root password. Therefore, it's important to keep your server in a secure location. You can also password-protect GRUB (see Chapter 3) or even the BIOS menu to keep anyone with physical access to your computer from booting it in single-user mode.

The linux rescue Environment

In brief, you can boot even an unbootable system using the linux rescue environment. Using the first RHEL installation CD, type linux rescue at the boot: prompt. The first couple of steps are the same as those for installing RHEL 5. If the linux rescue environment detects your system, it may mount the standard directories in /mnt/ sysimage subdirectories in read-write or in read-only mode. If your filesystems are not mountable, you can open a command prompt and continue with your troubleshooting.

When you type linux rescue at the installation boot prompt and go through the steps, the installation discs install a compact version of a root filesystem. To boot into linux rescue mode, first boot your system using the first installation CD in a bootable CD-ROM drive, as shown in Figure 16-3.

image from book
Figure 16-3: Booting into linux rescue mode

Exam Watch

The RHCE portion of the Red Hat Exam Prep guide explicitly states that you need to know how to use the rescue environment provided by the first RHEL installation CD.

Now take the following steps:

Boot your system from the first RHEL 5 installation CD.
Type linux rescue at the boot: prompt as shown in Figure 16-3. Your system boots a basic Linux system from the first installation CD.
Select an appropriate language when prompted.
Select an appropriate keyboard type when prompted.

You'll see the following message, briefly:

 Running anaconda, the Red Hat Enterprise Linux rescue mode - please wait...

You'll be asked whether you want to set up the network interfaces on the local system, as shown in Figure 16-4. Select Yes if you need to connect to a network installation source to install other packages; otherwise, select No and skip to step 8.
You'll see a network configuration window for the local network card, similar to what's shown in Figure 16-5. If directed by your instructor or exam proctor to set up a static IP address, follow the instructions carefully; otherwise, try to configure this interface using a local network DHCP server. If you set up a static IP address, you'll see another screen where you're prompted to enter a gateway, a primary DNS, and a secondary DNS IP address.
Select one of the three options for the rescue environment, as shown in Figure 16-6. Generally, you should try the Continue option first, followed by Read-Only. Continue mounts your RHEL filesystems in read-write mode. Read-Only mounts RHEL file systems in read-only mode. Skip does not mount any of your RHEL filesystems. I address each of these three options in detail in the following sections.
When successful, you'll see a message to the effect that your system has been mounted under /mnt/sysimage. When you select OK (the only option), you'll see the following prompt, where you have root privileges.
```
 sh-3.1# 
```

image from book
Figure 16-4: Networking interface options in linux rescue mode

image from book
Figure 16-5: Networking interface configuration in linux rescue mode

image from book
Figure 16-6: The linux rescue environment options

Standard linux rescue Environment

When you select Continue from the screen shown in Figure 16-6, you're taken through the standard linux rescue environment. The rescue files search for your root directory (/) filesystem. If found, your standard root directory (/) is mounted on /mnt/sysimage. All of your other regular filesystems are subdirectories of root; for example, your /boot directory will be found on /mnt/sysimage/boot.

Not all of your filesystems may mount properly. You may see error messages such as:

 An error occurred trying to mount some or all of your filesystem

This suggests that at least one of the filesystems listed in /etc/fstab isn't mounting properly for some reason. If the linux rescue environment has no problems, you'll see a message noting that your system has been mounted, as shown in Figure 16-7.

image from book
Figure 16-7: The linux rescue environment has found your root directory (/).

Select OK. You should see the following prompt messages:

 Your system is mounted under the /mnt/sysimage directory. When finished please exit from the shell and your system will reboot. sh-3.1#

You'll use the chroot /mnt/sysimage command shortly. Now you can work on repairing any files or filesystems that might be damaged. First, check for unmounted filesystems. Run a df command. The output should look similar to Figure 16-8.

image from book
Figure 16-8: Labels, filesystems, and partitions

Compare the result to the /mnt/sysimage/etc/fstab configuration file. If some filesystem is not mounted, it may be configured incorrectly in the fstab file. Alternatively, the label associated with a partition may not match the filesystem shown in your fstab file. For example, to find the label associated with /dev/sda1, run the following command:

 # e2label /dev/sda1

This should return the name of a filesystem to be mounted on that partition such as /boot.

Sometimes an unmounted filesystem just needs a little cleaning; remember, a command such as the following cleans the /dev/sdb1 partition:

 # fsck /dev/sdb1

The fsck command works only on an unmounted filesystem. For example, if you get a message such as:

 WARNING!!! Running e2fsck on a mounted filesystem may cause SEVERE filesystem damage.

unmount the subject filesystem with a command such as umount /mnt/sysimage/boot. If that doesn't work, restart the rescue process. When you get to the screen shown in Figure 16-6, select Skip and read the "No Mount linux rescue Environment" section later in this chapter.

Alternatively, you may see a message like:

 fsck.ext2: Device or resource busy while trying to open /dev/hda2 filesystem mounted or opened exclusively by another program?

This is presented when the partition, in this case /dev/hda2, is part of a Logical Volume Manager (LVM) array as described in Chapter 8. In that case, you'll need to review your /mnt/sysimage/etc/fstab file for the appropriate logical volume (and unmount it), before trying to apply the fsck command to it.

On the Job

Don't let messages associated with the ext2 filesystem bother you; they'll appear even if your filesystems are mounted to ext3.

Remember the message in Figure 16-7? It includes an important clue. All you need to do to restore the original filesystem structure is to run the following command:

 # chroot /mnt/sysimage

When you use the rescue disc, your standard root directory (/) is actually mounted on the /mnt/sysimage directory. This command resets your standard root directory (/), so you don't have to go to the /mnt/sysimage subdirectory.

When you've made your changes, run the sync command to make sure any changes you've made are written to disk. Type the exit command, twice. Linux should automatically run the sync command again when you exit, making sure any changes are written to disk. Then it stops, allowing you to reboot or restart your computer.

On the Job

Normally it should not be necessary to run the sync command. However, running it does make sure that any pending data is actually written to your hard disks.

Read-Only linux rescue Environment

When you select the Read-Only option shown in Figure 16-6, you'll get the same basic prompt. There is little difference between regular and read-only rescue mode. The rescue system attempts to do everything that it would under regular mode, except all partitions associated with your standard system are mounted read-only. (Some of the rescue system filesystems are still mounted as read-write.)

This is appropriate if you have a large number of mounted filesystems; it can help you cull through what is and isn't working with less risk of overwriting key configuration files.

No Mount linux rescue Environment

When you select the Skip option shown in Figure 16-6, the rescue environment doesn't even search for a Linux installation. A minimal root image is loaded into a RAM disk created by the kernel, and you're taken to a root shell prompt (#), as shown:

 When finished, please exit from the shell and your system will reboot. sh-3.1#

At this point, you have access to a basic set of commands. You can mount filesystems, create directories, move files, and use editors such as vi. As nothing from your physical drives is mounted, you can apply the fdisk and fsck commands to various hard disks and partitions. A few other basic commands are also available.

The great difficulty in operating from the rescue environment is that you are working with a minimal version of the Linux operating system. Many of the commands you are accustomed to having at your disposal are not available at this level. If your root partition has not been completely destroyed, you may be able to mount this partition to your temporary root directory in memory and access commands from there.

But you may need a little help identifying the partitions on your system. As I'll show you shortly, the fdisk -l /dev/hda command lists the configured partitions on the first IDE hard drive. You can create a new directory such as /mnt/sysimage, mount a partition such as /dev/hda2 on that directory, and check the result with the following commands:

 # mkdir /mnt/sysimage # mount /dev/hda2 /mnt/sysimage # ls /mnt/sysimage

If you can verify that you've mounted the standard root directory (/) filesystem on the /mnt/sysimage directory, you can run the chroot /mnt/sysimage command. You can then have full access to the commands and configuration files available under that mounted partition.

On the Job

If you mount partitions from your hard drive in rescue mode and then make changes to files on those partitions, remember to use the sync command. This writes your files to disk so the information isn't lost if you hit the power button on your computer. Alternatively, a umount command applied to any partition also writes data to disk.