Certification Objective 16.03-Required RHCE Troubleshooting Skills | Linux Patch Management: Keeping Linux Systems Up To Date

At some point in your career as a Red Hat Enterprise Linux administrator, maybe even on the Red Hat exams, you're going to be faced with a system that will not boot. It will be up to you to determine the cause of the problem and implement a fix. Sometimes, the problem may be due to hardware failure: the system in question has a bad power supply or has experienced a hard disk crash.

Quite often, however, the failure of a system to boot can be traced back to the actions of a user: you, the system administrator! When you are editing certain system configuration files, typographical errors can render your system unbootable.

Any time you plan to make any substantial modifications to your system or change key configuration files, back them up first. Then, after making changes, you should actually reboot your system rather than assume that it will boot up the next time you need a reboot. It's much better to encounter problems while you can still remember exactly which changes you made. It is even better if you can go back to a working configuration file.

As described earlier in this chapter, the main tool is the linux rescue environment provided by the first installation CD. Per the Exam Prep guide, you also need to know how to diagnose and correct boot failures arising from bootloader, module, and filesystem errors; It's broken down into three sections. In addition, some of the key tools to diagnose and correct problems with network services, as described in previous chapters, are summarized here. Key tools are discussed that allow you to add, remove, and resize logical volumes. And finally to diagnose and correct networking services problems where SELinux contexts are interfering with proper operation, use the Setroubleshoot browser described in Chapter 15.

Troubleshooting the Boot Loader

The boot loader associated with Red Hat Enterprise Linux 5 is GRUB. For an extensive discussion, see Chapter 3. It can help you to know how to:

Associate the root directive with the partition with the /boot directory.
Boot into the desired, non-default runlevel.
Access the GRUB command line.
Test different GRUB commands.
Use command completion to find and use the exact names of your kernel and initial RAM disk.

While it isn't necessary to know all these skills, they can help you diagnose problems more quickly during your exam.

Exercise 16-6: Troubleshooting the Boot Loader

For this exercise, you'll need a partner. Have your partner make changes to your system. As your partner works to create a network problem for you to solve on your computer, look away until the computer is rebooting.

It's most helpful if you have a VMware snapshot of your RHEL system. Problems like those created in this exercise have caused administrators to mess up their systems in other ways. You'll also need the first RHEL installation CD.

Back up the configuration file associated with the boot loader, /boot/grub/grub.conf. Make sure to back up this file to a non-standard location, in case your partner also backs up any files before changing them.
Open the /boot/grub/grub.conf configuration file in a text editor. Focus on the kernel command line, which might look like one of the following:
```
 kernel /vmlinuz-2.6.18-8.el5 ro root=/dev/VolGroup00/LogVol00 
```
or
```
 kernel /vmlinuz-2.6.18-8.el5 ro root=LABEL=/ 
```
Introduce a typographical error in the root directive in the kernel command line.
Reboot the system, and let your partner back at the computer. Tell him or her to address the error message shown. Give your partner the first RHEL 5 installation CD.
Make sure to tell your partner to back up any files that he or she might change to the appropriate home directory.
Whatever happens, restore the original /boot/grub/grub.conf configuration file when your partner is finished with this exercise. (Alternatively, you can restore the configuration from a VMware snapshot.)

Module Errors

Most kernels are compiled with loadable modules. Current Linux distributions, including RHEL, configure modules in the initial RAM disk, which is compiled into a initrd-* file in the /boot directory. As you can see in the GRUB configuration file, the initial RAM disk is normally associated with the last line in a GRUB configuration stanza. As described in Chapter 8, you can create a new initial RAM disk configuration file with the mkinitrd command. But errors are also possible, as you'll see in the following exercise.

Exercise 16-7: Troubleshooting Boot Loader Modules

Back up the configuration file associated with the boot loader, /boot/grub/grub.conf. Make sure to back up this file to a non-standard location, in case your partner also backs up any files before changing them.
Open the /boot/grub/grub.conf configuration file in a text editor. Focus on the initrd command line, which might look like the following:
```
 initrd /initrd-2.6.18-8.el5.img 
```
Misspell both initrd words in this line.
Reboot the system, and let your partner back at the computer. Tell him or her to address the error message shown. Give your partner the first RHEL 5 installation CD.
Make sure to tell your partner to back up any files that he or she might change to the appropriate home directory.
Whatever happens, restore the original /boot/grub/grub.conf configuration file when your partner is finished with this exercise. (Alternatively, you can restore the configuration from a VMware snapshot.)

Filesystem Corruption and Checking

Although there are potentially many things that will prevent a system from booting, these problems can be generally categorized as either hardware problems or software and configuration problems. The most common hardware-related problem you will probably encounter is a bad hard drive; like all mechanical devices with moving parts, these have a finite lifetime and will eventually fail. Fortunately, the Red Hat exams do not require you to address hardware failures.

Software and configuration problems, however, can be a little more difficult. At first glance, they can look just like regular hardware problems.

In addition to knowing how to mount disk partitions, edit files, and manipulate files, you will need to know how to use several other commands to fix problems from rescue mode or single-user mode. The most useful of these are the df, fdisk, and the fsck commands. To diagnose a problem, you need to know how these commands work at least at a rudimentary level.

df

The Linux df command was covered briefly in Chapter 4. When you use df, you can find mounted directories, the capacity of each partition, and the percentage of each partition that's filled with files. The result shown back in Figure 16-8 illustrates the result in kilobytes. There are a couple of simple variations; the following commands provide output in megabytes and inodes:

 # df -m # df -i

fdisk and parted

The Linux fdisk and parted utilities were covered briefly in Chapter 4. When you use fdisk or parted, you can find the partitions you have available for mounting. For example, the fdisk -l /dev/hda (or parted /dev/hda print) command lists available partitions on the first IDE hard disk:

 # fdisk -l /dev/hda Disk /dev/hda: 15.0GB, 15020457984 bytes 240 heads, 63 sectors/track, 1940 cylinders Units = cylinders of 15120 * 512 = 7741440 bytes    Device    Boot   Start     End     Blocks    Id  System /dev/hda1    *      1         949    7174408+   b   Win95 FAT32 /dev/hda2           950       963    105840     83  Linux /dev/hda3           964       1871   6864480    83  Linux /dev/hda4           1872      1940   521640     f   Win95 Ext'd (LBA) /dev/hda5           1872      1940   521608+    82  Linux swap

Looking at the output from fdisk, it's easy to identify the partitions configured with a Linux format, /dev/hda2, /dev/hda3, and /dev/hda5. Given the size of each partition, it is reasonable to conclude that /dev/hda2 is associated with /boot, and /dev/hda3 is associated with root (/). Here's a fairly complex output from parted:

 # parted /dev/sda print Model: ATA HDS728080PLA380 (scsi) Disk /dev/sda: 82.3GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start   End     Size    Type     File system  Flags  1     32.3kB  197MB   197MB   primary  ext3         boot  2     197MB   15.2GB  15.0GB  primary  ext3  3     15.2GB  16.2GB  1003MB  primary  linux-swap  4     16.2GB  82.3GB  66.1GB  extended  5     16.2GB  16.3GB  107MB   logical  ext3  6     16.3GB  26.8GB  10.5GB  logical  ext3  7     26.8GB  41.8GB  15.0GB  logical  fat32        lba  8     41.8GB  51.8GB  10.0GB  logical  ext3  9     51.8GB  82.3GB  30.5GB  logical  ext3 Information: Don't forget to update /etc/fstab, if necessary.

In this example, it's easy to identify the Linux swap partition. Since /boot partitions are small and normally configured toward the front of a drive (with a boot flag), it's reasonable to associate it with /dev/sda1.

For simple partitioning schemes, this is easy. It gets far more complicated when you have lots of partitions. You should always have some documentation available that clearly identifies your partition layout within your filesystem:

 # fdisk -l /dev/hda Disk /dev/hda: 26.8 GB, 26843545600 255 heads, 63 sectors/track, 3263 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes    Device   Boot   Start      End     Blocks     Id  System /dev/hda1   *      1          13      104391     83  Linux /dev/hda2          14         268     2048287+   b   Win95 FAT32 /dev/hda3          269        396     1028160    83  Linux /dev/hda4          397        3263    23029177+  f   Win95 Ext'd (LBA) /dev/hda5          397        1097    5630751    83  Linux /dev/hda6          1098       1734    5116671    83  Linux /dev/hda7          1735       1989    2048256    83  Linux /dev/hda8          1990       2244    2048256    83  Linux /dev/hda9          2245       2372    1028218+   83  Linux /dev/hda10         2373       2499    1020096    82  Linux swap /dev/hda11         2500       2626    1020096    83  Linux /dev/hda12         2627       2753    1020096    83  Linux /dev/hda13         2754       2880    1020096    83  Linux /dev/hda14         2881       3007    1020096    83  Linux /dev/hda15         3008       3134    1020096    83  Linux /dev/hda16         3135       3236    1020096    83  Linux

In this example, it's easy to identify the Linux swap partition. Since /boot partitions are small and normally configured toward the front of a drive, it's reasonable to associate it with /dev/hda1.

However, that is just a guess; some trial and error may be required. For example, after mounting /dev/hda2 on an empty directory, you would want to check the contents of that directory for the typical contents of /boot.

e2label

Based on the previous output from fdisk -l, you could probably use a little help to identify the filesystems associated with the other partitions. The e2label command can help. When you set up a new filesystem, the associated partition is normally marked with a label. For example, the following command tells you that the /usr filesystem is normally mounted on /dev/hda5.

 # e2label Usage: e2label device [newlabel] # e2label /dev/hda5 /usr

dumpe2fs

You can get a lot more information on each partition with the dumpe2fs command, as shown in Figure 16-9.

image from book
Figure 16-9: The dumpe2fs command provides a lot of information.

The dumpe2fs command not only does the job of e2label but also tells you about the format, whether it has a journal, and the block size. Proceed further through the output, and you'll find the locations for backup superblocks, which can help you use the fsck or e2fsck command to select the appropriate superblock for your Linux partition.

On the Job

fsck is a "front end" for e2fsck, which is used to check partitions formatted to the ext2 and ext3 filesystems.

Filesystem Check-fsck

You should also know how to use the fsck command. This command is a front end for most of the filesystem formats available in Linux, such as ext2, ext3, reiserfs, vfat, and more. This command is used to check the filesystem on a partition for consistency. In order to use the fsck command effectively, you need to understand something about how filesystems are laid out on disk partitions.

When you format a disk partition under Linux using the mkfs command, it sets aside a certain portion of the disk to use for storing inodes, which are data structures that contain the actual disk block addresses that point to file data on a disk. The mkfs command also stores information about the size of the filesystem, the filesystem label, and the number of inodes in a special location at the start of the partition called the superblock. If the superblock is corrupted or destroyed, the remaining information on the disk is unreadable. Because the superblock is so vital to the integrity of the data on a partition, the mkfs command makes duplicate copies of the superblock at fixed intervals on the partition, which you can find with the dumpe2fs command described earlier.

The fsck command checks for and corrects problems with filesystem consistency by looking for things such as disk blocks that are marked as free but are actually in use (and vice versa), inodes that don't have a corresponding directory entry, inodes with incorrect link counts, and a number of other problems. The fsck command will also fix a corrupted superblock. If fsck fails due to a corrupt superblock, you can use the fsck command with the -b option to specify an alternative superblock. For example, the following command performs a consistency check on the filesystem on disk partition /dev/hda5, using the superblock located at disk block 8193:

 # fsck -b 8193 /dev/hda5

Exam Watch

Get to know the key commands and the associated options for checking disks and partitions: fdisk, e2label, dumpe2fs, and fsck. Practice using these commands to check your partitions-on a test computer! (Some of these commands can destroy data.)

File Corruption

You may find corruption in some key files or commands such as bash or init. If you do, one option is to reload the files from the original RPMs. For example, if the init command were to be corrupted or erased, you can reload it from the SysVinit RPM.

When you boot your system into the linux rescue environment, you'll still need to connect to the network source associated with the installation RPMs. So you should set up networking as described earlier when you boot into the linux rescue environment. Assume the /bin/bash command is corrupt or missing and has to be replaced. With that in mind, take the following steps:

Run the df command. You should see how the linux rescue environment mounted your partitions.
If possible, mount the network source. If it's an NFS share in the /inst directory on the server.example.com system, you can do so on the /mnt/ source directory (create /mnt/source if required) with the following command:
```
 # mount -t nfs server.example.com:/inst /mnt/source 
```
Copy the bash RPM from the /mnt/source directory. It isn't necessary, but can help if your connection is dropped. Use the following command:
```
 # cp /mnt/source/Server/bash-*.rpm . 
```
Install the bash RPM, forcing installation over current files:
```
 # rpm -Uvh --force --root=/mnt/sysimage bash-*.rpm 
```
Run the following chroot command to move into the standard directory tree:
```
 # chroot /mnt/sysimage 
```
Check the status of the bash command.
```
 # rpm -Vf /bin/bash 
```

If you see no output, you'll know that there is no longer a problem with the bash command. (You can also use this command at the start of the process to see if there is a problem.) You should now be able to run the exit command twice to reboot your computer-and at least this problem should be solved.

Network Service Issues

The Red Hat Exam Prep guide says that you may have to diagnose and correct problems with network services during the Troubleshooting and System Maintenance portion of the RHCE exam. These are the same network services that you may need to configure during the Installation and Configuration portion of the same exam. Example scenarios and solutions are shown as follows. Needless to say, the solutions are far from complete: for example, firewalls and network configuration issues can prevent communication to any network service.

Add, Remove, and Resize Logical Volumes

You may need to add, remove, and/or resize a logical volume (LV) during the Troubleshooting and System Maintenance portion of the RHCE exam. As you'll recall from Chapter 8, LVs involve commands that manage physical volumes (PVs), LVs, and volume groups (VGs). You can manage this process from the command line or the LVM Configuration Tool. For detailed information on each process, see Chapter 8. In this section, we'll review the basics. (If you're not already familiar with logical volumes, reread Chapter 8. Otherwise, this section will probably confuse you to no end.)

SCENARIO & SOLUTION
A connection to a local Web server results in a 404 file not found error.	Make sure there is an appropriate index.html in the DocumentRoot directory, and the SELinux permissions for index.html match what you see from a ls -Z/var/www/html command.
You're unable to connect to your Samba home directory on a remote system.	Check the [homes] share in /etc/samba/smb.conf. Use testparm to check that configuration file for errors.
You have trouble connecting to a remote NFS share.	Make sure the directory is shared properly in /etc/ exports, the directory has been exported, is visible from a showmount -e hostname command, and appropriate services are running.
You're unable to log into the remote FTP server anonymously.	Make sure anonymous access is enabled in /etc/vsftpd/vsftpd.conf.
You can't connect through the Web proxy server.	Make sure appropriate access is enabled to the right network in /etc/squid/squid.conf.
You're unable to send outgoing e-mail through a local network server.	Make sure one outgoing mail server (sendmail or Postfix) is active and properly configured.
You're unable to receive secure POP3 e-mail through a local network server.	Make sure the Dovecot service is properly configured in /etc/dovecot.conf. If you use an alternative service, configure its configuration file instead.
You want to limit SSH access to specific users.	Use the AllowUsers directive in /etc/ssh/sshd_config.

First, to check the current status of your LVs, PVs, and VGs, the lvdisplay, pvdisplay, and vgdisplay commands can help. Those commands work with the mount command and the contents of /etc/fstab to confirm your configuration and the currently mounted LVs.

Adding a Logical Volume

If you need to add an LV, and you don't have sufficient free space in your PVs/VGs, you'll need to create a new partition, configured to the appropriate partition type in a tool such as fdisk or parted. This section examines the process of creating a new LV. If you have room in existing PVs or VGs, you can skip ahead and add an LV more quickly.

To create a new PV from a properly configured partition, you can use the pvcreate command. For example, if you've just created /dev/hda9 and /dev/hda10 for this purpose, use the following command:

 # pvcreate /dev/hda9 /dev/hda10

You can reverse the process with the pvremove command-in this case:

 # pvremove /dev/hda9 /dev/hda10

If you've already assigned space from the noted partitions to an LV, the pvremove command will give you an error message.

To create a new VG from one or more unused PVs, use the vgcreate command. For example, if you wanted to create a VG named myvg from the aforementioned PVs, use the following command:

 # vgcreate myvg /dev/hda9 /dev/hda10

If this command works, you'll find a VG device in /dev/myvg.

To create a new LV from an available VG, the lvcreate command can help. For example, if you wanted to create a new LV of 1000MB from the VG named myvg, run the following command:

 # lvcreate -L 1000 myvg

You'll see the name of the new LV in the output; normally it'll be something like lvol0. In this case, the LV device is /dev/myvg/lvol0.

Expanding a Logical Volume

Expanding the space available from an LV is a two-step process. First, you'll extend the LV in available free space, perhaps from a PV that you've just created. Then you'll expand the filesystem to fill the LV with the resize2fs command.

To expand an existing LV, use the lvextend command. You'll need an existing PV partition with available free space, such as /dev/sda2. For example, if you're increasing the size of the aforementioned LV by 500MB, use the following command:

 # lvextend -L +500 /dev/myvg/lvol0 /dev/sda2

You can then resize ext2 or ext3 formatted filesystems with the resize2fs command. For the noted device, the command would be

 # resize2fs /dev/myvg/lvol0

If you're decreasing the size of an LV, the process is more complex. There is no command similar to resize2fs that reduces the size of a formatted filesystem. You'll need to unmount the LV, and then apply the resize2fs command to the LV device file, with the desired smaller size at the end of the command.

Removing a Logical Volume

Removing an LV is easy; as long as the data in the LV has been saved, the lvremove command works well. For example, if you've created a second LV in the same VG, you might use the following command:

 # lvremove /dev/myvg/lvol1

Diagnosing SELinux-related Network Service Issues

The final Troubleshooting and System Maintenance item in the RHCE part of the Exam Prep guide is the ability to diagnose and correct networking services problems where SELinux contexts are interfering with proper operation.

In most cases, this is simpler than it looks. SELinux log messages are stored in /var/log/messages with an avc label. But even better, the Setroubleshoot browser can identify SELinux issues, describe causes, and even suggest solutions. Watch it for suggested commands such as chcon to change SELinux contexts and sesetbool to set SELinux booleans. All you need to do is open the browser in a GUI with the sealert -b command, and browse the most recent errors. For more information, see Chapter 15.