Section 32.4. Objective 4: General Troubleshooting

32.4. Objective 4: General Troubleshooting

Troubleshooting is one of the most difficult but satisfying administration tasks. Few things feel better for a professional administrator than being posed with a problem and finding a solution for it. Successful troubleshooting requires an in-depth knowledge of the system being shot at and a few rules of conduct. In this section we'll go over a few tips on how to troubleshoot and suggest places for you to look for information.

The first rule of troubleshooting is: don't jump to conclusions! Just because someone says "foo stopped working," don't start adjusting the parameters for whatever "foo" is without gathering more information. Initial problem descriptions (especially from nontechnical users) are notoriously misleading.

This leads to rule number two: get a complete and accurate description of the problem. Foo may have very well stopped working, but it could be a side effect of bar being misconfigured.

Rule number three: reproduce the problem. It's very difficult to shoot at a problem you can't see. The hardest problems are intermittent, but luckily, most aren't. Most intermittent problems show a pattern over time.

So, you've followed the Three Rules of Troubleshooting. Where do you look to gather more information?

If you suspect hardware problems, dmesg and its associated log file /var/log/dmesg are a good place to start. dmesg shows you the kernel ring buffer, the buffer that the kernel writes messages to. (It's called a "ring" buffer because old messages disappear over time to make room for new ones.) We'll use these messages to troubleshoot a specific problem in the following example.

A partial listing of dmesg on my machine looks like this:

 Linux version 2.6.9-1.667 (bhcompile@tweety.build.redhat.com) (gcc version 3.4.2 20041017 (Re d Hat 3.4.2-6.fc3)) #1 Tue Nov 2 14:41:25 EST 2004 BIOS-provided physical RAM map:  BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)  BIOS-e820: 0000000000100000 - 000000005fe88c00 (usable)  BIOS-e820: 000000005fe88c00 - 000000005fe8ac00 (ACPI NVS)  BIOS-e820: 000000005fe8ac00 - 000000005fe8cc00 (ACPI data)  BIOS-e820: 000000005fe8cc00 - 0000000060000000 (reserved)  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)  BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved)  BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved)  BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)  BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) 0MB HIGHMEM available. 1534MB LOWMEM available. zapping low mappings. On node 0 totalpages: 39284

Not much there that's useful to an everyday system administrator. But if we keep scrolling down, we'll see:

 Probing IDE interface ide0... hda: GCR-8483B, ATAPI CD/DVD-ROM drive hdb: PHILIPS DVD+/-RW DVD8631, ATAPI CD/DVD-ROM drive Using cfq io scheduler ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 Probing IDE interface ide1... ide1: Wait for ready failed before probe ! Probing IDE interface ide2...

Hmmm, I might have problems with my IDE bus. Looking even further . . .

 hda: command error: status=0x51 { DriveReady SeekComplete Error } hda: command error: error=0x54 ide: failed opcode was 100 end_request: I/O error, dev hda, sector 1309952 hda: command error: status=0x51 { DriveReady SeekComplete Error } hda: command error: error=0x54 ide: failed opcode was 100

I seem to have problems with my /dev/hda device. Maybe that explains why I haven't been able to read CDs in my CD-ROM drive!

If too many kernel messages are sent, the kernel ring buffer will overflow and you won't see the initial boot message. In that case, take a look at /var/log/dmesg, which gets written over at each boot.

If you attach a new USB device to your system and you're not sure which device it was assigned, dmesg shows you the device:

 # dmesg | tail scsi4 : SCSI emulation for USB Mass Storage devices   Vendor: SanDisk   Model: Cruzer Mini       Rev: 0.1   Type:   Direct-Access                      ANSI SCSI revision: 02 SCSI device sdd: 501759 512-byte hdwr sectors (257 MB) sdd: Write Protect is off sdd: Mode Sense: 03 00 00 00 sdd: assuming drive cache: write through  sdd: sdd1 Attached scsi removable disk sdd at scsi4, channel 0, id 0, lun 0 USB Mass Storage device found at 5

The USB stick I just plugged into my computer is assigned to /dev/sdd and has one partition.

Other hardware information can be found in the /proc subdirectory. Unlike other directories, /proc is not really part of the filesystem; it is an interface to the running kernel. Most of what is in /proc is read-only, but some entries can be written to. This lets you change some parameters of the running kernel.

If you don't want to go searching through /proc for hardware information, the command lsdev will gather up information from /proc/dma,/proc/interrupts, and /proc/ioports and present the data in a combined format. Similarly, lspci will list the data for the PCI buses on your system and the devices connected to them.

If the hardware seems correct (or you can't find anything wrong with the hardware), the issue may be related to the kernel modules. You can list the currently loaded modules with lsmod, insert a new module with insmod, or remove a running module with rmmod. If somebody says a feature is not working, and you know that feature requires support in the kernel, one fruitful source of information is to find out what modules that feature depends on, and to issue lsmod to see whether those modules have been loaded. (Of course, the same feature could be compiled in, in which case lsmod would not show the modules even though the functionality is there.)

Here is a partial listing of the output of lsmod:

 Module                  Size  Used by vfat                   17217  0 fat                    55005  1 vfat parport_pc             30981  1 lp                     16713  0 parport                38793  2 parport_pc,lp

There is some danger with using the insmod and rmmod commands. Many modules rely on other modules, so inserting or removing modules that rely on modules that aren't present can cause instabilities or even outright crashes. In the previous sample output, we see the fat module (a popular module supporting Windows filesystems) is used by the vfat module and that the parport module is used by two other modules: parport_pc and lp. If we were to remove the fat module without first removing the vfat module, unpredictable things most likely will happen. Depending on the module you insert or remove, the results may be nothing more serious than losing your audio capabilitiesbut in other cases, it could crash the system.

This is why the modprobe command was written; modprobe is designed to replace lsmod, insmod and rmmod and to act more intelligently. The following table shows the equivalent modprobe commands.

Table 32-1. Equivalent modprobe commands
command	Is equivalent to
lsmod	modprobe -l
insmod	modprobe -a
rmmod	modprobe -r

As an example of how modprobe is more intelligent than the three *mod commands, rmmod removes a module without regard to the presence of other modules, whereas modprobe will remove the entire stack of modules. In the example previously mentioned, the command modprobe -r fat would have removed the vfat module before the fat module, reducing the possibility of crashes (assuming that removing the fat module is a good thing in the first place!). Conversely, the command modprobe -a vfat inserts the fat module first, then inserts the vfat module.

The modprobe command relies on an updated modules.dep file located in the modules directory. modules.dep is updated by the depmod command. The modules directory can be found at /lib/modules/'uname -r'.

If a problem lies with system applications such as the web server or email server, the first place to look is in the /var/log directory. Most applications that use the syslog logging facility send their logging information to /var/log unless configured differently in /etc/syslog.conf. A good idea when you start troubleshooting is to look at /var/log/messages. From there, the software in question may have specific log files (such as /var/log/maillog for Postfix) or a subdirectory (such as /var/log/httpd for Apache) to search for further log information.

Sometimes executing the program itself is necessary to debug the problem. This is often the case when the program segfaults or crashes with no error messages or log information. The programs strace and ltrace will execute a given command name and display the system or library calls, respectively, that the command makes. These commands show us precisely what the command is doing and where it fails. While there are many useful options for strace and ltrace, two of the most useful are -f to follow forking children and -o trace.out to write the voluminous output to the file trace.out.

32.4. Objective 4: General Troubleshooting

Table 32-1. Equivalent modprobe commands