4.4 Troubleshooting: Handling Crashes and Boot Failures | Essential System Administration, Third Edition

Even the best-maintained systems crash from time to time. A crash occurs when the system suddenly stops functioning. The extent of system failure can vary quite a bit, from a failure affecting every subsystem to one limited to a particular device or to the kernel itself. System hang-ups are a related phenomenon in which the system stops responding to input from any user or device or stops producing output, but the operating system nominally remains loaded. Such a system also may be described as frozen.

There are many causes of system crashes and hangups. These are among the most common:

Hardware failures: failing disk controllers, CPU boards, memory boards, power supplies, disk head crashes, and so on.
Unrecoverable hardware errors, such as double-bit memory errors. These sorts of problems may indicate hardware that is about to fail, but they also just happen from time to time.
Power failures or surges due to internal power supply problems, external power outages, electrical storms, and other causes.
Other environmental problems: roof leaks, air conditioning failure, etc.
I/O problems involving a fatal error condition rather than a device malfunction.
Software problems, ranging from fatal kernel errors caused by operating system bugs to (much less frequently) problems caused by users or third-party programs.
Resource overcommitment (for example, running out of swap space). These situations can interact with bugs in the operating system to cause a crash or hang-up.

Some of these causes are easier to identify than others. Rebooting the system may seem like the most pressing concern when the system crashes, but it's just as important to gather the available information about why the system crashed while the data is still accessible.

Sometimes it's obvious why the system crashed, as when the power goes out. If the cause isn't immediately clear, the first source of information is any messages appearing on the system console. They are usually still visible if you check immediately, even if the system is set to reboot automatically. After they are no longer on the screen, you may still be able to find them by checking the system error log file, usually stored in /var/log/messages (see Chapter 3 for more details), as well as any additional, vendor-supplied error facilities.

Beyond console messages lie crash dumps. Most systems automatically write a dump of kernel memory when the system crashes (if possible). These memory images can be examined using a debugging tool to see what the kernel was doing when it crashed. Obviously, these dumps are of use only for certain types of crashes in which the system state at the time of the crash is relevant. Analyzing crash dumps is beyond the scope of this book, but you should know where crash dumps go on your system and how to access them, if only to be able to save them for your field service engineers or vendor technical support personnel.

Crash dumps are usually written to the system disk swap partition. Since this area may be overwritten when the system is booted, some provisions need to be made to save its contents. The savecore command solves this problem, as we have seen ( the command is called savecrash under HP-UX).

NOTE

figs/armadillo_tip.gif

If you want to be able to save crash dumps, you need to ensure that the primary swap partition is large enough. Unless your system has the ability to compress crash dumps as they are created (e.g., Tru64) or selectively dump only the relevant parts of memory, the swap partition needs to be at least as large as physical memory.

If your system crashes and you are not collecting crash dumps by default, but you want to get one, boot the system to single-user mode and execute savecore by hand. Don't let the system boot to multiuser mode before saving the crash dump; once the system reaches multiuser mode, it's too late.

AIX also provides the snap command for collecting crash dump and other system data for later analysis.

4.4.1 Power-Failure Scripts

There are two other action keywords available for inittab that we've not yet considered: powerfail and powerwait. They define entries that are invoked if a SIGPWR signal is sent to the init process, which indicates an imminent power failure. This signal is generated only for detectable power failures: those caused by faulty power supplies, fans, and the like, or via a signal from an u ninterruptable power supply (UPS). powerwait differs from powerfail in that it requires init to wait for its process to complete before going on to the next applicable inittab entry.

The scripts invoked by these entries are often given the name rc.powerfail. Their purpose is to do whatever can be done to protect the system in the limited time available. Accordingly, they focus on syncing the disks to prevent data loss that might occur if disk operations are still pending when the power does go off.

Linux provides a third action, powerokwait, that is invoked when power is restored and tells init to wait for the corresponding process to complete before going on to any additional entries.

Keeping the Trains on Time

If you can keep your head when all about you
Are losing theirs and blaming it on you . . .
Kipling

System administration is often metaphoricallydescribed as keeping the trains on time, referring to the pervasive attitude that its effects should basically be invisible no one ever pays any attention to the trains except when they're late. To an even greater extent, no one notices computer systems except when they're down. And a few days of moderate system instability (in English, frequent crashes) can make even the most good-natured users frustrated and hostile.

The system administrator is the natural target when that happens.People like to believe that there was always something that could have been done to prevent whatever problem has surfaced. Sometimes, that's true, but not always or even usually. Systems sometimes develop problems despite your best preventative maintenance.

The best way to handle such situations involves two strategies. First, during the period of panic and hysteria, do your job as well as you can and leave the sorting out of who did or didn't do what when for after things are stable again. The second part gets carried out in periods of calm between crises. It involves keeping fairly detailed records of system performance and status over a significant period of time; they are invaluable for figuring out just how much significance to attach to any particular period of trouble after the fact. When the system has been down for two days, no one will care that it has been up 98% of the time it was supposed to be over the last six months, but it will matter once things have stabilized again.

It's also a good idea to document how you spend your time caring for the system, dividing the time into broad categories (system maintenance, user support, routine activities, system enhancement), as well as how much time you spend doing so, especially during crises. You'll be amazed by the bottom line.

4.4.2 When the System Won't Boot

As with system crashes, there can be many reasons why a system won't boot. To solve such problems, you first must figure out what the specific problem is. You'll need to have a detailed understanding of what a normal boot process looks like so that you can pinpoint exactly where the failure is occurring. Having a hard copy of normal boot messages is often very helpful. One thing to keep in mind is that boot problems always result from some sort of change to the system; systems don't just stop working. You need to figure out what has changed. Of course, if you've just made modifications to the system, they will be the prime suspects.

This section lists some of the most common causes of booting problems, along with suggestions for what to do in each case.

4.4.2.1 Bad or flaky hardware

Check the obvious first.The first thing to do when there is a device failure is to see if there is a simple problem that is easily fixed. Is the device plugged in and turned on? Have any cables connecting it to the system come loose? Does it have the correct SCSI ID (if applicable)? Is the SCSI chain terminated? You get the idea.

Try humoring the device. Sometimes devices are just cranky and can be coaxed back to life. For example, if a disk won't come on line, try power-cycling it. If that doesn't work, try shutting off the power to the entire system. Then power up the devices one by one, beginning with peripherals and ending with the CPU if possible, waiting for each one to settle down before going on to the next device. Sometimes this approach works on the second or third try even after failing on the first. When you decide you've had enough, call field service. When you use this approach, once you've turned the power off, leave it off for a minute or so to allow the device's internal capacitors to discharge fully.

Device failures. If a critical hardware device fails, there is not much you can do except call field service. Failures can occur suddenly, and the first reboot after the system power has been off often stresses marginal devices to the point that they finally fail.

4.4.2.2 Unreadable filesystems on working disks

You can distinguish this case from the previous one by the kind of error you get. Bad hardware usually generates error messages about the hardware device itself, as a whole. A bad filesystem tends to generate error messages later in the boot process, when the operating system tries to access it.

Bad root filesystem. How you handle this problem depends on which filesystem is damaged. If it is the root filesystem, then you may be able to recreate it from a bootable backup/recovery tape (or image on the network) or by booting from alternate media (such as the distribution tape, CD-ROM, or diskette from which the operating system was installed), remaking the filesystem and restoring its files from backup. In the worst case, you'll have to reinstall the operating system and then restore files that you have changed from backup.

Restoring other filesystems. On the other hand, if the system can still boot to single-user mode, things are not nearly so dire. Then you will definitely be able to remake the filesystem and restore its files from backup.

4.4.2.3 Damage to non-filesystem areas of a disk

Damaged boot areas. Sometimes, it is the boot partition or even the boot blocks of the root disk that are damaged. Some Unix versions provide utilities for restoring these areas without having to reinitialize the entire disk. You'll probably have to boot from a bootable backup tape or other distribution media to use them if you discover the problem only at boot time. Again, the worst-case scenario is having to reinstall the operating system.

Corrupted partition tables. On PCs, it is possible to wipe out a disk's partition tables if a problem occurs while you are editing them with the fdisk disk partitioning utility. If the power goes off or fdisk hangs, the disk's partition information can be incorrect or wiped out entirely. This problem can also happen on larger systems as well, although its far less common to edit the partition information except at installation (and often not even then).

The most important thing to do in this case is not to panic. This happened to me on a disk where I had three operating systems installed, and I really didn't want to have to reinstall all of them. The fix is actually quite easy: simply rerun fdisk and recreate the partitions as they were before, and all will be well again. However, this does mean that you need to have complete, detailed, and accessible (e.g., hardcopy) records of how the partitions were set up.

4.4.2.4 Incompatible hardware

Problems with a new device. Sometimes, a system hangs when you try to reboot it after adding new hardware. This can happen when the system does not support the type of device that you've just added, either because the system needs to be reconfigured to do so or because it simply does not support the device.

In the first case, you can reconfigure the system to accept the new hardware by building a new kernel or doing whatever else is appropriate on your system. However, if you find out that the device is not supported by your operating system, you will probably have to remove it to get the system to boot, after which you can contact the relevant vendors for instructions and assistance. It usually saves time in the long run to check compatibility before purchasing or installing new hardware.

Problems after an upgrade.Hardware incompatibility problems also crop up occasionally after operating system upgrades on systems whose hardware has not changed, due to withdrawn support for previously supported hardware or because of undetected bugs in the new release. You can confirm that the new operating system is the problem if the system still boots correctly from bootable backup tapes or installation media from the previous release. If you encounter sudden device-related problems after an OS upgrade, contacting the operating system vendor is usually the best recourse.

Device conflicts. On PCs, devices communicate with the CPU using a variety of methods: interrupt signals, DMA channels, I/O addresses/ports, and memory addresses (listed in decreasing order of conflict likelihood). All devices that operate at the same time must have unique values for the items relevant to it (values are set via jumpers or other mechanisms on the device or its controller or via a software utility provided by the manufacturer for this purpose). Keeping detailed and accurate records of the settings used by all of the devices on the system will make it easy to select appropriate ones when adding a new device and to track down conflicts should they occur.

4.4.2.5 System configuration errors

Errors in configuration files. This type of problem is usually easy to recognize. More than likely, you've just recently changed something, and the boot process dies at a clearly identifiable point in the process. The solution is to boot to single-user mode and then correct the erroneous configuration file or reinstall a saved, working versions of it.

Unbootable kernels. Sometimes, when you build a new kernel, it won't boot. There are at least two ways that this can occur: you may have made a mistake building or configuring the kernel, or there may be bugs in the kernel that manifest themselves on your system. The latter happens occasionally when updating the kernel to the latest release level on Linux systems and when you forget to run lilo after building a new kernel.

In either case, the first thing to do is to reboot the system using a working, saved kernel that you've kept for just this contingency. Once the system is up, you can track down the problem with the new kernel. In the case of Linux kernels, if you're convinced that you haven't made any mistakes, you can check the relevant newsgroups to see if anyone else has seen the same problem. If no information is available, the best thing to do is wait for the next patch level to become available (it doesn't take very long) and then try rebuilding the kernel again. Frequently, the problem will disappear.

Errors in initialization files are a very common cause of boot problems. Usually, once an error is encountered, the boot stops and leaves the system in single-user mode. The incident described in Chapter 3 about the workstation that wouldn't boot ended up being a problem of this type. The user had been editing the initialization files on his workstation, and he had an error in the first line of /etc/rc (I found out later). So only the root disk got mounted. On this system, /usr was on a separate disk partition, and the commands stored in /bin used shared libraries stored under /usr. There was no ls, no cat, not even ed.

As I told you before, I remembered that echo could list filenames using the shell's internal wildcard expansion mechanism (and it didn't need the shared library). I typed:

# echo /etc/rc*

and found out there was an rc.dist file there. Although it was probably out of date, it could get things going. I executed it manually:

# . /etc/rc.dist

The moral of this story is, of course, test, test, test. Note once more that obsessive prudence is your best hope every time.