Anyone with any system administration experience has been there. You are in the middle of some production cycle or are just working on the desktop when the computer, for some mysterious reason, hangs or displays some elaborate screen message with a lot of HEX addresses and perhaps a stack of an offending NULL dereference.
What to do? In this chapter, we hope to provide an answer as we discuss kernel panics, oops, hangs, and hardware faults. We examine what the system does in these situations and discuss the tools required for initial analysis. We begin by discussing OS hangs. We then discuss kernel panics and oops panics. Finally, we conclude with hardware machine checks.
It is important to identify whether you are encountering a panic, a hang, or a hardware fault to know how to remedy the problem. Panics are easy to detect because they consist of the kernel voluntarily shutting down. Hangs can be more difficult to detect because the kernel has gone into some unknown state and the driver has ceased to respond for some reason, preventing the processes from being scheduled. Hardware faults occur at a lower level, independent of and beneath the OS, and are observed through firmware logs.
When you encounter a hang, panic, or hardware fault, determine whether it is easily reproducible. This information helps to identify whether the underlying problem is a hardware or software problem. If it is easily reproducible on different machines, chances are that the problem is software-related. If it is reproducible on only one machine, focus on ruling out a problem with supported hardware.
One final important point before we begin discussing hangs: Whether you are dealing with an OS hang or panic, you must confirm that the hardware involved is supported by the Linux distribution before proceeding. Make sure the manufacturer supports the Linux kernel and hardware configuration used. Contact the manufacturer or consult its documentation or official Web site. This step is so important because when the hardware is supported, the manufacturer has already contributed vast resources to ensure compatibility and operability with the Linux kernel. Conversely, if it is not supported, you will not have the benefit of this expertise, even if you can find the bug, and either the manufacturer would have to implement your fix, or you would have to modify the open source driver yourself. However, even if the hardware is not supported, you may find this chapter to be a helpful learning tool because we highlight why the driver, kernel module, application, and hardware are behaving as they are.