My good friend, James Kirkland, sent me an instant message one day asking if I wanted to write a Linux troubleshooting book with him. James has been heavily involved in Linux at the HP Response Center for several years. While troubleshooting Linux issues for customers, he realized there was not a good troubleshooting reference available. I remember a meeting discussing Linux troubleshooting. Someone asked what the most valuable Linux troubleshooting tool was. The answer was immediate. Google. If you have ever spent time trying to find a solution for a Linux problem, you know what that engineer was talking about. A wealth of great Linux information can be found on the Internet, but you can't always rely on this strategy. Some of the Linux information is outdated. A lot of it can't be understood without a good foundation of subject knowledge, and some of it is incorrect. We wanted to write this book so the Linux administrator will know how Linux works and how to approach and resolve common issues. This book contains the information we wish we had when we started troubleshooting Linux.
Greg and Chris are identical twins and serious Linux hobbyists. They have been Linux advocates within HP for years. Yes, they both run Linux on their laptops. Chris is a member of the Superdome Server team (http://www.hp.com/products1/servers/scalableservers/superdome/index.html). Greg works for the XP storage team (http://h18006.www1.hp.com/storage/xparrays.html). Their Linux knowledge is wide and deep. They have worked through SAN storage issues and troubleshot process hangs, Linux crashes, performance issues, and everything else for our customers, and they have put their experience into the book.
I am a member of the HP escalations team. I've primarily spent my time resolving HPUX issues. I've been a Linux hobbyist for a few years, and I've started working Linux escalations, but I'm definitely newer to Linux than the rest of the team. I try to give the book the perspective of someone who is fairly new to Linux. I tried to remember the questions I had when I first started troubleshooting Linux issues and included them in the book. We sincerely hope our effort is helpful to you.
These chapter summaries will give you an idea of how the book is organized and a bit of an overview of the content of each chapter.
Chapter 1: System Boot, Startup, and Shutdown Issues
Chapter 1 discusses the different subsystems that comprise Linux startup. These include the bootloaders GRUB and LILO, the init process, and the rc startup and shutdown scripts. We explain how GRUB and LILO work along with the important features of each. The reader will learn how to boot when there are problems with the bootloader. There are numerous examples. We explain how init works and what part it plays in starting Linux. The rc scripts are explained in detail as well. The reader will learn how to boot to single user mode, emergency mode, and confirm mode. Examples are included of using a recovery CD when Linux won't boot from disk.
Chapter 2: System Hangs and Panics
This chapter explains interruptible and non-interruptible OS hangs, kernel panics, and IA64 hardware machine checks. A Linux hang takes one of two forms. An interruptible hang is when Linux seems frozen but does respond to some events, such as a ping request. Non-interruptible hangs do not respond to any actions. We show how to use the Magic SysReq keystroke to generate a stack trace to troubleshoot an interruptible hang. We explain how to force a panic when Linux is in a non-interruptible hang. An OS panic is a voluntary shutdown of the kernel in response to something unexpected. We discuss how to obtain a panic dump from Linux. The IA64 architecture dump mechanism is also explained.
Chapter 3: Performance Tools
In Chapter 3, we explain how to use some of the most popular Linux performance tools including top, sar, vmstat, iostat, and free. The examples show common syntaxes and options. Every system administrator should be familiar with these commands.
Chapter 4: Performance
Chapter 4 discusses different approaches to isolating a performance problem. As with the majority of performance issues, storage always seems to draw significant attention. The goal of this chapter is to provide a quick understanding of how a storage device should perform and easy ways to get a performance measurement without expensive software. In addition to troubleshooting storage performance, we touch on CPU bottlenecks and ways to find such events.
Chapter 5: Adding New Storage via SAN with Reference to PCMCIA and USB
Linux is moving out from under the desk and into the data center. An essential feature of an enterprise computing platform is being able to access storage on the SAN. This chapter provides a detailed walkthrough and examples of installing and configuring Fibre Channel cards. We discuss driver issues, how the device files work, and how to add LUNs.
Chapter 6: Disk Partitions and Filesystems
Master Boot Record (MBR) basics are explained, and examples are shown detailing how bootloader programs such as LILO and GRUB manipulate the MBR. We explain the partition table, and a lot of examples are given so that the reader will understand how the disk is carved up into extended and logical partitions. Many scenarios are provided explaining common disk and filesystem problems and their solutions. After reading this chapter, the reader will understand not only what MBA, LBA, extended partitions, and all the other buzzwords mean, but also how they look on the disk and how to fix problems related to them.
Chapter 7: Device Failure and Replacement
This chapter explains identifying problems with hardware devices and how to fix them. We begin with a discussion of supported devices. Whether a device is supported by the Linux distribution is a good thing to know before spending a lot of time trying to get it working. Next we show where to look for indications of hardware problems. The reader will learn how to decipher the hexadecimal error messages from dmesg and syslog. We explain how to use the lspci tool for troubleshooting. When the error is understood, the next goal is to resolve the device problem. We demonstrate techniques for determining what needs to be done to fix device issues including SAN devices.
Chapter 8: Linux Processes: Structure, Hangs, and Core Dumps
Process management is the heart of the Linux kernel. A system administrator should know what happens when a process is created to troubleshoot process issues. This chapter explains process creation and provides a foundation for troubleshooting. Linux is a multithreading kernel. The reader will learn how multithreading works and what heavyweight and lightweight processes are. The reader also will learn how to troubleshoot a process that seems to be hanging and not doing any work. Core dumps are also covered. We show you how to learn which process dumped core and why. This chapter details how cores are created and how to best utilize them to understand the problem.
Chapter 9: Backup/Recovery
Creating good backups is one of if not the most important tasks a system administrator must perform. This chapter explains the most commonly used backup/recovery commands: tar, cpio, dump/restore, and so on. Tape libraries (autoloaders) are explained along with the commands needed to manipulate them. The reader will learn the uses of different tape device files. There are examples showing how to troubleshoot common issues.
Chapter 10: cron and at
The cron and at commands are familiar to most Linux users. These commands are used to schedule jobs to run at a later time. This chapter explains how the cron/at subsystem works and where to look when jobs don't run. The cron, at, batch, and anacron facilities are explained in detail. The kcron graphical cron interface is discussed. Numerous examples are provided to demonstrate how to resolve the most common problems. The troubleshooting techniques help build good general troubleshooting skills that can be applied to many other Linux problems.
Chapter 11: Printing and Printers
This chapter explains the different print spoolers used in Linux systems. The reader will learn how the spooler works. The examples show how to use the spooler commands such as lpadmin, lpoption, lprm, and others to identify problems. The different page description languages such as PCL and PostScript are explained. Examples demonstrate how to fix remote printing and network printing problems.
Chapter 12: System Security
Security is a concern of every system administrator. Is the box safe because it is behind a firewall? What steps should be taken to secure my system? These questions are answered. Host-based and network-based security are explained. Secure Shell protocol (SSH) is covered in detail: why SSH is secure, encryption with SSH, SSH tunnels, troubleshooting typical SSH problems, and SSH examples are provided. The reader will learn system hardening using netfilter and iptables. netfilter and iptables together make up the standard firewall software for the Linux 2.4 and 2.6 kernels.
Chapter 13: Network Problems
Network issues are a common problem for any system administrator. What should be done when Linux boots and users can't connect? Is the problem with the Linux box or something on the LAN? Has the network interface card failed? We need a systematic way to verify the network hardware and Linux configuration. Chapter 13 provides the information a Linux system administrator needs to troubleshoot network problems. Learn where to look for configuration problems and how to use the commands ethtool, modinfo, mii, and others to diagnose networking problems.
Chapter 14: Login Problems
Chapter 14 explains how the login process works and how to troubleshoot login failures. Password aging is explained. Several examples show the reader how to fix common login problems. The Pluggable Authentication Modules (PAM) subsystem is explained in detail. The examples reinforce the concepts explained and demonstrate how to fix problems encountered with PAM.
Chapter 15: X Windows Problems
GNOME and KDE are client/server applications just like many others that run on Linux, but they can be frustrating to troubleshoot because they are display managers. After reading this chapter, the reader will understand the components of Linux graphical display managers and how to troubleshoot problems. Practical examples are provided to reinforce the concepts, and they can be applied to real-world problems.
We would like to extend our sincere gratitude to everyone who made this book possible. We wish to express gratitude to Hewlett-Packard and our HP management as well as the Prentice Hall editorial and production teams.
We also wish to express gratitude to our families for their understanding and support throughout the long road from the initial drafting to the final publication.
About the Authors
JAMES KIRKLAND is a Senior Consultant for Racemi. He was previously a Senior Systems Administrator at Hewlett-Packard. He has been working with UNIX variants for more than ten years. James is a Red Hat Certified engineer, Linux LPIC level one certified, and an HP-UX certified System Administrator. He has been working with Linux for seven years and HP-UX for eight years. He has been a participant at HP World, Linux World, and numerous internal HP forums.
DAVID CARMICHAEL works for Hewlett-Packard as a Technical Problem Manager in Alpharetta, Georgia. He earned a bachelors degree in computer science from West Virginia University in 1987 and has been helping customers resolve their IT problems ever since. David has written articles for HP's IT Resource Center (http://itrc.hp.com) and presented at HP World 2003.
CHRIS and GREG TINKER are twin brothers originally from LaFayette, Georgia. Chris began his career in computers while working as a UNIX System Administrator for Lockheed Martin in Marietta, Georgia. Greg began his career while at Bellsouth in Atlanta, Georgia. Both Chris and Greg joined Hewlett-Packard in 1999. Chris's primary role at HP is as a Senior Software Business Recovery Specialist and Greg's primary role is as a Storage Business Recovery Specialist. Both Chris and Greg have participated in HP World, taught several classes in UNIX/Linux and Disk Array technology, and obtained various certifications including certifications in Advanced Clusters, SAN, and Linux. Chris resides with his wife, Bonnie, and Greg resides with his wife, Kristen, in Alpharetta, Georgia.