Solving Problems

As the system administrator, it is your responsibility to keep the system secure and running smoothly. When a user is having a problem, it usually falls to the administrator to help the user get back on track. This section suggests ways to keep users happy and the system functioning at peak performance.

Helping When a User Cannot Log In

When a user has trouble logging in on the system, the source may be a user error or a problem with the system software or hardware. The following steps can help determine where the problem is:

Determine whether only that one user or only that one user's terminal/workstation has a problem or whether the problem is more widespread.
Check that the user's Caps Lock key is not on.
Make sure the user's home directory exists and corresponds to that user's entry in the /etc/passwd file. Verify that the user owns his or her home directory and startup files and that they are readable (and, in the case of the home directory, executable). Confirm that the entry for the user's login shell in the /etc/passwd file is valid (that is, the entry is accurate and the shell exists as specified).
Change the user's password if there is a chance that he or she has forgotten the correct password.
Check the user's startup files (.profile, .login, .bashrc, and so on). The user may have edited one of these files and introduced a syntax error that prevents login.
Check the terminal or monitor data cable from where it plugs into the terminal to where it plugs into the computer (or as far as you can follow it). Try turning the terminal or monitor off and then turning it back on.
When the problem appears to be widespread, check whether you can log in from the system console. If you can, make sure that the system is in multiuser mode. If you cannot log in, the system may have crashed; reboot it and perform any necessary recovery steps (the system usually does quite a bit automatically).
Check that the /etc/inittab file is set up to start mingetty at runlevels 25.
Check the /var/log/messages file. This file accumulates system errors, messages from daemon processes, and other important information. It may indicate the cause or more symptoms of a problem. Also, check the system console. Occasionally messages about system problems that are not written to /var/log/messages (for instance, a full disk) are displayed on the system console.
If the user is logging in over a network connection, run system-config-services (page 406) to make sure that the service the user is trying to use (such as telnet or ssh) is enabled.
Use df to check for full filesystems. If the /tmp filesystem or the user's home directory is full, login sometimes fails in unexpected ways. In some cases you may be able to log in to a textual environment but not a graphical one. When applications that start when the user logs in cannot create temporary files or cannot update files in the user's home directory, the login process itself may terminate.

Speeding Up the System

When the system is running slowly for no apparent reason, perhaps a process did not exit when a user logged out. Symptoms of this problem include poor response time and a system load, as shown by w or uptime, that is greater than 1.0. Running top (page 550) is an excellent way to quickly find rogue processes. Use ps ef to list all processes. One thing to look for in ps ef output is a large number in the TIME column. For example, if a Firefox process has a TIME field over 100.0, this process has likely run amok. However, if the user is doing a lot of Java work and has not logged out for a long time, this value may be normal. Look at the STIME field to see when the process was started. If the process has been running for longer than the user has been logged in, it is a good candidate to be killed.

When a user gets stuck and leaves her terminal unattended without notifying anyone, it is convenient to kill (page 395) all processes owned by that user. If the user is running a window system, such as GNOME or KDE on the console, kill the window manager process. Manager processes to look for include startkde, gnome-session, or another process name that ends in wm. Usually the window manager is either the first or the last thing to be run, and exiting from the window manager logs the user out. If killing the window manager does not work, try killing the X server process itself. This process is typically listed as /usr/bin/Xorg. If that fails, you can kill all processes owned by a user by giving the command kill 1 1, or equivalently kill TERM 1 while you are logged in as that user. Using 1 (one) in place of the process ID tells kill that it should send the signal to all processes that are owned by that user. For example, as root you could give the following command:

# su jenny -c 'kill -TERM -1'

If this does not kill all processes (sometimes TERM does not kill a process), you can use the KILL signal. The following line will definitely kill all processes owned by Jenny and will not be friendly about it:

# su jenny -c 'kill -KILL -1'

(If you do not use su jenny c, the same command brings the system down.)

lsof: Finds Open Files

The lsof (ls open files) utility locates open files. Its options display only certain processes, only certain file descriptors of a process, or only certain network connections (network connections use file descriptors just as normal files do and lsof can show these as well). Once you have identified a suspect process using ps ef, give the following command:

# lsof -sp pid

Replace pid with the process ID of the suspect process; lsof displays a list of file descriptors that process pid has open. The s option displays the sizes of all open files. This size information is helpful in determining whether the process has a very large file open. If it does, contact the owner of the process or, if necessary, kill the process. The rn option redisplays the output of lsof every n seconds.

Keeping a Machine Log

A machine log that includes the information shown in Table 16-3 (next page) can help you find and fix system problems. Note the time and date for each entry in the log. Avoid the temptation to keep the log only on the computerit will be most useful to you when the system is down. Another good idea is to keep a record of all email about user problems. One strategy is to save this mail to a separate file or folder as you read it. Another approach is to set up a mail alias that users can send mail to when they have problems. This alias can then forward mail to you and also store a copy in an archive file. Following is an example of an entry in the /etc/aliases file (page 633) that sets up this type of alias:

trouble: admin,/var/spool/mail/admin.archive

Table 16-3. Machine log
Entry	Function
Hardware modifications	Keep track of the system hardware configuration: which devices hold which partitions, the model of the new NIC you added, and so on.
System software modifications	Keep track of the options used when building Linux. Print such files as /usr/src/linux/.config (Linux kernel configuration) and the X11 configuration file /etc/X11/xorg.conf. The file hierarchy under /etc/sysconfig contains valuable information about network configuration, among other things.
Hardware malfunctions	Keep as accurate a list as possible of any problems with the system. Make note of any error messages or numbers that the system displays on the system console and identify what users were doing when the problem occurred.
User complaints	Make a list of all reasonable complaints made by knowledgeable users (for example, "machine is abnormally slow").

Email sent to the trouble alias will be forwarded to the admin user and also stored in the file /var/mail/admin.archive.

Keeping the System Secure

No system with dial-in lines or public access to terminals is absolutely secure. You can make a system as secure as possible by changing the Superuser password frequently and choosing passwords that are difficult to guess. Do not tell anyone who does not absolutely need to know the Superuser password. You can also encourage system users to choose difficult passwords and to change them periodically.

By default, passwords on Red Hat Linux use MD5 (page 1042) hashing, which makes them more difficult to break than passwords encrypted with DES (page 990). It makes little difference how well encrypted your password is if you make it easy for someone to find out or guess what it is.

A password that is difficult to guess is one that someone else would not be likely to think you would have chosen. Do not use words from the dictionary (spelled forward or backward); names of relatives, pets, or friends; or words from a foreign language. A good strategy is to choose a couple of short words, include some punctuation (for example, put a ^ between them), mix the case, and replace some of the letters in the words with numbers. If it were not printed in this book, an example of a good password would be C&yGram5 (candygrams). Ideally you would use a random combination of ASCII characters, but that would be difficult to remember.

You can use one of several excellent password-cracking programs to find users who have chosen poor passwords. These programs work by repeatedly encrypting words from dictionaries, phrases, names, and other sources. If the encrypted password matches the output of the program, then the program has found the password of the user. A program that cracks passwords is crack. It and many other programs and security tips are available from CERT (www.cert.org), which was originally called the Computer Emergency Response Team. Specifically look at www.cert.org/tech_tips.

Make sure that no one except Superuser can write to files containing programs that are owned by root and run in setuid mode (for example, mail and su). Also make sure that users do not transfer programs that run in setuid mode and are owned by root onto the system by means of mounting tapes or disks. These programs can be used to circumvent system security. One technique that prevents users from having setuid files is to use the nosuid flag to mount, which you can set in the flags section in the fstab file. Refer to "fstab: Keeps Track of Filesystems" on page 469.

The BIOS in many machines gives you some degree of protection from an unauthorized person modifying the BIOS or rebooting the system. When you set up the BIOS, look for a section named Security. You can probably add a BIOS password. If you depend on the BIOS password, lock the computer case. It is usually a simple matter to reset the BIOS password by using a jumper on the motherboard.

Log Files and Mail for root

Users frequently email root and postmaster to communicate with the system administrator. If you do not forward root's mail to yourself (/etc/aliases on page 633), remember to check root's mail periodically. You will not receive reminders about mail that arrives for root when you use su to perform system administration tasks. However, after you use su to become root, you can give the command mail u root to look at root's mail.

Review the system log files regularly for evidence of problems. Two important files are /var/log/messages, where the operating system and some applications record errors, and /var/log/maillog, which contains errors from the mail system.

The logwatch utility (/usr/sbin/logwatch points to the Perl script named /usr/share/logwatch/scripts/logwatch.pl) is a report writer that sends email reports on log files. By default, this script is run daily (/etc/cron.daily/0logwatch points to the same Perl script) and emails its output to root. Refer to the logwatch man page and to the script itself for more information.

Monitoring Disk Usage

Sooner or later you will probably start to run out of disk space. Do not fill up a disk; Linux can write to files significantly faster if at least 5 to 30 percent of the disk space in a given filesystem remains free. Using more than the maximum optimal disk space in a filesystem can degrade system performance.

Fragmentation

As a filesystem becomes full, it can become fragmented. This is similar to the DOS concept of fragmentation but is not nearly as pronounced and is typically rare on modern Linux filesystems; by design Linux filesystems are resistant to fragmentation. Keep filesystems from running near full capacity, and you may never need to worry about fragmentation. If there is no space on a filesystem, you cannot write to it at all.

To check for filesystem fragmentation, unmount the filesystem and run fsck on it. The output of fsck includes a percent fragmentation figure for the filesystem. You can defragment a filesystem by backing it up, using mkfs (page 419) to make a clean, empty image, and then restoring the filesystem. Which utility you use to do the backup and restoredump/restore, tar, cpio, or a third-party backup programis irrelevant.

Reports

Linux provides several programs that report on who is using how much disk space on which filesystems. Refer to the du, quota, and df man pages and the size option in the find utility man page. In addition to these utilities, you can use the disk quota system to manage disk space.

Four strategies to increase the amount of free space on a filesystem are to compress files, delete files, grow LVM-based filesystems, and condense directories. This section contains some ideas on ways to maintain a filesystem so that it does not become overloaded.

Files that grow quickly

Some files, such as log files and temporary files, grow over time. Core dump files, for example, take up substantial space and are rarely needed. Also, users occasionally run programs that accidentally generate huge files. As the system administrator, you must review these files periodically so that they do not get out of hand.

If a filesystem is running out of space quickly (that is, over a period of an hour rather than weeks or months), first figure out why it is running out of space. Use a ps ef command to determine whether a user has created a runaway process that is creating a huge file. When evaluating the output of ps, look for a process that has consumed a large amount of CPU time. If such a process is running and creating a large file, the file will continue to grow as you free up space. If you remove the huge file, the space it occupied will not be freed until the process terminates, so you need to kill the process. Try to contact the user running the process, and ask the user to kill it. If you cannot contact the user, log in as root and kill the process yourself. Refer to kill on page 395 for more information.

You can also truncate a large log file rather than removing it, although you can better deal with this recurring situation with logrotate (discussed in the next section). For example, if the /var/log/messages file has become very large because a system daemon is misconfigured, you can use /dev/null to truncate it:

# cp /dev/null /var/log/messages

# cat /dev/null > /var/log/messages

or, without spawning a new process,

# : > /var/log/messages

If you remove /var/log/messages, you have to restart the syslogd daemon. If you do not restart syslogd, the space on the filesystem is not released.

When no single process is consuming the disk space but capacity has instead been used up gradually, locate unneeded files and delete them. You can archive these files by using cpio, dump, or tar before you delete them. You can safely remove most files named core that have not been accessed for several days. The following command line performs this function without removing necessary files named core (such as /dev/core):

[View full width]
# find / -type f -name core | xargs file | grep 'B core file' | sed 's/:ELF.*//g' | xargs  rm -f

The find command lists all ordinary files named core and sends its output to xargs, which runs file on each of the files in the list. The file utility displays a string that includes B core file for files created as the result of a core dump. These files need to be removed. The grep command filters out from file lines that do not contain this string. Finally sed removes everything following the colon so that all that is left on the line is the pathname of the core file; xargs removes the file.

To free up more disk space, look through the /tmp and /var/tmp directories for old temporary files and remove them. Keep track of disk usage in /var/mail, /var/spool, and /var/log.

logrotate: Manages Log Files

Rather than deleting or truncating log files, you may want to keep these files for a while in case you need to refer to them. The logrotate utility helps you manage system log (and other) files automatically by rotating (page 1053), compressing, mailing, and removing each as you specify. The logrotate utility is controlled by the /etc/logrotate.conf file, which sets default values and can optionally specify files to be rotated. Typically, logrotate.conf has an include statement that points to utility-specific specification files in /etc/logrotate.d. Following is the default logrotate.conf file:

$ cat /etc/logrotate.conf # see "man logrotate" for details # rotate log files weekly weekly # keep 4 weeks worth of backlogs rotate 4 # create new (empty) log files after rotating old ones create # uncomment this if you want your log files compressed #compress # RPM packages drop log rotation information into this directory include /etc/logrotate.d # no packages own wtmp -- we'll rotate them here /var/log/wtmp {     monthly     create 0664 root utmp     rotate 1 } # system-specific logs may be also be configured here.

The logrotate.conf file sets default values for common parameters. Whenever logrotate reads another value for one of these parameters, it resets the default value. You have a choice of rotating files daily, weekly, or monthly. The number following the rotate keyword specifies the number of rotated log files that you want to keep. The create keyword causes logrotate to create a new log file with the same name and attributes as the newly rotated log file. The compress keyword (commented out in the default file) causes log files to be compressed using gzip. The include keyword specifies the standard /etc/logrotate.d directory for program-specific logrotate specification files. When you install a program using rpm (page 487) or an rpm-based utility such as yum (page 476), rpm puts the logrotate specification file in this directory.

The last set of instructions in logrotate.conf takes care of the /var/log/wtmp log file (wtmp holds login records; you can view this file with the command who /var/log/wtmp). The keyword monthly overrides the default value of weekly for this utility only (because the value is within brackets). The create keyword is followed by the arguments establishing the permissions, owner, and group for the new file. Finally rotate establishes that one rotated log file should be kept.

The /etc/logrotate.d/cups file is an example of a utility-specific logrotate specification file:

$ cat /etc/logrotate.d/cups /var/log/cups/*_log {      missingok      notifempty      sharedscripts      postrotate         /etc/init.d/cups condrestart >/dev/null 2>&1 || true      endscript }

This file, which is incorporated in /etc/logrotate.d because of the include statement in logrotate.conf, works with each of the files in /var/log/cups that has a filename that ends in _log (*_log). The missingok keyword means that no error will be issued when the file is missing. The notifempty keyword causes logrotate not to rotate the log file if it is empty, overriding the default action of rotating empty log files. The sharedscripts keyword causes logrotate to execute the command(s) in the prerotate and postrotate sections one time onlynot one time for each log that is rotated. Although it does not appear in this example, the copytruncate keyword causes logrotate to truncate the original log file immediately after it copies it. This keyword is useful for programs that cannot be instructed to close and reopen their log files because they might continue writing to the original file even after it has been moved. The logrotate utility executes the commands between prerotate and endscript before the rotation begins. Similarly, commands between postrotate and endscript are executed after the rotation is complete.

The logrotate utility has many keywords, many of which take arguments and have side effects. Refer to the logrotate man page for details.

Removing Unused Space from Directories

A directory that contains too many filenames is inefficient. The point at which a directory on an ext2 or ext3 filesystem becomes inefficient varies, depending partly on the length of the filenames it contains. Keep directories relatively small. Having fewer than several hundred files (or directories) in a directory is generally a good idea, and having more than several thousand is generally a bad idea. Additionally, Linux uses a caching mechanism for frequently accessed files to speed the process of locating an inode from a filename. This caching mechanism works only on filenames of up to 30 characters in length, so avoid giving extremely long filenames to frequently accessed files.

When a directory becomes too large, you can usually break it into several smaller directories by moving its contents to those new directories. Make sure that you remove the original directory once you have moved all of its contents.

Because Linux directories do not shrink automatically, removing a file from a directory does not shrink the directory, even though it frees up space on the disk. To remove unused space and make a directory smaller, you must copy or move all the files to a new directory and remove the original directory.

The following procedure removes unused directory space. First remove all unneeded files from the large directory. Then create a new, empty directory. Next move or copy all remaining files from the old large directory to the new empty directory. Remember to copy hidden files. Finally, delete the old directory and rename the new directory.

# mkdir /home/alex/new # mv /home/alex/large/*/home/alex/large/.[A-z]* /home/alex/new # rmdir /home/alex/large # mv /home/alex/new /home/alex/large

Optional: Disk Quota System

The disk quota system limits the disk space and number of files owned by individual users. You can choose to limit each user's disk space, the number of files each user can own, or both. Each resource that is limited has two limits. The lower limit, or quota, can be exceeded by the user, although a warning is given each time the user logs in when he is above the quota. After a certain number of warnings (set by the system administrator), the system will behave as if the user had reached the upper limit. Once the upper limit is reached or the user has received the specified number of warnings, the user will not be allowed to create any more files or use any more disk space. The user's only recourse at that point is to remove some files.

Users can review their usage and limits with the quota utility. Superuser can use quota to obtain information about any user.

First you must decide which filesystems to limit and how to allocate space among users. Typically only filesystems that contain users' home directories, such as /home, are limited. Use the edquota utility to set the quotas, and then use quotaon to start the quota system. You will probably want to put a quotaon command into the appropriate init script so that the quota system will be enabled when you bring up the system (page 404). Unmounting a filesystem automatically disables the quota system for that filesystem.

syslogd: Logs System Messages

Traditionally UNIX programs sent log messages to standard error. If a more permanent log was required, the output was redirected to a file. Because of the limitations of this approach, 4.3BSD introduced the system log daemon (syslogd) now used by Linux. This daemon listens for log messages and stores them in the /var/log hierarchy. In addition to providing logging facilities, syslogd allows a single machine to serve as a log repository for a network and allows arbitrary programs to process specific log messages.

syslog.conf

The /etc/syslog.conf file stores configuration information for syslogd. Each line in this file contains one or more selectors and an action, separated by whitespace. The selectors define the origin and type of the messages; the action specifies how syslogd is to process the message. Sample lines from syslog.conf follow (a # indicates a comment):

# Log all kernel messages to the console. kern.*                                          /dev/console # Log all the mail messages in one place. mail.*                                          /var/log/maillog # Log cron stuff cron.*                                          /var/log/cron # Everybody gets emergency messages *.emerg                                         * # Save boot messages also to boot.log local7.*                                         /var/log/boot.log

Selectors

A selector is split into two parts, a facility and a priority, which are separated by a period. The facility indicates the origin of the message. For example, kern messages come from the kernel and mail messages come from the mail subsystem. Following is a list of facility names used by syslogd and the systems that generate these messages:

auth	Authorization and security systems including login
authpriv	Same as auth, but should be logged to a secure location
cron	cron
daemon	System and network daemons without their own categories
kern	Kernel
lpr	Printing subsystem
mail	Mail subsystem
news	Network news subsystem
user	Default facility; all user programs use this facility
uucp	The UNIX-to-UNIX copy protocol subsystem
local0 to local7	Reserved for local use

The priority indicates the severity of the message. The following list of the priority names and the conditions they represent is in priority order:

debug	Debugging information
info	Information that does not require intervention
notice	Conditions that may require intervention
warning	Warnings
err	Errors
crit	Critical conditions such as hardware failures
alert	Conditions that require immediate attention
emerg	Emergency conditions

A selector consisting of a single facility and priority, such as kern.info, causes the corresponding action to be applied to every message from that facility with that priority or higher (more urgent). Use .= to specify a single priority; for example, kern.=info applies the action to kernel messages of info priority. An exclamation point specifies that a priority is not matched, so kern.!info matches kernel messages with a priority lower than info and kern.!=info matches kernel messages with a priority other than info.

A line with multiple selectors, separated by semicolons, applies the action if any of the selectors is matched. Each of the selectors on a line with multiple selectors constrains the match, with subsequent selectors frequently tightening the constraints. For example, the selectors mail.info;mail.!err match mail subsystem messages with info, notice, or warning priorities.

You can replace either part of the selector with an asterisk to match anything. The keyword none in either part of the selector indicates no match is possible. The selector *.crit;kern.none matches all critical or higher-priority messages, except those from the kernel.

Actions

The action specifies how syslogd processes a message that matches the selector. The simplest actions are ordinary files, which are specified by their absolute pathnames; syslogd appends messages to these files. Specify /dev/console if you want messages sent to the system console. If you want a hardcopy record of messages, you can specify a device file that represents a dedicated printer.

You can write important messages to a specific user's terminal by specifying a username, such as root, or a comma-separated list of usernames. Very important messages can be written to every logged-in terminal by using an asterisk.

To forward messages to syslogd on a remote system, specify the name of the system preceded by @. It is a good idea to forward critical messages from the kernel to another system because these messages often precede a system crash and may not be saved to the local disk. The following line from syslog.conf sends critical kernel messages to grape:

kern.crit      @grape

Because syslogd is not configured by default to enable logging over the network, you must edit the /etc/sysconfig/syslog file on the remote system (grape in this case) so that syslogd is started with the r option. After you modify the syslog file, restart syslogd using the syslog init script.