Health Checks and System Monitoring


In previous chapters, we described how to install and configure the system and its software components. We also covered the steps required to add users to the system. In the following sections, we monitor how the two interact.

Daily cycle checks of machines and equipment are an important facet of offering robust services. Only by knowing what is going on can you get a feel for what may need attention. The frequency and depth of these checks depend on your comfort level with the system and its importance to the organization. As a rule of thumb, daily checks are a good starting point. Here is a suggested list of what should be checked:

  • Machine uptime

  • Log review

  • Top consumers

  • Application checks

  • User login activity

  • System resource checks

The examples given in this chapter are based on our own personal preferences. They are intended as a quick guide for extracting system information on the fly. The examples are in no way an exhaustive list of the tools availablejust our favorites. As you become more knowledgeable with the system, you will almost certainly progress beyond simple commands and move into scripts or third-party layered applications. However, we believe these tools will suffice as a starting point.

Machine Uptime

Though current versions of Linux are very robust, it is important to check how long the system has been running. System reboots can be caused by hardware failures, software failures, or human intervention. If an unscheduled reboot occurred, it is important to find out why.

The uptime command provides a quick snapshot of the state of the system, including the current time, the time the system has been running, and user load:

 Athena:~ # uptime 8:47am  up   3:18,  3 users,  load average: 0.30, 0.29, 0.21 Athena:~ # 

As this example illustrates, the machine has been running for just over three hours. Because we require our system to be very robust, such a short uptime should lead to some sort of investigation. Typically, this will be due to a scheduled shutdown for hardware maintenance. If this is not the case, a review of the logs might indicate what is going on.

Load averages are supplied to provide a quick verification of the status of the system. Dramatic shifts in load between the 1-, 5-, and 15-minute figures could alert the system administrator to potential problems. Increases can indicate a runaway process, whereas drops in activity can indicate that clients can no longer reach the server.

Log Review

When you are reviewing the health of a system, it is important to look through the logs. Each system component places information in the appropriate log file. This information can be as simple as a time stamp or process milestone marker, or the log may contain information on the pending failure of a hardware component.

You can find system boot information in two ways. First, you can use the dmesg command to examine the content of the kernel ring buffer. This buffer contains the most current messages generated by the kernel. At boot time, many of the important hardware discovery phases are logged here. A review of this information yields detailed hardware information on BIOS, memory, disk, and IRQs. The boot information is available in a secondary location that provides yet again more detail. The /var/log/boot.msg file contains all the messages from various sources generated during a system startup. It is a good idea to review these sources periodically to ensure that the expected hardware matches and that no warning messages are present that might affect system performance.

Linux maintains application-specific log files. The location of the log files and their content can be controlled through the /etc/syslog.conf file. In a default install of SLES, you can find all application-specific log information under the /var/log directory structure.

Within this structure, many individual services such as YaST, Samba, and Apache maintain distinct directories of application-specific messages. It is important to review many of these service log files periodically. They can not only reveal local configuration errors within the system, but they also log connection information from all sources.

To identify attack vectors such as infected machines and system probes, you can review connection failure logs. Additionally, you should check Internet-facing machine logs frequently for signs of valid connections requesting nonexistent resources such as login attempts on default account names or requests for invalid web pages.

NOTE

To get an appreciation for the information contained in log files, simply review the Apache failure logs for an Internet-facing machine. The number of errors generated by requests for nonexistent pages can be staggering. A large number of these requests come from script kiddies who launch IIS-specific attacks against Apache servers. Once these individuals discover your site, it is a good idea to keep a vigilant eye on these logs.


It is also very important to check the firewall logs. By default, they are merged into the standard /var/log/messages file. We have stated many times how important it is to offer only services that are required. A periodic review of the firewall logs will expose connection failures. The source of the failure and the nature of the access requested could bring to light misconfigured client machines and possibly exploit attempts. In many instances, infected desktop computers are capable of generating volumes of traffic and probes as the worm or virus tries to replicate. A proper review of server-side firewall logs for internal (trusted network) events is just as important as dissecting the external firewall logs. This topic is explored further in Chapter 12, "Intrusion Detection."

Two log files that you will reference the most often are warn and messages in /var/log. These files contain most of the regular error messages generated by applications and users. Some of the important messages you will want to scan for could be requests for elevated privileges such as

 Mar 13 11:26:23 Athena sudo: hart : 3 incorrect password attempts ; TTY=ptst 

It will take a second to verify that hart is allowed access to sudo and that this user is just having a hard day. On the other hand, if a different username were to show up, it may be an indication of someone trying to gain unsanctioned access.

This section emphasized the importance of log files in tracking system activities. It is important to note that as these log files grow, they will also need to be maintained. Log file retention is essential in guaranteeing that you have a continuous log of system activity.

In the event of an incident that seems to indicate a system compromise, archived logs are essential in tracking down the history of the intrusion. Often, daily reviews of the log files are not possible. Using a utility such as logrotate can ensure that logs are stored in a central location in a manageable format.

Top Consumers

Log files are important for tracing historical events. Unless a process generates a fault or an informational message, it may exist for quite some time unnoticed. As part of your daily exercise, it is a good idea to get a feel for the load on a server. This type of monitoring will help you understand the resource consumption on a typical day, forecast when a resource upgrade may be required, and quickly identify abnormal loads.

You can use the top command to generate a dynamic listing of the current processes load of a system. As applications require more resources, they percolate up the list. Listing 7.1 shows an example of the output generated by top. The information presented is dynamic; this is, of course, just a snapshot in time.

Listing 7.1. Typical Output from the top Command
 top - 16:57:05 up 11:27,  4 users,  load average: 0.15, 0.03, 0.01 Tasks:77 total, 1 running, 76 sleeping, 0 stopped, 0 zombie Cpu(s):0.3% us,0.7% sy,0.0% ni,99.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 190012k total, 185668k used,  4344k free, 31608k buffers Swap: 570268k total,  8k used,   570260k free,    30012k cached PID USER PR  NI  VIRT  RES  SHR S %CPU %MEM  TIME+  COMMAND 4514 root 15   0 33796 11m  24m S  0.3  6.5 0:34.42 X 5329 root 15   0 22696 13m  16m S  0.3  7.5 0:14.43 gnome-terminal    1 root 16   0   588 244  444 S  0.0  0.1 0:04.21 init    2 root RT   0     0   0    0 S  0.0  0.0 0:00.00 migration/0    3 root 34  19     0   0    0 S  0.0  0.0 0:00.00 ksoftirqd/0    4 root  5 -10     0   0    0 S  0.0  0.0 0:00.07 events/0    5 root 15 -10     0   0    0 S  0.0  0.0 0:00.00 kacpid    6 root  5 -10     0   0    0 S  0.0  0.0 0:00.09 kblockd/0    8 root  5 -10     0   0    0 S  0.0  0.0 0:00.01 khelper    9 root 15   0     0   0    0 S  0.0  0.0 0:01.04 pdflush   10 root 15   0     0   0    0 S  0.0  0.0 0:00.74 pdflush   12 root 15 -10     0   0    0 S  0.0  0.0 0:00.00 aio/0   11 root 15   0     0   0    0 S  0.0  0.0 0:02.14 kswapd0  160 root 25   0     0   0    0 S  0.0  0.0 0:00.00 kseriod  203 root 25   0     0   0    0 S  0.0  0.0 0:00.00 scsi_eh_0  382 root  5 -10     0   0    0 S  0.0  0.0 0:00.00 reiserfs/0  561 root  6 -10     0   0    0 S  0.0  0.0 0:00.00 kcopyd 

The top command is very useful when the machine appears to be sluggish. A quick glance at its output can reveal whether the situation is due to CPU consumption, excessive swapping, or particular processes running at odd priorities.

The top process also lists the name of the running executable. As you become familiar with the most common names, odd user-written application names may appear in the list. By matching the user and application columns, you can quickly identify the rogue application.

Another tool that is very useful is the w command. This command provides a quick summary of who is currently logged on to the system and what programs they are running. On the current system, w returns the following output:

 Athena:~ # w  13:39:33 up  9:09,  6 users,  load average: 0.00, 0.05, 0.02 USER     TTY   LOGIN@ IDLE  JCPU  PCPU WHAT root     :0    04:31 ?xdm?  4:21  0.11s -:0 root     pts/0 04:32  3:08  4.83s 0.41s bash belandja pts/1 04:43  0.00s 0.84s 0.10s login -- belandja belandja pts/2 05:24  1.00s 0.39s 0.07s wget http://cnn.com eric     pts/3 13:36  3:10  0.38s 0.15s vi index.html hart     pts/4 13:38  1:06  0.53s 0.34s top Athena:~ # 

From this, you can get a good idea as to who is currently working on the system, what they are doing, and what type of resource impact they are having on the system.

Application Check

When applications become unresponsive to the end user, it typically does not take a long time for the operations crew to be made aware of the situation. Before you forage through the appropriate log file, though, a quick glance at the running processes might be in order.

One method for checking on the presence of an application is to use the ps command. The Process Status command (ps) generates a list of all the running processes on a system. In addition, you can use modifiers to selectively extract information for the target service. One method for checking on an application is to use the ps command and filter the output through grep:

 Athena:~ #ps -ef | grep -e telnet root  5408 5387 0 04:43 ?     00:00:01 in.telnetd: 192.168.1.100 root  5766 5387 0 05:23 ?     00:00:00 in.telnetd: 192.168.1.100 root  5957 5920 0 06:09 pts/1 00:00:00 grep -e telnet 

In this example, the -e (every process) and -f (full listing format) modifiers were used to qualify the ps verb. The output from the ps verb was then filtered through grep to extract only the records containing the Telnet process. It is important to remember that, in this case, Telnet does not have its own daemon process. It is part of the xinetd server offerings. As such, the preceding command returns existing Telnet connections, not the presence of the actual server process.

A different method for checking the existence of a service is to verify whether their characteristic port is being advertised by the server. Because each protocol, be it httpd, sshd, or telnetd, offers a specific port to the network, you can quickly check for an open listener process on the port. The netstat command can list various attributes of a server's networking environment. You can specify the -l parameter to list all listeners and the -p parameter to identify the program offering the service. The result would look like the following:

 Athena:~ # netstat -lp | grep -e http tcp      0    0 *:https   *:*    LISTEN  4601/httpd2-prefork Athena:~ # 

By using the process ID, you can find all related processes and identify the routines offering the service:

 Athena:~ # ps -ef | grep -e 4601 root      4601     1  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf wwwrun    4602  4601  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf wwwrun    4603  4601  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf wwwrun    4604  4601  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf wwwrun    4605  4601  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf wwwrun    4606  4601  0 04:31 ? 00:00:00 /usr/sbin/httpd2-prefork -f /etc/apache2/httpd.conf root      6807  5920  0 13:28 pts/1  00:00:00 grep -e 4601 Athena:~ # 

Using this approach , you can quickly verify the presence of a service through its process name or port offerings. You can now verify the application-specific log files for possible errors. If none are present, the application unresponsiveness may be due to resource constraints from other processes.

System Resource Check

A server can be thought of as a container for resources. Application programs, in turn, consume these resources. Well-behaved applications start up, allocate a portion of available memory, utilize a moderate amount of CPU, write a controlled amount of information back to the appropriate files, and then terminate.

Of the resources used by an application, only the CPU resource can be thought of as unlimited. Though only so much processing power is available at any moment in time, a well-behaved application consumes only a little bit at a time. The faster the CPU, the faster the job completes. After the job has terminated, the portion of the CPU's time spent processing the application is now available for other processes.

The same cannot be said for memory and disk resources. When their total complement is consumed, no further processing can take place until some of these resources are freed. In most cases, a well-behaved program releases all its allocated memory upon exit. When this does not happen properly, the application is said to have a memory leak. If the application is run a sufficient number of times, eventually the memory leak will consume all the available process memory on the machines.

Similarly, applications that create large output data files and log files often consume vast amounts of disk space. Though disk quotas mitigate against user applications from filling the disk, typical services are not constrained in regards to disk space consumption. In the event of a misbehaving client application or an attack, it is quite possible that the service log files consume all available disk space in a partition.

The preceding describes what could be considered the worst-case scenario. Diligence will ensure that these conditions are less likely to happen. The following commands are suggested additions to your daily server health checks.

You can address memory consumption using the top command discussed earlier in the chapter. You also can use a number of other commands to determine the current memory demands:

free lists the current amount of free memory on the system. The -m parameter lists the amounts in megabytes.

 Athena:~ # free -m              total   used   free     shared    buffers     cached Mem:           185    173     11          0          4         40 -/+ buffers/cache:    128     57 Swap:          556      0    556 Athena:~ # 

procinfo displays the system status information contained within the /proc filesystem. The following is a truncated version of the information returned:

 Athena:~ # procinfo Linux 2.6.5-7.97-smp(geeko@buildhost)(gcc 3.3.3 ) #1 1CPU[Athena.] Memory:      Total        Used        Free      Shared     Buffers Mem:        190012      181224        8788           0        6000 Swap:       570268           8      570260 Bootup:Sun Mar 13 04:29:44 2005 Load average:0.06 0.09 0.04 1/99 8152 

Disk space consumption can be tracked using the following commands:

 Athena:~ # du -hsc /var 215M    /var 215M    total 

df displays disk usage by mounted filesystem:

 Athena:~ # df -h Filesystem            Size  Used Avail Use% Mounted on /dev/sda2             6.5G  2.9G  3.6G  45% / tmpfs                  93M  8.0K   93M   1% /dev/shm /dev/sdb1              14G   45M   14G   1% /home 

mount, though typically used for adding resources, can be typed without parameters to display a list of what is currently mounted on the system. A systematic review of this information, coupled with the content of /etc/fstab, may identify missing partitions or mount points:

 Athena:~ # mount /dev/sda2 on / type reiserfs (rw,acl,user_xattr) proc on /proc type proc (rw) tmpfs on /dev/shm type tmpfs (rw) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) /dev/hdc on /media/dvd type subfs (ro,nosuid,nodev, fs=cdfss,procuid,iocharset=ut f8) /dev/fd0 on /media/floppy type subfs (rw,nosuid,nodev,sync, fs=floppyfss,procuid) /dev/sdb1 on /home type ext3 (rw,acl,user_xattr,usrquota) usbfs on /proc/bus/usb type usbfs (rw) Athena: 

User Login Activity

Several situations require auditing system access. Auditing is not done to invade anyone's privacy but is a requirement for tracking both system performance and application access. Knowing system load, response times, and processes generating the load is important in order to provide satisfactory service. Another reason for tracking users is to facilitate the investigation of spurious and inappropriate activity that might indicate a compromise of the system. This topic will be explored further in Chapter 12. In this section, we examine a number of commands that can provide user access information.

The w or who commands display which accounts are currently logged in to the system. Here's an example:

 Athena:~ # w 14:33:32 up  2:10,  7 users,  load average: 0.07, 0.11, 0.06 USER     TTY   LOGIN@ IDLE  JCPU   PCPU WHAT ted      tty5  14:29  3:14  0.20s  0.04s perl extract_P2_pulse.pl root     :0    12:28 ?xdm?  1:53   0.11s -:0 root     pts/0 12:28  3:06  2.33s  0.25s bash belandja pts/1 12:29  0.00s 1.68s  1.36s ssh -l root 192.168.1.242 root     pts/2 12:30  0.00s 0.73s  0.04s w pol      pts/3 14:25  6:34  0.40s  0.16s vi start_femtopulse.csh mark     pts/4 14:31 14.00s 0.62s  0.38s -bash 

If more information is required for the currently signed-on users, you need to access the information more indirectly. Each user on the running system creates a login session. This login session is the parent for additional child processes. Additional user information can be extracted by accessing the currently active process information. You can do this in a number of ways. Each of the following examples reports information with varying amounts of detail.

pstree displays a tree of processes:

 Athena:~ # pstree -U pol sshd???bash???vi 

ps reports process status:

 Athena:~ # ps -ef | grep -e pol root  5786  3244  0 14:25 ?     00:00:00 sshd: pol [priv] pol   5789  5786  0 14:25 ?     00:00:00 sshd: pol@pts/3 pol   5790  5789  0 14:25 pts/3 00:00:00 -bash pol   5814  5790  0 14:26 pts/3 00:00:00 vi start_femtopulse.csh root  6122  5383  0 14:33 pts/2 00:00:00 grep -e pol 

lsof lists open files:

[View full width]

Athena:~ # lsof COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME sshd 5789 pol cwd DIR 8,2 592 2 / sshd 5789 pol rtd DIR 8,2 592 2 / sshd 5789 pol txt REG 8,2 288184 83822 /usr/sbin/sshd sshd 5789 pol mem REG 8,2 107122 19361 /lib/ld-2.3.3.so sshd 5789 pol mem REG 8,2 36895 29038 /lib/libwrap.so.0.7.6 sshd 5789 pol mem REG 8,2 33263 29436 /lib/libpam.so.0.77 sshd 5789 pol mem REG 8,2 13647 19370 /lib/libdl.so.2 . . . sshd 5789 pol 2u CHR 1,3 42156 /dev/null sshd 5789 pol 3u unix 0xc31a8700 11642 socket sshd 5789 pol 4u IPv6 11623 TCP Athena.UniversalExport.ca:ssh->192.168.1.240 :1037 (ESTABLISHED) sshd 5789 pol 5r FIFO 0,7 11644 pipe . . . vi 5814 pol 1u CHR 136,3 5 /dev/pts/3 vi 5814 pol 2u CHR 136,3 5 /dev/pts/3 vi 5814 pol 3u REG 8,33 12288 1048626 /research/home/pol/.start_femtopulse.csh.swp

The lsof command returns a great deal of information about the selected user. A majority of the files listed are standard system libraries and modules. Depending on the details required, this command may yield too much information to be useful.

NOTE

The lsof command will prove itself quite useful in Chapter 13, "System Security." In that chapter, we will be required to enumerate all files used by a specific application. lsof can perform this task quite easily.


Other commands can be used to extract information indicating when users accessed the server in the past. This information may be important if you are trying to find out who was on the system at a specific point in time. The last command provides information on all logins, their source, as well as the period of time the session was active. A couple of examples of the last command are shown here:

 Athena:~ # last hart hart pts/1 192.168.1.100    Sat Feb 26 10:08 - down   (13:48) hart pts/1 192.168.1.100    Sat Feb 26 09:53 - 09:54  (00:01) hart pts/1 192.168.1.100    Sat Feb 26 08:44 - 09:53  (01:08) hart pts/1 192.168.1.100    Wed Feb 16 04:38 - 04:39  (00:00) hart pts/2 Athena.Universal Tue Feb 15 19:49 - 20:20  (00:31) hart pts/4 192.168.1.77     Tue Feb 15 06:57 - 07:34  (00:37) wtmp begins Mon Feb 14 19:00:40 2005 Athena:~ # 

Without specifying a username, it is possible to generate a login log for all users on the system:

 Athena:~ # last hart     pts/1     192.168.1.100    Sat Feb 26 10:08 - down   (13:48) eric     pts/1     192.168.1.100    Sat Feb 26 10:04 - 10:08  (00:04) eric     pts/1     192.168.1.100    Sat Feb 26 09:55 - 10:00  (00:04) eric     pts/1     192.168.1.100    Sat Feb 26 09:55 - 09:55  (00:00) belandja pts/4     192.168.1.100    Sat Feb 26 09:54 - down   (14:02) hart     pts/1     192.168.1.100    Sat Feb 26 09:53 - 09:54  (00:01) root     pts/3     :0.0             Sat Feb 26 09:46 - down   (14:11) belandja pts/2     192.168.1.100    Sat Feb 26 09:12 - 10:14  (01:01) hart     pts/1     192.168.1.100    Sat Feb 26 08:44 - 09:53  (01:08) root     pts/0     :0.0             Sat Feb 26 08:44 - down   (15:12) wtmp begins Mon Feb 14 19:00:40 2005 

You can use a separate command to quickly extract the last login times of certain accounts. This can be a quick way to check for unused accounts or accounts that are being used but shouldn't be. Here's an example of the lastlog command:

 Athena:~ # lastlog Username Port     From             Latest root     pts/2 athena.universal Sun Feb 27 12:30:43 -0500 2005 bin                           **Never logged in** daemon                             **Never logged in** lp                                 **Never logged in** games                              **Never logged in** man                                **Never logged in** . . . ldap                       **Never logged in** dhcpd                      **Never logged in** belandja 1     192.168.1.100    Sun Feb 27 12:29:40 -0500 2005 eric     2     192.168.1.100    Sun Feb 27 02:23:43 -0500 2005 hart     1     192.168.1.100    Sat Feb 26 10:08:46 -0500 2005 peter                           **Never logged in** pol      pts/3 192.168.1.240    Sun Feb 27 14:25:20 -0500 2005 mark     pts/4 hermes.universal Sun Feb 27 14:31:03 -0500 2005 ted      tty5                   Sun Feb 27 14:29:25 -0500 2005 

This command provides a listing of all the accounts on the system as well as times they were last logged in to the system and the source of the login. Because of the large number of system process accounts, the list can grow quite large. You might be tempted to parse the output of this command using grep to remove all accounts that have never been used:

 Athena:~ # lastlog | grep -v -e "*Never" Username  Port     From             Latest root      pts/2    athena.universal Sun Feb 27 12:30:43 -0500 2005 belandja  1        192.168.1.100    Sun Feb 27 12:29:40 -0500 2005 eric      2        192.168.1.100    Sun Feb 27 02:23:43 -0500 2005 hart      1        192.168.1.100    Sat Feb 26 10:08:46 -0500 2005 pol       pts/3    192.168.1.240    Sun Feb 27 14:25:20 -0500 2005 mark      pts/4    hermes.universal Sun Feb 27 14:31:03 -0500 2005 ted       tty5                      Sun Feb 27 14:29:25 -0500 2005 

Parsing the output this way is an acceptable practice if you are looking for individual accounts that you know have been active. The raw lastlog command does, however, provide additional information about the system process accounts. Verifying that they have not been used for an interactive login could prove quite useful.

Knowing that a specific account was used in an inappropriate manner is one thing. Being able to identify the individual who used the account is much more difficult. Verifying the number of login failures for an account is an important step. Excessive login failures might be an indication of a password hack attackpossibly a successful one. The faillog command allows you to check the number of times a login attempt was unsuccessful against each account:

 Athena:~ # faillog Username   Failures  Maximum  Latest root         3        0  Wed Feb 16 03:49:05 -0500 2005 on 0 belandja     0        0  Mon Feb 14 19:02:11 -0500 2005 on 1 eric         0        0  Tue Feb 15 18:06:28 -0500 2005 on 2 hart         0        0  Tue Feb 15 06:57:10 -0500 2005 on 4 

An indication of a large number of login failures on an account might indicate that the password on the account was hacked. It may also be a ruse. Having a sound account password policy and including the PAM module pam_tally to lock out accounts after a set number of failed attempts are important steps you can take. This way, you can mitigate some of the exposure to password- harvesting attacks. This also helps you in ensuring that the person who last logged in to a specific account knew the password.

In this section, you saw numerous methods for extracting user login information. You can use this information to help investigate resource issues. This information can also be used as a starting point in investigating suspect behavior on the system.



    SUSE LINUX Enterprise Server 9 Administrator's Handbook
    SUSE LINUX Enterprise Server 9 Administrators Handbook
    ISBN: 067232735X
    EAN: 2147483647
    Year: 2003
    Pages: 134

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net