Identifying Important System Monitoring Categories | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

This chapter identifies important system resources to monitor, so that you can detect faults, avoid problems, and ultimately ensure availability. Many important system resources can be monitored for events and faults, and many system management tools are available with which to monitor them. Instead of categorizing based on specific hardware components , this chapter relates its descriptions of tools to the different ways of monitoring your system. For example, your focus as an operator may be on watching for system faults and failures, software or hardware configuration changes, system resource usage, performance management, or security. This chapter tries to show a tool's role, if any, in each of these monitoring categories.

Monitoring System Configuration Changes

This category includes monitoring for changes in hardware and software configurations that can be caused by an operating system upgrade, patches applied to the system, changes to kernel parameters, or the installation of a new software application, for example. The root cause of system problems can often be traced back to an inappropriate hardware or software configuration change. Therefore, it is important to keep accurate records of these changes, because the problem that a change causes may remain latent for a long period before it surfaces.

Adding or removing hardware devices typically requires the system to be restarted, so configuration changes can be tracked indirectly (in other words, remote monitoring tools would notice system status changes). However, software configuration changes, or the installation of a new application, are not tracked in this way, so reporting tools are needed. Also, more systems are becoming capable of adding hardware components online, so hardware configuration tracking is becoming increasingly more important.

Monitoring System Faults

After ensuring that the configuration is correct, the first thing to monitor is the overall condition of the system. Is the system up? Can you talk to it, ping it, run a command? If not, a fault may have occurred. Detecting system problems ranges from determining whether the system is up to determining whether it is behaving properly. If the system either isn't up or is up but not behaving properly, then you must determine which system component is having a problem.

This chapter addresses monitoring various components of the system for faults or events. The fault category generally covers system hardware components, including the Central Processing Unit (CPU), memory, and system buses, as well as peripherals, such as tape drives and printers. (Disks are covered in Chapter 5.) CPU and memory faults may cause system failures or degraded performance. Tape faults may result in a backup failing, a bad backup, or delays in completing a backup in a timely manner. With proactive monitoring, you can find out that a tape drive is having problems before backup is actually scheduled to begin.

Monitoring System Resource Utilization

For an application to run correctly, it may need a fixed amount of system resources. Some resources are renewable, such as the amount of CPU or I/O bandwidth an application is entitled to use during a time interval. The resource category refers to those system resources that an application acquires and then releases at its own discretion. For example, an application can allocate a segment of shared memory or launch a group of processes. Other examples included in the resource category are the number of open files or sockets, message segments, and system semaphores that an application has. The system has fixed limits for each of these resources, so monitoring their use is important. If these system tables are exhausted, the system may no longer function properly. You may want to set up alarms to notify you when the available resources in a given system table are below a certain threshold, which will give you time to react before the problem becomes critical.

Another aspect of resource utilization is studying the amount of resources that an application has used. You may not want a given workload to use more than a certain amount of CPU time or fixed amount of disk space. Some resource management tools, such as quota, can help with this.

Monitoring System Security

One way that a system's availability can be impacted is through unauthorized use. Performance and resource controls are not useful if the system is used for the wrong purposes. You need to prevent unauthorized use of system resources by using password files, network firewalls, and so forth. In addition to setting up access rights and policies, you need to monitor the system, so that you know when security has been compromised. This chapter briefly mentions some of the security tools that are available.

Monitoring System Performance

Knowing both that system resources are available and that your application is performing well is important. Eliminating bottlenecks or, even better, preventing them, allows the system to provide its intended services. Monitoring the performance of system resources can help to indicate problems with the operation of the system. Bottlenecks in one area usually impact system performance in another area. CPU, memory, and disk I/O bandwidth are the important resources to watch for performance bottlenecks. You should monitor during typical usage periods to establish baselines. Understanding what is "normal" operation helps you to identify when system resources are not behaving well. Resource management tools are available that can help you to allocate system resources among applications and users.

In this and each of the next four chapters, performance issues are contained in a separate section and described after the other tools are discussed.

One way to check for system problems is to watch the system's front panel of lights. Any change from normal (for example, color changes from green to red or a light starts flashing) could be indicative of a hardware or firmware problem. Of course, to monitor the system in this way, you need an operator to watch the front panel of the system manually. If an operator isn't always available to watch the front panel, a delay in detecting a problem may occur. Many other, more sophisticated tools are available to help you detect system problems that may occur in your system. These tools, which are covered in the following sections, range from standard UNIX commands to sophisticated add-on monitoring software suites.

I l @ ve RuBoard