Important Application Components to Monitor | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Applications can be monitored from several different perspectives. This chapter divides application monitoring into configuration management, fault management, and resource and performance management.

An application can fail for many reasons. Software logic problems, hardware faults, limit conditions (such as running out of memory or overflowing a counter), and race conditions (such as forgetting to request a semaphore) can all cause an application to fail. In some of these cases, an operator can determine what happened by other system events (such as available memory dropping below a threshold). However, having monitors watching specific applications is often useful. Customized application monitors may be able to identify hundreds of different causes of application problems, saving you from some of the effort involved in troubleshooting a failure. Continual monitoring during business hours is needed to ensure the service is available. "By far, application software failures account for the greatest portion of downtime," according to a 1996 Gartner Group study.

Configuration management includes both ensuring that the application was configured initially without errors and monitoring any changes that are made to the configuration files. This change history can be used to detect and recover from any inappropriate configuration changes.

In addition to monitoring whether or not the application is running, monitoring its performance is also important. If an application is hung or performing poorly, the end- user considers the service to be unavailable. Checking system performance can be a useful gauge as to how well the application is performing. A Unix environment often has only one important application per system. However, more specific tools are needed to diagnose and recover from application problems.

For important applications, it may be useful to monitor the resources that the application needs, even if the application is not running. For example, tapes should be checked periodically so that tape hardware failures are detected before the backup application starts running. This helps to ensure that the backup completes within the desired time window.

Another reason to monitor the application is to determine its resource usage over time. This can be used to bill back clients for their use of a service. You may want to check the number of resources in use by the application for several reasons. The application may be nearing configuration limits or taking an inappropriate share of system resources.

A trend in companies' IT/O departments is to implement "service-level agreements." Instead of merely maintaining the availability of key systems and the networking backbone, IT/O departments are now measured based on their ability to provide specific levels of availability and response times for their end-users. The focus is on the service provided by the application. When measuring performance or availability in a computing environment, measuring in terms of the end-user service or application is most important.

I l @ ve RuBoard