Defining Fault Management | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Detecting and reporting unusual or unacceptable behavior is generally referred to as fault management (or event management ). A fault is any behavior different from specified or expected behavior, and generally is used to refer to the complete failure of a hardware component or software product.

Fault conditions can be characterized in many different ways. Faults can be caused by hardware component failures in the environment, or by the failure of software running on systems within the environment. A computer is dependent on more than the CPU and memory; for example, power supplies and fans can also fail. Loss of power in the data center, natural disasters, and the failure of air conditioning units are just a few examples of how environmental problems can cause systems to fail.

Fault isolation is something that hardware and software vendors try to achieve. This means that a fault within the software or hardware of a system should not affect the correct operation of other components in that system or other systems that the system is interacting with on the network. When fault isolation is not achieved, operators have a much more difficult time finding the root cause of problems when they occur.

An event is a more general term and includes faults as well as system anomalies, such as performance problems. An event includes anything that happens in the computing environment that may be of interest to someone. Events may be sent automatically, or a user (operator) may ask to be informed of events. The start of a backup, for example, may be an interesting event worth noting.

A variety of events can occur in a computer environment that are of interest to a system administrator or computer operator. Many involve failure conditions in the environment, but other important types of events also occur. In a mission-critical environment, a configuration change of any kind to the hardware or software may need to be recorded. For example, an administrator may want to record when a software patch was installed on a system. This information can be used later when troubleshooting a problem on the system. In addition, the CPU utilization reaching a certain predefined threshold on a critical server may also be considered a significant event. Automated monitoring tools can detect other activities, such as an attempt to breach system security. Multiple failed "root" login attempts could be the sign of a security problem.

Events can be received in a variety of ways. Hewlett-Packard OpenView, a suite of enterprise management products, provides sophisticated event management capabilities, allowing an event message to be stored in a log, while optionally causing a graphical icon associated with the event to change color based on a status change (see Chapter 4). Recovery or troubleshooting actions can be triggered automatically, or trouble tickets can be filed so that the appropriate person can be notified of the problem and a problem history can be maintained .

Fault management includes the process of detecting, reporting, and reacting to the faults or events taking place in the computing environment. Chapter 3 describes both the types of notification methods available for reporting events and the enterprise management tools that can receive events. The rest of this chapter describes the different types of events that can be detected . Given the list of possible events, you can choose the level of monitoring that is appropriate for your environment. Subsequent chapters introduce the tools that detect the various events.

I l @ ve RuBoard