Chapter 3. Using Monitoring Frameworks | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

Chapter 2 focused on the set of possible events that can occur on UNIX systems. The remaining chapters of this book describe the tools available to monitor these events, with tools for different components such as systems, disks, or networks; each discussed in a separate chapter. This chapter focuses on monitoring frameworks, which can be used to monitor many different components and systems through a common interface.

Some companies monitor systems simply by checking the size of log files every day, such as the system log file. If the file is bigger than a certain size , then something unusual has happened and the operator can investigate further by looking at the specific file contents. The problem is that this can be very labor- intensive . A lot of duplicate messages may exist, making it difficult to analyze and fix each individual problem. Also, this technique may not help to determine the root cause of the problem. Lastly, although each logged event may have a severity associated with it, prioritizing the investigation of numerous logged messages may be difficult.

Operators spend a great deal of time watching the system consoles. Small data centers might manage from a Microsoft Windows computer, with scripts executed periodically (such as daily) to check resource usage and the availability of the computer systems. System failures are detected by the loss of a window on the console, or by complaints from users. Larger data centers might manage from a workstation or server running monitoring software such as HP IT/Operations, Computer Associates' Unicenter The Next Generation (TNG), or IBM's Tivoli Management Environment. These products can show the status of many systems graphically on a single screen.

Some comprehensive monitoring products can be used to monitor multiple components, and often provide developer's kits so that you can add additional monitors . These products are referred to as monitoring frameworks. If a product has additional capabilities to help deploy it over many systems, it is referred to as an enterprise management framework. A data center is unlikely to be using more than one enterprise management framework, due to the cost, complexity, and time required to learn a second framework.

Customers are recognizing the need for management tools, and enterprise management framework providers ” such as Hewlett-Packard, with its IT/Operations (IT/O), and Computer Associates, with its Unicenter TNG ” are feeling pressured to provide low-cost alternatives. Both HP and CA have announced intentions to bundle low-cost versions of their software with HP-UX servers. Enterprise SyMON is a low-cost management station available on Sun systems, but because it currently lacks a developer's kit, it is not a true monitoring framework.

According to a recent Network Computing Review of enterprise frameworks, the products " are as complex as the problems they are meant to solve." Monitoring frameworks can be difficult to set up and are very expensive, but the benefit is that you can monitor heterogeneous components and systems through a common interface.

This chapter describes some of the frameworks available for UNIX fault management. It identifies some of the monitoring capabilities provided or available with each framework. Hopefully, these descriptions will help you to determine which framework, if any, is best for your environment.

You may need to use multiple monitoring frameworks in your environment. Chapter 9 briefly describes how to integrate different combinations of these tools.

I l @ ve RuBoard