Chapter 1. Analyzing the Role of System Operators | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

System operators are responsible for maintaining the integrity and availability of the computer systems in a company's data center. An operator's responsibilities can span a wide range of tasks . Most spend the majority of their time troubleshooting and resolving problems reported by users. Other primary tasks are system health monitoring, performance monitoring, and backing up the system. Periodically, the operator is asked to perform additional tasks such as upgrading the operating system and applications, restoring files, installing patches, or performing system maintenance. It is not unusual for an operator to have more than 100 systems to maintain in the environment.

Larger organizations will have a number of system operators. On average, there is approximately one operator per ten UNIX servers. System management responsibilities may be segmented by region, by application, by problem area (performance, backup, and so forth), or by time zone. In some situations, an operator may need to pass responsibility for a problem to an operator in one of the company's data centers in another country. System, application, database, and network management may involve different people in different departments. As part of troubleshooting, an operator may ask the advice of a more experienced system administrator. Large companies may have many experts, including database administrators and network administrators, who are specialized in certain areas.

This book should be a useful tool for simplifying the operator's primary task of troubleshooting user problems. Troubleshooting involves checking the current health of the system and researching recent events and faults that may be related . The goal is to find the root cause of a problem. Many system outages are caused by operator errors made when trying to fix problems. In fact, studies by Hewlett-Packard and industry consultants such as the Gartner Group have indicated that operator error is the most common cause of unplanned downtime. The following chapters are intended to simplify the monitoring and recovery tasks.

The monitoring chapters are categorized by system component, and are self-contained, so you should be able to resolve a problem without leafing back and forth between chapters. However, because of the huge number of unique problem situations, it is impossible to list solutions to all possible problems that you may encounter. Instead, this book tries to explain a representative set of available monitoring tools, and how each can be used to solve problems. You should read the recovery sections at the end of each chapter to get an idea of general techniques for problem resolution and to get a feeling for when to apply each tool.

I l @ ve RuBoard