5.5 Monitoring Tools | Performance by Design: Computer Capacity Planning By Example

Monitors are used for measuring the level of activity (i.e., workload intensities, device utilizations) of a computer system [2]. Ideally, monitors should affect as little as possible the operation of the system being measured in order to minimally degrade its performance. Monitors are characterized by their type and mode. There are three types of monitors depending upon their implementation: hardware, software, and hybrid. There are two different data colletion modes: event trace and sampling.

5.5.1 Hardware Monitors

A hardware monitor is a specialized measurement tool that detects certain events (e.g., the setting of a register) within a computer system by sensing predefined signals (e.g., a high voltage at the register's control point). A hardware monitor captures the state of the computer system under study via electronic probes that are attached to its circuitry and records the measurements. The electronic probes sense the state of hardware components of the systems, such as registers, memory locations, and I/O channels. For example, a hardware monitor may detect a memory-read operation by sensing that the read probe to the memory module changes from the inactive to the active state [3, 4, 6].

The main advantages of hardware monitors are that they do not consume resources from the monitored system, they do not affect the operation or performance of the system, and they do not place any overhead on the system. One of the major problems of hardware monitors is that software features (e.g., the completion of a specific job) are difficult to detect, since these monitors do not have access to software-related information such as the identification of the process that triggered a given event. Thus, workloadspecific data and the number of transactions executed by a given class are difficult to obtain using a hardware monitor.

5.5.2 Software Monitors

A software monitor consists of routines inserted into the software, either at the user level or (more often) at the kernel level, of a computer system with the aim of recording status information and events of the system [1, 4]. These routines gather performance data about the execution of programs and/or about the components of the hardware. Software monitoring is activated either by the occurrence of specific events (e.g., an interrupt signaling an I/O completion) or by timer interrupts (e.g., to see if a particular disk is active or not every 5 msec), depending on the monitoring mode.

Software monitors can basically record any information that is available to programs and operating systems. This feature, together with the flexibility to select and reduce performance data, makes software monitors a powerful tool for analyzing computer systems. The IBM Resource Management Facility (RMF) and Windows XP's Performance Monitor are examples of software monitors that provide performance information such as throughput, device utilizations, I/O counts, and network activity. A drawback of software monitors is that they use the the very resources that they measure. Therefore, software monitors may (and sometimes significantly) interfere with the system being measured. Depending on the level of overhead introduced, software monitors may yield results of minimal value.

Two special classes of software monitors are accounting systems and program analyzers. Each provides useful information that helps to parameterize QN models.

Accounting Systems. Accounting systems are tools primarily intended to apportion financial charges to users of a system [7, 8]. They are usually an integral part of most multiuser operating systems. The IBM/SMF (System Management Facility) is a standard feature of IBM's MVS operating system, which collects and records data related to job executions. UNIX's sar (System Activity Report) is another example of an accounting system.

Although their main purpose is billing, accounting tools can be used as a source for necessary model parameters in capacity planning studies. In general, accounting data include three groups of information.

Identification. Specifies user, program, project, accounting number, and class of the monitored event.
Resource usage. Indicates the resources (e.g., CPU times, I/O operations) consumed by programs.
Execution time. Records the start and completion times of program execution.

Although accounting monitors provide useful data, there are often problems with their use in performance modeling. Accounting monitors typically do not capture the use of resources by operating systems. That is, they do not measure any unaccountable (i.e., non-user billable) system overhead. Another problem is the unique way that accounting systems view some special programs, such as database management systems (DBMS) and transaction monitors. These programs have transactions and processes that execute within them. Since accounting systems treat such special programs as single entities, they normally do not collect any information about what is executed inside these programs. However, in order to model transaction response time, information about individual transactions, such as arrival rates and service demands, are required. Thus, special monitors are required to examine the performance of some programs.

Program Analyzers. Program Analyzers are software tools that collect information about the execution of individual programs. These analyzers can be used to identify the parts of a program that consume significant computing resources [8]. They are capable of observing and recording events internal to the execution of specific programs. In the case of transaction oriented systems, program analyzers provide information such as transaction counts, average transaction response time, mean CPU time per transaction, mean number of I/O operations per transaction, and transaction mix. Examples of program analyzers include monitors for special programs such as IBM's database products (i.e., DB2 and IMS) and transaction processing products (i.e., CICS).

5.5.3 Hybrid Monitors

The combination of hardware and software monitors results in a hybrid monitor, which shares the best features of both types. In a hybrid monitor, software routines are responsible for sensing events and storing this information in special "monitoring registers". The hardware component of the monitor records the data stored in these registers, avoiding interference in the normal I/O activities of the system. Thus, the advantage of capturing specific job related events (i.e., the primary benefit of software monitors), without placing significant overhead or altering the performance of the system (i.e., primary benefits of hardware monitors), is possible using hybrid monitors. The primary disadvantages associated with hybrid monitors are the requirements of special hardware (e.g., monitoring registers) and more specialized software routines (i.e., to record a more limited set of program events). Unless hybrid monitors are designed as an integral part of the system architecture, their practical use is limited.

5.5.4 Event-trace Monitoring

Any system interrupt, such as an I/O interrupt indicating the completion of a disk read/write operation, can be viewed as an event that changes the state of a computer system. At the operating system level, the state of the system is usually defined as the number of processes that are "at" each system device, either in the device's ready queue, blocked in the device's waiting queue, or executing in the device. Examples of events at this level are OS system calls that changes a process' status (e.g., an I/O request that moves a process from executing in the CPU to the waiting queue at a disk). At a higher level, where the number of transactions in memory represents the state of the system, the completion of a transaction (e.g., an interrupts to swap out a job) is an event. An event trace monitor collects information and chronicles the occurrence of specific events.

Usually, an event trace software monitor consists of special pieces of code inserted at specific points in the operating system, typically within interrupt service routines. Upon detection of an event, the special code generates a record containing information such as the date, time, and type of event. In addition, the record contains any relevant event-specific data. For instance, a record corresponding to the completion of a process might contain the CPU time used by the process, the number of page faults initiated, the number of I/O operations executed, and the amount of memory required. In general, event traces record changes in the active set of PCBs (process control blocks). These "change events" are later transferred to secondary storage.

When the event rate becomes very high, the monitor routines are executed frequently. This may introduce significant overhead in the measurement process. Depending on the events selected and the event rate, the overhead may reach levels as high as 30% or more. Overheads up to 5% are regarded as acceptable for measurement activities [8]. Since the event rate cannot be controlled or predicted by the monitor, the measurement overhead, likewise, becomes unpredictable. This is one of the major shortcomings of event trace monitors.

5.5.5 Sampling Monitoring

A sampling monitor collects information about a system (i.e., recorded state information) at specified time instants. Instead of being triggered by the occurrence of an internal event such as an I/O interrupt, the data collection routines of a sampling software monitor are triggered by an external timer event. Such events are activated at predetermined times, which are specified prior to the monitoring session. The sampling is driven by timer interrupts, based on a hardware clock.

The overhead introduced by a sampling monitor depends on two factors: the number of variables measured and the frequency of the sampling interval. With the ability to limit both factors, a sampling monitor is able to strictly control its overhead. However, long intervals result in low overhead. On the other hand, if the intervals are too long, the number of samples decreases and the confidence in the data collected, likewise, decreases. Thus, there exists a clear trade-off between overhead and quality of the measurements. Similarly, the higher the sampling rate, the higher the accuracy, and the higher the overhead. When compared to event trace monitoring, sampling provides a less detailed observation of a computer system but at a controllable overhead level. Errors may also be introduced because a certain percentage of potentially important interrupts are masked. For example, if some routines within the operating system cannot be interrupted by the timer, their contribution to the CPU utilization will not be accounted for by a sampling monitor [3].

Sampling monitors typically provide information that can be classified as system-level statistics: for example, the number of processes in execution and resource use, such as CPU and disk utilization. Process level statistics are better captured by event trace monitors, because it is easier to associate events to the start and completion of processes.