Collecting System Performance Data

I l @ ve RuBoard

Users call their IT department when they have delays in accessing data or applications. Good tools are needed to help an operator pinpoint the source of the problem. This section covers some of the interesting performance and resource-utilization metrics, and the tools available to collect data about these metrics.

A wide range of conditions may result in resource and performance problems. Running out of available memory may be caused by a failure of a memory component or by a memory leak in an application. A sudden rise in CPU utilization could be an indication of processor failure or the introduction on the system of a CPU- intensive application. Analysis is needed to determine whether resource problems can be fixed with a configuration change, hardware repair, or other techniques.

Many important system resources have configured limits. The following system resource metrics are important to monitor:

  • Number of named pipes

  • Number of messages and message queues

  • Number of system semaphores

  • Amount of shared memory

  • Number of open files

  • Number of processes

Earlier, this chapter discussed some of the tools that can be used to check system resource usage. The sar and sysdef commands can compare current usage to configured limits. An EMS monitor is available to detect thresholds being exceeded for the following resources:

  • Callout table

  • Process table

  • File descriptor table

  • File lock table

  • Shared memory

  • System semaphores

  • Message queues and message segments

The performance tools discussed in this section can also detect resource usage problems.

Some system performance monitoring is available from the SAM Performance Monitors , with which an administrator can obtain information on system, disk, and virtual memory activity, for example. Text-based information is displayed in a Motif window when one of the desired metrics is selected.

Having historical information is important, to understand how the system performance has varied over time. Knowing how your system behaves under normal conditions helps when trying to troubleshoot system performance problems. Note that the performance tools themselves impact the performance of the system, so you need to find a tool with low overhead.

This section describes some common tools for measuring and monitoring system performance. Here are some of the key metrics discussed in this section:

  • Buffer cache queue length: Refers to the number of processes blocked that are waiting for updates to the buffer cache. If this value is high, it could be an indication of a memory bottleneck.

  • Context switches: How often processes are being swapped out of the run queue.

  • CPU utilization: Expressed as a percentage of time spent in various execution states. Low utilization indicates that the CPU spent the majority of its time in the idle state.

  • CPU run queue length: The average number of processes in the run state waiting to be scheduled.

  • Memory utilization: Usually expressed as a ratio of the amount of memory in use versus the total memory available.

  • Paging: Refers to the transfer of data between virtual memory (disks) and physical memory.

  • Swapping: Refers to the transfer of data between physical memory and a special virtual memory area reserved for swapping.

Performance tools, such as BMC PATROL and MeasureWare, don't always provide the same set of metrics on all platforms. For simplicity, this section focuses on the Sun Solaris and HP-UX platforms only. Also, these products are continually being enhanced, so the actual metrics available for use in your environment may not precisely match the information presented in this section.

MeasureWare

HP MeasureWare Agent is a Hewlett-Packard product that collects and logs resource and performance metrics. MeasureWare agents run and collect data on the individual server systems being monitored . agents exist for many platforms and operating systems, including HP-UX, Solaris, and AIX.

The MeasureWare agents collect data, summarize it, timestamp it, log it, and send alarms when appropriate. The agents collect and report on a wide variety of system resources, performance metrics, and user -defined data. The information can then be exported to spreadsheets or to performance analysis programs, such as PerfView. The data can be used by these programs to generate alarms to warn of potential performance problems. By using historical data, trends can be discovered . This can help address resource issues before they affect system performance.

MeasureWare agents collect data at three different levels: global system metrics, application, and process metrics. Global and application data is summarized at five-minute intervals, whereas process data is summarized at one-minute intervals. Important applications can be defined by an administrator by listing the processes that make up an application in a configuration file.

Table 4-4. Categories of MeasureWare Agent Information
Category Metric Type
System CPU, disk, networking, memory, process queue depths, user/process information, and summary information
Application CPU, disk, memory, process count, average process wait states, and summary information
Process CPU, disk, memory, average process wait states, overall process lifetime, and summary information
Transaction Transaction count, average response time, distribution of response time metrics, and aborted transactions

The basic categories of MeasureWare data are listed in Table 4-4. Also included are optional modules for database and networking support. MeasureWare agents also collect data provided through the DSI interface.

The following lists the global system metrics that are available from MeasureWare on HP-UX and Sun Solaris. Additional metrics provided by MeasureWare are covered in other chapters.

  • CPU use during interval

  • Number and rate of physical disk inputs/outputs

  • Maximum percent full of all disk file sets

  • System CPU use during interval

  • User CPU use during interval

  • CPU use at nice priorities

  • CPU idle time during interval

  • Rate of system procedure calls during interval

  • Main memory use

  • Swap space use on disk

  • Number and rate of memory page faults during interval

  • Number of process swaps during interval

  • Percentage of virtual memory currently in active use

  • Number of processes in run queue during interval

  • Number of processes waiting for a disk during interval

  • Number of processes waiting for memory during interval

  • Number of processes currently in sleep state during interval

  • Number of processes waiting for other reasons during interval

  • Number of user sessions during interval

  • Number of processes alive during interval

  • Number of processes active during interval

  • Number of processes started during interval

  • Number of processes completed during interval

  • Average runtime of completing process during interval

  • Operating system version

  • Number of processors in the system

  • Number of disk devices and their device IDs

  • Main memory size

  • Swapping space allocated

  • Disk I/O information (see Chapter 5)

  • Networking statistics (see Chapter 6)

Note that, in addition to performance metrics, MeasureWare provides useful configuration information, such as number of processors and the number of disk devices.

The following additional global system metrics are available on HP-UX:

  • CPU use at real-time priorities

  • CPU use for context switching during interval

  • CPU use for interrupt handling during interval

  • Number of processes waiting for interprocess communications during interval

  • Number of processes waiting on network transfers during interval

  • Number and rate of terminal transactions during interval

  • Average terminal transaction "think" time

  • Average terminal transaction first response time

  • Average terminal response to prompt time

  • Distribution of transaction first response times

  • Distribution of transaction response to prompt times

You can have alarms sent based on conditions that involve a combination of metrics. For example, a CPU bottleneck alarm can be based on the CPU use and CPU run queue length.

MeasureWare agents provide these alarms to PerfView for analysis, and to the IT/O management console. SNMP traps can also be sent at the time a threshold condition is met. Automated actions can be taken, or the operator can choose to take a suggested action.

MeasureWare's extract command can be used to export data to other tools, such as spreadsheet programs. Additionally, Application Resource Measurement (ARM) APIs (described in detail in Chapter 7) can be used to instrument applications so that response times can be measured. The application response time information can be passed along to MeasureWare agents for analysis.

Although MeasureWare provides extensive performance and resource information, it provides limited configuration information and no data about system faults. For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com /solutions/application/.

GlancePlus

GlancePlus is a real-time, graphical performance monitoring tool from Hewlett-Packard. It is used to monitor the performance and system resource utilization of a single system. Both Motif-based and character-based interfaces are available. The product can be used on HP-UX, Sun Solaris, and many other operating systems.

GlancePlus collects information similar to the information collected by MeasureWare, and samples data more frequently than MeasureWare. GlancePlus can be used to graphically view the following:

  • Current CPU, memory, swap, and disk activity and utilization (see Figure 4-9)

    Figure 4-9. The GlancePlus main screen showing system utilization.

    graphics/04fig09.gif

  • Application and process information

  • Transaction information, if the MeasureWare Agent is installed and active

  • Alarm information, color -coded to reflect severity

  • CPU utilization, with per-processor information available for multiprocessor systems

  • Memory utilization, split among cache, user, and system memory

  • Disk utilization, with the I/O paths of the top disk users indicated

  • I/O activity, by filesystem or logical volume

GlancePlus is also capable of setting and receiving performance- related alarms. Customizable rules determine when a system performance problem should be sent as an alarm. The rules are managed by the GlancePlus Adviser. The Adviser menu gives you the option to Edit Adviser Syntax. When you select this option, all the alarm conditions are shown, and you can then modify them.

Listing 4-13 Defining alarms in GlancePlus.
 alarm CPU_Bottleneck > 50 for 2 minutes   start     if CPU_Bottleneck > 90 then       red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"     else       yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"   repeat every 10 minutes     if CPU_Bottleneck > 90 then       red alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"     else       yellow alert "CPU Bottleneck probability= ", CPU_Bottleneck, "%"   end     reset alert "End of CPU Bottleneck Alert" 

Alarms result in onscreen notification, with the color representing the criticality of the alarm. An alarm can also trigger a command or script to be executed automatically. Instead of sending an alarm, GlancePlus can print messages or notify you by executing a UNIX command, such as mailx, using its EXEC feature.

To configure events, you need to edit a configuration file. The GlancePlus Adviser syntax file (/var/opt/perf/adviser.syntax) contains symptom and alarm configuration. Additional syntax files can also be used. A condition for an alarm to be sent can be based on rules involving different symptoms. Listing 4-13 shows an example of how you can set up an alarm for CPU bottlenecks that is based on CPU utilization and the size of the run queue.

You can also execute scripts in command mode. To execute a script, type:

 glance -adviser_only --syntax <script file name> 

In this example, a yellow alert is sent to the GlancePlus Alarm screen if a CPU bottleneck is suspected. As a bottleneck becomes more likely, the alarm changes to red. You can define the threshold for when the alarm should be sent. The symptoms are re-evaluated at every time interval.

Here is a sampling of some of the useful system metrics that can be monitored with GlancePlus:

  • CPU utilization

  • CPU run queue length

  • Number of processors

  • Filesystem buffer cache queue length

  • Disk utilization and queue length

  • Physical memory capacity

  • Amount of physical memory available

  • Memory page fault rate

  • Total swap space

  • Amount of swap space available

  • Filesystem I/O rates

  • Amount of buffer cache available

  • Available shared memory

  • Available file table entries

  • Available process table entries

  • Most active processes

  • Wait states

  • System table resources

  • Open file information

More than 600 metrics are accessible from GlancePlus. Some of these metrics are discussed in other chapters. The complete list of metrics can be found by using the online help facility. This information can also be found in the directory /opt/perf/paperdocs/gp/C.

GlancePlus allows filters to be used to reduce the amount of information shown. For example, you can set up a filter in the Process view to show only the more active system processes.

GlancePlus can also show short- term historical information. When selected, the alarm buttons , visible on the main GlancePlus screen, show a history of alarms that have occurred.

GlancePlus also shows Process Resource Manager behavior, if PRM is installed, and allows the PRM process group entitlements to be changed.

For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/.

PerfView

PerfView is a graphical performance analysis tool from Hewlett-Packard. It is used to graphically display performance and system resource utilization for one system or multiple systems simultaneously , so that comparisons can be made. A variety of performance graphs can be displayed. The graphs are based on data collected over a period of time, unlike the real-time graphs of GlancePlus. This tool runs on HP-UX or NT systems and works with data collected by MeasureWare agents.

PerfView has the following three main components :

  • PerfView Monitor: Provides the ability to receive alarms. A textual description of an alarm can be displayed. Alarms can be filtered by severity, type, or source system. Also, after an alarm is received, the alarm can be selected to display a graph of related metrics. An operator can monitor trends leading to failures and then take proactive actions to avoid problems. Graphs can be used for comparison between systems and to show a history of resource consumption. An internal database is maintained that keeps a history of alarm notification messages.

  • PerfView Analyzer: Provides resource and performance analyses for disks and other resources. System metrics can be shown at three different levels: process, application (configured by the user as a set of processes), and global system information. It relies on data received from MeasureWare agents on managed nodes. Data can be analyzed from up to eight systems concurrently. All MeasureWare data sources are supported. PerfView Analyzer is required by both PerfView Monitor and PerfView Planner.

  • PerfView Planner: Provides forecasting capability. Graphs can be extrapolated into the future. A variety of graphs (such as linear, exponential, s-curve , and smoothed) can be shown for forecasted data.

PerfView can be used to monitor critical system resources. Figure 4-10 shows the Perf- View Analyzer graphing memory utilization and paging rates. Other predefined graphs exist for history, CPU, memory, and queue information. For example, the history graph shows CPU, active processes, disk utilization, memory pageout rates, and swapout rates.

Figure 4-10. PerfView graph showing memory utilization and paging rates.

graphics/04fig10.gif

The PerfView Analyzer graph shown in Figure 4-11 compares the performance of two systems simultaneously. Up to eight systems can be compared in one graph. Comparing system utilization can be useful when determining where to deploy new applications, or when adding new users.

Figure 4-11. PerfView graph comparing two systems.

graphics/04fig11.gif

PerfView's ability to show history and trend information can be helpful in diagnosing system problems. Graphing performance information can help you to understand whether a persistent problem exists or if an anomaly is simply a momentary spike of activity.

To diagnose a problem further, PerfView Monitor can allow users to change time intervals, to try to find the specific time a problem occurred. The graph is redrawn showing the new time period.

PerfView is integrated with several other monitoring tools. You can launch GlancePlus from within PerfView by accessing the Tools menu. PerfView can be launched from the IT/O Applications Bank as well. When troubleshooting an event in the IT/O Message Browser window, you can launch PerfView to see a related performance graph.

PerfView Monitor is not used with IT/O. Instead, the IT/O Message Browser is used. When an alarm is received in IT/O, the operator can click the alarm and a related PerfView graph can be shown.

PerfView can show information collected from multiple systems in a single performance graph. The PerfView and ClusterView products have also been integrated to enable the operator to select a cluster symbol on an HP OpenView submap and launch the PerfView application. This quickly shows a performance comparison between all systems in the cluster.

For further information, visit the HP Resource and Performance Management Web site at http://www.openview.hp.com/solutions/application/.

BMC PATROL for UNIX

BMC Software provides monitoring capabilities through its PATROL software suite. PATROL is a system, application, and event management suite for system and database administrators. PATROL provides the basic framework for defining thresholds, sending and translating events, and so forth. Optional products, called Knowledge Modules (KMs), are capable of monitoring specific components. For example, BMC PATROL includes KMs for UNIX, SAP R/3, Oracle, Informix, and other applications. In fact, more than 40 KMs are available from BMC for use with PATROL.

With the PATROL KM for UNIX, managed components include the CPU, memory, users, kernel, processes, printers, security, and filesystems. These components are discovered automatically and represented on the PATROL console with status icons. System utilization can be shown as graphs, to capture trends, and data can either be displayed in real time or saved in log files.

Like other graphical monitoring tools, PATROL provides an Event Manager window, which can show received events. Figure 4-12 highlights disk and NFS events received at the console.

Figure 4-12. PATROL Event Manager showing disk and NFS events.

graphics/04fig12.gif

For memory and swap resources, PATROL can show total real memory available, total virtual memory available, a list of swap devices, the number of processes swapped, and swap space utilization.

For the CPU, PATROL can show bottlenecks and utilization information, along with a variety of statistics, such as CPU idle time, run queue length, and swap queue length. Information about the operating system itself is also maintained, such as the name, version, and creation date.

PATROL can display the total number of processes, the number of zombie processes, and heavy CPU users. Through the PATROL console, you can perform administrative tasks , such as reprioritizing processes.

PATROL also can display the total number of users and sessions, and can check security by monitoring the number of failed user and privileged logins. You can check the printer queue to see how many jobs are in the queue and to determine the state of the printer.

PATROL can monitor the filesystem and can automatically determine the effectiveness of the buffer cache. Regular reports can be generated to check disk usage per user, to create a list of the largest files, or to list files that have not been accessed in a long time. Corrective actions, such as removing core files, can also be configured.

In addition to the system metrics monitored by PATROL, the KM for UNIX includes a set of tools to provide additional system monitoring, including tools to monitor CPU usage, paging activity, I/O caching, swap activity, and system log files, tools to check filesystem and kernel file resources, and tools to monitor printer queues.

The following list shows some of the parameters available for monitoring from the PATROL KM for UNIX:

  • CPUCpuUtil

  • CPUIdleTime

  • CPUInt

  • CPULoad

  • CPUProcsWaiting

  • CPUProcSwch

  • CPURunQSize

  • CPUSysTime

  • CPUUserTime

  • KERSysCall

  • MEMActiveVirPage

  • MEMFreeMem

  • MEMPageAnticipated

  • MEMPageFreed

  • MEMPageIn

  • MEMPageOut

  • MEMPageScanned

  • PRNQlength

  • PROCAvgUsrProc

  • PROCCpuHogs

  • PROCNoZombies

  • PROCNumProcs

  • PROCProcWait

  • PROCUserProcs

  • SWPSwapFreeSpace

  • SWPSwapIn

  • SWPSwapOut

  • SWPSwapSize

  • SWPSwapUsedPercent

  • USRNoSession

  • USRNoUser

The BMC PATROL KM for UNIX is supported on Bull, DG AViiON, DEC Alpha, DEC Ultra, Hewlett-Packard, NCR, Olivetti, OSF/1, Pyramid, RS/6000, SCO, Sequent, SGI, Sun Solaris, SunOS, Unisys, and UNIXWare systems.

Candle

The Candle Corporation provides software for mainframes and distributed systems. The Availability Command Center is a suite of integrated performance monitors and availability management solutions. The Candle Command Center for Distributed Systems is used to manage the performance and availability of computer systems and applications. Command Center solutions are available for UNIX, NT, IBM AIX, and MVS platforms. The Command Center for Distributed Systems can monitor many systems from a single console.

Candle's management agents provide detailed performance and availability metrics. The OMEGAMON Monitoring Agent for UNIX provides system information standardized across multiple UNIX platforms (IBM AIX, HP-UX, Sun Solaris, and SunOS). Available metrics include OS and CPU performance, process status, and disk performance. Disk performance is expressed as kilobytes per second, percent busy, and transfers per second. Disk performance and other tools can be launched from the Command Center console.

The Command Center provides some predefined threshold conditions for sending alerts. You also can change these conditions. If you decide to change the threshold conditions, they are automatically redistributed to the appropriate systems. Different alarm severity levels can be used.

The Command Center's event correlation engine and Visual Policy Editor can be used to create rules that automatically recognize the symptoms of problems and develop automated responses.

Candle has performed additional testing of the Command Center with MC/ServiceGuard to ensure that its Command Center for Distributed Systems product runs in that environment. More information about Candle Corporation's products can be found on the Web at http://www.candle.com.

I l @ ve RuBoard


UNIX Fault Management. A Guide for System Administrators
UNIX Fault Management: A Guide for System Administrators
ISBN: 013026525X
EAN: 2147483647
Year: 1999
Pages: 90

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net