Using System Performance Data | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

This section provides a brief introduction on how you can use performance monitoring tools to avoid, identify, and address system performance problems. An extensive tutorial on system performance is beyond the scope of this book.

A bottleneck in one system resource can render other system resources unusable. You need to ensure that all system components have sufficient capacity to operate at their optimal level. You can use performance data to avoid bottlenecks, by detecting trends to establish appropriate resource entitlements for each application, and to help eliminate problems when they occur.

This section does not discuss how to troubleshoot network performance issues, which is covered in Chapter 6.

Note that performance monitoring itself can create problems in your environment. Sending regular performance data from each system to a central location could result in hundreds of megabytes per day of network traffic and data storage for medium- sized companies. You should make sure that all the data you are collecting is going to be used. Instead of sending all data, you should send only the unusual or exceptional information. However, enough data should be sent to be able to identify trends for capacity planning. You should store a fixed amount of detailed performance data locally on each system so that you can troubleshoot problems when they appear.

Avoiding Performance Issues

The first step that you can take toward avoiding performance problems is to establish baselines for your environment. Collect performance data when your system is performing well, for long enough to get a valid representation of your system's workload, so that you have something to contrast with a poorly performing system.

Next , you should see whether the CPU, memory, and I/O resources are well-balanced. You should also do capacity planning, to ensure that your system has sufficient headroom to support any additional users and applications that you may be expecting. If excess capacity is not available, you should develop a plan for addressing future growth.

Another area to check is the allocation of system resources. Use the sar and sysdef commands, for example, to see whether any resources are at their configured limits. Check the available swap space and entries in the file and process tables to see whether these are sized appropriately. Use EMS to set up early warnings as the usage of other system resources increases . Because changing these limits often requires that you restart the system, early detection can allow you to plan for the time when the system will be unavailable.

Another way to protect system resources is to use the Process Resource Manager (PRM), a resource management tool used to balance system resources among PRM groups. PRM groups are configured by the administrator and consist of a set of HP-UX users or applications. PRM is then used to give each PRM group a certain percentage of the CPU, real memory, or disk I/O bandwidth available on the system. PRM ensures that each PRM group gets a minimum percentage of the system's resources, even during heavy loads.

PRM can be used in conjunction with HP GlancePlus to adjust system configuration. For example, if an administrator detects unwanted system load for a PRM group, GlancePlus can be used to lower that group's entitlement dynamically.

Normally, if one PRM group doesn't need its system resources, PRM allocates them to other groups that may need them. However, PRM can also help with capacity planning, by allowing resource maximums to be specified. Thus, if an administrator knows that a system will soon have 25 percent more users, the administrator can allocate a maximum of 80 percent of system resources to simulate the upcoming load.

Although PRM can ensure that users get a certain percentage of CPU resources, it can't prevent all system performance problems. For example, an application sending large network packets but using very little CPU resources can starve a more critical application, because network bandwidth is not controlled by PRM.

PRM can also be used to adjust workload dynamically in a high availability environment. For example, if three MC/ServiceGuard packages are each running with similar PRM entitlements, and one package fails to another system, this can be automatically detected , and a new PRM configuration can be applied, giving the two remaining packages higher entitlements.

Despite these efforts, you still are likely to have some performance problems to investigate. The next sections describe how to use the data collected by the various performance monitoring tools to address performance issues.

Detecting CPU Contention

UNIX commands, such as top and uptime, and performance monitoring tools, such as GlancePlus, provide CPU utilization information. CPU utilization and run queue length can be used together to determine whether a CPU bottleneck exists. High CPU utilization alone may not be indicative of a problem; batch jobs may be consuming the CPU remaining from interactive users. However, if interactive users are getting poor response times, that indicates a problem, such as a system bottleneck.

If the run queue is greater than one, the likelihood that a CPU bottleneck exists increases as the CPU utilization gets closer to 100 percent. Make sure that the high utilization and large run queue are sustained for a period of time.

If a CPU bottleneck is identified, recovery may depend on the applications and processes consuming large amounts of CPU. This can be determined by using performance monitoring tools such as GlancePlus. Applications spending the majority of their time in system code may need to be changed. In some cases, an application can be recompiled, optimized, or restructured to improve its performance. If batch processing is causing a problem, a job scheduler can be used to route jobs to less utilized systems. Less important applications, such as batch processes, can also be reconfigured to run at a lower priority by using nice. An application may need to be aborted or moved to another system if it continually thrashes with other applications. Tools such as PRM can be enabled or reconfigured to handle resource allocation among applications or users. PRM can keep applications within configured CPU limits.

Checking System Resource Usage

This chapter has described a variety of tools to monitor system resource usage. System table utilization can be checked by using tools such as GlancePlus. Using EMS monitors to set up thresholds is another useful approach.

The number of processes allowed and the number of concurrent open files allowed are two parameters that should be checked and that can be reconfigured using SAM.

Many actions to correct this type of problem require restarting the system, but if the problem is due to a runaway application, you may be able to detect the problem before other applications are affected. You can abort the application before system resources are depleted.

Detecting Memory and Swap Contention

To check for a real memory bottleneck on the network server, you can first check the amount of free memory. It should not drop below 5 percent of the total available. If the system cannot keep up with the demands for memory, it will start paging and swapping. Excessive paging and swapping, viewed from GlancePlus, may be a sign of a memory bottleneck. Two other signs that may indicate a memory bottleneck are a high percentage of processes blocked on virtual memory and large disk queues on swap devices.

To lower the swap rate, you may want to configure a higher percentage of available disk space for swapping. Increasing the capacity of the system by adding more memory or disk space may also eliminate the bottleneck. PRM can be used to ensure that the most important applications get a sufficient percentage of the memory.

If the amount of memory being used seems unusually high, you can use performance tools to determine which processes are using the most memory. A program may need to be redesigned to use memory more efficiently . A program may also need to be examined for memory leaks.

Detecting Disk and File System Bottlenecks

System, application, and disk information should be studied together to resolve disk performance issues. MeasureWare provides a lot of information about an application's disk utilization, which may need to be correlated with system data.

To avoid disk bottlenecks, you need to balance I/O across filesystems, disk spindles, and disk controllers to reduce uneven queuing and delays. Performance monitoring tools such as GlancePlus can be used to find the process with the highest I/O rate, and also the busiest physical disk. Checking the I/O rate only is insufficient, because a slower device has a higher utilization than a faster disk with the same I/O rate. If a single disk has greater than 50-percent utilization for an extended period of time, it may be an indication of an I/O bottleneck. The percentage should be compared with that of other disks, to see whether a severe load imbalance exists. However, a high utilization is not sufficient to identify a problem. The disk may still be capable of handling more I/O. A continually long disk queue length is also needed to indicate a problem. Heavily used disks are likely to have large disk queue lengths as well.

Both BMC PATROL and MeasureWare collect read cache hit ratio information. Determining how many logical reads are satisfied by the system's buffer cache is an indication of whether the cache size was configured correctly. Because increasing the cache size negatively affects the system memory available for other purposes, the appropriate cache hit ratio depends on the type of workload being run on the system. For I/O- intensive applications, you may want to configure your system such that this ratio is as high as 90 or 95 percent. Similarly, you may want to ensure that your write cache hit ratio is at least 75 percent. If your hit rates are too low, the system buffer cache may be too small.

After you determine that the system buffer cache is too small, you can increase its size on HP-UX by using SAM. Select the Configurable Parameters option from the Kernel Configura tion functional area. The appropriate parameter to modify depends on whether a static or dynamic buffer cache is being used, which can also be checked on this screen. Fixed-size buffer caches are most effective if the environment and workload are static. Dynamic buffer caches fluctuate in size based on the demands for I/O or virtual memory, and are useful when workloads vary. If the nbuf and bufpages system parameters are set to 0, a dynamic buffer cache is in use. When using a dynamic buffer cache on systems with greater than 1GB of real memory, you should lower the maximum size below 50 percent, because caches greater than 500MB actually cause performance degradations.

Detecting disk contention is discussed in Chapter 5. If no problem seems to exist with the CPU, memory, or disk, other possibilities include networking or system table utilization. Checking network utilization is discussed in Chapter 6.

I l @ ve RuBoard