Using Disk Performance Data | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

To address disk performance problems, you first need to collect the appropriate data. MeasureWare and BMC PATROL are two examples of tools that collect the necessary performance metrics. This section provides a brief overview of how to interpret the data to determine whether a disk performance problem exists.

This chapter shows only those metrics related directly to disks. However, you should also study system and application- related performance information, because disk information alone may not be sufficient to solve a problem. MeasureWare provides a lot of information about an application's disk utilization that may need to be correlated with system data. For example, the root cause of a disk performance problem may be an I/O- intensive application that is not supposed to be running during production hours. Looking at application and system data in conjunction with disk data can help to find the culprit.

All performance monitoring tools can provide system CPU utilization information. If an application's response time is low, and the system's CPU utilization is less than 95 percent, this may be an indication that a disk bottleneck exists. Other important metrics are disk utilization, disk queue lengths, and the time the system spends waiting for I/O to complete.

To avoid disk bottlenecks, you need to balance I/O across filesystems, disk spindles, and disk controllers to reduce uneven queuing and delays. Performance monitoring tools, such as GlancePlus, can be used to find the process with the highest I/O rate, and also the busiest physical disk. Checking only the I/O rate is insufficient, because a slower device will have a higher utilization than a faster disk with the same I/O rate. If a single disk has greater than 50-percent utilization for an extended period of time, this may be an indication of an I/O bottleneck. The percentage should be compared with that of other disks to see whether a severe load imbalance exists. However, high utilization is not sufficient to identify a problem. The disk may still be capable of handling more I/O. A continually long disk queue length must be present to indicate a problem. Heavily used disks are also likely to have long disk queue lengths.

Both BMC PATROL and MeasureWare collect read cache hit ratio information. Determining how many logical reads are satisfied by the system's buffer cache is an indication of whether the cache size was configured correctly. Because increasing the cache size negatively affects the system memory available for other purposes, the appropriate cache hit ratio depends on the type of workload being run on the system. For I/O-intensive applications, you may want to configure your system such that this ratio is as high as about 90 or 95 percent. Similarly, you may want to ensure that your write cache hit ratio is at least 75 percent. If your hit rates are too low, the system buffer cache may be too small.

After you determine that the system buffer cache is too small, you can increase its size on HP-UX by using SAM. Select Configurable Parameters from the Kernel Configuration functional area. The appropriate parameter to modify depends on whether a static or dynamic buffer cache is being used, but this can be checked on this screen as well. Fixed-size buffer caches are most effective if the environment and workload are static. Dynamic buffer caches fluctuate in size based on demands for I/O or virtual memory, and are useful when workloads vary. If the nbuf and bufpages system parameters are set to zero, then a dynamic buffer cache is in use. When using a dynamic buffer cache on systems with greater than 1GB of real memory, you should lower the maximum size below 50 percent, because caches greater than 500MB actually cause performance degradations.

The Process Resource Manager (PRM) can be used to control system CPU, real memory, and I/O bandwidth allocations between users and applications. Users or applications are assigned to PRM groups, and each group is configured with the desired system resource entitlement. This can help to provide application isolation, by preventing one application from affecting the performance of another application. CPU controls alone are often sufficient, but sometimes, an I/O-intensive application may be using very little CPU. Other applications that need to use the same I/O interface may be starved. If multiple applications are sharing access to a volume group, PRM can be used to allocate an appropriate minimum entitlement of disk band width to each. PRM can throttle disk throughput for PRM groups by ordering logical I/O requests, so that requests for lower-priority processes are delayed, allowing requests from higher-priority processes to get through. Resource capping isn't available for I/O bandwidth, so an application can use all of the available bandwidth if no other applications need it.

PRM controls disk bandwidth only for disks under LVM. Determining the best entitlements to configure may be difficult and may require some experimentation. Before enabling a disk bandwidth configuration in PRM, you can collect data on the current usage of different PRM groups. This can be done by first configuring the CPU entitlements in PRM with no disk records specified. The prmconfig and prmmonitor commands can then be used to collect current usage statistics for a specified volume group. PRM is available only on HP-UX.

If GlancePlus is available, PRM behavior can be shown on the GlancePlus screens. For example, the actual I/O bandwidth used by each PRM group can be shown. Also, GlancePlus can be used to change PRM resource group entitlements.

These are just a few examples of how disk information can be used to address performance problems.

I l @ ve RuBoard