IO Utilization | Performance Tuning for Linux Servers

I/O Utilization

Although the overall processor speeds, memory sizes, and I/O speeds continue to increase, I/O throughput and latency are still orders of magnitude slower than equivalent memory access. Additionally, because many workloads have a substantial I/O component, I/O can easily become a significant bottleneck to overall throughput and overall application response times. For I/O-intensive applications, the performance analyst must be able to access tools that help provide insights into the operations of the I/O subsystem.

This section initially looks at disk I/O. A future section looks at networking I/O because the tools to measure throughput and latency for each are different.

For disk I/O, performance is often evaluated in terms of throughput and latency. Disk drives tend to be able to handle large sequential transfers much better than small random transfers. Large sequential transfers allow optimizations such as read ahead or write behind, allow the storage system to reduce head movement and perform full track writes when possible. However, many applications rely on the capability to access data in disparate, often unpredictable locations on the media. As a result, the I/O patterns of various workloads are often a mix of sequential and random I/O, with varying sizes of block transfers.

When evaluating I/O performance on a system, the performance analyst needs to keep in mind several things. The first, and perhaps most obvious (although it is often forgotten), is that I/O performance cannot exceed the performance of the underlying hardware. Although we will not go into much detail on this aspect, it is very helpful when analyzing I/O throughput and latency to understand the system's underlying limitations. For instance, the performance aspects of the storage device, the I/O buses connecting the storage devices (for example, SCSI and Fibre Channel), any limitations imposed by the storage fabric (such as a Fibre Channel Switch), the host bus adapters, the systems interconnect bus (for example, PCI, PCI-X, and Infiniband), and in some cases the system's architecture (such as how host bus adapters perform DMA, how memory is interleaved, and NUMA connectivity) all contribute to the overall performance analysis of the I/O subsystem. In more complex system configurations, it may be helpful to maintain a list of "speeds and feeds" for the system to help distinguish between hardware bottlenecks and software bottlenecks.

To understand software bottlenecks and related performance impacts, the two major considerations are overall I/O throughput and latency of any individual I/O requests. Ideally, a system wants to optimize the data transfer rate to and from the media. However, because the latency of an individual request can be extremely long compared to the speed of the processor, applications can effectively stall waiting for I/O. Take, for example, an application that reads a block of data that provides information on how to access the next block of data, and so on. If the system or application is unable to optimize this pattern, performance is limited to the combined latencies of the I/O subsystem. One common solution to this problem is to perform many similar operations like this simultaneously. Multitaskingthe capability of many tasks to run in parallelallows an application or operating system to schedule many long latency I/O requests simultaneously, even when each application might spend a large portion of the time blocked on an individual request. As a result, the overall efficiency of the I/O subsystem, or the total I/O throughput, might approach the capacity of the underlying I/O subsystem. And, although it is always a goal of the operating system and the individual applications to optimize for overall system throughput, doing so at the expense of end-user response times is not usually an acceptable trade-off.

The underlying I/O subsystem latencies can be severely impacted by the incoming pattern of data transfer requests. For instance, if disk I/O requests alternately ask for a block of I/O at the "beginning" and "end" of the disk media, the physical disk arm may need to make relatively slow adjustments to position the disk head over the selected disk block. This type of access would obviously slow down all accesses to a given device, thereby reducing the number of I/O operations that could be completed over a period of time. Further, such a bizarre access pattern would likely reduce not just the number of devices per logical drive, but also the overall I/O transfer rate.

Another solution to multitasking is to ensure that the data requests from the applications and operating system are well distributed among the disks connected to the system. Distributing the I/O requests to multiple disks effects a level of parallelism that further reduces the performance impacts of the individual latencies of disk drives. Redistributing the application data among a number of disk devices often requires a solid understanding of the workload as well as an understanding of the data access patterns of that workload.

Although system monitoring tools do not provide the capability to track each I/O that a particular application issues, there are tools that allow a performance analyst to monitor the total number of I/Os processed by the system, the number of I/O operations per logical disk drive, and the overall I/O transfer rate. The two primary tools discussed in the next sections are iostat(1) and sar(1). You can use these tools to understand what the I/O bottlenecks are, what disks or interconnects are underutilized, and what the latencies are from a system perspective (as opposed to an application perspective).

Before exploring the specific tools, keep in mind that there are many techniques for increasing I/O performance. These techniques include purely hardware-related solutions (such as using disk drives that have higher revolutions per minute, thereby providing lower I/O latencies, larger disk cache sizes, or I/O controller cache sizes). They also include improved data transfer rates for both reads and writes, and/ increases in I/O bus speeds or I/O fabric speeds, which both increase data transfer rates and reduce I/O latencies. Also, some disk drives and disk storage subsystems provide multiported logical or physical disks, allowing parallel I/O from a single disk, which again increases potential I/O throughput. Additionally, hardware and software RAID (Redundant Array of Independent Disks) were designed to increase access parallelism by striping data across multiple disk drives.

The tools discussed in the following sections provide data that can be useful in considering hardware and software solutions as well as improvements in data layout.

iostat

The iostat command monitors system I/O activities by observing how long the physical disks are active in relation to their average transfer rates. The iostat command generates reports that can be used to change system configuration to better balance the I/O load among physical disks. iostat(1) also provides CPU utilization that can sometimes be useful in comparing directly against the I/O activities. If no display interval is given, iostat gives out I/O information since the system was last booted. If a display interval is given, the first set of output represents total activity since boot time, and subsequent displays only show the delta activities. The following display corresponds to copying files from /dev/sdo7 to /dev/sds7, /dev/sdp7 to /dev/sdt7, and /dev/sdr7 to /dev/sdu7:

 avg-cpu:  %user   %nice    %sys %iowait   %idle           0.21    0.00    0.80    2.07   96.92 Device:  tps Blk_read/s Blk_wrtn/s   Blk_read   Blk_wrtn sdx     0.00       0.00      0.00         32          0 sdw     0.00       0.00      0.00         32          0 sdv     0.00       0.00      0.00         32          0 sdu     2.49       0.05   1443.46       2778   79552392 sdt     4.94       0.10   2871.73       5322  158268008 sds     4.95       0.10   2860.91       5330  157671720 sdr    30.20    1518.55      0.42   83690898      23288 sdq    60.25    2902.76      0.92  159978258      50896 sdp     0.00       0.01      0.00        378         24 sdo    59.49    2883.87      0.90  158937034      49520

iostat reports CPU utilization similar to how it is provided by the top tool. It splits the CPU time into user, nice, system, I/O wait, and system idle. It is followed by the disk utilization report. The disk header is followed by lines of disk statistics, where each line reports the activities of one logical disk that is configured. The tps column represents the number of I/O requests that are issued to the logical disk. However, the size of the I/O requests is not given. Blk_read/s and Blk_wrtn/s represent the amount of data read from and written to the logical drive in a number of blocks per second. Again, the block size is not given. Finally, Blk_read and Blk_wrtn correspond to the amount of data read from and written to the logical drive in a number of blocks per second, without specifying the block's size.

The k option displays statistics in kilobytes, the -p option gets per-partition statistics, and the -x option gets information such as average wait time and average service time. In the data, the count on Blk_read of sdo is very close to the count of Blk_wrtn of sds, as data is copied from sdo7 to sds7. In addition, the rate of reads is slightly higher than the rate of writes. This report can highlight disk I/O bottlenecks, if any, and helps database designers lay out data to achieve higher access parallelism.

sar

sar is included in the sysstat package. sar collects and reports a wide range of system activities in the operating system. Activities include I/O operations, CPU utilization, the rate of context switches and interrupts, the rate of paging in and paging out, and the use of shared memory, buffer, and network. Based on the values in the count and interval parameters, sar writes information the specified number of times spaced at the specified intervals in seconds. For example, the command sar b 3 12 reports disk usage every 3 seconds for a total of 12 seconds. In addition, at the end of data collection, average statistics are given. sar is a very option-rich tool. The remainder of this section discusses a few features of the tool:

sar displays I/O statistics similar to iostat. sar provides the total number of I/O operations (tps), which is further split into read operations (rtps) and write operations (wtps). sar also provides the rates of read and write operations under bread/s and bwrtn/s. The following data is collected every 2 seconds for a total of 18 seconds. At the end of the data collection, the averages of five fields are given. However, operations for individual logical drives are not given.

 12:59:15       tps     rtps     wtps   bread/s   bwrtn/s 12:59:17     37.50    37.50     0.00    396.00      0.00 12:59:19     66.50    66.50     0.00  16140.00      0.00 12:59:21    268.50   268.50     0.00  66560.00      0.00 12:59:23    333.50   261.50    72.00  64548.00   9620.00 12:59:25    153.50    40.50   113.00   9728.00  27984.00 12:59:27    133.00     5.00   128.00   1024.00  31744.00 12:59:29    119.50     7.50   112.00   1536.00  27776.00 12:59:31    133.00     5.00   128.00   1024.00  31744.00 Average:    155.63    86.50    69.13  20119.50  16108.50

sar provides data on CPU utilization for individual processors as well as for the whole system. This particular feature is especially useful in multiprocessor environments. If some processors do more work than others, the display clearly shows. You can then check whether the imbalanced use of processors is from, for example, the applications or the scheduler of the kernel. The following data is collected from a four-way SMP system every 5 seconds:

 11:09:13   CPU  %user  %nice   %system   %iowait  %idle 11:09:18   all   0.00   0.00      4.70     52.45  42.85 11:09:18     0   0.00   0.00      5.80     57.00  37.20 11:09:18     1   0.00   0.00      4.80     49.40  45.80 11:09:18     2   0.00   0.00      6.00     62.20  31.80 11:09:18     3   0.00   0.00      2.40     41.12  56.49 11:09:23   all   0.00   0.00      3.75     47.30  48.95 11:09:23     0   0.00   0.00      5.39     37.33  57.29 11:09:23     1   0.00   0.00      2.80     41.80  55.40 11:09:23     2   0.00   0.00      5.40     41.60  53.00 11:09:23     3   0.00   0.00      1.40     68.60  30.00 . . . Average:   all   0.00   0.00      4.22     16.40  79.38 Average:     0   0.00   0.00      8.32     24.33  67.35 Average:     1   0.00   0.00      2.12     14.35  83.53 Average:     2   0.01   0.00      4.16     12.07  83.76 Average:     3   0.00   0.00      2.29     14.85  82.86

sar also provides interrupt information among the processors:

 10:53:53  CPU  i000/s  i001/s  i002/s  i003/s  i004/s  i005/s  i006/s  i007/s 10:53:58    0 1000.20    0.00    0.00    0.40    0.00    0.00    3.00    0.00 10:53:58    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00 2320.00 10:53:58    2    0.00    0.00    0.00    0.00 1156.00    0.00    0.00    0.00 10:53:58    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 Average:    0  999.94    0.00    0.00    1.20  590.99    0.00    3.73    0.00 Average:    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00  926.61 Average:    2    0.00    0.00    0.00    0.00  466.51    0.00    0.00 1427.48 Average:    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

The study of interrupt distribution reveals an imbalance of interrupt processing. One method to tackle this imbalance is to affinitize IRQ processing to a specific processor or to a number of processors. For example, if 0x0001 is echoed to /proc/irq/ID, where ID corresponds to a device, only CPU 0 will process IRQ for this device. If 0x000f is echoed to /proc/irq/ID, CPU 0 through CPU 3 will be used to process IRQ for this device. For some workloads, this technique can reduce contention on certain heavily used processors. This technique allows I/O interrupts to be processed more efficiently, causing the I/O performance to increase accordingly.