15.5 Disk IO Performance Issues | Essential System Administration, Third Edition

15.5 Disk I/O Performance Issues

Disk I/O is the third major performance bottleneck that can affect a system or individual job. This section will look first at the tools for monitoring disk I/O and then consider some of the factors that can affect disk I/O performance.

15.5.1 Monitoring Disk I/O Performance

Unfortunately, Unix tools for monitoring disk I/O data are few and rather poor. BSD-like systems provide the iostat command (all but Linux have some version of it). Here is an example of its output from a FreeBSD system experiencing moderate usage on one of its two disks:

$ iostat 6       tty             ad0              ad1              cd0             cpu  tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id    0   13 31.10  71  2.16   0.00   0  0.00   0.00   0  0.00   0  0 11  2 87    0   13 62.67  46  2.80   0.00   0  0.00   0.00   0  0.00   0  0 10  2 88    0   13  9.03  64  0.56   0.00   0  0.00   0.00   0  0.00   1  0  7  1 91    0   13  1.91  63  0.12   0.00   0  0.00   0.00   0  0.00   2  0  4  2 92    0   13  2.29  64  0.14   0.00   0  0.00   0.00   0  0.00   2  0  5  1 92

The command parameter specifies the interval between reports (and we've omitted the first, summary one, as usual). The columns headed by disk names are the most useful for our present purposes. They show current disk usage as the number of transfers/sec (tps) and MB/sec.

System V-based systems offer the sar command, and it can be used to monitor disk I/O. Its syntax in this mode is:

$ sar -d interval [count]

interval is the number of seconds between reports, and count is the total number of reports to produce (the default is one). In general, sar's options specify what data to include in its report. sar is available for AIX, HP-UX, Linux, and Solaris. However, it requires that process accounting be set up before it will return any data.

This report shows the current disk usage on a Linux system:

$ sar -d 5 10 Linux 2.4.7-10 (dalton)         05/29/2002 07:59:34 PM       DEV       tps    blks/s 07:59:39 PM    dev3-0      9.00     70.80 07:59:39 PM   dev22-0      0.40      1.60 07:59:39 PM       DEV       tps    blks/s 07:59:44 PM    dev3-0     61.80    494.40 07:59:44 PM   dev22-0     10.80     43.20 07:59:44 PM       DEV       tps    blks/s 07:59:49 PM    dev3-0     96.60    772.80 07:59:49 PM   dev22-0      0.00      0.00 Average:          DEV       tps    blks/s Average:       dev3-0     78.90    671.80 Average:      dev22-0      1.12      4.48

The first column of every sar report is a time-stamp. The other columns give the transfer operations per second and blocks transferred per second for each disk. Note that devices are specified by their major and minor device numbers; in this case, we are examining two hard disks.

15.5.2 Getting the Most From the Disk Subsystem

Disk performance is something that more effectively results from installation-time planning and configuration than from after-the-fact tuning. Different techniques are most effective for optimizing different kinds of I/O. This means that you'll need to understand the I/O performed by the applications/typical workload on the system.

There are two sorts of disk I/O:

Sequential access: Data from disk is read in disk block order, one block after another. After the initial seek (head movement) to the starting point, the speed of this sort of I/O is limited by disk transfer rates.
Random access: Data is read in no particular order. This means that the disk head will have to move frequently to reach the proper data. In this case, seek time is an important factor in overall I/O performance, and you will want to minimize it to the extent possible.

Three major factors affect disk I/O performance in general:

Disk hardware
Data distribution across the system's disks
Data placement on the physical disk

15.5.2.1 Disk hardware

In general, the best advice is to choose the best hardware you can afford when disk I/O performance is an important consideration. Remember that the best SCSI disks are many times faster than the fastest EIDE ones, and also many times more expensive.

These are some other points to keep in mind:

When evaluating the performance of individual disks, consider factors such as its local cache in addition to quoted peak transfer rates.
Be aware that actual disk throughput will seldom if ever achieve the advertised peak transfer rates. Consider the latter merely as relative numbers useful in comparing different disks.
Musameci and Loukides suggest using the following formula to estimate actual disk speeds: (sectors-per-track * RPM * 512)/60,000,000. This yields an estimate of the disk's internal transfer rate in MB. However, even this rate will only be achievable via sequential access (and rarely even then).
When random access performance is important, you can estimate the number of I/O operations per second as 1000/(average-seek-time + 30000/rpm)
Don't neglect to consider the disk controller speed and other characteristics when choosing hardware. Fast disks won't perform as well on a mediocre controller.
Don't overload disk controllers. Placing disks on multiple disk controllers is one way to improve I/O throughput rates. In configuring a system, be sure to compare the maximum transfer rate for each disk adapter with the sum of the maximum transfer rates for all the disks it will control; obviously, placing too large a load on a disk controller will do nothing but degrade performance. A more conservative view states that you should limit total maximum disk transfer rates to 85%-90% of the top controller speed.
Similarly, don't overload system busses. For example, a 32-bit/33MHz PCI bus has a peak transfer rate of 132 MB/sec, less than what an Ultra3 SCSI controller is capable of.

15.5.2.2 Distributing the data among the available disks

The next issue to consider after a system's hardware configuration is planning data distribution among the available disks: in other words, what files will go on which disk. The basic principle to take into account in such planning is to distribute the anticipated disk I/O across controllers and disks as evenly as possible (in an attempt to prevent any one resource from becoming a performance bottleneck). In its simplest form, this means spreading the files with the highest activity across two or more disks.

Here are some example scenarios that illustrate this principle:

If you expect most of a system's I/O to come from user processes, distributing the files they are likely to use across multiple disks usually works better than putting everything on a single disk.
A system intended to support multiple processes with large I/O requirements will benefit from placing the data for different programs or jobs on different disks (and ideally on separate controllers). This minimizes the extent to which the jobs interfere with one another.
For a system running a large transaction-oriented database, ideally you will want to place each of the following item pairs on different disks:
- Tables and their indexes.
- Database data and transaction logs.
- Large, heavily used tables accessed simultaneously.
Given the constraints of an actual system, you may have to decide which of these separations is the most important.

Of course, placing heavily accessed files on network rather than local drives is almost always a guarantee of poor performance. Finally, it is also almost always a good idea to use a separate disk for the operating system filesystem(s) (provided you can afford to do so) to isolate the effects of the operating system's own I/O operations from user processes.

15.5.2.3 Data placement on disk

The final disk I/O performance factor that we will consider is the physical placement of files on disk. The following general considerations apply to the relationship between file access patterns, physical disk location, and disk I/O performance:

Sequential access of large files (i.e., reading or writing, starting at the beginning and moving steadily toward the end) is most efficient when the files are contiguous: made up of a single, continuous chunk of space on disk. Again, it may be necessary to rebuild a filesystem to create a large amount of contiguous disk space.^[27] Sequential access performance is highest at the outer edge of the disk (i.e., beginning at 0) because the platter is the widest at that point (head movement is minimized).

^[27] Unfortunately, some disks are too smart for their own good. Disks are free to do all kinds of remapping to improve their concept of disk organization and to mask bad blocks. Thus, there is no guarantee that what look like sequential blocks to the operating system are actually sequential on the disk.
Disk I/O to large sequential files also benefits from software disk striping, provided an appropriate stripe size is selected (see Section 10.3). Ideally, each read should result in one I/O operation (or less) to the striped disk.
Placing large, randomly accessed files in the center portions of disk drives (rather than out at the edges) will yield the best performance. Random data access is dominated by seek times the time taken to move the disk heads to the correct location and seek times are minimized when the data is in the middle of the disk and increases at the inner and outer edges. AIX allows you to specify the preferred on-disk location when you create a logical volume (see Section 10.3). With other Unix versions, you accomplish this by defining physical disk partitions appropriately.
Disk striping is also effective for processes performing a large number of I/O operations.
Filesystemfragmentation degrades I/O performance. Fragmentation results when the free space within a filesystem is made of many small chunks of space (rather than fewer large ones of the same aggregate size). This means that files themselves become fragmented (noncontiguous), and access times to reach them become correspondingly longer. If you observe degrading I/O performance on a very full filesystem, fragmentation may be the cause.

Filesystem fragmentation tends to increase over time. Eventually, it may be necessary or desirable to use a defragmenting utility. If none is available, you will need to rebuild the filesystem to reduce fragmentation; the procedure for doing so is discussed in Section 10.3.

15.5.3 Tuning Disk I/O Performance

Some systems offer a few hooks for tuning disk I/O performance. We'll look at the most useful of them in this subsection.

15.5.3.1 Sequential read-ahead

Some operating systems attempt to determine when a process is accessing data files in a sequential manner. When it decides that this is the access pattern being used, it attempts to aid the process by performing read-ahead operations: reading more pages from the file than the process has actually requested. For example, it might begin by retrieving two pages instead of one. As long as sequential access of the file continues, the operating system might double the number of pages read with each operation before settling at some maximum value.

The advantage of this heuristic is that data has often already been read in from disk at the time the process asks for it, and so much of the process's I/O wait time is eliminated because no physical disk operation need take place.

15.5.3.1.1 AIX

AIX provides this functionality. You can alter the default threshold value of 2 and 8 pages using these vmtune options:

-r minpgahead: Starting number of pages for sequential read aheads.
-R maxpgahead: Maximum number of pages to read ahead. You will want to increase this parameter for striped filesystems. Good values to try are 8-16 times the number of component drives.

Both parameters must be a power of 2.

15.5.3.1.2 Linux

Linux provides some kernel parameters related to read-ahead behavior. They may be accessed via these files in /proc/sys/vm:

page-cluster: Determines the number of pages read in by a single read operation. The actual number is computed as 2 raised to this power. The default setting is 4, resulting in a page cluster size of 16. Large sequential I/O operations may benefit from increasing this value.
min-readahead and max-readahead: Specify the minimum and maximum pages used for read-ahead. They default to 3 and 31, respectively.

Finally, the Linux Logical Volume Manager allows you to specify the read-ahead size when you create a logical volume with lvcreate, via its -r option. For example, this command specifies a read-ahead size of 8 sectors and also creates a contiguous logical volume:

# lvcreate -L 800M -n bio_lv -r 8 -C y vg1

The valid range for -r is 2 to 120.

15.5.3.2 Disk I/O pacing

AIX also provides a facility designed to prevent general system interactive performance from being adversely affected by large I/O operations. By default, write requests are serviced by the operating system in the order in which they are made (queued). A very large I/O operation can generate many pending I/O requests, and users needing disk access can be forced to wait for them to complete. This occurs most frequently when an application computes a large amount of new data to be written to disk (rather than processing a data set by reading it in and then writing it back out).

You can experience this effect by copying a large file 32MB or more in the background and then running an ls command on any random directory you have not accessed recently on the same physical disk. You'll notice an appreciable wait time before the ls output appears.

Disk I/O pacing is designed to prevent large I/O operations from degrading interactive performance. It is disabled by default. Consider enabling it only under circumstances like those described.

This feature may be activated by changing the values of the minpout and maxpout system parameters using the chdev command. When these parameters are nonzero, if a process tries to write to a file for which there are already maxpout or more pending write operations, the process is suspended until the number of pending requests falls below minpout.

maxpout must be one more than a multiple of 4: 5, 9, 13, and so on (i.e., of the form 4x+1). minpout must be a multiple of 4 and at least 4 less than maxpout. The AIX documentation suggests starting with values of 33 and 16, respectively, and observing the effects. The following command will set them to these values:

# chdev -l sys0 -a maxpout=33 -a minpout=16

If interactive performance is still not as rapid as you want it to be, try decreasing these parameters; on the other hand, if the performance of the job doing the large write operation suffers more than you want it to, increase them. Note that their values do persist across boot because they are stored in the ODM.