Now that we've discussed in some detail how disks operate , how they're connected to the system, and the filesystems that serve as an intermediate layer between those low-level components and us as users, we can move on to tools for analyzing and improving the performance of disk subsystems. In this section, I discuss how to enable disk caches; some tools to benchmark disk performance; a common pitfall in disk performance analysis, and how to avoid it; tools like sar and iostat that let us monitor the performance of our devices; and a low-level disk I/O tracing tool.
We've discussed the nature of disk write caching; before you use this procedure, please be sure to read Section 5.1.1.1 earlier in this chapter. Blindly enabling write caches can have a profound affect on the reliability of your data. With that caveat, then, Example 5-1 shows how to invoke format in a mode that enables the changing of the cache policy on a per-disk basis.
# format -e Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c0t0d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80> /sbus@1f,0/espdma@e,8400000/esp@e,8800000/sd@0,0 Specify disk (enter its number): ... format> cache CACHE MENU: write_cache - display or modify write cache settings read_cache - display or modify read cache settings !<cmd> - execute <cmd>, then return quit cache> write_cache WRITE_CACHE MENU: display - display current setting of write cache enable - enable write cache disable - disable write cache !<cmd> - execute <cmd>, then return quit write_cache> display Write Cache is disabled write_cache> enable write_cache> display Write Cache is enabled
The best tool to use under Linux for changing a disk's cache policy is hdparm , which is available at ftp://metalab.unc.edu/pub/Linux/system/hardware/hdparm-4.1.tar.gz. Here's how to enable write caching for /dev/hda :
# hdparm -W1 /dev/hda /dev/hda: setting drive write-caching to 1 (on)
If you'd rather not play so fast and loose with your data, here's how to turn it off:
# hdparm -W0 /dev/hda /dev/hda: setting drive write-caching to 0 (off)
After all the techniques that we have discussed for sizing disk subsystems to workloads and improving disk performance, tools to characterize the performance of an I/O subsystem would be most useful. Thankfully, there are several good ones available. Here, I cover hdparm , tiobench , and iozone .
One nice, compact device for testing raw device performance in Linux is the hdparm command, which includes a few simple tests of sequential data transfer performance. By specifying the -Tt switches, it's possible to get a fairly good estimate of single-disk performance. This is very useful given the large number of widely varying disk drives available in the PC-compatible hardware arena.
# hdparm -Tt /dev/hda /dev/hda: Timing buffer-cache reads: 128 MB in 3.94 seconds = 32.49 MB/sec Timing buffered disk reads: 64 MB in 15.31 seconds = 4.18 MB/sec
tiobench is an open source, threaded disk performance analysis tool that has eclipsed the old bonnie benchmark. It presents a more complete range of data than hdparm , by including sequential and random-access patterns and reporting latency measurements.
Running the benchmark, especially under Linux, is very straightforward. This example was generated on a 300 MHz Celeron system running a 2.4 Linux kernel; the disk device is a 7,200 rpm IDE disk:
% tiotest Tiotest latency results: ,-------------------------------------------------------------------------. Item Average latency Maximum latency % >2 sec % >10 sec +--------------+-----------------+-----------------+----------+-----------+ Write 1.827 ms 1421.003 ms 0.00000 0.00000 Random Write 0.453 ms 409.878 ms 0.00000 0.00000 Read 0.241 ms 262.779 ms 0.00000 0.00000 Random Read 0.087 ms 107.313 ms 0.00000 0.00000 --------------+-----------------+-----------------+----------+----------- Total 0.819 ms 1421.003 ms 0.00000 0.00000 `--------------+-----------------+-----------------+----------+-----------' Tiotest results for 4 concurrent io threads: ,----------------------------------------------------------------------. Item Time Rate Usr CPU Sys CPU +-----------------------+----------+--------------+----------+---------+ Write 40 MBs 14.1 s 2.834 MB/s 1.1 % 19.5 % Random Write 16 MBs 10.2 s 1.533 MB/s 0.2 % 15.3 % Read 40 MBs 1.1 s 37.924 MB/s 3.8 % 19.0 % Random Read 16 MBs 0.5 s 32.338 MB/s 4.1 % 14.5 % `----------------------------------------------------------------------'
Further options are available by perusing the output of tiotest -h . Of particular interest are the -S and -W flags. -S forces all writes to be done synchronously, and -W causes the writes to be done sequentially. tiobench also comes with a Perl script that allows you to automate the running of test suites.
As of this writing , tiobench is available at http:// sourceforge .net/projects/tiobench/.
The iozone benchmark is quite possibly one of the most complete disk and filesystem benchmarks ever developed. Unfortunately, it is extremely complex (a side effect of its completeness), and beyond the scope of discussion here. Extensive documentation is available online, however. iozone is available as of this writing at http://www.iozone.org.
Let's look at a set of simple throughput benchmarks:
% mkfile -v 10m test-file test-file 10485760 bytes % ptime cat test-file > /tmp/glop real 1.728 user 0.004 sys 0.273 % rm /tmp/glop % ptime cat test-file > /tmp/glop real 0.504 user 0.004 sys 0.211
The astute administrator might ask why the second run is so much faster than the first. The simple reason is that this is a completely bogus measurement. What occurred is the first copy legitimately copied the data from disk to temporary space, which on this system is a tmpfs filesystem. [26] However, the second run did not -- it copied the file from the in-memory filesystem cache to temporary space. When you run a benchmark, be careful that you are actually measuring disk performance, rather than in-memory filesystem cache performance. You can ensure the validity of your measurements by unmounting and remounting the filesystem in between trials.
[26] As such, the data was never written to disk. This benchmark measured the time required to read the file from disk and to store it into memory.
One of the most useful tools for monitoring ongoing disk performance is iostat . Unfortunately, iostat can be confusing; it presents output in many different forms, and in some cases it computes numbers strangely for historical reasons.
The most useful flags for the Solaris version of iostat are probably -x (for extended statistics), -n (to print names in the much more descriptive cXtXdXsX format), [27] and -P (for per-partition statistics). Other interesting flags are -C , which causes aggregation by controller ID; -m , which reports the filesystem mount points; and -z , which suppresses the output of any lines that are all zeroes (e.g., idle disks).
[27] The cXtXdXsX format lets you easily relate the pathname to a partition to how the partition is attached to the system. The number following c is the controller instance; the number following t is the device target number; the number following d is the device instance (LUN); and the number following s is the slice (partition) number. For example, c1t5d0s3 would be partition three ( s3 ) on the disk with ID 5 ( d5 ) attached to controller 1 ( c1 ). IDE disks do not have a d value, and are represented as cXtXsX .
% iostat -mnPxz 30 ... extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.2 0.0 0.8 0.0 0.0 0.0 9.5 0 0 c0t0d0s3 (/) extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1.4 0.0 10.6 0.0 0.0 0.0 13.0 0 1 c0t0d0s3 (/) extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 2.6 0.0 22.4 0.0 0.0 0.0 12.1 0 1 c0t0d0s3 (/) extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 3.2 0.8 22.2 6.4 0.0 0.0 0.0 3.9 0 2 c0t0d0s3 (/) extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.2 0.2 1.4 1.6 0.0 0.0 0.0 9.1 0 0 c0t0d0s3 (/) 0.0 2.2 0.0 206.4 0.0 0.0 0.0 21.7 0 1 c0t1d0s0 (/export/home)
The fields are as follows :
r/s and kr/s report the number of read operations per second and the number of kilobytes read per second. w/s and kw/s report the same information, but for writes.
wait is the average number of transactions waiting to be serviced.
actv is the number of requests currently being actively serviced.
wsvc_t is the average service time in the wait queue, in milliseconds .
asvc_t is the average service time for active transactions, in milliseconds.
%w represents the percent of time that there are transactions waiting to be serviced.
%b represents the percent of time that the disk was actively servicing transactions.
If you are using the Linux version of iostat , the flags to use are -d -x interval . This enables extended output and only displays disk information (rather than adding CPU statistics):
$ iostat -d -x 30 Linux 2.4.2-2 (aiua) 07/29/2001 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util hde 2.88 1.32 52.90 2.23 99.06 28.50 2.31 0.65 11.82 8.34 4.60 hde1 0.00 0.00 49.60 0.00 49.61 0.00 1.00 0.28 5.64 5.64 2.80 hde2 2.88 1.32 3.30 2.23 49.44 28.50 14.10 0.37 67.26 33.29 1.84 hde5 0.00 0.00 0.00 0.00 0.01 0.00 8.00 0.00 100.00 100.00 0.00 hdg 5.89 1.61 34.27 3.27 233.97 39.41 7.28 5.17 137.62 31.69 11.90 hdg1 0.00 0.00 12.47 0.00 12.47 0.00 1.00 0.05 4.21 4.21 0.53 hdg2 5.89 1.61 21.80 3.27 221.50 39.41 10.40 5.11 203.98 45.35 11.37
You can optionally provide a specific device before the time interval to only see statistics for that device.
In the old days (when disk interfaces such as IPI dominated the marketplace ), disk controllers really did directly control the disks. In such an environment, terms like utilization , service time , and wait time have clearly defined meanings, and some simple queuing theory mathematics can let you derive them from a handful of underlying measurements. iostat was originally developed in this era. Nowdays, the disk controller is not nearly as omnipresent in the disk subsystem: modern disks are extremely smart. The rather simple model implemented by iostat , and particularly its use of terminology, can be very confusing in modern environments.
In the old controller-centric environments, once the disk driver sent a request to the disk, it knew that the disk would single-mindedly service that request, and the length of time it required was the service time. With modern systems, the service time varies with the disk load and queue length, and there is no direct comparison to the older service time measurement. We can work around this by explicitly creating two queues: a wait queue, which refers to commands queued in the disk driver, and an active queue, which refers to commands queued in the disk itself. Unfortunately, iostat tries to report the new measurements with the old terminology; the "wait service time" is actually the time spent in the "wait" queue, which really doesn't make sense.
The number that iostat calls "service time" is defined as the queue length divided by the throughput. This is not really service time, which is more properly the amount of time it takes for the first command in line to be processed ; what iostat calls "service time" would be better labeled formally as the "response time."
The -d flag to sar causes it to report various per-partition disk activity statistics:
% sar -d 5 100 SunOS islington 5.8 Generic_108528-03 sun4u 02/16/01 14:58:51 device %busy avque r+w/s blks/s avwait avserv ... 14:59:06 fd0 0 0.0 0 0 0.0 0.0 nfs1 0 0.0 0 0 0.0 0.0 sd0 93 2.6 81 17934 0.0 31.9 sd0,a 0 0.0 0 0 0.0 0.0 sd0,c 0 0.0 0 0 0.0 0.0 sd0,d 93 2.6 81 17934 0.0 31.9 sd1 0 0.0 0 0 0.0 0.0 sd1,a 0 0.0 0 0 0.0 0.0 sd1,c 0 0.0 0 0 0.0 0.0 sd6 0 0.0 0 0 0.0 0.0 ...
The first column, %busy , refers to the portion of the time that the drive was busy servicing a transfer request in the sampling interval; avque gives the average number of outstanding requests during the same period. The number of reads, writes, and bytes transferred (in 512-byte units) is given by read/s , write/s , and blks/s . avwait specifies the average wait time, in milliseconds; avserv refers to the average service time, also in milliseconds.
For more information on sar in general, please see Section 4.5.2.2.
It would be wonderful if we could break down iostat output even further -- for example, if we observe a high utilization rate on one partition, it would be very useful to know what process was responsible for this behavior. But to accomplish this, we must use a more complicated tool: the kernel trace facility.
Unix systems have had kernel tracing for a long time, but it was originally for development use only; enabling the facility in production kernels wasn't generally done because it provided only a small performance overhead. IBM decided to leave the kernel tracing facility enabled in their AIX kernel, and Sun has since followed suit.
The trace capability is rather robust and has some nice features; for example, the trace file is self-describing , so an external definitions file isn't required. It consists of three applications: prex(1) , which controls probe execution; tnfxtract(1) , which reads out the kernel trace buffer; and tnfdump(2) , which displays the data in a human-readable format. However, it does have some problems. For instance, user-level probes can directly write to trace files, but the kernel probes are written to a ring buffer, which is implemented as a separate buffer per kernel thread to prevent having to worry about locking and contention issues on multiprocessor systems. The problem is that it's hard to size the buffer, and one highly active probe might refill its buffer repeatedly while others are just getting started. Thankfully, tnfxtract takes a snapshot of the buffer, so a second snapshot will include leftover data; tnfdump also is very good about sorting things into chronological order. While this facility is extremely powerful, we are only going to discuss its utility in regards to tracing I/O.
When the prex command is invoked, it goes into an interactive command mode, where you can specify the details of the trace you wish to perform. Here, we invoke it with the -k switch, so that it attaches to the kernel's internal probes:
# prex -k Type "help" for help ... prex> buffer alloc 2m Buffer of size 2097152 bytes allocated prex> enable io prex> trace io prex> ktrace on
These commands allocate a 2 MB buffer for data storage [28] and enable the io probe set for tracing. [29] The ktrace on command flips the kernel probe "master switch," which controls when data is gathered.
[28] You do not need to explicitly declare the size of the buffer -- a simple buffer alloc will suffice.
[29] If you feel daring, you can enable all the kernel probes. This can be accomplished by issuing the command enable $all followed by trace $all . Be advised that this is very verbose, and probably not for the faint of heart.
At this point, wait for a little while until the behavior you're interested in occurs, and shut off the master switch:
prex> ktrace off
Then you can dump the trace records into a file and review them:
# mkdir /tmp/trace # cd /tmp/trace # tnfxtract iotrace.tnf # tnfdump iotrace.tnf more
The resulting output can be divided into two sections: a header and contents. Both are very wide, and have been edited unmercifully to fit on the page. The header is illustrated in Example 5-2.
probe tnf_name: "strategy" tnf_string: "... blockio;file .../driver.c;line 305;" probe tnf_name: "pageout" tnf_string: "... pageio io;file .../vm_pvn.c;line 552;" probe tnf_name: "biodone" tnf_string: "... blockio;file .../bio.c;line 1189;"
The output header details the probes that were enabled for the trace. The strategy probe records an I/O being issued, and biodone records an I/O being completed. The other mechanism that may cause I/O is the paging mechanism, recorded via the pageout probe, which is why it is included in the io probe set. The header also describes what probe sets a probe resides in (such as io and blockio ), as well as the source code file and line where the probe actually resides.
The contents of the trace follow the format indicated in Example 5-3, again highly truncated.
Elapsed Delta PID LWPID TID CPU Probe Data / Description . . . 0.000 0.000 2040 1 ... 1 strategy ... buf: 0x707b45a0 flags: 4194313 16.911 16.911 0 0 ... 0 biodone ... buf: 0x707b45a0 17.027 0.116 2040 1 ... 1 strategy ... buf: 0x706d8410 flags: 524569 26.101 9.074 0 0 ... 0 biodone ... buf: 0x706d8410 26.106 0.005 0 0 ... 0 biodone ... buf: 0x706d8410 26.158 0.051 2040 1 ... 1 pageout ...
The next section, the trace data, shows exactly what I/O requests were handled, in chronological order. It does not show I/O requests that were serviced by means of the in-memory filesystem cache. The first column shows the time elapsed (in ms) since the first probe occured; the second represents the amount of time (in ms) spent performing the operation. The next three columns refer to the process ID, lightweight process ID, and thread ID of the generator of the I/O request. The CPU column describes what processor handled the request; this trace was taken on a dual-processor Sun Ultra 2, which explains why you see a mixture of CPU 0 and CPU 1. The final column describes the data generated by the probe. As you can imagine, tracking down exactly what a probe represents can be nontrivial. It helps to use tracing on a process that is being run through truss (1) . I have found using truss -lDdaf command to be most useful.
Note that we have translated the flags field from binary to hexadecimal. The flags field is composed of logical additions of flags. You can get the canonical list from /usr/include/sys/buf.h ; a reference list of some of the more useful ones is presented in Table 5-11.
Flag | Name | Description |
---|---|---|
0x000001 | BUSY | Request busy (see WANTED). |
0x000002 | DONE | Transaction finished. |
0x000004 | ERROR | Transaction aborted. |
0x000008 | KERNBUF | Accessing a kernel buffer. |
0x000010 | PAGEIO | Do I/O to pages. |
0x000040 | READ | Request is a read. |
0x000080 | WANTED | Wake up when the BUSY flag goes off. |
0x000100 | WRITE | Request is a write. |
0x000400 | ASYNC | Asynchronous; don't wait for I/O completion. |
0x002000 | DONTNEED | Don't bother caching after a write. |
0x008000 | FREE | Free the page when done with it. |
0x010000 | INVAL | Request contains bogus data. |
0x080000 | NOCACHE | Don't cache the block when it's released. |
0x400000 | RETRYWRI | Retry the write until it succeeds. |
In this case, we have observed all the I/O requests for a short-lived process. I'll rewrite the raw data shown in Example 5-3 into a little bit clearer format, as illustrated in Table 5-12.
Buffer | Time (ms) | Operation | Flags field contents |
---|---|---|---|
0x707b45a0 | 0.000 | I/O request issued. | 0x400009 |
16.911 | I/O request complete. | ||
0x706d8410 | 17.027 | I/O request issued. | 0x080119 |
26.102 | I/O request complete. |
Examining the flags, the first I/O is an 8 KB request to a kernel buffer, which took 16.911 ms to complete, and the second is for a noncached 1 KB write to a kernel-buffered I/O page, which took 9.074 ms to complete. The first request is an update to the inode, and the second is a data write.
The method we've just described tracks all the I/O activity for the system. This can give you some interesting results, but it can also be a real pain. If you are interested in tracing the I/O done by a single process on a busy system, rather than looking at the behavior of the entire system, that can be done as well. In fact, there is a method to accomplish just that.
prex implements a means of focusing tracing called process filtering . When process filtering is enabled, tracing in the kernel is limited to events initiated on behalf of the processes in a specific list. Process 0 is used to represent all the threads not associated with a process.
When it is run without arguments, the prex command pfilter displays the current process filtering mode (on or off) and the process ID list. The process ID list can be modified regardless of the state of process filtering.
A process must exist to be added to the process ID list; similarly, the list is automatically updated to delete any processes that no longer exist. Should the list become empty while process filtering is turned on, prex issues a warning. Process filtering mode is persistent between invocations of prex , so be sure to turn it off manually when you're done.
Let's work through an example, where we trace all the I/O generated by a vi session (process ID 19786):
# prex -k Type "help" for help ... prex> buffer alloc Buffer of size 393216 bytes allocated prex> pfilter add 19786 prex> pfilter on prex> pfilter Process filtering is on Process filter set is {19786} prex> trace io prex> enable io prex> ktrace on ...in a separate window, exit the vi session ( :wq )... prex> ktrace off Warning: Process filtering on, but pid filter list is empty prex> pfilter off prex> quit # tnfxtract /tmp/vi.trc # tnfdump /tmp/vi.trc ...
Note that we received a warning from prex when the vi process exited and could no longer be traced.
If you restart a kernel trace and it is only of short duration, you will see old information before you get to what you actually care about. This problem can be easily solved by unallocating and reallocating the buffer:
prex> buffer dealloc prex> buffer alloc
These examples just scratch the surface of prex . It is a complex tool that serves many purposes. If you're interested in exploring further, there are other meaningful probe sets. One of these is vm , which contains all the virtual memory- related probes.