Section 4.15. DTrace for IO


4.15. DTrace for I/O

DTrace was added to the Solaris 10 release; see Chapter 10 for a reference. DTrace can trace I/O events with ease by using the io provider, and tracing I/O with the io provider is often used as a demonstration of DTrace itself.

4.15.1. io Probes

The io provider supplies io:::start and io:::done probes, which for disk events represents the initiation and completion of physical I/O.

# dtrace -lP io    ID   PROVIDER            MODULE                           FUNCTION NAME    60         io           genunix                            biodone done    61         io           genunix                            biowait wait-done    62         io           genunix                            biowait wait-start    71         io           genunix                     default_physio start    72         io           genunix                      bdev_strategy start    73         io           genunix                            aphysio start   862         io               nfs                           nfs4_bio done   863         io               nfs                           nfs3_bio done   864         io               nfs                            nfs_bio done   865         io               nfs                           nfs4_bio start   866         io               nfs                           nfs3_bio start   867         io               nfs                            nfs_bio start 


In this example, we list the probes from the io provider. This provider also tracks NFS events, raw disk I/O events, and asynchronous disk I/O events.

The names for the io:::start and io:::done probes include the kernel function names. Disk events are likely to use the functions bdev_strategy and biodone, the same functions that TNF tracing probed. If you are writing DTrace scripts to match only one type of disk activity, then specify the function name. For example, io::bdev_strategy:start matches physical disk events.

The probes io:::wait-start and io:::wait-done trace the time when a thread blocks for I/O and begins to wait and the time when the wait has completed.

Details about each I/O event are provided by three arguments to these io probes. Their DTrace variable names and contents are as follows:

  • args[0]: struct bufinfo. Useful details from the buf struct

  • args[1]: struct devinfo. Details about the device: major and minor numbers, instance name, etc.

  • args[2]: struct fileinfo. Details about the file name, path name, file system, offset, etc.

Note that the io probes fire for all I/O requests to peripheral devices and for all file read and file write requests to an NFS server. However, requests for metadata from an NFS server, for example. readdir(3C), do not trigger io probes.

The io probes are documented in detail in Section 10.6.1.

4.15.2. I/O Size One-Liners

You can easily fetch I/O event details with DTrace. The following one-liner command tracks PID, process name, and I/O event size.

# dtrace -n 'io:::start { printf("%d %s %d",pid,execname,args[0]->b_bcount); }' dtrace: description 'io:::start ' matched 6 probes CPU     ID                    FUNCTION:NAME   0     72              bdev_strategy:start 418 nfsd 36864   0     72              bdev_strategy:start 418 nfsd 36864   0     72              bdev_strategy:start 418 nfsd 36864   0     72              bdev_strategy:start 0 sched 512   0     72              bdev_strategy:start 0 sched 1024   0     72              bdev_strategy:start 418 nfsd 1536   0     72              bdev_strategy:start 418 nfsd 1536 ... 


This command assumes that the correct PID is on the CPU for the start of an I/O event, which in this case is fine. When you use DTrace to trace PIDs, be sure to consider whether the process is synchronous with the event.

Tracing I/O activity as it occurs can generate many screenfuls of output. The following one-liner produces a simple summary instead, printing a report of PID, process name, and IOPS (I/O count). We match on io:genunix::start so that this script matches disk events and not NFS events.

# dtrace -n 'io:genunix::start { @[pid, execname] = count(); }' dtrace: description 'io:genunix::start ' matched 3 probes ^C     16585  find                                                             420     16586  tar                                                             2812     16584  dd                                                             22443 


From the output, we can see that the dd command requested 22, 443 disk events, and find requested 420.

4.15.3. A More Elaborate Example

While one-liners can be handy, it is often more useful to write DTrace scripts. The following DTrace script uses the device, buffer, and file name information from the io probes.

#!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN {         printf("%10s %58s %2s %8s\n", "DEVICE", "FILE", "RW", "Size"); } io:::start {         printf("%10s %58s %2s %8d\n", args[1]->dev_statname,             args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W",             args[0]->b_bcount); } 


When run, it provides a simple tracelike output showing the device, file name, read/write flag, and I/O size.

# ./iotrace.d     DEVICE                                                       FILE RW      SIZE      cmdk0                               /export/home/rmc/.sh_history  W      4096      cmdk0                                 /opt/Acrobat4/bin/acroread  R      8192      cmdk0                                 /opt/Acrobat4/bin/acroread  R      1024      cmdk0                                 /var/tmp/wscon-:0.0-gLaW9a  W      3072      cmdk0                           /opt/Acrobat4/Reader/AcroVersion  R      1024      cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192      cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192      cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      4096      cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192      cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R      8192 


The way this script traces I/O events as they occur is similar to the way the Solaris snoop command traces network packets. An enhanced version of this script, called iosnoop, is discussed later in this chapter.

Since I/O events are generally "slow" (a few hundred per second, depending on activity), the CPU costs for tracing them with DTrace is minimal (often less than 0.1% CPU).

4.15.4. I/O Size Aggregation

The following short DTrace script makes for an incredibly useful tool; it is available in the DTraceToolkit as bitesize.d. It traces the requested I/O size by process and prints a report that uses the DTrace quantize aggregating function.

#!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN {         printf("Tracing... Hit Ctrl-C to end.\n"); } io:::start {         @size[pid, curpsinfo->pr_psargs] = quantize(args[0]->b_bcount); } dtrace:::END {         printf("%8s  %s\n", "PID", "CMD");         printa("%8d  %s\n%@d\n", @size); } 


The script was run while a find / command executed.

# ./bites.d Tracing... Hit Ctrl-C to end. ^C      PID  CMD    14818  find /            value  ------------- Distribution ------------- count              512 |                                         0             1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    2009             2048 |                                         0             4096 |                                         0             8192 |@@@                                      180            16384 |                                         0 


The find command churned thorough directory files and inodes on disk, causing many small disk events. The distribution plot that DTrace has printed nicely conveys the disk behavior that find caused and is read as follows: 2009 disk events were between 1024 and 2047 bytes in size, and 180 disk events were between 8 Kbytes and 15.9 Kbytes. In summary, we measured find causing a storm of small disk events.

Such a large number of small events usually indicates random disk activitya characteristic that DTrace can also accurately measure.

Finding the size of disk events alone can be quite valuable. To demonstrate this further, we ran the same script for a different workload. This time we used a tar command to archive the disk.

# ./bites.d Tracing... Hit Ctrl-C to end. ^C 8122  tar cf /dev/null /            value  ------------- Distribution ------------- count              512 |                                         0             1024 |@@@@@@@@@@@@@@@@@@@@@@                   226             2048 |@@                                       19             4096 |@@                                       23             8192 |@@@@@@@                                  71            16384 |                                         3            32768 |                                         1            65536 |@                                        8           131072 |@@@@@                                    52           262144 |                                         0 


While tar must work through many of the same directory files as find, it now also reads through file contents. There are now many events in the 128 to 255 Kbytes bucket because tar has encountered some large files.

And finally, we ran the script with a deliberately large sequential workloadadd command with specific options.

# ./bites.d Tracing... Hit Ctrl-C to end. ^C      PID  CMD     8112  dd if=/dev/rdsk/c0t0d0s0 of=/dev/null bs=128k            value  ------------- Distribution ------------- count            65536 |                                         0           131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 246           262144 |                                         0 


We used the dd command to read 128-Kbyte blocks from the raw device, and that's exactly what happened.

It is interesting to compare raw device behavior with that of the block device. In the following demonstration, we changed the rdsk to dsk and ran dd on a slice that contained a freshly mounted file system.

# ./bites.d Tracing... Hit Ctrl-C to end. ^C     8169  dd if=/dev/dsk/c0t0d0s3 of=/dev/null bs=128k            value  ------------- Distribution -------------  count            32768 |                                          0            65536 |                                          1           131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1027           262144 |                                          0 


No difference there, except that when the end of the slice was reached, a smaller I/O event was issued.

This demonstration becomes interesting after the dd command has been run several times on the same slice. The distribution plot then looks like this.

# ./bites.d Tracing... Hit Ctrl-C to end. ^C     8176 dd if=/dev/dsk/c0t0d0s3 of=/dev/null bs=128k            value  ------------- Distribution ------------- count             4096 |                                          0             8192 |@@@@@@@@@@@@@                             400            16384 |@@@                                       83            32768 |@                                         29            65536 |@@                                        46           131072 |@@@@@@@@@@@@@@@@@@@@@@                    667           262144 |                                          0 


The distribution plot has become quite different, with fewer 128-Kbyte events and many 8-Kbyte events. What is happening is that the block device is reclaiming pages from the page cache and is at times going to disk only to fill in the gaps.

We next used a different DTrace one-liner to examine this further, summing the bytes read by two different invocations of dd: the first (PID 8186) on the dsk device and the second (PID 8187) on the rdsk device.

# dtrace -n 'io:::start { @[pid, args[1]->dev_statname] = sum(args[0]->b_bcount); }' dtrace: description 'io:::start ' matched 6 probes ^C      8186  dad1                                                       89710592      8187  dad1                                                      134874112 


The rdsk version read the full slice, 134, 874, 112 bytes. The dsk version read 89, 710, 592 bytes, 66.5%.

4.15.5. I/O Seek Aggregation

The following script can help identify random or sequential activity by measuring the seek distance for disk events and generating a distribution plot. The script is available in the DTraceToolkit as seeksize.d.

#!/usr/sbin/dtrace -s #pragma D option quiet self int last[dev_t]; dtrace:::BEGIN {         printf("Tracing... Hit Ctrl-C to end.\n"); } io:genunix::start /self->last[args[0]->b_edev] != 0/ {         this->last = self->last[args[0]->b_edev];         this->dist = (int)(args[0]->b_blkno - this->last) > 0 ?             args[0]->b_blkno - this->last : this->last - args[0]->b_blkno;         @size[args[1]->dev_statname] = quantize(this->dist); } io:genunix::start {         self->last[args[0]->b_edev] = args[0]->b_blkno +             args[0]->b_bcount / 512; } 


Since the buffer struct is available to the io probes, we can examine the block address for each I/O event, provided as args[0]->b_blkno. This address is relative to the slice, so we must be careful to compare addresses only when the events are on the same slice, achieved in the script by matching on args[0]->b_edev.

We are assuming that we can trust the block address and that the disk device did not map it to something strange (or if it did, it was mapped proportionally). We are also assuming that the disk device isn't using a frontend cache to initially avoid seeks altogether, as with storage arrays.

The following example uses this script to examine random activity that was generated with filebench.

# ./seeks.d Tracing... Hit Ctrl-C to end. ^C   cmdk0            value  ------------- Distribution -------------  count               -1 |                                          0                0 |@@@@                                      174                1 |                                          0                2 |                                          0                4 |                                          0                8 |                                          2               16 |@@                                        104               32 |@@@                                       156               64 |@@                                        98              128 |@                                         36              256 |@                                         39              512 |@@                                        70             1024 |@@                                        71             2048 |@@                                        71             4096 |@                                         55             8192 |@                                         43            16384 |@                                         63            32768 |@@                                        91            65536 |@@@                                       135           131072 |@@@                                       159           262144 |@@                                        107           524288 |@@@@                                      183          1048576 |@@@@                                      174          2097152 |                                          0 


And the following is for sequential activity from filebench.

# ./seeks.d Tracing... Hit Ctrl-C to end. ^C   cmdk0            value  ------------- Distribution -------------  count               -1 |                                          0                0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   27248                1 |                                          0                2 |                                          0                4 |                                          0                8 |                                          12               16 |                                          141               32 |                                          218               64 |                                          118              128 |                                          7              256 |                                          81              512 |                                          0 


The difference is dramatic. For the sequential test most of the events incurred a zero length seek, whereas with the random test, the seeks were distributed up to the 1, 048, 576 to 2, 097, 151 bucket. The units are called disk blocks (not file system blocks), which are disk sectors (512 bytes).

4.15.6. I/O File Names

Sometimes knowing the file name that was accessed is of value. This is another detail that DTrace makes easily available through args[2]->fi_pathname, as demonstrated by the following script.

#!/usr/sbin/dtrace -s #pragma D option quiet dtrace:::BEGIN {         printf("Tracing... Hit Ctrl-C to end.\n"); } io:::start {         @files[pid, execname, args[2]->fi_pathname] = sum(args[0]->b_bcount); } dtrace:::END {         normalize(@files, 1024);         printf("%6s %-12s %6s %s\n", "PID", "CMD", "KB", "FILE");         printa("%6d %-12.12s %@6d %s\n", @files); } 


Running this script with several files of a known size on a newly mounted file system produces the following.

# ./files.d Tracing... Hit Ctrl-C to end. ^C    PID CMD              KB FILE   5797 bash              1 /extra1   8376 grep              8 /extra1/lost+found   8376 grep             10 /extra1/testfile_size10k   8376 grep             20 /extra1/testfile_size20k   8376 grep             30 /extra1/testfile_size30k   8376 grep             64 <none>   8376 grep          10240 /extra1/testfile_size10m   8376 grep          20480 /extra1/testfile_size20m   8376 grep          30720 /extra1/testfile_size30m 


Not only can we see that the sizes match the files (see the file names), we can also see that the bash shell has read one kilobyte from the /exTRa1 directoryno doubt reading the directory contents. The "<none>" file name occurs when file system blocks not related to a file are accessed.




Solaris Performance and Tools(c) Dtrace and Mdb Techniques for Solaris 10 and Opensolaris
Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris
ISBN: 0131568191
EAN: 2147483647
Year: 2007
Pages: 180

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net