Section 14.7. readv and writev Functions

14.7. `readv` and `writev` Functions

The readv and writev functions let us read into and write from multiple noncontiguous buffers in a single function call. These operations are called scatter read and gather write.

[View full width]

 #include <sys/uio.h> ssize_t readv(int filedes, const struct iovec *iov , int iovcnt); ssize_t writev(int filedes, const struct iovec  *iov, int iovcnt);

Both return: number of bytes read or written, 1 on error

The second argument to both functions is a pointer to an array of iovec structures:

    struct iovec {      void   *iov_base;   /* starting address of buffer */      size_t  iov_len;    /* size of buffer */    };

The number of elements in the iov array is specified by iovcnt. It is limited to IOV_MAX (Recall Figure 2.10). Figure 14.27 shows a picture relating the arguments to these two functions and the iovec structure.

Figure 14.27. The `iovec` structure for `readv` and `writev`

The writev function gathers the output data from the buffers in order: iov[0], iov[1], through iov[iovcnt1]; writev returns the total number of bytes output, which should normally equal the sum of all the buffer lengths.

The readv function scatters the data into the buffers in order, always filling one buffer before proceeding to the next. readv returns the total number of bytes that were read. A count of 0 is returned if there is no more data and the end of file is encountered.

These two functions originated in 4.2BSD and were later added to SVR4. These two functions are included in the XSI extension of the Single UNIX Specification.

Although the Single UNIX Specification defines the buffer address to be a void *, many implementations that predate the standard still use a char * instead.

Example

In Section 20.8, in the function _db_writeidx, we need to write two buffers consecutively to a file. The second buffer to output is an argument passed by the caller, and the first buffer is one we create, containing the length of the second buffer and a file offset of other information in the file. There are three ways we can do this.

Call write twice, once for each buffer.
Allocate a buffer of our own that is large enough to contain both buffers, and copy both into the new buffer. We then call write once for this new buffer.
Call writev to output both buffers.

The solution we use in Section 20.8 is to use writev, but it's instructive to compare it to the other two solutions.

Figure 14.28 shows the results from the three methods just described.

The test program that we measured output a 100-byte header followed by 200 bytes of data. This was done 1,048,576 times, generating a 300-megabyte file. The test program has three separate casesone for each of the techniques measured in Figure 14.28. We used times (Section 8.16) to obtain the user CPU time, system CPU time, and wall clock time before and after the writes. All three times are shown in seconds.

As we expect, the system time increases when we call write twice, compared to calling either write or writev once. This correlates with the results in Figure 3.5.

Next, note that the sum of the CPU times (user plus system) is less when we do a buffer copy followed by a single write compared to a single call to writev. With the single write, we copy the buffers to a staging buffer at user level, and then the kernel will copy the data to its internal buffers when we call write. With writev, we should do less copying, because the kernel only needs to copy the data directly into its staging buffers. The fixed cost of using writev for such small amounts of data, however, is greater than the benefit. As the amount of data we need to copy increases, the more expensive it will be to copy the buffers in our program, and the writev alternative will be more attractive.

Be careful not to infer too much about the relative performance of Linux to Mac OS X from the numbers shown in Figure 14.28. The two computers were very different: they had different processor architectures, different amounts of RAM, and disks with different speeds. To do an apples-to-apples comparison of one operating system to another, we need to use the same hardware for each operating system.

Figure 14.28. Timing results comparing `writev` and other techniques
Operation	Linux (Intel x86)			Mac OS X (PowerPC)
Operation	User	System	Clock	User	System	Clock
two `write`s	1.29	3.15	7.39	1.60	17.40	19.84
buffer copy, then one `write`	1.03	1.98	6.47	1.10	11.09	12.54
one `writev`	0.70	2.72	6.41	0.86	13.58	14.72

In summary, we should always try to use the fewest number of system calls necessary to get the job done. If we are writing small amounts of data, we will find it less expensive to copy the data ourselves and use a single write instead of using writev. We might find, however, that the performance benefits aren't worth the extra complexity cost needed to manage our own staging buffers.

14.7. readv and writev Functions

Figure 14.27. The iovec structure for readv and writev

Example

Figure 14.28. Timing results comparing writev and other techniques

14.7. `readv` and `writev` Functions

Figure 14.27. The `iovec` structure for `readv` and `writev`

Figure 14.28. Timing results comparing `writev` and other techniques