Section 13.1. Input and Output Multiplexing

13.1. Input and Output Multiplexing

Many client/server applications need to read input from or write output to multiple file descriptors at a time. For example, modern Web browsers open many simultaneous network connections to reduce the loading time for a Web page. This allows them to download the multiple images that appear on most Web pages more quickly than consecutive connections would allow. Along with the interprocess communication (IPC) channel that graphical browsers use to contact the X server on which they are displayed, browsers have many file descriptors to keep track of.

The easiest way to handle all these files is for the browser to read from each file in turn and process whatever data that file delivers (a read() system call on a network connection, as on a pipe, returns whatever data is currently available and blocks only if no bytes are ready). This approach works fine, as long as all the connections are delivering data fairly regularly.

If one of the network connections gets behind, problems start. When the browser next reads from that file, the browser stops running while the read() blocks, waiting for data to arrive. Needless to say, this is not the behavior the browser's user would prefer.

To help illustrate these problems, here is a short program that reads from two files: p1 and p2. To try it, open three X terminal sessions (or use three virtual consoles). Make named pipes named p1 and p2 (with the mknod command), then run cat > p1 and cat > p2 in two of the terminals while running mpx-blocks in the third. Once everything is running, type some text in each of the cat windows and watch how it appears. Remember that the two cat commands will not write any data into the pipes until the end of a line.

  1: /* mpx-blocks.c */  2:  3: #include <fcntl.h>  4: #include <stdio.h>  5: #include <unistd.h>  6:  7: int main(void) {  8:     int fds[2];  9:     char buf[4096]; 10:     int i; 11:     int fd; 12: 13:     if ((fds[0] = open("p1", O_RDONLY)) < 0) { 14:         perror("open p1"); 15:         return 1; 16:     } 17: 18:     if ((fds[1] = open("p2", O_RDONLY)) < 0) { 19:         perror("open p2"); 20:         return 1; 21:     } 22: 23:     fd = 0; 24:     while (1) { 25:         /* if data is available read it and display it */ 26:         i = read(fds[fd], buf, sizeof(buf) - 1); 27:         if (i < 0) { 28:             perror("read"); 29:             return 1; 30:         } else if (!i) { 31:             printf("pipe closed\n"); 32:             return 0; 33:         } 34: 35:         buf[i] = '\0'; 36:         printf("read: %s", buf); 37: 38:         /* read from the other file descriptor */ 39:         fd = (fd + 1) % 2; 40:     } 41: }

Although mpx-blocks does read from both pipes, it does not do a very nice job of it. It reads from only one pipe at a time. When it starts, it reads from the first file until data becomes available on it; the second file is ignored until the read() from the first file returns. Once that read does return, the first file is ignored until data is read from the second file. This method does not perform anything like smooth data multiplexing. Figure 13.1 shows what mpx-blocks looks like when it is run.

Figure 13.1. Running Multiplex Examples

13.1.1. Nonblocking I/O

Recall from Chapter 11 that we may specify a file is nonblocking through the fcntl() system call. When a slow file is nonblocking, read() always returns immediately. If no data is available, it simply returns 0. Nonblocking I/O provides a simple solution to multiplexing by preventing operations on files from ever blocking.

Here is a modified version of mpx-blocks that takes advantage of nonblocking I/O to alternate between p1 and p2 more smoothly:

  1: /* mpx-nonblock.c */  2:  3: #include <errno.h>  4: #include <fcntl.h>  5: #include <stdio.h>  6: #include <unistd.h>  7:  8: int main(void) {  9:     int fds[2]; 10:     char buf[4096]; 11:     int i; 12:     int fd; 13: 14:     /* open both pipes in nonblocking mode */ 15:     if ((fds[0] = open("p1", O_RDONLY | O_NONBLOCK)) < 0) { 16:         perror("open p1"); 17:         return 1; 18:     } 19: 20:     if ((fds[1] = open("p2", O_RDONLY | O_NONBLOCK)) < 0) { 21:         perror("open p2"); 22:         return 1; 23:     } 24: 25:     fd = 0; 26:     while (1) { 27:         /* if data is available read it and display it */ 28:         i = read(fds[fd], buf, sizeof(buf) - 1); 29:         if ((i < 0) && (errno != EAGAIN)) { 30:             perror("read"); 31:             return 1; 32:         } else if (i > 0) { 33:             buf[i] = '\0'; 34:             printf("read: %s", buf); 35:         } 36: 37:         /* read from the other file descriptor */ 38:         fd = (fd + 1) % 2; 39:     } 40: }

One important difference between mpx-nonblock and mpx-blocks is that mpx-nonblock does not exit when one of the pipes it is reading from is closed. A nonblocking read() from a pipe with no writers returns 0 bytes; from a pipe with writers but no data read() returns EAGAIN.

Although nonblocking I/O allows us to switch easily between file descriptors, it has a high price. The program is always polling the two file descriptors for input it never blocks. As the program is constantly running, it inflicts a heavy performance penalty on the system as the operating system can never put the process to sleep (try running 10 copies of mpx-nonblock on your system and see how it affects system performance).

13.1.2. Multiplexing with `poll()`

To allow efficient multiplexing, Linux provides the poll() system call, which allows a process to block on multiple file descriptors simultaneously. Rather than constantly check each file descriptor it is interested in, a process makes a single system call that specifies which file descriptors the process would like to read from or write to. When one or more of those files have data available for reading or can accept data written to them, the poll() returns and the application can read and write from those file descriptors without worrying about blocking. Once those files have been handled, the process makes another poll() call, which blocks until a file is ready for more attention. Here is the definition of poll():

 #include <sys/poll.h> int poll(struct pollfd * fds, int numfds, int timeout);

The last two parameters are straightforward; numfds specifies the number of items in the array pointed to by the first parameter and timeout specifies how long poll() should wait for an event to occur. If 0 is used as the timeout, poll() never times out.

The first parameter, fds, describes which file descriptors should be monitored and what types of I/O they should be monitored for. It is a pointer to an array of struct pollfd structures.

 struct pollfd {     int fd;     /* file descriptor */     short events;   /* I/O events to wait for */     short revents;  /* I/O events that occurred */ };

The first element, fd, is a file descriptor being monitored, and the events element describes what types of events are of interest. It is one or more of the following flags logically OR'ed together:

`POLLIN`	Normal data is available for reading from the file descriptor.
`POLLPRI`	Priority (out-of-band^[1]) data is available for reading.
`POLLOUT`	The file descriptor is able to accept some data being written to it.

^[1] This is almost the only place in this book we ever mention out-of-band data. For more information, consult [Stevens, 2004].

The revents element of struct pollfd is filled in by the poll() system call, and reflects the status of file descriptor fd. It is similar to the events member, but instead of specifying what types of I/O events are of interest to the application, it specifies what types of I/O events are available. For example, if the application is monitoring a pipe both for reading and writing (events is set to POLLIN | POLLOUT), then after the poll() succeeds revents has the POLLIN bit set if the pipe has data ready to be read, and the POLLOUT bit set if there is room in the pipe for more data to be written. If both are true, both bits are set.

There are a few bits that the kernel can set in revents that do not make sense for events;

`POLLERR`	There is an error pending on the file descriptor; performing a system call on the file descriptor will cause `errno` to be set to the appropriate code.
`POLLHUP`	The file has been disconnected; no more writing to it is possible (although there may be data left to be read). This occurs if a terminal has been disconnected or the remote end of a pipe or socket has been closed.
`POLLNVAL`	The file descriptor is invalid (it does not refer to an open file).

The return value of poll() is zero if the call times out, -1 if an error occurs (such as fds being an invalid pointer; errors on the files themselves cause POLLERR to get set), or a positive number describing the number of files with nonzero revents members.

Rather than the inefficient method we used earlier to multiplex input and output from pipes, poll() lets us solve the same problem quite elegantly. By using poll() on the file descriptors for both pipes simultaneously, we know when poll() returns, one of the pipes has data ready to be read or has been closed. We check the revents member for both file descriptors to see what actions need to be taken and go back to the poll() call once we are done. Most of the time is now spent blocking on the poll() call rather than continuously checking the file descriptors using nonblocking I/O, significantly reducing load on the system. Here is mpx-poll:

  1: /* mpx-poll.c */  2:  3: #include <fcntl.h>  4: #include <stdio.h>  5: #include <sys/poll.h>  6: #include <unistd.h>  7:  8: int main(void) {  9:     struct pollfd fds[2]; 10:     char buf[4096]; 11:     int i, rc; 12: 13:     /* open both pipes */ 14:     if ((fds[0].fd = open("p1", O_RDONLY | O_NONBLOCK)) < 0) { 15:         perror("open p1"); 16:         return 1; 17:     } 18: 19:     if ((fds[1].fd = open("p2", O_RDONLY | O_NONBLOCK)) < 0) { 20:         perror("open p2"); 21:         return 1; 22:     } 23: 24:     /* start off reading from both file descriptors */ 25:     fds[0].events = POLLIN; 26:     fds[1].events = POLLIN; 27: 28:     /* while we're watching one of fds[0] or fds[1] */ 29:     while (fds[0].events || fds[1].events) { 30:         if (poll(fds, 2, 0) < 0) { 31:             perror("poll"); 32:             return 1; 33:         } 34: 35:         /* check to see which file descriptors are ready to be 36:            read from */ 37:         for (i = 0; i < 2; i++) { 38:             if (fds[i].revents) { 39:                 /* fds[i] is ready for reading, go ahead... */ 40:                 rc = read(fds[i].fd, buf, sizeof(buf) - 1); 41:                 if (rc < 0) { 42:                     perror("read"); 43:                     return 1; 44:                 } else if (!rc) { 45:                     /* this pipe has been closed, don't try 46:                        to read from it again */ 47:                     fds[i].events = 0; 48:                 } else { 49:                     buf[rc] = '\0'; 50:                     printf("read: %s", buf); 51:                 } 52:             } 53:         } 54:     } 55: 56:     return 0; 57: }

13.1.3. Multiplexing with `select()`

The poll() system call was originally introduced as part of the System V Unix tree. The BSD development efforts solved the same basic problem in a similar way by introducing the select() system call.

 #include <sys/select.h> int select(int numfds, fd_set * readfds, fd_set * writefds,            fd_set * exceptfds, struct timeval * timeout);

The middle three parameters, readfds, writefds, and exceptfds, specify which file descriptors should be watched. Each parameter is a pointer to an fd_set, a data structure that allows a process to specify an arbitrary number of file descriptors.^[2] It is manipulated through the following macros:

^[2] This is similar to sigset_t used for signal masks.

 FD_ZERO(fd_set * fds);

Clears fds no file descriptors are contained in the set. This macro is used to initialize fd_set structures.

 FD_SET(int fd, fd_set * fds);

Adds fd to the fd_set.

 FD_CLR(int fd, fd_set * fds);

Removes fd from the fd_set.

 FD_ISSET(int fd, fd_set * fds);

Returns true if fd is contained in set fds.

The first of select()'s file descriptor sets, readfds, contains the set of file descriptors that will cause the select() call to return when they are ready to be read from^[3] or (for pipes and sockets) when the process on the other end of the file has closed the file. When any of the file descriptors in writefds are ready to be written, select() returns. exceptfds contains the file descriptors to watch for exceptional conditions. Under Linux (as well as Unix), this occurs only when out-of-band data has been received on a network connection. Any of these may be NULL if you are not interested in that type of event.

^[3] When a network socket being listen() ed to is ready to be accept() ed, it is considered ready to be read from for select()'s purposes; information on sockets is in Chapter 17.

The final parameter, timeout, specifies how long, in milliseconds, the select() call should wait for something to happen. It is a pointer to a struct timeval, which looks like this:

 #include <sys/time.h> struct timeval {     int tv_sec;     /* seconds */     int tv_usec;    /* microseconds */ };

The first element, tv_sec, is the number of seconds to wait, and tv_usec is the number of microseconds to wait. If the timeout value is NULL, select() blocks until something happens. If it points to a struct timeval that contains zero in both its elements, select() does not block. It updates the file descriptor sets to indicate which file descriptors are currently ready for reading or writing, then returns immediately.

The first parameter, numfds, causes the most difficulty. It specifies how many of the file descriptors (starting from file descriptor 0) may be specified by the fd_sets. Another (and perhaps easier) way of thinking of numfds is as one greater than the maximum file descriptor select() is meant to consider.^[4] As Linux normally allows each process to have up to 1,024 file descriptors, numfds prevents the kernel from having to look through all 1,024 file descriptors each fd_set could contain, providing a performance increase.

^[4] If you compare this to the numfds parameter for poll() you'll see where the confusion comes from.

On return, the three fd_set structures contain the file descriptors that have input pending, may be written to, or are in an exceptional condition. Linux's select() call returns the total number of items set in the three fd_set structures, 0 if the call timed out, or -1 if an error occurred. However, many Unix systems count a particular file descriptor in the return value only once, even if it occurs in both readfds and writefds, so for portability, it is a good idea to check only whether the return value is greater than 0. If the return value is -1, do not assume the fd_set structures remain pristine. Linux updates them only if select() returns a value greater than 0, but some Unix systems behave differently.

Another portability concern is the timeout parameter. Linux kernels^[5] update it to reflect the amount of time left before the select() call would have timed out, but most other Unix systems do not update it.^[6] However, other systems do not update the timeout, to conform to the more common implementation. For portability, do not depend on either behavior and explicitly set the timeout structure before calling select().

^[5] Except for some experimental kernels in the 2.1 series.

^[6] When Linus Torvalds first implemented select(), the BSD man page for select() listed the BSD kernel's failure to update the timeout as a bug. Rather than write buggy code, Linus decided to "fix" this bug. Unfortunately, the standards commitees decided they liked BSD's behavior.

Now let's look at a couple of examples of using select(). First of all, we use select for something unrelated to files, constructing a subsecond sleep() call.

 #include <sys/select.h> #include <sys/stdlib.h> int usecsleep(int usecs) {     struct timeval tv;     tv.tv_sec = 0;     tv.tv_usec = usecs;     return select(0, NULL, NULL, NULL, &tv); }

This code allows highly portable pauses of less than one second (which BSD's usleep() library function allows, as well, but select() is much more portable). For example, usecsleep(500000) causes a minimum of a half-second pause.

The select() call can also be used to solve the pipe multiplexing example we have been working with. The solution is very similar to the one using poll().

  1: /* mpx-select.c */  2:  3: #include <fcntl.h>  4: #include <stdio.h>  5: #include <sys/select.h>  6: #include <unistd.h>  7:  8: int main(void) {  9:     int fds[2]; 10:     char buf[4096]; 11:     int i, rc, maxfd; 12:     fd_set watchset;       /* fds to read from */ 13:     fd_set inset;          /* updated by select() */ 14: 15:     /* open both pipes */ 16:     if ((fds[0] = open("p1", O_RDONLY | O_NONBLOCK)) < 0) { 17:         perror("open p1"); 18:         return 1; 19:     } 20: 21:     if ((fds[1] = open("p2", O_RDONLY | O_NONBLOCK)) < 0) { 22:         perror("open p2"); 23:         return 1; 24:     } 25: 26:     /* start off reading from both file descriptors */ 27:     FD_ZERO(&watchset); 28:     FD_SET(fds[0], &watchset); 29:     FD_SET(fds[1], &watchset); 30: 31:     /* find the maximum file descriptor */ 32:     maxfd = fds[0] > fds[1] ? fds[0]: fds[1]; 33: 34:     /* while we're watching one of fds[0] or fds[1] */ 35:     while (FD_ISSET(fds[0], &watchset) || 36:            FD_ISSET(fds[1], &watchset)) { 37:         /* we copy watchset here because select() updates it */ 38:         inset = watchset; 39:         if (select(maxfd + 1, &inset, NULL, NULL, NULL) < 0) { 40:             perror("select"); 41:             return 1; 42:         } 43: 44:         /* check to see which file descriptors are ready to be 45:            read from */ 46:         for (i = 0; i < 2; i++) { 47:             if (FD_ISSET(fds[i], &inset)) { 48:                 /* fds[i] is ready for reading, go ahead... */ 49:                 rc = read(fds[i], buf, sizeof(buf) - 1); 50:                 if (rc < 0) { 51:                     perror("read"); 52:                     return 1; 53:                 } else if (!rc) { 54:                     /* this pipe has been closed, don't try 55:                        to read from it again */ 56:                     close(fds[i]); 57:                     FD_CLR(fds[i], &watchset); 58:                 } else { 59:                     buf[rc] = '\0'; 60:                     printf("read: %s", buf); 61:                 } 62:             } 63:         } 64:     } 65: 66:     return 0; 67: }

13.1.4. Comparing `poll()` and `select()`

Although poll() and select() perform the same basic function, there are real differences between the two. The most obvious is probably the timeout, which has millisecond precision for poll() and microsecond precision for select(). In reality, this difference is almost meaningless as neither are going to be accurate down to the microsecond.

The more important difference is performance. The poll() interface differs from select() in a few ways that make it much more efficient.

1.	When `select()` is used, the kernel must check all of the file descriptors between 0 and `numfds - 1` to see if the application is interested in I/O events for that file descriptor. For applications with large numbers of open files, this can cause substantial waste as the kernel checks which file descriptors are of interest.
2.	The set of file descriptors is passed to the kernel as a bitmap for `select()` and as a list for `poll()`. The somewhat complicated bit operations required to check and set the `fd_set` data structures are less efficient than the simple checks needed for `struct pollfd`.
3.	As the kernel overwrites the data structures passed to `select()`, the application is forced to reset those structures every time it needs to call `select()`. With `poll()` the kernel's results are limited to the `revents` member, removing the need for data structures to be rebuilt before every call.
4.	Using a set-based structure like `fd_set` does not scale as the number of file descriptors available to a process increases. Since it is a static size (rather than dynamically allocated; note the lack of a corresponding macro like `FD_FREE`), it cannot grow or shrink with the needs of the application (or the abilities of the kernel). Under Linux, the maximum file descriptor that can be set in an `fd_set` is 1023. If a larger file descriptor may be needed, `select()` will not work.

The only advantage select() offers over poll() is better portability to old systems. As very few of those implementations are still in use, you should consider select() of interest primarily for understanding and maintaining existing code bases.

To illustrate how much less efficient select() is than poll(), here is a short program that measures the number of system calls that can be performed in a second:

  1: /* select-vs-poll.c */  2:  3: #include <fcntl.h>  4: #include <stdio.h>  5: #include <sys/poll.h>  6: #include <sys/select.h>  7: #include <sys/signal.h>  8: #include <unistd.h>  9: 10: int gotAlarm; 11: 12: void catch(int sig) { 13:      gotAlarm = 1; 14: } 15: 16: #define HIGH_FD 1000 17: 18: int main(int argc, const char ** argv) { 19:      int devZero; 20:      int count; 21:      fd_set selectFds; 22:      struct pollfd pollFds; 23: 24:      devZero = open("/dev/zero", O_RDONLY); 25:      dup2(devZero, HIGH_FD); 26: 27:      /* use a signal to know when time's up */ 28:      signal(SIGALRM, catch); 29: 30:      gotAlarm = 0; 31:      count = 0; 32:      alarm(1); 33:      while (!gotAlarm) { 34:          FD_ZERO(&selectFds); 35:          FD_SET(HIGH_FD, &selectFds); 36: 37:          select(HIGH_FD + 1, &selectFds, NULL, NULL, NULL); 38:          count++; 39:      } 40: 41:      printf("select() calls per second: %d\n", count); 42: 43:      pollFds.fd = HIGH_FD; 44:      pollFds.events = POLLIN; 45:      count = 0; 46:      gotAlarm = 0; 47:      alarm(1); 48:      while (!gotAlarm) { 49:          poll(&pollFds, 0, 0); 50:          count++; 51:      } 52: 53:      printf("poll() calls per second: %d\n", count); 54: 55:      return 0; 56: }

It uses /dev/zero, which provides an infinite number of zeros so that the system calls return immediately. The HIGH_FD value can be changed to see how select() degrades as the file descriptor values increase.

On one particular system and a HIGH_FD value of 2 (which is not very high), this program showed that the kernel could handle four times as many poll() calls per second as select() calls. When HIGH_FD was increased to 1,000, poll() became forty times more efficient than select().

13.1.5. Multiplexing with `epoll`

The 2.6 version of the Linux kernel introduced a third method for multiplexed I/O, called epoll. While epoll is more complicated than either poll() or select(), it removes a performance bottleneck common to both of those methods.

Both the poll() and select() system calls pass a full list of file descriptors to monitor each time they are called. Every one of those file descriptors must be processed by the system call, even if only one of them is ready for reading or writing. When tens, or hundreds, or thousands of file descriptors are being monitored, those system calls become bottlenecks; the kernel spends a large amount of time checking to see which file descriptors need to be checked by the application.

When epoll is used, applications provide the kernel with a list of file descriptors to monitor through one system call, and then monitor those file descriptors using a different system call. Once the list has been created, the kernel continually monitors those file descriptors for the events the application is interested in,^[7] and when an event occurs, it makes a note that something interesting just happened. As soon as the application asks the kernel which file descriptors are ready for further processing, the kernel provides the list it has been maintaining without having to check every file descriptor.

^[7] The kernel actually sets a callback on each file, and when those events occur the callback is invoked. This mechanism eliminates the scaling problems with very large numbers of file descriptors as polling is not used at any point.

The performance advantages of epoll require a system call interface that is more complicated than those of poll() and select(). While poll() uses an array of struct pollfd to represent a set of file descriptors and select() uses three different fd_set structures for the same purpose, epoll moves these file descriptor sets into the kernel rather than keeping them in the program's address space. Each of these sets is referenced through an epoll descriptor, which is a file descriptor that can be used only for epoll system calls. New epoll descriptors are allocated by the epoll_create() system call.

 #include <sys/epoll.h> int epoll_create(int numDescriptors);

The sole parameter numDescriptors is the program's best guess at how many file descriptors the newly created epoll descriptor will reference. This is not a hard limit, it is just a hint to the kernel to help it initialize its internal structures more accurately. epoll_create() returns an epoll descriptor, and when the program has finished with that descriptor it should be passed to close() to allow the kernel to free any memory used by that descriptor.

Although the epoll descriptor is a file descriptor, there are only two system calls it should be used with.

 #include <sys/epoll.h> int epoll_ctl(int epfd, int op, int fd, struct epoll_event * event); int epoll_wait(int epfd, struct epoll_event * events, int maxevents,                int timeout);

Both of these use parameters of the type struct epoll_event, which is defined as follows:

 #include <sys/epoll.h> struct epoll_event {     int events;     union {         void * ptr;         int fd;         unsigned int u32;         unsigned long long u64;     } data; };

This structure serves three purposes: It specifies what types of events should be monitored, specifies what types of events occurred, and allows a single data element to be associated with the file descriptor. The events field is for the first two functions, and is one or more of the following values logically OR'ed together:^[8]

^[8] EPOLLET is one more value events can have, which switches epoll from being level-triggered to edge-triggered. This topic is beyond the scope of this book, and edge-triggered epoll should be used only under very special circumstances.

`EPOLLIN`	Indicates that a `read()` operation will not block; either data is ready or there is no more data to be read.
`EPOLLOUT`	The associated file is ready to be written to.
`EPOLLPRI`	The file has out-of-band data ready for reading.

The second member of struct epoll_event, data, is a union that contains an integer (for holding a file descriptor), a pointer, and 32-bit and 64-bit integers.^[9] This data element is kept by epoll and returned to the program whenever an event of the appropriate type occurs. The data element is the only way the program has to know which file descriptor needs to be serviced; the epoll interface does not pass the file descriptor to the program, unlike poll() and select() (unless data contains the file descriptor). This method gives extra flexibility to applications that track files as something more complicated than simple file descriptors.

^[9] The structure shown in the text gives the right member sizes on most platforms, but it is not correct for machines that define an int as 64 bits.

The epoll_ctl() system call adds and removes file descriptors from the set the epfd epoll descriptor refers to.

The second parameter, op, describes how the file descriptor set should be modified, and is one of the following:

`EPOLL_CTL_ADD`	The file descriptor `fd` is added to the file descriptor set with the event set `events`. If the file descriptor is already present, it returns `EEXIST`. (It is possible that multiple threads will be able to add the same file descriptor to an `epoll` set more than once, and doing so does not change anything.)
`EPOLL_CTL_DEL`	The file descriptor `fd` is removed from the set of file descriptors that is being monitored. The `events` parameter must point to a `struct epoll_event`, but the contents of that structure are ignored. (This is another way of saying `events` needs to be a valid pointer; it cannot be `NULL`.)
`EPOLL_CTL_MOD`	The `struct epoll_event` for `fd` is updated from the information pointed to by `events`. This allows the set of events being monitored and the data element associated with the file descriptor to be updated without introducing any race conditions.

The final system call used by epoll is epoll_wait(), which blocks until one or more of the file descriptors being monitored has data to read or is ready to be written to. The first argument is the epoll descriptor, and the last provides a timeout in seconds. If no file descriptors are ready for processing before the timeout expires, epoll_wait() returns 0.

The middle two parameters specify a buffer for the kernel to copy a set struct epoll_event structures into. The events parameter points to the buffer, maxevents specifies how many struct epoll_event structures fit in that buffer, and the return value tells the program how many structures were placed in that buffer (unless the call times out or an error occurs).

Each struct epoll_event tells the program the full status of a file descriptor that is being monitored. The events member can have any of the EPOLLIN, EPOLLOUT, or EPOLLPRI flags set, as well as two new flags.

`EPOLLERR`	An error condition is pending on the file; this can occur if an error occurs on a socket when the application is not reading from or writing to it.
`EPOLLHUP`	A hangup occurred on the file descriptor; see page 138 for information on when this can occur.

While all of this seems complicated, it is actually very similar to how poll() works. Calling epoll_create() is the same as allocating the struct pollfd array, and epoll_ctl() is the same step as initializing the members of that array. The main loop that is processing file descriptors uses epoll_wait() instead of the poll() system call, and close() is analogous to freeing the struct pollfd array. These parallels make switching programs that were originally written around poll() or select() to epoll quite straightforward.

The epoll interface allows one more trick that cannot really be compared to poll() or select(). Since the epoll descriptor is really a file descriptor (which is why it can be passed to close()), you can monitor that epoll descriptor as part of another epoll descriptor, or via poll() or select(). The epoll descriptor will appear as ready to be read from whenever calling epoll_wait() would return events.

Our final solution to the pipe multiplexing problem we have used throughout this section uses epoll. It is very similar to the other examples, but some of the initialization code has been moved into a new addEvent() function to keep the program from getting longer than necessary.

  1: /* mpx-epoll.c */  2:  3: #include <fcntl.h>  4: #include <stdio.h>  5: #include <stdlib.h>  6: #include <sys/epoll.h>  7: #include <unistd.h>  8:  9: #include <sys/poll.h> 10: 11: void addEvent(int epfd, char * filename) { 12:     int fd; 13:     struct epoll_event event; 14: 15:     if ((fd = open(filename, O_RDONLY | O_NONBLOCK)) < 0) { 16:         perror("open"); 17:         exit(1); 18:     } 19: 20:     event.events = EPOLLIN; 21:     event.data.fd = fd; 22: 23:     if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event)) { 24:         perror("epoll_ctl(ADD)"); 25:         exit(1); 26:     } 27: } 28: 29: int main(void) { 30:     char buf[4096]; 31:     int i, rc; 32:     int epfd; 33:     struct epoll_event events[2]; 34:     int num; 35:     int numFds; 36: 37:     epfd = epoll_create(2); 38:     if (epfd < 0) { 39:         perror("epoll_create"); 40:         return 1; 41:     } 42: 43:     /* open both pipes and add them to the epoll set */ 44:     addEvent(epfd, "p1"); 45:     addEvent(epfd, "p2"); 46: 47:     /* continue while we have one or more file descriptors to 48:        watch */ 49:     numFds = 2; 50:     while (numFds) { 51:         if ((num = epoll_wait(epfd, events, 52:                              sizeof(events) / sizeof(*events), 53:                              -1)) <= 0) { 54:             perror("epoll_wait"); 55:             return 1; 56:         } 57: 58:         for (i = 0; i < num; i++) { 59:             /* events[i].data.fd is ready for reading */ 60: 61:             rc = read(events[i].data.fd, buf, sizeof(buf) - 1); 62:             if (rc < 0) { 63:                 perror("read"); 64:                 return 1; 65:             } else if (!rc) { 66:                 /* this pipe has been closed, don't try 67:                    to read from it again */ 68:                 if (epoll_ctl(epfd, EPOLL_CTL_DEL, 69:                               events[i].data.fd, &events[i])) { 70:                     perror("epoll_ctl(DEL)"); 71:                     return 1; 72:                 } 73: 74:                 close(events[i].data.fd); 75: 76:                 numFds--; 77:             } else { 78:                 buf[rc] = '\0'; 79:                 printf("read: %s", buf); 80:             } 81:         } 82:     } 83: 84:     close(epfd); 85: 86:     return 0; 87: }

13.1.6. Comparing `poll()` and `epoll`

The differences between poll() and epoll are straightforward; poll() is well standardized but does not scale well, while epoll exists only on Linux but scales very well. Applications that watch a small number of file descriptors and value portability should use poll(), but any application that needs to monitor a large number of descriptors is better off with epoll even if it also needs to support poll() for other platforms.

The performance differences between the two methods can be quite dramatic. To illustrate how much better epoll scales, poll-vs-epoll.c measures how many poll() and epoll_wait() system calls can be made in one second for file descriptor sets of various sizes (the number of file descriptors to put in the set is specified on the command line). Each file descriptor refers to the read portion of a pipe, and they are created through dup2().

Table 13.1 summarizes the results of running poll-vs-epoll.c for set sizes ranging from a single file descriptor to 100,000 file descriptors.^[10] While the number of system calls per second drops off rapidly for poll(), it stays nearly constant for epoll.^[11] As this table makes clear, epoll places far less load on the system than poll() does, and scales far better as a result.

^[10] The program needs to be run as root for sets larger than about 1,000 descriptors.

^[11] This testing was not done particularly scientifically. Only a single test run was done, so the results show a little bit of jitter that would disappear over a large number of repetitions.

Table 13.1. Comparing `poll()` and `epoll`
File Descriptors	`poll()`	`epoll`
1	310,063	714,848
10	140,842	726,108
100	25,866	726,659
1,000	3,343	729,072
5,000	612	718,424
10,000	300	730,483
25,000	108	717,097
50,000	38	729,746
100,000	18	712,301

   1: /* poll-vs-epoll.c */   2:   3: #include <errno.h>   4: #include <fcntl.h>   5: #include <stdio.h>   6: #include <sys/epoll.h>  7: #include <sys/poll.h>   8: #include <sys/signal.h>   9: #include <unistd.h>  10: #include <sys/resource.h>  11: #include <string.h>  12: #include <stdlib.h>  13:  14: #include <sys/select.h>  15:  16: int gotAlarm;  17:  18: void catch(int sig) {  19:      gotAlarm = 1;  20: }  21:  22: #define OFFSET 10  23:  24: int main(int argc, const char ** argv) {  25:     int pipeFds[2];  26:     int count;  27:     int numFds;  28:     struct pollfd * pollFds;  29:     struct epoll_event event;  30:     int epfd;  31:     int i;  32:     struct rlimit lim;  33:     char * end;  34:  35:     if (!argv[1]) {  36:         fprintf(stderr, "number expected\n");  37:         return 1;  38:     }  39:  40:     numFds = strtol(argv[1], &end, 0);  41:     if (*end) {  42:         fprintf(stderr, "number expected\n");  43:         return 1;  44:     }  45:  46:     printf("Running test on %d file descriptors.\n", numFds);  47:  48:     lim.rlim_cur = numFds + OFFSET;  49:     lim.rlim_max = numFds + OFFSET;  50:     if (setrlimit(RLIMIT_NOFILE, &lim)) {  51:         perror("setrlimit");  52:         exit(1);  53:     }  54:  55:     pipe(pipeFds);  56:  57:     pollFds = malloc(sizeof(*pollFds) * numFds);  58:  59:     epfd = epoll_create(numFds);  60:     event.events = EPOLLIN;  61:  62:     for (i = OFFSET; i < OFFSET + numFds; i++) {  63:         if (dup2(pipeFds[0], i) != i) {  64:             printf("failed at %d: %s\n", i, strerror(errno));  65:             exit(1);  66:         }  67:  68:         pollFds[i - OFFSET].fd = i;  69:         pollFds[i - OFFSET].events = POLLIN;  70:  71:         event.data.fd = i;  72:         epoll_ctl(epfd, EPOLL_CTL_ADD, i, &event);  73:     }  74:  75:     /* use a signal to know when time's up */  76:     signal(SIGALRM, catch);  77:  78:     count = 0;  79:     gotAlarm = 0;  80:     alarm(1);  81:     while (!gotAlarm) {  82:         poll(pollFds, numFds, 0);  83:         count++;  84:     }  85:  86:     printf("poll() calls per second: %d\n", count);  87:  88:     alarm(1);  89:  90:     count = 0;  91:     gotAlarm = 0;  92:     alarm(1);  93:     while (!gotAlarm) {  94:         epoll_wait(epfd, &event, 1, 0);  95:         count++;  96:     }  97:  98:     printf("epoll() calls per second: %d\n", count);  99: 100:     return 0; 101: }

13.1. Input and Output Multiplexing

Figure 13.1. Running Multiplex Examples

13.1.1. Nonblocking I/O

13.1.2. Multiplexing with poll()

13.1.3. Multiplexing with select()

13.1.4. Comparing poll() and select()

13.1.5. Multiplexing with epoll

13.1.6. Comparing poll() and epoll

Table 13.1. Comparing poll() and epoll

13.1.2. Multiplexing with `poll()`

13.1.3. Multiplexing with `select()`

13.1.4. Comparing `poll()` and `select()`

13.1.5. Multiplexing with `epoll`

13.1.6. Comparing `poll()` and `epoll`

Table 13.1. Comparing `poll()` and `epoll`