Section 6.4. Descriptor Management and Services

   


6.4. Descriptor Management and Services

For user processes, all I/O is done through descriptors. The user interface to descriptors was described in Section 2.6. This section describes how the kernel manages descriptors and how it provides descriptor services, such as locking and polling.

System calls that refer to open files take a file descriptor as an argument to specify the file. The file descriptor is used by the kernel to index into the descriptor table for the current process (kept in the filedesc structure, a substructure of the process structure for the process) to locate a file entry, or file structure. The relations of these data structures are shown in Figure 6.4.

Figure 6.4. File-descriptor reference to a file entry.


The file entry provides a file type and a pointer to an underlying object for the descriptor. There are six object types supported in FreeBSD:

  • For data files, the file entry points to a vnode structure that references a substructure containing the filesystem-specific information described in Chapters 8 and 9. The vnode layer is described in Section 6.5. Special files do not have data blocks allocated on the disk; they are handled by the special-device filesystem that calls appropriate drivers to handle I/O for them.

  • For access to interprocess communication including networking, the FreeBSD file entry may reference a socket.

  • For unnamed high-speed local communication, the file entry will reference a pipe. Earlier FreeBSD systems used sockets for local communication, but optimized support was added for pipes to improve their performance.

  • For named high-speed local communication, the file entry will reference a fifo. As with pipes, optimized support was added for fifo's to improve performance.

  • For systems that have cryptographic support in hardware, the descriptor may provide direct access to that hardware.

  • To provide notification of kernel events, the descriptor will reference a kqueue.

The virtual-memory system supports the mapping of files into a process's address space. Here, the file descriptor must reference a vnode that will be partially or completely mapped into the user's address space.

Open File Entries

The set of file entries is the focus of activity for file descriptors. They contain the information necessary to access the underlying objects and to maintain common information.

The file entry is an object-oriented data structure. Each entry contains a type and an array of function pointers that translate the generic operations on file descriptors into the specific actions associated with their type. The operations that must be implemented for each type are as follows:

  • Read from the descriptor

  • Write to the descriptor

  • Poll the descriptor

  • Do ioctl operations on the descriptor

  • Collect stat information for the descriptor

  • Check to see if there are any kqueue events pending for the descriptor

  • Close and possibly deallocate the object associated with the descriptor

Note that there is no open routine defined in the object table. FreeBSD treats descriptors in an object-oriented fashion only after they are created. This approach was taken because the various descriptor types have different characteristics. Generalizing the interface to handle all types of descriptors at open time would have complicated an otherwise simple interface. Vnode descriptors are created by the open system call; socket descriptors are created by the socket system call; fifo descriptors are created by the pipe system call.

Each file entry has a pointer to a data structure that contains information specific to the instance of the underlying object. The data structure is opaque to the routines that manipulate the file entry. A reference to the data structure is passed on each call to a function that implements a file operation. All state associated with an instance of an object must be stored in that instance's data structure; the underlying objects are not permitted to manipulate the file entry themselves.

The read and write system calls do not take an offset in the file as an argument. Instead, each read or write updates the current file offset in the file according to the number of bytes transferred. The offset determines the position in the file for the next read or write. The offset can be set directly by the lseek system call. Since more than one process may open the same file, and each such process needs its own offset for the file, the offset cannot be stored in the per-object data structure. Thus, each open system call allocates a new file entry, and the open file entry contains the offset.

Some semantics associated with all file descriptors are enforced at the descriptor level, before the underlying system call is invoked. These semantics are maintained in a set of flags associated with the descriptor. For example, the flags record whether the descriptor is open for reading, writing, or both reading and writing. If a descriptor is marked as open for reading only, an attempt to write it will be caught by the descriptor code. Thus, the functions defined for doing reading and writing do not need to check the validity of the request; we can implement them knowing that they will never receive an invalid request.

The application-visible flags are described in the next subsection. In addition to the application-visible flags, the flags field also has information on whether the descriptor holds a shared or exclusive lock on the underlying file. The locking primitives could be extended to work on sockets, as well as on files. However, the descriptors for a socket rarely refer to the same file entry. The only way for two processes to share the same socket descriptor is for a parent to share the descriptor with its child by forking or for one process to pass the descriptor to another in a message.

Each file entry has a reference count. A single process may have multiple references to the entry because of calls to the dup or fcntl system calls. Also, file structures are inherited by the child process after a fork, so several different processes may reference the same file entry. Thus, a read or write by either process on the twin descriptors will advance the file offset. This semantic allows two processes to read the same file or to interleave output to the same file. Another process that has independently opened the file will refer to that file through a different file structure with a different file offset. This functionality was the original reason for the existence of the file structure; the file structure provides a place for the file offset between the descriptor and the underlying object.

Each time that a new reference is created, the reference count is incremented. When a descriptor is closed (any one of (1) explicitly with a close, (2) implicitly after an exec because the descriptor has been marked as close-on-exec, or (3) on process exit), the reference count is decremented. When the reference count drops to zero, the file entry is freed.

The close-on-exec flag is kept in the descriptor table rather than in the file entry. This flag is not shared among all the references to the file entry because it is an attribute of the file descriptor itself. The close-on-exec flag is the only piece of information that is kept in the descriptor table rather than being shared in the file entry.

Management of Descriptors

The fcntl system call manipulates the file structure. It can be used to make the following changes to a descriptor:

  • Duplicate a descriptor as though by a dup system call.

  • Get or set the close-on-exec flag. When a process forks, all the parent's descriptors are duplicated in the child. The child process then execs a new process.

    Any of the child's descriptors that were marked close-on-exec are closed. The remaining descriptors are available to the newly executed process.

  • Set the no-delay (O_NONBLOCK) flag to put the descriptor into nonblocking mode. In nonblocking mode, if any data are available for a read operation, or if any space is available for a write operation, an immediate partial read or write is done. If no data are available for a read operation, or if a write operation would block, the system call returns an error (EAGAIN) showing that the operation would block, instead of putting the process to sleep. This facility is not implemented for local filesystems in FreeBSD, because local-filesystem I/O is always expected to complete within a few milliseconds.

  • Set the synchronous (O_FSYNC) flag to force all writes to the file to be written synchronously to the disk.

  • Set the direct (O_DIRECT) flag to request that the kernel attempt to write the data directly from the user application to the disk rather than copying it via kernel buffers.

  • Set the append (O_APPEND) flag to force all writes to append data to the end of the file, instead of at the descriptor's current location in the file. This feature is useful when, for example, multiple processes are writing to the same log file.

  • Set the asynchronous (O_ASYNC) flag to request that the kernel watch for a change in the status of the descriptor, and arrange to send a signal (SIGIO) when a read or write becomes possible.

  • Send a signal to a process when an exception condition arises, such as when urgent data arrive on an interprocess-communication channel.

  • Set or get the process identifier or process-group identifier to which the two I/O-related signals in the previous steps should be sent.

  • Test or change the status of a lock on a range of bytes within an underlying file. Locking operations are described later in this section.

The implementation of the dup system call is easy. If the process has reached its limit on open files, the kernel returns an error. Otherwise, the kernel scans the current process's descriptor table, starting at descriptor zero, until it finds an unused entry. The kernel allocates the entry to point to the same file entry as does the descriptor being duplicated. The kernel then increments the reference count on the file entry and returns the index of the allocated descriptor-table entry. The fcntl system call provides a similar function, except that it specifies a descriptor from which to start the scan.

Sometimes, a process wants to allocate a specific descriptor-table entry. Such a request is made with the dup2 system call. The process specifies the descriptor-table index into which the duplicated reference should be placed. The kernel implementation is the same as for dup, except that the scan to find a free entry is changed to close the requested entry if that entry is open and then to allocate the entry as before. No action is taken if the new and old descriptors are the same.

The system implements getting or setting the close-on-exec flag via the fcntl system call by making the appropriate change to the flags field of the associated descriptor-table entry. Other attributes that fcntl manipulates operate on the flags in the file entry. However, the implementation of the various flags cannot be handled by the generic code that manages the file entry. Instead, the file flags must be passed through the object interface to the type-specific routines to do the appropriate operation on the underlying object. For example, manipulation of the non-blocking flag for a socket must be done by the socket layer, since only that layer knows whether an operation can block.

The implementation of the ioctl system call is broken into two major levels. The upper level handles the system call itself. The ioctl call includes a descriptor, a command, and pointer to a data area. The command argument encodes what the size is of the data area for the parameters and whether the parameters are input, output, or both input and output. The upper level is responsible for decoding the command argument, allocating a buffer, and copying in any input data. If a return value is to be generated and there is no input, the buffer is zeroed. Finally, the ioctl is dispatched through the file-entry ioctl function, along with the I/O buffer, to the lower-level routine that implements the requested operation.

The lower level does the requested operation. Along with the command argument, it receives a pointer to the I/O buffer. The upper level has already checked for valid memory references, but the lower level may do more precise argument validation because it knows more about the expected nature of the arguments. However, it does not need to copy the arguments in or out of the user process. If the command is successful and produces output, the lower level places the results in the buffer provided by the top level. When the lower level returns, the upper level copies the results to the process.

Asynchronous I/O

Historically, UNIX systems did not have the ability to do asynchronous I/O beyond the ability to do background writes to the filesystem. An asynchronous I/O interface was defined by the POSIX.lb-1993 realtime group. Shortly after its ratification, an implementation was added to FreeBSD.

An asynchronous read is started with aio_read; an asynchronous write is started with aio_write. The kernel builds an asynchronous I/O request structure that contains all the information needed to do the requested operation. If the request cannot be immediately satisfied from kernel buffers, the request structure is queued for processing by an asynchronous kernel-based I/O daemon and the system call returns. The next available asynchronous I/O daemon handles the request using the usual kernel synchronous I/O path.

When the daemon finishes the I/O, the asynchronous I/O structure is marked as finished along with a return value or error code. The application uses the aio_error system call to poll to find if the I/O is complete. This call is implemented by checking the status of the asynchronous I/O request structure created by the kernel. If an application gets to the point where it cannot proceed until an I/O completes, it can use the aio_suspend system call to wait until an I/O is done. Here, the application is put to sleep on its asynchronous I/O request structure and is awakened by the asynchronous I/O daemon when the I/O completes. Alternatively, the application can request that a specified signal be sent when the I/O is done.

The aio_return system call gets the return value from the asynchronous request once aio_error, aio_suspend, or the arrival of a completion signal has indicated that the I/O is done. FreeBSD has also added the nonstandard aio_waitcomplete system call that combines the functions of aio_suspend and aio_return into a single operation. For either aio_return or aio_waitcomplete, the return information is copied out to the application from the asynchronous I/O request structure and the asynchronous I/O request structure is then freed.

File-Descriptor Locking

Early UNIX systems had no provision for locking files. Processes that needed to synchronize access to a file had to use a separate lock file. A process would try to create a lock file. If the creation succeeded, then the process could proceed with its update; if the creation failed, the process would wait and then try again. This mechanism had three drawbacks:

  1. Processes consumed CPU time by looping over attempts to create locks.

  2. Locks left lying around because of system crashes had to be removed (normally in a system-startup command script).

  3. Processes running as the special system-administrator user, the superuser, are always permitted to create files, and so were forced to use a different mechanism.

Although it is possible to work around all these problems, the solutions are not straightforward, so a mechanism for locking files was added in 4.2BSD.

The most general locking schemes allow multiple processes to update a file concurrently. Several of these techniques are discussed in Peterson [1983]. A simpler technique is to serialize access to a file with locks. For standard system applications, a mechanism that locks at the granularity of a file is sufficient. So, 4.2BSD and 4.3BSD provided only a fast, whole-file locking mechanism. The semantics of these locks include allowing locks to be inherited by child processes and releasing locks only on the last close of a file.

Certain applications require the ability to lock pieces of a file. Locking facilities that support a byte-level granularity are well understood. Unfortunately, they are not powerful enough to be used by database systems that require nested hierarchical locks, but are complex enough to require a large and cumbersome implementation compared to the simpler whole-file locks. Because byte-range locks are mandated by the POSIX standard, the developers added them to BSD reluctantly. The semantics of byte-range locks come from the initial implementation of locks in System V, which included releasing all locks held by a process on a file every time a close system call was done on a descriptor referencing that file. The 4.2BSD whole-file locks are removed only on the last close. A problem with the POSIX semantics is that an application can lock a file, then call a library routine that opens, reads, and closes the locked file. Calling the library routine will have the unexpected effect of releasing the locks held by the application. Another problem is that a file must be open for writing to be allowed to get an exclusive lock. A process that does not have permission to open a file for writing cannot get an exclusive lock on that file. To avoid these problems, yet remain POSIX compliant, FreeBSD provides separate interfaces for byte-range locks and whole-file locks. The byte-range locks follow the POSIX semantics; the whole-file locks follow the traditional 4.2BSD semantics. The two types of locks can be used concurrently; they will serialize against each other properly.

Both whole-file locks and byte-range locks use the same implementation; the whole-file locks are implemented as a range lock over an entire file. The kernel handles the other differing semantics between the two implementations by having the byte-range locks be applied to processes, whereas the whole-file locks are applied to descriptors. Because descriptors are shared with child processes, the whole-file locks are inherited. Because the child process gets its own process structure, the byte-range locks are not inherited. The last-close versus every-close semantics are a small bit of special-case code in the close routine that checks whether the underlying object is a process or a descriptor. It releases locks on every call if the lock is associated with a process and only when the reference count drops to zero if the lock is associated with a descriptor.

Locking schemes can be classified according to the extent that they are enforced. A scheme in which locks are enforced for every process without choice is said to use mandatory locks, whereas a scheme in which locks are enforced for only those processes that request them is said to use advisory locks. Clearly, advisory locks are effective only when all programs accessing a file use the locking scheme. With mandatory locks, there must be some override policy implemented in the kernel. With advisory locks, the policy is left to the user programs. In the FreeBSD system, programs with superuser privilege are allowed to override any protection scheme. Because many of the programs that need to use locks must also run as the superuser, 4.2BSD implemented advisory locks rather than creating an additional protection scheme that was inconsistent with the UNIX philosophy or that could not be used by privileged programs. The use of advisory locks carried over to the POSIX specification of byte-range locks and is retained in FreeBSD.

The FreeBSD file-locking facilities allow cooperating programs to apply advisory shared or exclusive locks on ranges of bytes within a file. Only one process may have an exclusive lock on a byte range, whereas multiple shared locks may be present. A shared and an exclusive lock cannot be present on a byte range at the same time. If any lock is requested when another process holds an exclusive lock, or an exclusive lock is requested when another process holds any lock, the lock request will block until the lock can be obtained. Because shared and exclusive locks are only advisory, even if a process has obtained a lock on a file, another process may access the file if it ignores the locking mechanism.

So that there are no races between creating and locking a file, a lock can be requested as part of opening a file. Once a process has opened a file, it can manipulate locks without needing to close and reopen the file. This feature is useful, for example, when a process wishes to apply a shared lock, to read information, to determine whether an update is required, and then to apply an exclusive lock and update the file.

A request for a lock will cause a process to block if the lock cannot be obtained immediately. In certain instances, this blocking is unsatisfactory. For example, a process that wants only to check whether a lock is present would require a separate mechanism to find out this information. Consequently, a process can specify that its locking request should return with an error if a lock cannot be obtained immediately. Being able to request a lock conditionally is useful to daemon processes that wish to service a spooling area. If the first instance of the daemon locks the directory where spooling takes place, later daemon processes can easily check to see whether an active daemon exists. Since locks exist only while the locking processes exist, locks can never be left active after the processes exit or if the system crashes.

The implementation of locks is done on a per-filesystem basis. The implementation for the local filesystems is described in Section 7.5. A network-based filesystem has to coordinate locks with a central lock manager that is usually located on the server exporting the filesystem. Client lock requests must be sent to the lock manager. The lock manager arbitrates among lock requests from processes running on its server and from the various clients to which it is exporting the filesystem. The most complex operation for the lock manager is recovering lock state when a client or server is rebooted or becomes partitioned from the rest of the network. The FreeBSD network-based lock manager is described in Section 9.3.

Multiplexing I/O on Descriptors

A process sometimes wants to handle I/O on more than one descriptor. For example, consider a remote login program that wants to read data from the keyboard and to send them through a socket to a remote machine. This program also wants to read data from the socket connected to the remote end and to write them to the screen. If a process makes a read request when there are no data available, it is normally blocked in the kernel until the data become available. In our example, blocking is unacceptable. If the process reads from the keyboard and blocks, it will be unable to read data from the remote end that are destined for the screen. The user does not know what to type until more data arrive from the remote end, so the session deadlocks. Conversely, if the process reads from the remote end when there are no data for the screen, it will block and will be unable to read from the terminal. Again, deadlock would occur if the remote end were waiting for output before sending any data. There is an analogous set of problems to blocking on the writes to the screen or to the remote end. If a user has stopped output to his screen by typing the stop character, the write will block until the user types the start character. In the meantime, the process cannot read from the keyboard to find out that the user wants to flush the output.

FreeBSD provides three mechanisms that permit multiplexing I/O on descriptors: polling I/O, nonblocking I/O, and signal-driven I/O. Polling is done with the select or poll system call, described in the next subsection. Operations on non-blocking descriptors finish immediately, partially complete an input or output operation and return a partial count, or return an error that shows that the operation could not be completed at all. Descriptors that have signaling enabled cause the associated process or process group to be notified when the I/O state of the descriptor changes.

There are four possible alternatives that avoid the blocking problem:

  1. Set all the descriptors into nonblocking mode. The process can then try operations on each descriptor in turn to find out which descriptors are ready to do I/O. The problem with this busy-waiting approach is that the process must run continuously to discover whether there is any I/O to be done, wasting CPU cycles.

  2. Enable all descriptors of interest to signal when I/O can be done. The process can then wait for a signal to discover when it is possible to do I/O. The drawback to this approach is that signals are expensive to catch. Hence, signal-driven I/O is impractical for applications that do moderate to large amounts of I/O.

  3. Have the system provide a method for asking which descriptors are capable of doing I/O. If none of the requested descriptors are ready, the system can put the process to sleep until a descriptor becomes ready. This approach avoids the problem of deadlock because the process will be awakened whenever it is possible to do I/O and will be told which descriptor is ready. The drawback is that the process must do two system calls per operation: one to poll for the descriptor that is ready to do I/O and another to do the operation itself.

  4. Have the process notify the system of all the descriptors that it is interested in reading and then do a blocking read on that set of descriptors. When the read returns, the process is notified on which descriptor the read completed. The benefit of this approach is that the process does a single system call to specify the set of descriptors, then loops doing only reads [Accetta et al., 1986; Lemon, 2001].

The first approach is available in FreeBSD as nonblocking I/O. It typically is used for output descriptors because the operation typically will not block. Rather than doing a select or poll, which nearly always succeeds, followed immediately by a write, it is more efficient to try the write and revert to using select or poll only during periods when the write returns a blocking error. The second approach is available in FreeBSD as signal-driven I/O. It typically is used for rare events, such as the arrival of out-of-band data on a socket. For such rare events, the cost of handling an occasional signal is lower than that of checking constantly with select or poll to find out whether there are any pending data.

The third approach is available in FreeBSD via the select or poll system call. Although less efficient than the fourth approach, it is a more general interface. In addition to handling reading from multiple descriptors, it handles writes to multiple descriptors, notification of exceptional conditions, and timeout when no I/O is possible.

The select and poll interfaces provide the same information. They differ only in their programming interface. The select interface was first developed in 4.2BSD with the introduction of socket-based interprocess communication. The poll interface was introduced in System V several years later with their competing STREAMS-based interprocess communication. Although STREAMS has fallen into disuse, the poll interface has proven popular enough to be retained. The FreeBSD kernel supports both interfaces.

The select system call is of the form

 int error = select(     int numfds,     fd_set *readfds,     fd_set *writefds,     fd_set *exceptfds,     struct timeval *timeout); 

It takes three masks of descriptors to be monitored, corresponding to interest in reading, writing, and exceptional conditions. In addition, it takes a timeout value for returning from select if none of the requested descriptors becomes ready before a specified amount of time has elapsed. The select call returns the same three masks of descriptors after modifying them to show the descriptors that are able to do reading, to do writing, or that have an exceptional condition. If none of the descriptors has become ready in the timeout interval, select returns showing that no descriptors are ready for I/O. If a timeout value is given and a descriptor becomes ready before the specified timeout period, the time that select spends waiting for I/O to become ready is subtracted from the time given.

The poll interface copies in an array of pollfd structures, one array entry for each descriptor of interest. The pollfd structure contains three elements:

  • The file descriptor to poll

  • A set of flags describing the information being sought

  • A set of flags set by the kernel showing the information that was found

The flags specify availability of normal or out-of-band data for reading and the availability of buffer space for normal or out-of-band data writing. The return flags can also specify that an error has occurred on the descriptor, that the descriptor has been disconnected, or that the descriptor is not open. These error conditions are raised by the select call by indicating that the descriptor with the error is ready to do I/O. When the application attempts to do the I/O, the error is returned by the read or write system call. Like the select call, the poll call takes a timeout value to specify the maximum time to wait. If none of the requested descriptors becomes ready before the specified amount of time has elapsed, the poll call returns. If a timeout value is given and a descriptor becomes ready before the specified timeout period, the time that poll spends waiting for I/O to become ready is subtracted from the time given.

Implementation of Select

The implementation of select is divided into a generic top layer and many deviceor socket-specific bottom pieces. At the top level, select or poll decodes the user's request and then calls the appropriate lower-level poll functions. The select and poll system calls have different top layers to determine the sets of descriptors to be polled but use all the same device- and socket-specific bottom pieces. Only the select top layer will be described here. The poll top layer is implemented in a completely analogous way.

The select top level takes the following steps:

1. Copy and validate the descriptor masks for read, write, and exceptional conditions. Doing validation requires checking that each requested descriptor is currently open by the process.

2. Set the selecting flag for the thread.

3. For each descriptor in each mask, call the device's poll routine. If the descriptor is not able to do the requested I/O operation, the device poll routine is responsible for recording that the thread wants to do I/O. When I/O becomes possible for the descriptor usually as a result of an interrupt from the underlying device a notification must be issued for the selecting thread.

4. Because the selection process may take a long time, the kernel does not want to block out I/O during the time it takes to poll all the requested descriptors. Instead, the kernel arranges to detect the occurrence of I/O that may affect the status of the descriptors being polled. When such I/O occurs, the select-notification routine, selwakeup(), clears the selecting flag. If the top-level select code finds that the selecting flag for the thread has been cleared while it has been doing the polling, and it has not found any descriptors that are ready to do an operation, then the top level knows that the polling results are incomplete and must be repeated starting at step 2. The other condition that requires the polling to be repeated is a collision. Collisions arise when multiple threads attempt to poll the same descriptor at the same time. Because the poll routines have only enough space to record a single thread identifier, they cannot track multiple threads that need to be awakened when I/O is possible. In such rare instances, all threads that are selecting must be awakened.

5. If no descriptors are ready and the select specified a timeout, the kernel posts a timeout for the requested amount of time. The thread goes to sleep, giving the address of the kernel global variable selwait. Normally, a descriptor will become ready and the thread will be notified by selwakeup(). When the thread is awakened, it repeats the polling process and returns the available descriptors. If none of the descriptors become ready before the timer expires, the thread returns with a timed-out error and an empty list of available descriptors. If a timeout value is given and a descriptor becomes ready before the specified timeout period, the time that select spent waiting for I/O to become ready is subtracted from the time given.

Each of the low-level polling routines in the terminal drivers and the network protocols follows roughly the same set of steps. A piece of the poll routine for a terminal driver is shown in Figure 6.5 (on page 238). The steps involved in a device poll routine are as follows:

1. The socket or device poll entry is called with one or more of the flags POLLIN, POLLOUT, or POLLRDBAND (exceptional condition). The example in Figure 6.4 shows the POLLIN case; the other cases are similar.

2. The poll routine sets flags showing the operations that are possible. In Figure 6.5, it is possible to read a character if the number of unread characters is greater than zero. In addition, if the carrier has dropped, it is possible to get a read error. A return from select does not necessarily mean that there are data to read; rather, it means that a read will not block.

3. If the requested operation is not possible, the thread identifier is recorded with the socket or device for later notification. In Figure 6.5, the recording is done by the selrecord() routine. That routine first checks to see whether any thread is recorded for this socket or device. If none is recorded, then the current thread is recorded and the selinfo structure is linked onto the list of events associated with this thread. The second if statement checks for a collision. If the thread recorded with the structure is not ourselves, then a collision has occurred.

4. If multiple threads are selecting on the same socket or device, a collision is recorded for the socket or device, because the structure has only enough space for a single thread identifier. In Figure 6.5, a collision occurs when the second if statement in the selrecord() function is true. There is a tty structure containing a selinfo structure for each terminal line (or pseudo-terminal) on the machine. Normally, only one thread at a time is selecting to read from the terminal, so collisions are rare.

Selecting threads must be notified when I/O becomes possible. The steps involved in a status change awakening a thread are as follows:

1. The device or socket detects a change in status. Status changes normally occur because of an interrupt (e.g., a character coming in from a keyboard or a packet arriving from the network).

Figure 6.5. Select code from the pseudo-terminal driver to check for data to read.
 struct selinfo {     TAILQ_ENTRY(selinfo) si_thrlist;/* list hung off thread */     struct  thread *si_thread;   /* thread that is waiting */     short   si_flags;        /* SI_COLL - collision occurred */ }; struct tty *tp;     if (events & POLLIN) {         if (nread > 0 || (tp->t_state & TS_CARR_ON) == 0)             revents | = POLLIN;         else             selrecord(curthread, &tp->t_rsel);     } void selrecord(     struct thread *selector,     struct selinfo *sip); {     if (sip->si_thread == NULL) {         sip->si_thread = selector;         TAILQ_INSERT_TAIL(&selector->td_selq, sip, si_thrlist);     } else if (sip->si_thread != selector) {         sip->si_flags | = SI_COLL;     } } 

2. Selwakeup() is called with a pointer to the selinfo structure used by selrecord() to record the thread identifier and with a flag showing whether a collision occurred.

3. If the thread is sleeping on selwait, it is made runnable (or is marked ready, if it is stopped). If the thread is sleeping on some event other than selwait, it is not made runnable. A spurious call to selwakeup() can occur when the thread returns from select to begin processing one descriptor and then another descriptor on which it had been selecting also becomes ready.

4. If the thread has its selecting flag set, the flag is cleared so that the kernel will know that its polling results are invalid and must be recomputed.

5. If a collision has occurred, all sleepers on selwait are awakened to rescan to see whether one of their descriptors became ready. Awakening all selecting threads is necessary because the selrecord() routine could not record all the threads that needed to be awakened. Hence, it has to wake up all threads that could possibly have been interested. Empirically, collisions occur infrequently. If they were a frequent occurrence, it would be worthwhile to store multiple thread identifiers in the selinfo structure.

Movement of Data Inside the Kernel

Within the kernel, I/O data are described by an array of vectors. Each I/O vector or iovec has a base address and a length. The I/O vectors are identical to the I/O vectors used by the readv and writev system calls.

The kernel maintains another structure, called a uio structure, that holds additional information about the I/O operation. A sample uio structure is shown in Figure 6.6; it contains the following:

  • A pointer to the iovec array

  • The number of elements in the iovec array

  • The file offset at which the operation should start

  • The sum of the lengths of the I/O vectors

  • A flag showing whether the source and destination are both within the kernel or whether the source and destination are split between the user and the kernel

  • A flag showing whether the data are being copied from the uio structure to the kernel (UIO_WRITE) or from the kernel to the uio structure (UIO_READ)

  • A pointer to the thread whose data area is described by the uio structure (the pointer is NULL if the uio structure describes an area within the kernel)

Figure 6.6. A uio structure.


All I/O within the kernel is described with iovec and uio structures. System calls such as read and write that are not passed an iovec create a uio to describe their arguments; this uio structure is passed to the lower levels of the kernel to specify the parameters of an I/O operation. Eventually, the uio structure reaches the part of the kernel responsible for moving the data to or from the process address space: the filesystem, the network, or a device driver. In general, these parts of the kernel do not interpret uio structures directly. Instead, they arrange a kernel buffer to hold the data and then use uiomove() to copy the data to or from the buffer or buffers described by the uio structure. The uiomove() routine is called with a pointer to a kernel data area, a data count, and a uio structure. As it moves data, it updates the counters and pointers of the iovec and uio structures by a corresponding amount. If the kernel buffer is not as large as the areas described by the uio structure, the uio structure will point to the part of the process address space just beyond the location completed most recently. Thus, while servicing a request, the kernel may call uiomove() multiple times, each time giving a pointer to a new kernel buffer for the next block of data.

Character device drivers that do not copy data from the process generally do not interpret the uio structure. Instead, there is one low-level kernel routine that arranges a direct transfer to or from the address space of the process. Here, a separate I/O operation is done for each iovec element, calling back to the driver with one piece at a time.


   
 


The Design and Implementation of the FreeBSD Operating System
The Design and Implementation of the FreeBSD Operating System
ISBN: 0201702452
EAN: 2147483647
Year: 2003
Pages: 183

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net