Section 11.6. Data Transfer | The Design and Implementation of the FreeBSD Operating System

11.6. Data Transfer

Most of the work done by the socket layer lies in sending and receiving data. Note that the socket layer itself explicitly refrains from imposing any structure on data transmitted or received via sockets other than optional record boundaries.

Within the overall interprocess-communication model, any data interpretation or structuring is logically isolated in the implementation of the communication domain. An example of this logical isolation is the ability to pass file descriptors between processes using local-domain sockets.

Sending and receiving data can be done with any one of several system calls. The system calls vary according to the amount of information to be transmitted and received, and according to the state of the socket doing the operation. For example, the write system call may be used with a socket that is in a connected state, since the destination of the data is implicitly specified by the connection. The sendto or sendmsg system calls, however, allow the process to specify the destination for a message explicitly. Likewise, when data are received, the read system call allows a process to receive data on a connected socket without receiving the sender's address; the recvfrom and recvmsg system calls allow the process to retrieve the incoming message and the sender's address. The differences between these calls were summarized in Section 11.1. The recvmsg and sendmsg system calls allow scatter-gather I/O with multiple user-provided buffers. In addition, recvmsg reports additional information about a received message, such as whether it was expedited (out of band), whether it completes a record, or whether it was truncated because a buffer was too small. The decision to provide many different system calls rather than only a single general interface is debatable. It would have been possible to implement a single system-call interface and to provide simplified interfaces to applications via user-level library routines. However, the single system call would have to be the most general call, which has somewhat higher overhead. Internally, all transmission and reception requests are converted to a uniform format and are passed to the socket-layer sendit() and recvit() routines, respectively.

Transmitting Data

The sendit() routine is responsible for gathering all system-call parameters that the application has specified into the kernel's address space (except the actual data) and then for invoking the sosend() routine to do the transmission. The parameters may include the following components, illustrated in Figure 11.1:

An address to which data will be sent, if the socket has not been connected
Optional ancillary data (control data) associated with the message; ancillary data can include protocol-specific data associated with a message, protocol option information, or access rights
Normal data, specified as an array of buffers (see Section 6.4)
Optional flags, including out-of-band and end-of-record flags

The sosend() routine handles most of the socket-level data-transmission options, including requests for transmission of out-of-band data and for transmission without network routing. This routine is also responsible for checking socket state for example, seeing whether a required connection has been made, whether transmission is still possible on the socket, and whether a pending error should be reported rather than transmission attempted. In addition, sosend() is responsible for putting processes to sleep when their data transmissions exceed the buffering available in the socket's send buffer. The actual transmission of data is done by the supporting communication protocol; sosend() copies data from the user's address space into mbufs in the kernel's address space and then makes calls to the protocol to transfer the data.

Most of the work done by sosend() lies in checking the socket state, handling flow control, checking for termination conditions, and breaking up an application's transmission request into one or more protocol transmission requests. The request must be broken up only when the size of the user's request plus the number of data queued in the socket's send data buffer exceeds the socket's high watermark. It is not permissible to break up a request if the protocol is atomic, because each request made by the socket layer to the protocol modules implicitly indicates a boundary in the data stream. Most datagram protocols are of this type. Honoring each socket's high watermark ensures that a protocol will always have space in the socket's send buffer to enqueue unacknowledged data. It also ensures that no process, or group of processes, can monopolize system resources.

For sockets that guarantee reliable data delivery, a protocol will normally maintain a copy of all transmitted data in the socket's send queue until receipt is acknowledged by the receiver. Protocols that provide no assurance of delivery normally accept data from sosend() and directly transmit the data to the destination without keeping a copy. But sosend() itself does not distinguish between reliable and unreliable delivery.

Sosend() always ensures that a socket's send buffer has enough space available to store the next section of data to be transmitted. If a socket has insufficient space in its send buffer to hold all the data to be transmitted, sosend() uses the following strategy. If the protocol is atomic, sosend() verifies that the message is no larger than the send buffer size; if the message is larger, it returns an EMSGSIZE error. If the available space in the send queue is less than the send low watermark, the transmission is deferred. If the process is not using nonblocking I/O, the process is put to sleep until more space is available in the send buffer. Otherwise, an error is returned. When space is available, a protocol transmit request is formulated according to the available space in the send buffer. Sosend() copies data from the user's address space into mbuf clusters whenever the data are larger than the minimum cluster size (specified by MINCLSIZE). If a transmission request for a nonatomic protocol is large, each protocol transmit request will normally contain a full mbuf cluster. Although additional data could be appended to the mbuf chain before delivery to the protocol, it is preferable to pass the data to lower levels immediately. This strategy allows better pipelining, because data reach the bottom of the protocol stack earlier and can begin physical transmission sooner. This procedure is repeated until insufficient space remains; it resumes each time additional space becomes available.

This strategy tends to preserve the application-specified message size and helps to avoid fragmentation at the network level. The latter benefit is important because system performance is significantly improved when data-transmission units are large for example, the mbuf cluster size.

When the receiver or network is slower than the transmitter, the underlying connection-based transmission protocols usually apply some form of flow control to delay the sender's transmission. In this case, the amount of data that the receiver will allow the sender to transmit can decrease to the point that the sender's natural transmission size drops below its optimal value. To retard this effect, sosend() delays transmission rather than breaking up the data to be transmitted in the hope that the receiver will reopen its flow control window and allow the sender to perform optimally. The effect of this scheme is fairly subtle and is also related to the networking subsystem's optimized handling of incoming data packets that are a multiple of the machine's page size (described in Section 12.8).

The sosend() routine, in manipulating a socket's send data buffer, takes care to ensure that access to the buffer is synchronized among multiple sending processes. It does so by bracketing accesses to the data structure with calls to sblock() and sbunlock().

Receiving Data

The soreceive() routine receives data queued at a socket. As the counterpart to sosend(), soreceive() appears at the same level in the internal software structure and does similar tasks. Three types of data may be queued for reception at a socket: in-band data, out-of-band data, and ancillary data, such as access rights. In-band data may also be tagged with the sender's address. Handling of out-of-band data varies by protocol. They may be placed at the beginning of the receive buffer or at the end of the buffer to appear in order with other data, or they may be managed in the protocol layer separate from the socket's receive buffer. In the first two cases, they are returned by normal receive operations. In the final case, they are retrieved through a special interface when requested by the user. These options allow varying styles of urgent data transmission.

Soreceive() checks the socket's state, including the received data buffer, for incoming data, errors, or state transitions, and processes queued data according to their type and the actions specified by the caller. A system-call request may specify that only out-of-band data should be retrieved (MSG_OOB) or that data should be returned but not removed from the data buffer (by specifying the MSG_PEEK flag). Receive calls normally return as soon as the low watermark is reached. Because the default is one octet, the call returns when any data are present. The MSG_WAITALL flag specifies that the call should block until it can return all the requested data, if possible. Alternatively, the MSG_DONTWAIT flag causes the call to act as though the socket was in nonblocking mode, returning EAGAIN rather than blocking.

Data present in the receive data buffer are organized in one of several ways, depending on whether message boundaries are preserved. There are three common cases for stream, datagram, and sequenced-packet sockets. In the general case, the receive data buffer is organized as a list of messages (see Figure 11.12). Each message can include a sender's address (for datagram protocols), ancillary data, and normal data. Depending on the protocol, it is also possible for expedited or out-of-band data to be placed into the normal receive buffer. Each mbuf chain on a list represents a single message or, for the final chain, a possibly incomplete record. Protocols that supply the sender's address with each message place a single mbuf containing the address at the front of message. Immediately following any address is an optional mbuf containing any ancillary data. Regular data mbufs follow the ancillary data. Names and ancillary data are distinguished by the type field in an mbuf; addresses are marked as MT_SONAME, whereas ancillary data are tagged as MT_CONTROL. Each message other than the final one is considered to be terminated. The final message is terminated implicitly when an atomic protocol is used, such as most datagram protocols. Sequenced packet protocols could treat each message as an atomic record, or they could support records that could be arbitrarily long (as is done in OSI). In the latter case, the final record in the buffer might or might not be complete, and a flag on the final mbuf, M_EOR, marks the termination of a record. Record boundaries (if any) are generally ignored by a stream protocol. However, transition from out-of-band data to normal data in the buffer, or presence of ancillary data, causes logical boundaries. A single receive operation never returns data that cross a logical boundary. Note that the storage scheme used by sockets allows them to compact data of the same type into the minimal number of mbufs required to hold those data.

Figure 11.12. Data queueing for datagram socket.

On entry to soreceive(), a check is made to see whether out-of-band data are being requested. If they are, the protocol layer is queried to see whether any such data are available; if the data are available, they are returned to the caller. Since regular data cannot be retrieved simultaneously with out-of-band data, soreceive() then returns. Otherwise, data from the normal queue have been requested. The soreceive() function first checks whether the socket is in confirming state, with the peer awaiting confirmation of a connection request. If it is, no data can arrive until the connection is confirmed, and the protocol layer is notified that the connection should be completed. Soreceive() then checks the receive-data-buffer character count to see whether data are available. If they are, the call returns with at least the data currently available. If no data are present, soreceive() consults the socket's state to find out whether data might be forthcoming. Data may no longer be received because the socket is disconnected (and a connection is required to receive data) or because the reception of data has been terminated with a shutdown by the socket's peer. In addition, if an error from a previous operation was detected asynchronously, the error needs to be returned to the user; soreceive() checks the so_error field after checking for data. If no data or error exists, data might still arrive, and if the socket is not marked for nonblocking I/O, soreceive() puts the process to sleep to await the arrival of new data.

When data arrive for a socket, the supporting protocol notifies the socket layer by calling sorwakeup(). Soreceive() can then process the contents of the receive buffer, observing the data-structuring rules described previously. Soreceive() first removes any address that must be present, then optional ancillary data, and finally normal data. If the application has provided a buffer for the receipt of ancillary data, they are passed to the application in that buffer; otherwise, they are discarded. The removal of data is slightly complicated by the interaction between in-band and out-of-band data managed by the protocol. The location of the next out-of-band datum can be marked in the in-band data stream and used as a record boundary during in-band data processing. That is, when an indication of out-of-band data is received by a protocol that holds out-of-band data separately from the normal buffer, the corresponding point in the in-band data stream is marked. Then, when a request is made to receive in-band data, only data up to the mark will be returned. This mark allows applications to synchronize the in-band and out-of-band data streams so that, for example, received data can be flushed up to the point at which out-of-band data are received. Each socket has a field, so_oobmark, that contains the character offset from the front of the receive data buffer to the point in the data stream at which the last out-of-band message was received. When in-band data are removed from the receive buffer, the offset is updated so that data past the mark will not be mixed with data preceding the mark. The SS_RCVATMARK bit in a socket's state field is set when so_oobmark reaches zero to show that the out-of-band data mark is at the beginning of the socket receive buffer. An application can test the state of this bit with the SIOCATMARK ioctl call to find out whether all in-band data have been read up to the point of the mark.

Once data have been removed from a socket's receive buffer, soreceive() updates the state of the socket and notifies the protocol layer that data have been received by the user. The protocol layer can use this information to release internal resources, to trigger end-to-end acknowledgment of data reception, to update flow-control information, or to start a new data transfer. Finally, if any access rights were received as ancillary data, soreceive() passes them to a communication-domain-specific routine to convert them from their internal representation to the external representation.

The soreceive() function returns a set of flags that are supplied to the caller of the recvmsg system call via the msg_flags field of the msghdr structure (see Figure 11.1). The possible flags include MSG_EOR to specify that the received data complete a record for a nonatomic sequenced packet protocol, MSG_OOB to specify that expedited (out-of-band) data were received from the normal socket receive buffer, MSG_TRUNC to specify that an atomic record was truncated because the supplied buffer was too small, and MSG_CTRUNC to specify that ancillary data were truncated because the control buffer was too small.