Section 11.4. Data Structures

11.4. Data Structures

Sockets are the basic objects used by processes communicating over a network. A socket's type defines the basic set of communication semantics, whereas the communication domain defines auxiliary properties important to the use of the socket and may refine the set of available communication semantics. Table 11.3 shows the four types of sockets currently supported by the system. To create a new socket, applications must specify its type and the communication domain. The request may also indicate a specific network protocol to be used by the socket. If no protocol is specified, the system selects an appropriate protocol from the set of protocols supported by the communication domain. If the communication domain is unable to support the type of socket requested (i.e., no suitable protocol is available), the request will fail.

Table 11.3. Socket types supported by the system.
Name	Type	Properties
SOCK_STREAM	stream	reliable, sequenced, data transfer; may support out-of-band data
SOCK_DGRAM	datagram	unreliable, unsequenced, data transfer, with message boundaries preserved
SOCK_SEQPACKET	sequenced packet	reliable, sequenced, data transfer, with message boundaries preserved
SOCK_RAW	raw	direct access to the underlying communication protocols

Sockets are described by a socket data structure that is dynamically created at the time of a socket system call. Communication domains are described by a domain data structure that is statically defined within the system based on the system's configuration (see Section 14.5). Communication protocols within a domain are described by a protosw structure that is also statically defined within the system for each protocol implementation configured. Having these structures defined statically means that protocol modules cannot be loaded or unloaded at run time.

When a request is made to create a socket, the system uses the value of the communication domain to search linearly the list of configured domains. If the domain is found, the domain's table of supported protocols is consulted for a protocol appropriate for the type of socket being created or for a specific protocol requested. (A wildcard entry may exist for a raw socket.) Should multiple protocol entries satisfy the request, the first is selected. We shall begin discussion of the data structures by examining the domain structure. The protosw structure is discussed in Section 12.1.

Communication Domains

The domain structure is shown in Figure 11.6. The dom_name field is the ASCII name of the communication domain. (In the original design, communication domains were to be specified with ASCII strings; they are now specified with manifest constants.) The dom_family field identifies the address family used by the domain; some possible values are shown in Table 11.4. Address families refer to the addressing structure of a domain. An address family generally has an associated protocol family. Protocol families refer to the suite of communication protocols of a domain used to support the communication semantics of a socket. The dom_protosw field points to the table of functions that implement the protocols supported by the communication domain, and the dom_NPROTOSW pointer marks the end of the table. The remaining entries contain pointers to domain-specific routines used in the management and transfer of access rights and fields relating to routing initialization for the domain.

Table 11.4. Address families.
Name	Description
AF_LOCAL	(AF_UNIX) local communication
AF_INET	Internet (TCP/IP)
AF_INET6	Internet Version 6 (TCP/IPv6)
AF_NS	Xerox Network System (XNS) architecture
AF_ISO	OSI network protocols
AF_CCITT	CCITT protocols, e.g., X.25
AF_SNA	IBM System Network Architecture (SNA)
AF_DLI	direct link interface
AF_LAT	local-area-network terminal interface
AF_APPLETALK	AppleTalk network
AF_ROUTE	communication with kernel routing layer
AF_LINK	raw link-layer access
AF_IPX	Novell Internet protocol

Figure 11.6. Communication-domain data structure.

Sockets

The socket data structure is shown in Figure 11.7 (on page 452). Storage for the socket structure is allocated by the UMA zone allocator (described in Section 5.3). Sockets contain information about their type, the supporting protocol in use, and their state (Table 11.5 on page 453). Data being transmitted or received are queued at the socket as a list of mbuf chains. Various fields are present for managing queues of sockets created during connection establishment. Each socket structure also holds a process-group identifier. The process-group identifier is used in delivering the SIGURG and SIGIO signals. SIGURG is sent when an urgent condition exists for a socket, and SIGIO is used by the asynchronous I/O facility (see Section 6.4). The socket contains an error field, which is needed for storing asynchronous errors to be reported to the owner of the socket.

Table 11.5. Socket states.
State	Description
SS_NOFDREF	no file-table reference
SS_ISCONNECTED	connected to a peer
SS_ISCONNECTING	in process of connecting to peer
SS_ISDISCONNECTING	in process of disconnecting from peer
SS_CANTSENDMORE	cannot send more data to peer
SS_CANTRCVMORE	cannot receive more data from peer
SS_RCVATMARK	at out-of-band mark on input
SS_ISCONFIRMING	peer awaiting connection confirmation
SS_NBIO	nonblocking I/O
SS_ASYNC	asynchronous I/O notification
SS_INCOMP	connection is incomplete and not yet accepted
SS_COMP	connection is complete but not yet accepted
SS_ISDISCONNECTED	socket is disconnected from peer

Figure 11.7. Socket data structure.

Sockets are located through a process's file descriptor via the file table. When a socket is created, the f_data field of the file structure is set to point at the socket structure, and the f_ops field is set to point to the set of routines defining socket-specific file operations. In this sense, the socket structure is a direct parallel of the vnode structure used by the filesystems.

The socket structure acts as a queueing point for data being transmitted and received. As data enter the system as a result of system calls, such as write or send, the socket layer passes the data to the networking subsystem as a chain of mbufs for immediate transmission. If the supporting protocol module decides to postpone transmission of the data, or if a copy of the data is to be maintained until an acknowledgment is received, the data are queued in the socket's send queue. When the network has consumed the data, it discards them from the outgoing queue. On reception, the network passes data up to the socket layer, also in mbuf chains, where they are then queued until the application makes a system call to request them. The socket layer can also make a callback to an internal kernel client of the network when data arrive, allowing the data to be processed without a context switch. Callbacks are used by the NFS server (see Chapter 9).

To avoid resource exhaustion, sockets impose upper bounds on the number of bytes of data that can be queued in a socket data buffer and also on the amount of storage space that can be used for data. This high watermark is initially set by the protocol, although an application can change the value up to a system maximum, normally 256 Kbyte. The network protocols can examine the high watermark and use the value in flow-control policies. A low watermark also is present in each socket data buffer. The low watermark allows applications to control data flow by specifying a minimum number of bytes required to satisfy a reception request, with a default of 1 byte and a maximum of the high watermark. For output, the low watermark sets the minimum amount of space available before transmission can be attempted; the default is the size of an mbuf cluster. These values also control the operation of the select system call when it is used to test for ability to read or write the socket.

When connection indications are received at the communication-protocol level, the connection may require further processing to complete. Depending on the protocol, that processing may be done before the connection is returned to the listening process, or the listening process may be allowed to confirm or reject the connection request. Sockets used to accept incoming connection requests maintain two queues of sockets associated with connection requests. The list of sockets headed by the so_incomp field represents a queue of connections that must be completed at the protocol level before being returned. The so_comp field heads a list of sockets that are ready to be returned to the listening process. Like the data queues, the queues of connections also have an application-controllable limit. The limit applies to both queues. Because the limit may include sockets that cannot yet be accepted, the system enforces a limit 50 percent larger than the nominal limit.

Although a connection may be established by the network protocol, the application may choose not to accept the established connection or may close down the connection immediately after discovering the identity of the client. It is also possible for a network protocol to delay completion of a connection until after the application has obtained control with the accept system call. The application might then accept or reject the connection explicitly with a protocol-specific mechanism. Otherwise, if the application does a data transfer, the connection is confirmed; if the application closes the socket immediately, the connection is rejected.

Socket Addresses

Sockets may be labeled so that peers can connect to them. The socket layer treats an address as an opaque object. Applications supply and receive addresses as tagged, variable-length byte strings. Addresses are placed in mbufs within the socket layer. A structure called a sockaddr, shown in Figure 11.8, is used as a template for referring to the identifying tag and length of each address. Most protocol layers support a single address type as identified by the tag, known as the address family.

Figure 11.8. Socket-address template structure.

It is common for addresses passed in by an application to reside in mbufs only long enough for the socket layer to pass them to the supporting protocol for transfer into a fixed-sized address structure for example, when a protocol records an address in a protocol control block. The sockaddr structure is the common means by which the socket layer and network-support facilities exchange addresses. The size of the generic data array was chosen to be large enough to hold many types of addresses directly, although generic code cannot depend on having sufficient space in a sockaddr structure for an arbitrary address. The local communication domain (formerly known as the UNIX domain), for example, stores filesystem pathnames in mbufs and allows socket names as large as 104 bytes, as shown in Figure 11.9. Both IPv4 and IPv6 use a fixed-size structure that combines an Internet address and a port number. The difference is in the size of the address (4 bytes for IPv4 and 16 bytes for IPv6) and in the fact that IPv6 address structures carry other information (the scope and flow info). Both Internet protocols reserve space for addresses in a protocol-specific control-block data structure and free up mbufs that contain addresses after copying the addresses.

Figure 11.9. Local-domain, IPv4 and IPv6 address structures.

Locks

Section 4.3 discussed the need for locking structures in an SMP kernel. The networking subsystem uses these locks internally to protect its data structures.

When SMP features were first introduced, the entire networking subsystem was placed, with the rest of the kernel, under the giant lock. During the development of FreeBSD 5.2 several pieces of networking code were modified to run without the giant lock, but at this time parts of the system still use it. Whenever a user process calls the networking subsystem (i.e., they create a socket, make a connection, send or receive data), they acquire the giant lock. Other locks are acquired and released when interacting with lower layers of the protocols and network devices. Specific instances of networking subsystem locks are discussed in the section in which they are most relevant.