Section 18.1. STREAMS and the Network Stack | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

18.1. STREAMS and the Network Stack

The networking stack of the Solaris 1.x release was a variant of the BSD UNIX implementation and was similar to the BSD Reno implementation. The BSD stack worked fine for the low-end machines, but Solaris was required to meet the demands of data-center enterprise installations as well as desktops and low-end systems. Thus, the Solaris code base was migrated to the AT&T SVR4 architecture, which became the Solaris 2.X product (Sun OS 5.X system).

With the Solaris 2.X release, the networking stack went through a makeover and transitioned from a BSD-style stack to a STREAMS-based stack, which aligned with the SVR4 architecture. The STREAMS framework provided an easy message passing interface, making it relatively simple to create a message flow by which STREAMS modules interact with other STREAMS modules. The kernel STREAMS framework includes a perimeters facility for managing thread concurrency in STREAMS modules, thereby guaranteeing exclusive access to STREAMS queues. Using the STREAMS inner- and outer-perimeter feature, the module writer could provide mutual exclusion without making the implementation complex.

The demand on systems providing network services changed with the expansion of the World Wide Web (WWW) and the increase in volume and processing power of client systems (for example, personal computers). A salient example is connection setup and teardown. In early implementations, the cost of setting up a STREAMS device was high, but the number of connection setups per second was not an important consideration, and connections were usually long-lived. With the Internet explosion, large numbers of short-lived connections are common, and require an implementation that can do fast connection setup and teardown. This is just one of several areas addressed in Solaris 10.

The networking stack in Solaris 10 went through further transitions by which the core pieces (that is, socket layer, TCP, UDP, IP, and device driver) use an IP classifier and serialization queue to improve the connection setup time and scalability and to reduce packet processing cost. STREAMS modules are still used to provide the flexibility that ISVs need to implement additional functionality.

18.1.1. The STREAMS Model

STREAMS^[1] allows users to create modules to provide standard data communications services and then manipulate the modules on a stream. The modules are precompiled and can be dynamically interconnected to form a stream from the application level without any explicit linking to other modules in the stream.

^[1] The capitalized word "STREAMS" refers to the STREAMS programming model and facilities. The word "stream" refers to an instance of a full-duplex path using the model and facilities between a user application and a driver.

The fundamental STREAMS unit is the stream. A stream is a full-duplex bidirectional data-transfer path between a process in user space and a STREAMS driver in kernel space. A stream has three parts: a stream head, zero or more modules, and a driver.

An application creates a stream by opening a STREAMS device driver and optionally inserting one or more STREAMS modules between the stream head and device driver. This string of STREAMS modules (see Figure 18.1) creates a bidirectional flow in which data moves between the STREAMS driver and modules in the kernel space and an application in user space. A stream head is the end of the stream nearest the user process. It is the interface between the stream and the user process. When a STREAMS device is first opened, the stream consists of only a stream head and a STREAMS driver.

Figure 18.1. STREAMS Example

A STREAMS module is a defined set of kernel-level routines and data structures. A STREAMS device driver is a character device driver that implements the STREAMS interface. A STREAMS device driver exists below the stream head and any modules. It can act on an external I/O device, or it can be an internal software driver, called a pseudo device driver. The driver transfers data between the kernel and the device.

Data on a stream is passed in the form of messages. Messages are the means by which all I/O is done under STREAMS. Each stream head, STREAMS module, and driver has a read side and a write side. When messages go from one module's read side to the next module's read side, they are said to be traveling upstream. Messages passing from one module's write side to the next module's write side are said to be traveling downstream.

Each stream head, STREAMS driver, and STREAMS module has its own pair of queues, one queue for the read side and one queue for the write side. Messages are ordered into queues, generally on a first-in, first-out basis (FIFO), according to priorities associated with them (see Figure 18.2).

Figure 18.2. STREAMS Queues

The stream head or STREAMS device driver uses the putnext() routine and passes the pointer to the read side or the write side and the message block. STREAMS determines the next element (driver, module, or head as appropriate) and calls the read or write procedure registered at the STREAMS creation time with the correct queue and message block.

To communicate with a STREAMS device, an application uses the read(2), write(2), getmsg(2), getpmsg(2), putmsg(2), putpmsg(2), and ioctl(2) system calls to transmit or receive data on a stream.

The data written by the application is converted into a STREAMS message, which is made up of one or more message blocks, referenced by a pointer to a msgb structure. The b_next and b_prev pointers in the msgb structure are used to link messages together on a queue. The b_cont pointer links message blocks together when a message consists of more than one block. Each msgb structure also includes a pointer to a datab structure, the data block that contains pointers to the actual data of the message, and the message type.

The STREAMS modules, device driver, and stream head communicate with each other, passing the STREAMS messages with the putnext(9F) or put(9E) routine.

STREAMS also allows a module or device driver to describe the concurrency model for itself. It can choose to be fully multithreaded, in which case the burden of protecting its data structures from multiple threads is the responsibility of the module. Modules can also choose the perimeter protection model that ensures that the STREAMS framework allows only one thread inside the module at any time. In-between options are available, whereby modules can specify that processing is single-threaded during setup and teardown of the stream but multithreaded during the data exchange.

18.1.2. Network Stack as STREAMS Module

STREAMS provided an excellent framework to implement the networking stack. Each protocol layer was implemented as a STREAMS module, and the network driver was implemented as a STREAMS device driver. The incoming packets were converted into STREAMS messages by the network device drivers and sent upstream to be processed by the various protocol layers (IP, TCP, UDP, etc.). Similarly, data sent by the application (by writing to an open socket) was converted by the stream head into a message and sent downstream to be processed by various protocol layers that were put together when the stream was constructed.

When an application opens a socket for communication, the domain and type argument determine what kind of stream is created. For example, an AF_INET domain and SOCK_STREAM type means that the application intends to establish a TCP stream to a remote endpoint. The stream created by opening such a socket would consist of a stream head, a TCP STREAMS module, and IP as the device driver. The socket domain and type to the STREAMS driver mapping is controlled by the /etc/sock2path file.

IP was implemented both as a STREAMS module and STREAMS device driver for multiplexing reasons. To the application, IP appeared as a device driver but it was also inserted as a module on top of each network device driver (see Figure 18.3). Each module and device driver is the instance of same code with common data structures, which help IP direct a packet received by a device driver to the correct application stream by means of an IP client table. Similarly, data sent by the application needs to be sent through the correct device driver instance. An Internet route entry (ire) data structure maintained by IP stores the mapping between a destination and the network interface device driver.

Figure 18.3. TCP/IP Stack as STREAMS Modules

Creating the TCP stream for outbound connections is simple enough. The application initiates the process by opening a socket that creates the stream, and all subsequent operations (for example, connect) are processed on that stream. The incoming connection case is more complex since the stream to handle this connection doesn't yet exist.

Incoming SYN packets are sent to TCP through one of the listener TCP streams. TCP creates a new TCP data structure (often referred to as "eager TCP") for this incoming connection. The three-way handshake for connection setup is completed on the listener TCP stream. Once done, TCP sends a "connection indication" to the application. Once the application "accepts" the connection, a new stream for this connection is created, the eager-TCP instance is transferred from the listener to this stream, and an appropriate entry is made in the IP client table. Since a remote endpoint can continue to send packets after the three-way handshake is completed, the packets can be sent to TCP by means of the listener stream or the correct IP client, which may be in the process of being created.

So the TCP receive-side code had to deal with significant complexity in setting up an incoming connection. This also limited the ability of the Solaris STREAMS-based stack to accept a large number of incoming TCP connections, since all the incoming packets for a particular listener were serialized at the listener instance and required opening a stream with IP as device and autopushing TCP as a module.

TCP and UDP also maintain a global stream to IP. This allows IP to pass incoming packets for which it can't find the correct IP client to the correct module on basis of protocol alone.

While this implementation of TCP/IP as STREAMS modules served us well for many years, work began in the early days of Solaris 10 development to build a new, more efficient implementation that significantly reduced the use of STREAMS.

18.1.2.1. Key Data Structures

The key data structures and the protection mechanism used to manage synchronized access include the following:

TCP structures. TCP maintains the connection state in a connection-specific tcp_t data structure. This contains local and remote IP addresses, TCP port information, send/receive sequence numbers and round-trip time among other important members. Access is restricted to one thread at a time so there is no need to hold a lock while accessing these members.
Because of this model, a unique TCP stream is created per connection, and STREAMS offers the serialized access to TCP module in the stream. TCP stores tcp_t data structures as a STREAMS queue private instance at connection creation time; the instance is passed to TCP when called by STREAMS while passing messages to TCP. TCP also maintains a connection hash table computed on the local and remote IP address and TCP ports. Each connection-specific tcp_t is also inserted in the TCP connections hash table and is used to find the connection instance for messages received on the TCP global stream or listener stream where its not possible to determine the tcp_t from the queue itself.
UDP structures. UDP maintains per UDP stream information similar to TCP in a udp_t structure. This structure is also stored as the read-side and write-side queue private member. Most of the work of connecting UDP is similar to the TCP module.
UDP also deals with applications sending datagrams to multiple remote destinations on the same local socket by sending them to IP and letting IP multiplex them to the correct Network Interface Card (NIC).
IP structures. IP has two main data structures: ipc_t and ill_t.
ipc_t is the structure for IP as a device. The ipc_t structure includes the stream-specific information to send the incoming packets to the correct TCP or UDP client stream.
ill_t is the structure, unique to each physical NIC, for IP as a module inserted on each network device driver. The ill_t structure contains the physical NIC-specific information and pointers to each ipif_t structure, which represents each logical NIC on top of the physical NIC.
IP also maintains hash tables for connected TCP clients, TCP listener clients, UDP clients, and other protocol streams. An ipc_t is inserted in these tables with a hash value computed on IP address and port information. The routing table is also maintained by IP. Routes are cached in the form of an Internet route entry (ire_t) data structure. The ire_t structure contains a pointer to an outbound NIC queue through which the destination is reachable.

18.1.2.2. IP as a Multiplexer

IP is by far the most important component of the network stack. Its job is to multiplex the incoming packets to the correct IP client stream and send the outbound packets to the correct NIC. When an NIC receives a packet, it sends the packet upstream to IP acting as a module. IP looks at the protocol type and, for anything other than TCP and UDP, passes the packet to the correct protocol stream.

For TCP and UDP, it tries to find an ipc_t in the connected hash table or the listener hash table and uses the pointer to the upstream queue stored in ipc_t to pass the packet up the correct stream.

The outbound multiplexing is done by means of the ire_t. When upstream modules want to send packets out, IP uses the ire_t to determine which outbound NIC it should use.

18.1.3. Issues with STREAMS-Based Stacks

The STREAMS-based stack served pretty well until the days of the Web. When the number of connections was small and more long-lived (NFS, ftp, etc.), the cost of setting up a new stream was amortized over the life of the connection. With the explosion of the Web and faster machines, even the long-lived connection became short-lived, and a typical server had to deal with a large number of incoming connections at any given time.

During the same period, servers became larger, multiprocessor-based systems with larger memory capacity. The cost of switching processing from one CPU to another became high as the mid-to-high-end machines became more nonuniform in their memory access. Since STREAMS by design had no CPU affinity, packets for particular connections moved around to different CPUs. It was apparent that Solaris needed to move away from the STREAMS architecture.