Section 18.3. Solaris 10 Network Stack Framework | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

18.3. Solaris 10 Network Stack Framework

The pre-Solaris 10 stack used the STREAMS perimeter facility and kernel adaptive mutexes for multithreading. TCP used a STREAMS QPAIR perimeter, UDP used a STREAMS QPAIR with the PUTSHARED attribute, and IP a PERMOD perimeter with PUTSHARED. Various TCP, UDP, and IP global data structures were protected by mutexes. The stack was executed by userland threads executing various system calls, the network device driver read-side interrupt or device driver worker thread, and by STREAMS framework worker threads. The then current perimeter provided a per-module, per-protocol stack layer, or horizontal perimeter. This could, and often did, lead to a packet being processed on more than one CPU and by more than one thread, leading to excessive context switching and poor CPU data locality. The problem was compounded by the various places at which packets could be queued under load and by the various threads that finally processed the packet.

The FireEngine approach is to merge all protocol layers into one STREAMS module that is fully multithreaded. Inside the merged module, instead of using per data structure locks, FireEngine uses a per-CPU synchronization mechanism called vertical perimeter. The vertical perimeter is implemented by a serialization queue abstraction called squeue. Each squeue is bound to a CPU, and each connection is in turn bound to an squeue, thereby providing any synchronization and mutual exclusion needed for the connection-specific data structures.

The connection (or context) lookup for inbound packets is done outside the perimeter by an IP connection classifier as soon as the packet reaches IP. The classification provides the basis by which the connection structure is identified. Since the lookup happens outside the perimeter, we can bind a connection to an instance of the vertical perimeter or squeue when the connection is initialized and processes all packets for that connection on the squeue it is bound to, maintaining better cache locality. More details about the vertical perimeter and classifier are given in later sections.

The classifier also becomes the database for storing a sequence of function calls necessary for all inbound and outbound packets. This facilitates a change in the Solaris networking stacks, from the current message passing interface to a BSD-style function call interface. The string of functions created on the fly (event list) for processing a packet for a connection provides the basis for an eventual new framework in which other modules, including third-party, high-performance modules, can participate in the framework.

18.3.1. Vertical Perimeter

An squeue guarantees that only a single thread can process a given connection at any given time, thus serializing access to the TCP connection structure by multiple threads (both from the read and write side) in the merged TCP/IP module. It is similar to the STREAMS QPAIR perimeter, but instead of just protecting a module instance, it protects the whole connection state from IP to sockfs (the socket file systemthe implementation of sockets in Solaris introduced in the Solaris 8 release).

Vertical perimeters or squeues by themselves just provide packet serialization and mutual exclusion for the data structures, but by creating per CPU perimeters and binding a connection to the instance attached to the CPU processing interrupts, we can guarantee much better data locality. We could have chosen between creating a per-connection perimeter or a per-CPU perimeter, that is, an instance for each connection or each CPU. However, the overhead involved with a per-connection perimeter and thread contention gives lower performance, so we opted for a per-CPU instance.

For the per-CPU instance, we had the choice of queuing a connection structure for processing or instead just queuing the packet itself and storing the connection structure pointer in the packet. The former approach leads to some interesting starvation scenarios when packets for a connection keep arriving at a steady rate, and managing the potential starvation issue came at a high overhead (performance) cost. Queuing the packets lets us protect the ordering and is much simpler, and this is the approach we have taken for FireEngine.

As mentioned before, each connection instance is assigned to a single squeue and is thus processed only within the vertical perimeter. An squeue is processed by a single thread at a time, so all data structures used to process a given connection from within the perimeter can be accessed without additional locking. This approach improves the CPU and thread context data locality of access of the connection metadata, the packet metadata, and the packet payload data. In addition, it lets us remove per-device-driver worker thread schemes, which are problematic in solving a systemwide resource issue. With that removal, we can implement additional strategic algorithms to best handle a given network interface according to the network interface throughput and the system throughput. For example, fanning-out per-connection packet processing to a group of CPUs is now possible. The thread entering an squeue may either process the packet right away or queue it for later processing by another thread or worker thread. The choice depends on the squeue entry point and the state of the squeue. The immediate processing is possible only when no other thread has entered the same squeue. The squeue is represented by the following abstraction.

struct squeue_s {         /* Keep the most used members 64bytes cache aligned */         kmutex_t        sq_lock;        /* lock before using any member */         uint32_t        sq_state;       /* state flags and message count */         int             sq_count;       /* # of mblocks in squeue */         mblk_t          *sq_first;      /* first mblk chain or NULL */         mblk_t          *sq_last;       /* last mblk chain or NULL */         clock_t         sq_awaken;      /* time async thread was awakened */         kthread_t       *sq_run;        /* Current thread processing sq */         void            *sq_rx_ring;         clock_t         sq_avg_drain_time; /* Avg time to drain a pkt */         processorid_t   sq_bind;        /* processor to bind to */         kcondvar_t      sq_async;       /* async thread blocks on */         clock_t         sq_wait;        /* lbolts to wait after a fill() */         uintptr_t       sq_private[SQPRIVATE_MAX];         timeout_id_t    sq_tid;         /* timer id of pending timeout() */         kthread_t       *sq_worker;     /* kernel thread id */         char            sq_name[SQ_NAMELEN + 1]; ... };                                                See usr/src/uts/common/sys/squeue_impl.h

It is important to note that the squeues are created on the basis of per-hardware execution pipelines, that is, cores, hyperthreads, and the like. The stack processing of the serialization queue (and the hardware execution pipeline) is limited to one thread at a time, but this actually improves performance because the new stack ensures that there are no waits for any resources such as memory or locks inside the vertical perimeter. Allowing more than one kernel thread to timeshare execution pipelines incurs more overhead than allowing only one thread to run uninterrupted.

FireEngine provides three models for flexible squeue processing:

Queuing model. The queue is strictly FIFO (first in, first out) for both the read and write side, which ensures that any particular connection does not suffer or is not starved. A read-side or write-side thread queues packets at the end of the chain. The thread can then be allowed to process the packet or to signal the worker thread according to the processing model.
Processing model. After enqueueing its packet, the enqueuing thread returns if another thread is already processing the squeue, and the packet is drained later according to the drain model. If the squeue is not being processed and no packets are queued, the thread can mark the squeue as being processed (represented by sq_flag) and processes the packet. Once the thread has processed the packet, it removes the "processing in progress" flag and frees the squeue for future processing.
Drain model. A thread that successfully processed its own packet can also drain any packets that were queued while it was processing the request. In addition, if the squeue is not being processed but packets are already queued, then instead of queuing its packet and leaving, the thread can drain the queue and then process its own packets. The worker thread is always allowed to drain the entire queue. Choosing the correct drain model is quite complicated. The choices show below can be independently applied to the read thread and the write thread.
Always queue.
Process your own packet if you can.
Time-bounded process and drain.

Typically, draining by an interrupt thread should always be time-bounded "process and drain," whereas the write thread can choose between "process your own" and time-bounded "process and drain." For Solaris 10, the write thread behavior is tunable and defaults to "process your own," whereas the read side is fixed to time-bounded "process and drain."

Signaling the worker thread is another option worth exploring. If the packet arrival rate is low and a thread is forced to queue its packet, then, when there is work to be done, the worker thread should be allowed to run as soon as the entering thread finishes processing the squeue. On the other hand, if the packet arrival rate is high, it may be desirable to delay waking up the worker thread and hope that an interrupt will shortly arrive to complete the drain. Waking up the worker thread immediately when the packet arrival rate is high creates unnecessary contention between the worker and interrupt threads.

The default for Solaris 10 is delayed wakeup of the worker thread. Initial experiments on available servers showed that the best results were obtained by waking up the worker thread after a 10 ms delay.

Placing a request on the squeue requires a per-squeue lock to protect the state of the queue; this doesn't introduce scalability problems, because the lock is distributed among CPUs and is only held for a short period of time. We also utilize optimizations that allow avoiding context switches while still preserving the single-threaded semantics of squeue processing. We create an instance of an squeue per CPU in the system and bind the worker thread to that CPU. Each connection is then bound to a specific squeue and thus to a specific CPU as well.

The binding of an squeue to a CPU can be changed, but the binding of a connection to an squeue never changes because of the squeue protection semantics. In the merged TCP/IP case, the vertical perimeter protects the TCP state for each connection. The squeue instance used by each connection is chosen either at the "open," "bind," or "connect" time for outbound connections or at "eager connection creation time" for inbound connections.

The choice of the squeue instance depends on the relative speeds of the CPUs and the NICs in the system. There are two cases:

The CPU is faster than the NIC. The incoming connections are assigned to the "squeue instance" of the interrupted CPU. For the outbound case, connections are assigned to the squeue instance of the CPU the application is running on.
The NIC is faster than the CPU. A single CPU is not capable of handling the NIC. The connections are randomly bounded on all available squeues.

For Solaris 10, the system administrator determines whether the NIC should be faster or slower than the CPU by tuning the global variable ip_squeue_fanout. The default is no fan-out; that is, assign the incoming connection to the squeue attached to the interrupted CPU. To take a CPU offline, the worker thread bound to this CPU removes its binding and restores it when the CPU comes back online. This allows dynamic reconfiguration functionality to work correctly. When packets for a connection are arriving on multiple NICs (and thus interrupting multiple CPUs), they are always processed on the squeue on which the connection was originally established. In Solaris 10, the vertical perimeter is provided only for TCP-based connections. The interface to the vertical perimeter is done at the TCP and IP layer after a determination that the perimeter is a TCP connection. Solaris 10 updates will introduce the general vertical perimeter for any use.

The function prototypes for the squeue interfaces are listed below.

extern void squeue_init(void); extern squeue_t *squeue_create(char *, processorid_t, clock_t, pri_t); extern void squeue_bind(squeue_t *, processorid_t); extern void squeue_unbind(squeue_t *); extern void squeue_enter_chain(squeue_t *, mblk_t *, mblk_t *,     uint32_t, uint8_t); extern void squeue_enter(squeue_t *, mblk_t *, sqproc_t, void *, uint8_t); extern void squeue_enter_nodrain(squeue_t *, mblk_t *, sqproc_t, void *,     uint8_t); extern void squeue_fill(squeue_t *, mblk_t *, sqproc_t, void *, uint8_t); extern uintptr_t *squeue_getprivate(squeue_t *, sqprivate_t); extern processorid_t squeue_binding(squeue_t *);                                                     See usr/src/uts/common/sys/squeue.h

squeue_create() instantiates a new squeue and uses squeue_bind() and squeue_unbind() to bind and unbind itself from a particular CPU. Once created, the squeues are never destroyed. The squeue_enter() function accesses the squeue, and the entering thread processes and drains the squeue according to the models previously discussed. squeue_fill() just queues a packet on the squeue to be processed by a worker thread or by other threads.

18.3.2. IP Classifier

The IP connection fan-out mechanism consists of three hash tables: a five-tuple hash table (protocol, remote and local IP addresses, and remote and local ports) to keep fully qualified TCP (ESTABLISHED) connections; a three-tuple lookup consisting of protocol, local address, and local port to keep the listeners; and a single-tuple lookup for protocol listeners. As part of the lookup, a connection structure (a superset of all connection information) is returned. This connection structure is called conn_t. A few of the key structure members are shown below.

struct conn_s {         kmutex_t        conn_lock;         uint32_t        conn_ref;               /* Reference counter */         uint_t          conn_state_flags;       /* IP state flags */         ire_t           *conn_ire_cache;        /* outbound ire cache */         uint32_t        conn_flags;             /* Conn Flags */ ...         tcp_t           *conn_tcp;              /* Pointer to the tcp struct */         squeue_t        *conn_sqp;              /* Squeue for processing */         edesc_rpf       conn_recv;              /* Pointer to recv routine */         void            *conn_pad1; ...         queue_t         *conn_rq;               /* Read queue */         queue_t         *conn_wq;               /* Write queue */         dev_t           conn_dev;               /* Minor number */         cred_t          *conn_cred;             /* Credentials */ ...         connf_t         *conn_fanout;           /* Hash bucket we're part of */ ... };                                              See usr/src/uts/common/inet/ipclassifier.h

The interesting member to note is the pointer to the squeue or vertical perimeter. The lookup is done outside the perimeter and the packet is processed or queued on the squeue to which the connection is attached. Also, conn_recv and conn_send point to the read-side and write-side functions. The read-side function can be tcp_input() if the packet is meant for TCP.

The connection fan-out mechanism supports wildcard listeners, that is, INADDR ANY. Currently, the connected and bind tables are primarily for TCP and UDP only. A listener entry is made during a listen() call. The entry is made into the connected table after the three-way handshake is complete for TCP.

For reference, the IPClassifier interfaces are listed below.

conn_t        *ipcl_conn_create(uint32_t type, int sleep); void          ipcl_conn_destroy(conn_t *connp); int           ipcl_proto_insert(conn_t *connp, uint8_t protocol); int           ipcl_proto_insert_v6(conn_t *connp, uint8_t protocol); conn_t        *ipcl_proto_classify(uint8_t protocol); int           *ipcl_bind_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, uint16_ t lport); int           *ipcl_bind_insert_v6(conn_t *connp, uint8_t protocol, const in6_addr_t * src, uint16_t lport); int           *ipcl_conn_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, ipaddr_ t dst, uint32_t ports); int           *ipcl_conn_insert_v6(conn_t *connp, uint8_t protocol, in6_addr_t *src, in6_addr_t *dst, uint32_t ports); void          ipcl_hash_remove(conn_t *connp); conn_t        *ipcl_classify_v4(mblk_t *mp); conn_t        *ipcl_classify_v6(mblk_t *mp); conn_t        *ipcl_classify(mblk_t *mp);                                              See usr/src/uts/common/inet/ipclassifier.h

18.3.3. Synchronization Mechanism

Since the stack is fully multithreaded (barring the per-CPU serialization enforced by the vertical perimeter), it uses a reference-based scheme to ensure that connection instances are available when needed. The reference count is implemented in the conn_t structure through the conn_ref field and is protected by conn_lock. The prime purpose of the lock is not to protect the bulk of the conn_t structure but to protect just the reference count. Each time some entity references the data structure (stores a pointer to the data structure for later processing), it increments the reference count by calling the CONN_INC_REF macro. This macro acquires the conn_lock, increments conn_ref, and then drops the conn_lock. Each time the entity drops the reference to the connection instance, it drops its reference by means of the CONN_DEC_REF macro.

An established TCP connection is guaranteed to have three references on it. Each protocol layer has a reference on the instance (one each for TCP and IP), and the classifier itself has a reference since it is an established connection. Each time a packet arrives for the connection and the classifier looks up the connection instance, an extra reference is placed. That reference is dropped when the protocol layer finishes processing that packet. Similarly, any timers running on the connection instance have a reference to ensure that the instance is around whenever the timer fires. The memory associated with the connection instance is freed once the last reference is dropped.