Section 18.9. Interrupt Model and NIC Speeds | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

18.9. Interrupt Model and NIC Speeds

Any discussion about networking in a modern operating system is incomplete without talking about the way in which the stack deals with incoming packets (receive) and about the interrupts resulting from them.

18.9.1. Solaris 9 and Earlier Releases

When the 100-Mbit speed was common, the bulk of inbound packet processing was done in the interrupt context. In a typical server doing heavy transmit, writes from an application queue data in the TCP transmit queue and the data is actually sent out when incoming ACKs open the congestion window (known as ACK-driven transmit). The ACKs themselves are processed by the interrupt thread, which also checks that the congestion window has opened and transmits any queued data. On a 100-Mbit Ethernet, this approach had two advantages: it provided extra cache locality to the data structures (since packets for a particular connection land on the same CPU and are processed on the same CPU) and there was no thread-switching overhead, resulting in better performance.

When the 1-Gbit NICs arrived in the Solaris 8 time frame, processor speeds lagged the NIC and were no longer capable of driving the NIC; the time spent processing network interrupts began to starve other system activities. This also resulted in some pathological cases in which the interrupted CPU would become 100% busy while other CPUs were mostly idle and the system became live-locked.

To get around this problem, the 1-Gbit NICs in the Solaris 8 and 9 releases adhered to a worker thread model, in which they used one or more worker threads to spread the work across several CPUs instead of sending the packet to IP in interrupt context. To avoid excessive D-cache misses with packets for the same connections landing on different CPUs, the driver did a simple hash computation on source IP address to assign a packet to a particular worker thread. It helped the scalability, but single CPU performance degraded because of the extra context switching involved: computation of an additional hash and the worker threads themselves migrating across CPUs.

The device driver became increasingly complex and burdened with decisions better left for the higher layers. The device driver had no good means of figuring out how many worker threads it should use to spread out the load. Worker threads based on number of CPUs on the system created chaos when multiple NICs became active on the same system with the same policy. The arrival of 10-Gbit NICs made matters worse because the packet size on the Internet was still 1500 bytes, resulting in potentially 80,000 packets per second at line rate and leaving the processor about 12 μsec to process each packet while doing useful work.

18.9.2. Dynamic Switch between Interrupt vs. Polling Mode

The popularity of 10-Gbit NICs is increasing in the data center because, apart from throughput, 10-Gbit NICs simplify the data center wiring and offer better latencies. To handle the interrupt load from a 10-Gbit NIC, the NIC vendors use interrupt coalescing schemes, by which they interrupt the CPU according to either n number of packets received or t time elapsed. This scheme was employed by several 1-Gbit NICs as well. It suffers from the fact that under lower load, when interrupts are firing based on time t elapsing, the latencies are poor, and under higher load, the system suddenly goes through an interrupt storm or longer interrupt processing time, resulting in application threads getting pinned by interrupts.

The problem is not as acute with 1-Gbit NICs (with current processor speed of 1 GHz and more), but performance still suffers. The NIC interrupts the CPU on the basis of its local policies, preempting the current thread and causing unnecessary context switches, lock contention, thread migration, etc. In most cases, the packet delivered can't be immediately processed because the squeue is busy, resulting in packets queuing on the squeue.

The Solaris 10 stack simplified device driver writing once again by unburdening the device driver from handling interrupt frequency and load spreading. The new GLDv3 framework allows the device to be tied to an squeue of the interrupted CPU. The squeue controls the interrupt frequency or switches the device into a polling mode.

As discussed earlier, the squeue is a common FIFO for both inbound and outbound packets, and only one thread is allowed to process it at any given time. As such, the per-CPU backlog is easy to figure out. If packets are queued, the squeue can switch the NIC associated with it from interrupt to polling mode, as illustrated in Figure 18.8. The NIC stops interrupting the CPU, and the squeue moveg packets from the NIC to the squeue from time to time. The move is done for the entire chain instead of the usual interrupt per packet. The NIC is essentially in polling mode at this point.

Figure 18.8. NIC Mode Switching

If the squeue finds that it has no packets to process and the NIC also has nothing queued in the ring buffer, it switches the NIC back to interrupt mode and the squeue worker thread goes back to sleep. As long as the NIC's interrupt can process its packet without queuing on the squeue, the NIC continues to be in interrupt mode. In other words, as long as the interarrival rate of packets is more than the processing time, the NIC continues to be in interrupt mode and the packets are processed in the interrupt context.

This scheme creates a powerful mechanism for processing incoming packets without putting the complexity in the device driver. The overall system performance improves significantly because interrupts are not clashing with system and application threads, lock contention, and the like.

The scheme also significantly boosts performance and helps improve latency because the system can cope with incoming packets and switch to throughput mode when a backlog builds. Since the system does this in milliseconds, it can deal with bursts very effectively.

18.9.3. Interrupt Load Spreading

The ability of a GLDv3-based device driver to dynamically switch between interrupt and polling mode helps boost both 1-Gbit and 10-Gbit NIC performance. It also ensures that a system will never suffer from the interrupt live-lock problem. But this mechanism by itself doesn't spread the load to multiple CPUs when they are available. For that purpose, GLDv3 provides a soft ring facility that is controlled by IP. Basically, IP asks the GLDv3 device driver to create n number of soft rings, each of which has its own worker threads to deliver the packets to IP. The soft ring worker threads are bound to same squeue (CPU) that owns the soft ring. The NIC still interrupts the CPU as needed, but very early, the packets are sent to one of the soft rings according to load spreading policies (hash of src IP address, for example). The ring is controlled by some squeue that either pulls the packet chain from the ring (the poll mode) or lets the ring send the packet up (the interrupt mode), as illustrated in Figure 18.9.

Figure 18.9. Interrupt Load Distribution

The stack accounts for hardware features such as multiple core processors or hardware hyperthreading, and hyperthreading for multiple cores. The soft rings are controlled by squeues of hardware strands on the same core if possible, giving better cache affinity.