Section 3.11. Interrupts | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.11. Interrupts

Understanding interrupts and what happens when an interrupt is generated are important components of the big dispatcher picture. A running thread gets pinned for a short period when the CPU on which it is running fields an interrupt. Additionally, the dispatcher code contains many conditional tests to determine whether a CPU is running an interrupt thread and takes a different code path depending on whether that condition is true.

An interrupt is the mechanism that a hardware device or a component of the kernel-at-large (through software interrupts) can use to interrupt the current execution flow and force a CPU into running an interrupt handler. The hardware device interrupt scenario is generally well knowna host bus adapter (HBA) for disk I/O generates interrupts on I/O completion, or a network interface card (NIC) generates interrupts for incoming network packets. The Solaris kernel programs an internal clock to generate an interrupt every 10 milliseconds to enter a clock interrupt handler and perform some housekeeping chores in the kernel. An interrupt can be initiated by software as well. A common use on Solaris multiprocessor systems is the cross-call mechanism, a facility whereby one CPU can send an interrupt to one or more of the other CPUs on the system (or to all of them) to force the CPU into a handler to take a specific action. The preemption mechanism uses cross-calls to force a CPU out of its current flow of execution so that the thread can be preempted.

Interrupts are directed to specific processors, and on reception, a processor stops executing the current thread (see Figure 3.14). The current thread is pinned, and the interrupt thread allowed to execute. When the interrupt thread completes, the interrupted thread is unpinned and resumes exection. This allows interrupts to be processed quickly, since a full context switch is not required. If the interrupt thread blocks, it is given full thread state and placed on a sleep queue, and the interrupted thread will be unpinned. Kernel threads handle all but high-priority interrupts. Consequently, the kernel can minimize the amount of time spent holding critical resources, thus providing better scalability of interrupt code and lower overall interrupt response time.

Figure 3.14. Process, Interrupt, and Kernel Threads

3.11.1. Interrupt Priorities

Solaris assigns priorities to interrupts to allow overlapping interrupts to be handled with the correct precedence; for example, a network interrupt can be configured to have a higher priority than a disk interrupt.

The kernel implements 15 interrupt priority levels: level 1 through level 15, where level 15 is the highest priority level. On each processor, the kernel can mask interrupts below a given priority level by setting the processor's interrupt level. Setting the interrupt level blocks all interrupts at the specified level and lower. That way, when the processor is executing a level 9 interrupt handler, it does not receive interrupts at level 9 or below; it handles only higher-priority interrupts.

Interrupts that occur with a priority level at or lower than the processor's interrupt level are temporarily ignored. An interrupt is not acknowledged by a processor until the processor's interrupt level is less than the level of the pending interrupt. More important interrupts have a higher-priority level to give them a better chance to be serviced than lower-priority interrupts.

Figure 3.15 illustrates interrupt priority levels.

Figure 3.15. Interrupt Priority Levels

3.11.2. Interrupts as Threads

Interrupt priority levels can synchronize access to critical sections used by interrupt handlers. By raising the interrupt level, a handler can ensure exclusive access to data structures for the specific processor that has elevated its priority level. This is in fact what early, uniprocessor implementations of UNIX systems did for synchronization.

But masking out interrupts to ensure exclusive access is expensive; it blocks other interrupt handlers from running for a potentially long time, which could lead to data loss if interrupts are lost because of overrun. (An overrun condition is one in which the volume of interrupts awaiting service exceeds the system's ability to queue the interrupts.) In addition, interrupt handlers using priority levels alone cannot block, since a deadlock could occur if they are waiting on a resource held by a lower-priority interrupt.

For these reasons, the Solaris kernel implements most interrupts as asynchronously created and dispatched high-priority threads. This implementation allows the kernel to overcome the scaling limitations imposed by interrupt blocking for synchronizing data access and thus provides low-latency interrupt response times.

Interrupts at priority 10 and below are handled by Solaris threads. These interrupt handlers can then block if necessary, using regular synchronization primitives such as mutex locks. Interrupts, however, must be efficient, and it is too expensive to create a new thread each time an interrupt is received. For this reason, each processor maintains a pool of partially initialized interrupt threads, one for each of the lower 9 priority levels plus a systemwide thread for the clock interrupt. When an interrupt is taken, the interrupt uses the interrupt thread's stack, and only if it blocks on a synchronization object is the thread completely initialized. This approach, allows simple, fast allocation of threads at the time of interrupt dispatch.

A typical scenario: An interrupt with priority 9 or less occurs (level 10 clock interrupts are handled slightly differently). When an interrupt occurs, the interrupt level is raised to the level of the interrupt to block subsequent interrupts at this level (and lower levels). The currently executing thread is interrupted and pinned to the processor. A thread for the priority level of the interrupt is taken from the pool of interrupt threads for the processor and is context-switched in to handle the interrupt.

The term pinned refers to a mechanism employed by the kernel that avoids context-switching out the interrupted thread. The executing thread is pinned under the interrupt thread. The interrupt thread "borrows" the LWP from the executing thread. While the interrupt handler is running, the interrupted thread is pinned to avoid the overhead of having to completely save its context; it cannot run on any processor until the interrupt handler completes or blocks on a synchronization object. Once the handler is complete, the original thread is unpinned and rescheduled.

If the interrupt handler thread blocks on a synchronization object (for example, a mutex or condition variable) while handling the interrupt, it is converted into a complete kernel thread capable of being scheduled. Control is passed back to the interrupted thread, and the interrupt thread remains blocked on the synchronization object. When the synchronization object is unblocked, the thread becomes runnable and may preempt lower-priority threads to be rescheduled.

The processor interrupt level remains at the level of the interrupt, blocking lower-priority interrupts, even while the interrupt handler thread is blocked. This prevents lower-priority interrupt threads from interrupting the processing of higher-level interrupts. While interrupt threads are blocked, they are pinned to the processor they initiated on, guaranteeing that each processor will always have an interrupt thread available for incoming interrupts.

Level 10 clock interrupts are handled similarly, but since there is only one source of clock interrupt, there is a single, systemwide clock thread. Clock interrupts are discussed further in Section 19.1.

3.11.3. Interrupt Thread Priorities

Interrupts that are scheduled as threads share global dispatcher priorities with other threads. Interrupt threads use the top ten global dispatcher priorities, 160 to 169. Figure 3.8 shows the relationship of the interrupt dispatcher priorities to the other scheduling classes.

3.11.4. High-Priority Interrupts

Interrupts above priority 10 block out all lower-priority interrupts until they complete. For this reason, high-priority interrupts need to have an extremely short code path to prevent them from affecting the latency of other interrupt handlers and the performance and scalability of the system.

High-priority interrupt threads also cannot block; they can use only the spin variety of synchronization objects. This is due to the priority level the dispatcher uses for synchronization. Since the dispatcher runs at level 11, code running at higher interrupt levels cannot enter the dispatcher.

High-priority threads typically service the minimal requirements of the hardware device (the source of the interrupt), then post down a lower-priority software interrupt to complete the required processing.

3.11.5. Interrupt Management

For some workloads, it may be desirable to partition the system such that application threads are isolated from CPUs handling interrupt traffic. Network-intensive applications, for example, can result in high interrupt rates to the CPU(s) handling the NIC interrupts. Application threads running on a such a CPU may frequently be pinned, which can degrade performance, especially if the pinned threads are holding a critical resource such as a lock.

Using processor sets (psrset(1)), we can partition off interrupt loading by creating a processor set for the application processes and using the psrset -f flag to disable interrupts to all the CPUs in the set. This forces the kernel to rebind the device interrupts to the remaining CPUs (not in the user-defined set). With the application processes bound to the user-defined set, they will execute on CPUs not fielding interrupts. This method should of course be tested before being applied to a production workload.

3.11.6. Interrupt Monitoring

You can use the mpstat(1M) and vmstat(1M) commands to monitor interrupt activity on a Solaris system. mpstat(1M) provides interrupts-per-second for each CPU in the intr column and interrupts handled on an interrupt thread (low-level interrupts) in the ithr column.

Solaris 10 added an intrstat(1) command, which displays interrupt-to-CPU bindings, interrupt rates, and time spent handling interrupts. For example, looking at mpstat(1), we can observe interrupt rates to CPUs.

# mpstat 1 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0    0   0    0   557  217  621    0    3    0    0   475    0   1   0  99   1    0   0    0 12971 12962 25457    0    3    0    0 25459    2   8    0  90   2    8   0    0    15    0   24    0    0    0    0    57    0   0   0 100   3    0   0    0     2    1    0    0    0    0    0     0    0   0   0 100 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0    0   0    0   442  216  394    1    6    0    0   255    0   0   0 100   1    0   0    0 13031 13023 25807    0    3    1    0 25792    1   8   0  91   2    0   0    0     7    0   11    0    0    0    0    28    0   2   0  98   3    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100

In this example, CPU 1 is taking the highest rate of interrupts, which means it is likely that CPU 1 has device interrupt bindings. Using intrstat(1), we see

# intrstat       device |      cpu0 %tim      cpu1 %tim      cpu2 %tim       cpu3 %tim -------------+------------------------------------------------------------        ata#1 |         0  0.0         4  0.0         0  0.0          0  0.0        bge#0 |         1  0.0         0  0.0         0  0.0          0  0.0        mpt#0 |         0  0.0     12661  4.8         0  0.0          0  0.0       device |      cpu0 %tim      cpu1 %tim      cpu2 %tim       cpu3 %tim -------------+------------------------------------------------------------        ata#1 |         0  0.0         0  0.0         0  0.0          0  0.0        bge#0 |         6  0.0         0  0.0         0  0.0          0  0.0        mpt#0 |         0  0.0     12630  4.7         0  0.0          0  0.0

The intrstat(1) data shows us that device mpt#0 (mpt is the device nomenclature, #0 refers to instance 0 of the device) is generating interrupts to CPU 1, which is spending about 5% of its time handling mpt interrupts. If you're not sure what an mpt device is, a good place to start is finding a match in the kernel module description.

# modinfo | grep mpt  29 fffffffffbb4a3d0  2f948 169   1  mpt (MPT HBA Driver v1.49) #

Here we determined the mpt device is our Host Bus Adapter (HBA), which is a disk interface. In this example, we clearly have a respectable rate of disk I/O traffic. We would use iostat(1) in conjunction with the DTrace io provider to determine precisely which processes are generating the I/O and which files are receiving the I/O traffic.

3.11.7. Interprocessor Interrupts and Cross-Calls

The kernel can send an interrupt or trap to another processor when it requires another processor to do some immediate work on its behalf. Interprocessor interrupts are delivered through the poke_cpu() function; they are used for the following purposes:

Preempting the dispatcher. A thread may need to signal a thread running on another processor to enter kernel mode when a preemption is required (initiated by a clock or timer event) or when a synchronization object is released. Preemption is discussed in detail in Section 3.9.
Delivering a signal. The delivery of a signal may require interrupting a thread on another processor.
Starting/stopping /proc threads. The /proc infrastructure uses interprocessor interrupts to start and stop threads on different processors.

Using a similar mechanism, the kernel can also instruct a processor to execute a specific low-level function by issuing a processor-to-processor cross-call. Cross-calls are typically part of the processor-dependent implementation. UltraSPARC kernels use cross-calls for two purposes:

Implementing interprocessor interrupts. As discussed above.
Maintaining virtual memory translation consistency. Implementing cache consistency on SMP platforms requires the translation entries to be removed from the MMU of each CPU that a thread has run on when a virtual address is unmapped. On UltraSPARC, user processes issuing an unmap operation make a cross-call to each CPU on which the thread has run, to remove the TLB entries from each processor's MMU. Address space unmap operations within the kernel address space make a cross-call to all processors for each unmap operation.

Both cross-calls and interprocessor interrupts are reported by mpstat(1M) in the xcal column as cross-calls per second.

# mpstat 3 CPU minf mjf xcal   intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0    0   0    6    607  246 1100  174   82   84    0  2907   28   5   0  66   1    0   0    2    218    0 1037  212   83   80    0  3438   33   4   0  62

High numbers of reported cross-calls can result from either of the activities mentioned in the preceding sectionmost commonly, from kernel address space unmap activity caused by file system activity.

Once again, we can use DTrace to root out the source of cross-calls.

# dtrace -n 'xcalls { @[stack()]=count()}' dtrace: description 'xcalls ' matched 3 probes ^C . . .               SUNW,UltraSPARC-II`send_one_mondo+0x20               SUNW,UltraSPARC-II`send_mondo_set+0x1c               unix`xt_some+0xc4               unix`xt_sync+0x3c               unix`hat_unload_callback+0x808               unix`bp_mapout+0x74               genunix`biowait+0xb0               ufs`ufs_putapage+0x400               ufs`ufs_putpages+0x2a4               genunix`segmap_release+0x300               ufs`ufs_diraddentry+0x2e4               ufs`ufs_direnter_cm+0x2a8               ufs`ufs_create+0x254               genunix`fop_create+0x38               genunix`vn_createat+0x550               genunix`vn_openat+0x130               genunix`copen+0x260               unix`syscall_trap+0xac             15848

In the example, we cut all but the last kernel stack frame, since the DTrace count aggregating function nicely generates output in ascending order, the last entry is the aggregation key (in this case, the kernel stack) that occurred most frequently during the sampling period. The kernel stack shown indicates a lot of cross-call traffic from the UFS I/O code path, which uses segmap for page caching. We can see segmap_release on the stack, followed by a hat_mapout and hat_unload_callback function. Without digressing into too many details, the cross-calls are due to segmap activity and the need to push pages out. This requires unmapping the page (handled by the HAT layer), and generating cross-call activity (xt_sync, xt_some on the stack) to maintain MMU-level coherence across the processors.