Section 3.9. Preemption | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.9. Preemption

We mentioned preemption several times in the preceding text. First, a quick review of what preemption is.

The kernel preempts a thread running on a processor when a higher-priority thread is inserted onto a dispatch queue. The thread is effectively forced to reschedule itself and surrender the processor before having used up its time quantum. Two types of preemption conditions are implementeda user preemption and a kernel preemptiondistinguished by the priority level of the preempted thread, which drives how quickly the preemption will take place.

A user preemption occurs if a thread is placed on a dispatch queue and the thread has a higher priority than the thread currently running on the processor associated with the queue but has a lower priority than the minimum required for a kernel preemption. A kernel preemption occurs when a thread is placed on a dispatch queue with a priority higher than kpreemptpri, which is set to 100, representing the lowest global dispatch priority for an RT class thread. RT and interrupt threads have global priorities greater than kpreemptpri.

User preemption enables higher-priority threads to get processor time expediently. Kernel preemption is necessary for support of real-time threads. Traditional real-time support in UNIX systems was built on a kernel with various preemption points, allowing a real-time thread to displace the kernel at a few well-defined preemptable places. The Solaris implementation goes the next step and implements a preemptable kernel with a few non-preemption points. In critical code paths, Solaris temporarily disables kernel preemption for a short period and enables it when the critical path has completed. Kernel preemption is disabled for very short periods in the thread_create() code during the pause_cpus() routine and in a few memory management (MMU) code paths, such as when a hardware address translation (HAT) is being set up.

Preemptions are flagged through fields in the per-processor cpu structure: cpu_runrun and cpu_kprunrun. cpu_runrun flags a user preemption; it is set when a thread inserted into a dispatch queue is a higher priority than the one running but a lower priority than kpreemptpri. cpu_kprunrun flags a kernel preemption. We saw in the cpu_resched() code one example of where these flags get set. The runrun flags can also get set in the following kernel routines.

cpupart_move_cpu(). When a processor set configuration is changed and a processor is moved from a processor set, the runrun flags are set to force a preemption so the threads running on the processor being moved can be moved to another processor in the set they've been bound to. Note that if only one processor is left in the set and there are bound threads, the processor set cannot be destroyed until any bound threads are first unbound.
cpu_surrender(). A thread is surrendering the processor it's running on. Recall from the section on thread priorities that cpu_surrender() is called following a thread's priority change and a test to determine if preemption conditions exist. Entering cpu_surrender() means a preemption condition has been detected and is the first step in a kthread giving up a processor in favor of a higher-priority thread.
Two other areas of the kernel that potentially call cpu_surrender() are the priority inheritance code and the processor support code that handles the binding of a thread to a processor. The conditions under which the priority inheritance code calls cpu_surrender() are the same as previously described, that is, a priority test determined that a preemption is warranted. The thread binding code forces a preemption through cpu_surrender() when a thread is bound to a processor in a processor set and the processor the thread is currently executing on is not part of the processor set the thread was just bound to. This is the only case in which a preemption that is not the result of a priority test is forced.
cpu_surrender() sets the cpu_runrun flag and sets cpu_kprunrun if the preemption priority is greater than kpreemptpri. On a multiprocessor system, if the processor executing the cpu_surrender() code is different from the processor that needs to preempt its thread, then a cross-call is sent to the processor that needs to be preempted, forcing it into a trap handler. At that point the runrun flags are tested. The other possible condition is one in which the processor executing the cpu_surrender() code is the same processor that must preempt the current thread, in which case it will test the runrun flags before returning to user mode; thus, the cross-call is not needed. In other words, the processor is already in the kernel because it is running the cpu_surrender() kernel routine, so a cross-call would be superfluous.

Once the preemption condition has been detected and the appropriate runrun flag has been set in the processor's CPU structure, the kernel must enter a code path that tests the runrun flags before the actual preemption occurs. This happens in different areas of the kernel for user versus kernel preemptions. User preemptions are tested for cpu_runrun when the kernel returns from a trap or interrupt handler. Kernel preemptions are also tested for cpu_kprunrun when a dispatcher lock is released.

The trap code that executes after the main trap or interrupt handler has completed tests cpu_runrun, and if it is set, calls the kernel preempt() function. preempt() tests two conditions initially. If the thread is not running on a processor (thread state is not ONPROC) or if the thread's dispatch queue pointer is referencing a queue for a processor other than the processor currently executing, then no preemption is necessary and the code falls through and simply clears a dispatcher lock.

Consider the two test conditions. If the thread is not running (the first test), then obviously it does not need to be preempted. If the thread's t_disp_queue pointer is referencing a dispatch queue for a different processor (different from the processor currently executing the preempt() code), then clearly the thread has already been placed on another processor's queue, so that condition also obviates the need for a preemption.

If the conditions just described are not true, preempt() increments the LWP's lrusage structure nicsw counter, which counts the number of involuntary context switches. The processor's inv_switch counter is also incremented in the cpu_sysinfo structure, which counts involuntary context switches processor-wide, and the scheduling-class-specific preempt code is called. The per-processor counters are available with mpstat(1M), reflected in the icsw column.

The class-specific code for threads prepares the thread for placement on a dispatch queue and calls either setfrontdq() or setbackdq() for actual queue insertion. xx_preempt() checks whether the thread is in kernel mode and whether the kernel-priority-requested flag (t_kpri_req) in the thread structure is set. If it is set, the thread's priority is set to the lowest SYS class priority (typically 60). The t_TRapret and t_astflag kthread flags are set, causing the xx_trapret() function to run when the thread returns to user mode (from kernel mode). At that point, the thread's priority is set back to something in the thread's priority range. xx_preempt() tests for a scheduler activation on the thread. If an activation has been enabled and the thread has not avoided preemption beyond the threshold of two clock ticks and the thread is not in kernel mode, then the thread's priority is set to the highest user-mode priority (59) and is placed at the front of a dispatch queue with setfrontdq().

If the thread's xxBACKQ flag is set, signifying that the thread should be placed at the back of a dispatch queue with setbackdq(), the thread preemption is due to time-slice expiration. (Recall that xx_tick() will call cpu_surrender().) The thread's t_dispwait field is zeroed, and a new time quantum is set in xx_timeleft from the dispatch table before setbackdq() is called. Otherwise, if xxBACKQ is not set, a real preemption occurred (higher-priority thread became runnable) and the thread is placed at the front of a dispatch queue.

The rt_preempt() code is less complex. If RTBACKQ is true, the preemption was due to a time quantum expiration (as was the case previously) and setbackdq() is called to place the thread at the back of a queue after setting the rt_timeleft value from rt_pquantum. Otherwise, the thread is placed at the front of a dispatch queue with setfrontdq().

The class-specific preempt code, once completed, returns to the generic preempt() routine, which then enters the dispatcher by calling swtch(). We look at the swtch() code in the next section.

Recall that kernel preemption is detected when a dispatcher lock is released. It is also tested for in kpreempt_enable(), which reenables kernel preemption after kpreempt_disable() blocked preemptions for a short time. The goal is to have kernel preemptions detected and handled more expediently (with less latency) than user preemptions.

Because the test, cpu_kprunrun, for a kernel preemption is put in the disp_lock_exit() code, the detection happens synchronously with respect to other thread scheduling and queue activity. The dispatcher locks are acquired and freed at various points in the dispatcher code, either directly through the dispatcher lock interfaces or indirectly through macro calls. For example, each kernel thread maintains a pointer to a dispatcher lock, which serves to lock the thread and the queue during dispatcher functions. The ThrEAD_LOCK and ThrEAD_UNLOCK macros use the dispatcher lock entry and exit functions. The key point is that a kernel preemption will be detected before a processor running a thread flagged for preemption completes a pass through the dispatcher.

When disp_lock_exit() is entered, it tests whether cpu_kprunrun is set; if so, then disp_lock_exit() calls kpreempt(). A clear cpu_kprunrun flag indicates that a kernel preemption is not pending, so there is no need to call kpreempt(). Kernel preemptions are handled by the kpreempt() code, represented here in pseudocode.

kpreempt()         if (current_thread->t_preempt)                 do statistics                 return         if (current_thread NOT running) OR (current_thread NOT on this CPUs queue)                 return         if (current PIL >= LOCK_LEVEL)                 return         block kernel preemption (increment current_thread->t_preempt)         call preempt()         enable kernel preemption (decrement current_thread->t_preempt)

The preceding pseudocode summarizes at a high level what happens in kpreempt(). Kernel threads have a t_preempt flag, which, if set, signifies that the thread is not to be preempted. This flag is set in some privileged threads, such as a processor's idle and interrupt threads. Kernel preemption is disabled by incrementing t_preempt in the current thread and is reenabled by decrementing t_preempt. kpreempt() tests t_preempt in the current thread; if t_preempt is set, kpreempt() increments some statistics counters and returns. If t_preempt is set, the code does not perform a kernel preemption.

The second test is similar in logic to what happens in the preempt() code previously described. If the thread is not running or is not on the current processor's dispatch queue, there's no need to preempt.

The third test checks the priority level of the processor. If we're running at a high PIL, we cannot preempt the thread, since it may be holding a spin lock. Preempting a thread holding a dispatcher spin lock could result in a deadlock situation.

Any of the first three test conditions evaluating true causes kpreempt() to return without actually doing a preemption. Assuming the kernel goes ahead with the preemption, kernel preemptions are disabled (to prevent nested kernel preemptions) and the preempt() function is called. Once preempt() completes, kernel preemption is enabled and kpreempt() is done.

Kernel statistical data is maintained for kernel preemption events in the form of a kpreempt_cnts structure.

struct kpreempt_cnts {  /* kernel preemption statistics */         int     kpc_idle;       /* executing idle thread */         int     kpc_intr;       /* executing interrupt thread */         int     kpc_clock;      /* executing clock thread */         int     kpc_blocked;    /* thread has blocked preemption (t_preempt) */         int     kpc_notonproc;  /* thread is surrendering processor */         int     kpc_inswtch;    /* thread has ratified scheduling decision */         int     kpc_prilevel;   /* processor interrupt level is too high */         int     kpc_apreempt;   /* asynchronous preemption */         int     kpc_spreempt;   /* synchronous preemption */ }       kpreempt_cnts;                           See usr/src/uts/sun4/os/trap.c or usr/src/uts/i86pc/os/trap.c

The kpreempt_cnts data is not accessible with a currently available Solaris command, but can be read with mdb(1).

# mdb -k > ::nm -x !grep kpreempt_cnts 0xfffffffffbc2e7d0|0x0000000000000024|OBJT |GLOB |0x0  |16      |kpreempt_cnts > 0xfffffffffbc2e7d0::print -d struct kpreempt_cnts {     kpc_idle = 0     kpc_intr = 0t1668     kpc_clock = 0     kpc_blocked = 0t12     kpc_notonproc = 0     kpc_inswtch = 0     kpc_prilevel = 0     kpc_apreempt = 0t1595     kpc_spreempt = 0t128 }

Most of the kpreempt_cnts counter descriptions are well described in the source code listing. The last two counters, asynchronous preemption and synchronous preemption, count each of the possible methods of kernel preemption. The kpreempt() function is passed one argument, asyncspl. For asynchronous preempts, asyncspl is a priority-level argument, and the kpreempt() code raises the PIL, as dictated by the value passed. Synchronous preempts pass a -1 argument and do not change the processor's priority level. In the case of both user and kernel preemption, the code ultimately executes the preempt() function, which as a last step, enters the dispatcher swtch() routine.

DTrace also provides several probes for tracking preempt activity. We saw one example in Section 3.7.3.6, where we tracked which thread was preempted and which thread initiated the preemption. The DTrace FBT provider manages probes and the entry and return points for the class-specific preempt functions, and the sched provider manages a preempt probe, which fires immediately before the current thread is preempted. Here's a simple example.

# dtrace -qn 'sched:unix:preempt:preempt { @s[execname,tid]=count() }' ^C   thrds-sp                                              89269                 1   thrds-sp                                                  1                 1   dtrace                                                    1                 1 . . .   thrds-sp                                              89246                 4   thrds-sp                                              89260                 4 . . .   thrds-sp                                              89256                 5   thrds-sp                                              89264                 5

Using a simple dtrace command-line command with the count aggregation, we can get a snapshot on which threads in which process are getting preempted.

A final note about preemptions and context switching. The system tracks two categories of context switches; voluntary and involuntary. A voluntary context switch occurs when a thread issues a blocking system call and goes to sleep. An involuntary context switch occurs when a thread has been preempted. The issue is that a thread may be preempted for one of two reasons: a higher-priority thread came along or the running thread used its time quantum. Time quantum expiration uses the preempt mechanism to nudge a thread off the CPU. These context switch rates can be tracked with mpstat(1), watching the csw and icsw columns. If icsw rates are high (involuntary), it's interesting to know the how many relate to time-quantum expiration versus the advent of a higher-priority thread. Here's a DTrace script that decomposes involuntary context switches.

#!/usr/sbin/dtrace -Zqs long inv_cnt;   /* all invountary context switches */ long tqe_cnt;   /* time quantum expiration count   */ long hpp_cnt;   /* higher-priority preempt count   */ long csw_cnt;   /* total number context switches   */ dtrace:::BEGIN {         inv_cnt = 0; tqe_cnt = 0; hpp_cnt = 0; csw_cnt = 0;         printf("%-16s %-16s %-16s %-16s\n","TOTAL CSW","ALL INV","TQE_INV","HPP_INV");         printf("==========================================================\n"); } sysinfo:unix:preempt:inv_swtch {         inv_cnt += arg0; } sysinfo:unix::pswitch {         csw_cnt += arg0; } fbt:TS:ts_preempt:entry / ((tsproc_t *)args[0]->t_cldata)->ts_timeleft <= 1 / {         tqe_cnt++; } fbt:TS:ts_preempt:entry / ((tsproc_t *)args[0]->t_cldata)->ts_timeleft > 1 / {         hpp_cnt++; } fbt:RT:rt_preempt:entry / ((rtproc_t *)args[0]->t_cldata)->rt_timeleft <= 1 / {         tqe_cnt++; } fbt:RT:rt_preempt:entry / ((rtproc_t *)args[0]->t_cldata)->rt_timeleft > 1 / {         hpp_cnt++; } tick-1sec {         printf("%-16d %-16d %-16d %-16d\n",csw_cnt,inv_cnt,tqe_cnt,hpp_cnt);         inv_cnt = 0; tqe_cnt = 0; hpp_cnt = 0; csw_cnt = 0; }

Note that the script enables probes at the class-specific preempt functions. If you have active FX and/or FSS class threads, you need to add those probes, along with the correct predicate and the appropriate structure name. The script provides per-second counters.

solaris10> ./c.d TOTAL CSW        ALL INV          TQE_INV          HPP_INV ========================================================== 6147             1294             233              1066 9199             1193             141              1055 9886             846              186              661 4940             658              128              531 7359             702              149              553 4892             874              134              742 5504             846              152              699 6994             972              183              790 11835            1041             210              837 7507             1018             209              815 . . .

In this example, the number of involuntary switches was much less than voluntary (TOTAL CSW, ALL INV); of the involuntary switches, most are due to a higher-priority thread. If we had a high rate of time quantum expirations (TQE_INV), we could consider increasing time quanta by using the dispatch tables for the appropriate scheduling class.