Section 3.8. Dispatcher Functions


3.8. Dispatcher Functions

The dispatcher's primary functions are to decide which runnable thread gets executed next, to manage the context switching of threads on and off processors, and to provide a mechanism for inserting into a dispatch queue kthreads that become runnable. Other dispatcher functions handle initialization and scheduling class loading, the lock functions previously discussed, preemption, and support for user and administrative commands, such as dispadmin(1M) and priocntl(1), that monitor or change dispatcher-related behavior.

The main entry point into the dispatcher is through a call to swtch(), which finds the highest-priority runnable thread and context-switches it onto the target CPU. Several areas of the dispatcher subsystem, as well as the kernel at large, enter the dispatcher through swtch() to initiate a thread switch. Much of the work is performed by the core disp() code, which is called from swtch(). Queue insertion is handled by the setfrontdq() and setbackdq() functions for the per-processor dispatch queues, and setkpdq() for kernel preempt (kp) queues. These functions place a thread on a dispatch queue according to the thread's priority. Whether the thread is placed at the front or the back of the queue is determined before the queue insertion function is called. We first look at the queue insertion functions and then examine swtch().

3.8.1. Dispatcher Queue Management

The dispatcher queue functions insert and remove threads from the appropriate dispatch queue.

Table 3.2. Dispatch Queue Management Functions

Function

Description

setbackdq()

Insert a thread at the back of a dispatch queue

setfrontdq()

Insert a thread at the front of a dispatch queue

setkpdq()

Insert a real-time thread on the kernel preempt (kp) queue

dispdeq()

Remove a thread from its dispatch queue


The queue insertion functions are entered from various places in the dispatcher, but the majority of thread queue insertions are initiated from the following events:

  • Thread creation. The first time a thread is assigned to and inserted onto a dispatch queue is when it is created. The thread's scheduling class is inherited from the thread issuing the tHRead_create() call, and a class-specific setrun() function is entered after the thread's priority (which is also inherited) is set. Newly created threads inherit the current CPU. That is, the new thread's t_cpu is set to the CPU on which the tHRead_create() code is executing.

  • Thread sleep. A thread issuing a blocking system call is voluntarily switched off a CPU and inserted into a sleep queue.

  • Thread wake-up. On wake-up, the scheduling-class-specific wakeup functions call into the dispatcher to insert a newly awakened thread, moving it from a sleep queue to a dispatch queue.

  • Thread preemption. A preempted thread is context-switched off its CPU and inserted on a dispatch queue.

  • Thread priority change. Typically, a thread's priority changes frequently over its lifetime, and a priority change means a dispatcher queue change (since the queues are organized by CPU and priority).

Figure 3.11 illustrates the execution phases of a typical kernel thread as it is moved to and from dispatch queues and sleep queues.

Figure 3.11. Kernel Thread Queue Insertion


The function calls with the xx_ prefix are scheduling-class-specific functions. The thread yield (xx_yield) scenario occurs only when a yield call is issued pro-grammatically in application code through thr_yield(3T). Preemption means that a thread is involuntarily context-switched off a processor in favor of a higher-priority thread or that the executing thread used its time quantumtime-quantum expiration uses the preempt mechanism to force the context switch.

The queue insertion functions select a CPU for the thread to run on next. Which CPU's dispatch queue wins is driven by several factors:

  • The priority of the thread

  • The home lgroup of the thread

  • Whether or not the thread is bound (processor set or pbind)

  • Dynamic load balancing by the dispatcher code

We look at the queue insertion functions, then summarize the algorithms that we implemented.

Here's a comment from the top of setbackdq(), along with several constants used in the function for load balancing.

/*  * setbackdq() keeps runqs balanced such that the difference in length  * between the chosen runq and the next one is no more than RUNQ_MAX_DIFF.  * For threads with priorities below RUNQ_MATCH_PRI levels, the runq's lengths  * must match.  When per-thread TS_RUNQMATCH flag is set, setbackdq() will  * try to keep runqs perfectly balanced regardless of the thread priority.  */ #define RUNQ_MATCH_PRI  16      /* pri below which queue lengths must match */ #define RUNQ_MAX_DIFF   2       /* maximum runq length difference */ #define RUNQ_LEN(cp, pri)       ((cp)->cpu_disp->disp_q[pri].dq_sruncnt)                                                      See usr/src/uts/common/disp/disp.c 


setbackdq() is passed a pointer to the thread to be queued. If the thread is not bound to a CPU or processor set, the cpu_choose() function is called to select a CPU for the thread; if the thread is bound, CPU selection is easy.

/*  * Select a CPU for this thread to run on.  Choose t->t_cpu unless:  *      - t->t_cpu is not in this thread's assigned lgrp  *      - the time since the thread last came off t->t_cpu exceeds the  *        rechoose time for this cpu (ignore this if t is curthread in  *        which case it's on CPU and t->t_disp_time is inaccurate)  *      - t->t_cpu is presently the target of an offline or partition move  *        request  */ static cpu_t * cpu_choose(kthread_t *t, pri_t tpri) {         ASSERT(tpri < kpqpri);         if ((((lbolt - t->t_disp_time) > t->t_cpu->cpu_rechoose) &&             t != curthread) || t->t_cpu == cpu_inmotion) {                 return (disp_lowpri_cpu(t->t_cpu, t->t_lpl, tpri, NULL));          }         /*          * Take a trip through disp_lowpri_cpu() if the thread was          * running outside its home lgroup          */         if (!klgrpset_ismember(t->t_lpl->lpl_lgrp->lgrp_set[LGRP_RSRC_CPU],             t->t_cpu->cpu_lpl->lpl_lgrpid)) {                 return (disp_lowpri_cpu(t->t_cpu, t->t_lpl, tpri,                     (t == curthread) ? t->t_cpu : NULL));          }         return (t->t_cpu); }                                                      See usr/src/uts/common/disp/disp.c 


The first conditional test in the code determines if the thread has been waiting longer than the cpu_rechoose value. Rechoose is a warm affinity mechanism that places threads back on the CPU they last ran on, thus potentially benefiting from a warm hardware cache. If too many cycles have passed since the thread last ran, the likelihood of finding a warm cache is diminished, so cpu_rechoose just selects the next best CPU. The original systemwide rechoose_interval tuneable still exists, but as part of the evolving CMT chip support, a per-CPU rechoose value, cpu_rechoose in the cpu_t, has been created. This value can be adjusted according to a per-chip rechoose adjustment value. Currently, cpu_rechoose is set to the rechoose_interval default value of 3.

If the thread's time on a dispatcher queue exceeds the CPU's rechoose value (3 ticks), then disp_lowpri_cpu() shall find a CPU for the thread. Otherwise (moving down to the next conditional statement), if the thread is not a member of its home lgroup, then disp_lowpri_cpu() find a CPU for the thread. cpu_choose() selects the thread's current t_cpu if the previous conditional statements are not true. Specifically, if the thread was waiting less than cpu_rechoose (warm affinity is still good) and the thread is a member of its home lgroup, the CPU the thread was on last is the best CPU.

disp_lowpri_cpu() looks for the CPU running the lowest-priority thread. One of the arguments passed is the thread's current t_cpu pointer (the last CPU the thread ran on), which provides the starting point for the search. The thread's lgroup and priority are used to select a CPU. The code favors locality over priority for placement. disp_lowpri_cpu() walks the CPUs in the thread's partition in lgroup distance order, testing CPUs in the thread's home lgroup first, than the next furthest set of CPUs, and so on until all the CPUs in the partition are considered. Within each lgroup, the best (lowest priority) CPU is determined. When we find a CPU where the thread could immediately runthe thread's priority is higher than the running thread and the highest priority thread on the CPU's queue, the loop is terminated and the CPU is selected.

Note the algorithm described in the previous paragraph applies to hierarchical lgroups, which were introduced in Solaris 10 1/06 (update 1) and OpenSolaris build 8. Prior to hierarchical lgroups, the disp_lowpri_cpu() loop walked the partitions CPU list, keeping track of the best local and remote CPUs. At the end of the loop, if the thread had a high enough priority to run immediately in the home lgroup, the best CPU was chosen from the home lgroup. Otherwise, the best CPU in the remote lgroup was selected.

Back in setbackdq(), a CPU has been selected, so it's time to see if some load balancing is required. The dispatcher code attempts to keep the length of the run queues closely balanced so that no one CPU has an inordinate number of threads on its queue relative to the other CPUs. Also, the dispatcher determines if it should load-balance across chips that have multiple execution cores and the system has NUMA properties (more than one lgroup). The following macro tests for the need to balance across chips.

/*  * Balancing is possible if multiple chips exist in the lgroup  * but only necessary if the chip has multiple online logical CPUs  */ #define CHIP_SHOULD_BALANCE(chp)                \         (((chp)->chip_ncpu > 1) && ((chp)->chip_next_lgrp != (chp)))                                                       See usr/src/uts/common/sys/chip.h 


The rationale here is this: If a chip has one core (one CPU), the extra load balancing is not necessaryone CPU that's idle is OK to run on. Contrast with a two-core chip, where even if only one of the two cores (CPUs) is busy, it's still better to try to find a chip where both CPUs are idle. Otherwise, we could end up having a chip with two busy cores and another chip on the same system with two idle cores. More succinctly, for systems with multicore chips, try to load-balance across physical chips, spreading the workload across CPUs on physical chips rather than filling up CPUs on chip 0, then the CPUs on chip 1, and so on.

In setbackdq(), if the thread's partition is not the same as the partition of the selected CPU, call disp_lowpri_cpu() again and move the thread to a different CPU, and place it on that queue. Otherwise, if the thread's CPU partition is the same as that of the selected CPU (from cpu_choose()), CHIP_SHOULD_BALANCE runs. If the chip has multiple CPUs and there's another chip in the lgroup to balance against, chip_balance() is called and determines, based on load (number of threads, number of CPUs on the chips), whether balancing is necessary. If load balancing is necessary, a lesser-loaded CPU is found; otherwise, we use the same CPU that was selected by cpu_choose().

Next is the run queue length balance test, which uses the RUNQ constants shown earlier. If it is determined that run queue length is out of balance, then a CPU in the next partition is selected.

Once the CPU is selected, it's a matter of inserting the thread at the end of the selected CPU's queue, according to the thread's priority, and updating the appropriate disp_t structure variables (run count, queue occupancy bitmap, etc.). The last step is setting the thread's state to TS_RUN and determining if the newly inserted thread has a higher priority than the thread the CPU is running. If the new thread has a higher priority, cpu_resched() is called to initiate the preemption process. cpu_resched() checks the priority of the thread currently executing on the processor against the priority of the thread just inserted onto the processor dispatch queue and also tests for a user or kernel preemption. A user preemption means the thread has a greater priority than the currently running thread, but not greater than kpreemptpri. More succinctly, if the thread's priority is less than 100 but greater than that of the currently running thread, the code sets up a user preemption. A kernel preemption is the result of the thread having a priority greater than the currently running thread and greater than kpreemptpri, which means it's an RT class thread. For a user preemption, the cpu_runrun flag is set. For a kernel preemption, cpu_kprunrun is set. The last step is to call poke_cpu(), which executes a cross-call (CPU-to-CPU interrupt), forcing the CPU into a trap handler. The runrun flags are tested in the trap handler, and the CPU executes the preemption as needed. The cpu_resched() function is shown below.

cpu_resched() {         int     call_poke_cpu = 0;         pri_t   cpupri = cp->cpu_dispatch_pri;         if (!CPU_IDLING(cpupri) && (cpupri < tpri)) {                 TRACE_2(TR_FAC_DISP, TR_CPU_RESCHED,                     "CPU_RESCHED:Tpri %d Cpupri %d", tpri, cpupri);                 if (tpri >= upreemptpri && cp->cpu_runrun == 0) {                         cp->cpu_runrun = 1;                         aston(cp->cpu_dispthread);                         if (tpri < kpreemptpri && cp != CPU)                                 call_poke_cpu = 1;                  }                  if (tpri >= kpreemptpri && cp->cpu_kprunrun == 0) {                         cp->cpu_kprunrun = 1;                         if (cp != CPU)                                 call_poke_cpu = 1;                  }         }         /*          * Propagate cpu_runrun, and cpu_kprunrun to global visibility.          */         membar_enter();         if (call_poke_cpu)                 poke_cpu(cp->cpu_id); } 


The setfrontdq() code implements basically the same algorithm as setbackdq(), with the exception of the load balancing component. Inserting at the front of the queue is analogous to stepping in front of a linethe depth of the line doesn't really matter. In this case, the depth of the queue doesn't really matter, since the newly inserted thread is going to run next.

The decision as to whether setfrontdq() or setbackdq() is called from the various points in the kernel where queue insertion is called is driven by factors such as how long a thread has been waiting to run, whether or not the thread is in the IA class, and similar concerns. IA class threads are put at the front of a dispatch queue for an additional edge on getting scheduled. A preempted thread with a scheduler activation is always placed at the front of a queue. RT class threads are always placed at the back of the kernel preempt queue. Threads that have waited awhile (relatively speaking) to run (as determined by the thread's t_disp_time value) are placed at the front of a queue.

For threads in the RT class, a partition-wide kp_queue is used. The kp_queue insertion process is done with setkpdq(). The thread's dispatch queue is derived from the t_cpupart->cp_kp_queue pointer, the queue's nrunnable count is incremented, and the specific queue pointer is set according to the thread's RT priority. If the queue has more than one runnable thread on it, the thread is placed at the back of the queue. Otherwise, if this is the only thread that will be on the run queue, it is simply inserted in the queue. The remaining disp_t fields (the bitmap of occupied queues and the maxrunpri fields) are updated. After the thread is inserted on a queue, the kp_queue process ensures that the thread's partition didn't change. If it did, a CPU is selected from the thread's partition (based on the thread's t_cpupart->cp_cpulist).

The selected CPU at this point is considered a top contender for running this thread next, and as such is passed to the disp_lowpri_cpu(), which may or may not change the selected CPU, depending on the criteria used by that function (described previously). With the final selection done, cpu_resched() is called to initiate a kernel preemption, since an RT class thread is runnable.

To summarize, the queue insertion functions are responsible for selecting which CPU a thread will run on next. The functions factor in thread bindings, thread priority, partitions, lgroups, and load balancing. Here's how the queue insertions functions select the CPU for non-RT class threads.

  • A newly created thread has its CPU set to the CPU running the thread create code, unless that CPU is not in the default partition, in which case, disp_lowpri_cpu() selects the new thread's CPU.

  • If the thread is bound (for example, to a processor set by psrset(1)), select a CPU in the processor set (partition).

  • If the thread has been waiting 3 ticks or less (warm affinity threshold through rechoose_interval), use the thread's t_cpu (last CPU it ran on).

  • If warm affinity expired, but the thread is in its home lgroup, use t_cpu.

  • Otherwise, look for a CPU running at a lower priority in the local lgroup. If one is not found, look in the remote lgroup. If all CPUs are at a higher priority, select a CPU from the local lgroup.

  • After a CPU has been selected, check chip load balancing; if load balancing is necessary, select a CPU on a chip with less load.

  • After a CPU has been selected, check run queue depth balancing. If the difference in run queue sizes is greater than 2, select a CPU with a smaller run queue length.

3.8.1.1. Monitoring Queue Activity

The DTrace sched provider manages several probes that enable us to observe dispatcher queue insertion (and removal). The enqueue probe fires immediately before a thread is inserted on a queue, and the argument arrays extract information on the thread, process, and CPU. The args[3] value is a boolean set to 0 if the thread will be placed at the back of the queue, and 1 if insertion is at the front of the queue. And of course the setfrontdq(), setbackdq(), setkpdq(), and related functions can be instrumented directly with the DTrace FBT provider.

The /usr/demo/dtrace directory in all Solaris 10 distributions contains several excellent D scripts for monitoring dispatcher queues. qlen.d monitors the queue length for each CPU.

# dtrace -s ./qlen.d dtrace: script './qlen.d' matched 5 probes ^C          9            value  ------------- Distribution ------------- count              < 0 |                                         0                0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       14106                1 |@@@@@                                    2070                2 |@                                        249                3 |                                         15                4 |                                         1                5 |                                         0 . . .          0            value  ------------- Distribution ------------- count              < 0 |                                         0                0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       14811                1 |@@@@@@                                   2427                2 |@                                        268                3 |                                         14                4 |                                         1                5 |                                         0          12            value  ------------- Distribution ------------- count              < 0 |                                         0                0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       16127                1 |@@@@@@                                   2647                2 |@                                        308                3 |                                         24                4 |                                         2                5 |                                         0 


qlen.d was executed on an 8-CPU system, but we cut the output for all but three CPUs for space. We can see that, during the qlen.d collection, the dispatcher queue length balancing works quite well; no one CPU's queue length was significantly different in size from the others.

This is just one example. Please try out /usr/demo/dtrace/qtime.d and /usr/ demo/dtrace/whoqueue.d to further observe dispatch queue activity (and check out the sched provider in the Solaris Dynamic Tracing Guide).

3.8.2. The Heart of the Dispatcher: swtch()

The kernel swtch() routine initiates the context switching of a thread off a processor, figures out which thread should run next, and context-switches the selected thread onto a processor for execution. It's called from many places within the operating system: in the class-fork-return function (a thread has just been created), from the idle thread (executed by processors if there are no runnable threads on a dispatch queue), by interrupt and trap handlers (to reenter the dispatcher), for thread sleep management, in kernel synchronization support code (mutexes, reader/writer locks, condition variables, etc.), and, of course, from the preempt() function. The various entry points to swtch() are listed in Table 3.3.

Table 3.3. Sources of Calls to swtch()

Kernel Subsystem

Kernel Function

Description

Dispatcher

idle

Per-processor idle thread

 

preempt

Last phase of a preemption

Kthread

release_interrupt

Called from an interrupt thread

TS/IA class

ts_forkret

After kthread is created

Sleep/wakeup

cv_xxxx

Various conditional variable functions

CPU

force_migrate

Thread migration to another processor

 

cpu_pause

Processor state change to pause

Mutex

mutex_vector_enter

Mutex lock acquisition

RWlock

rw_enter_sleep

RW lock acquisition

Memory scheduler

sched

PID 0

Semaphore

sema_p

Semaphore "p" operation

Signal

stop

Thread stop function

Sleep/wakeup

slp_cv_wait

Thread to sleep state

Interrupt

intr_thread_exit

Exit of an interrupt handler


Entering the swtch() routine causes the cpu_sysinfo.pswtch counter to be incremented, as reported in mpstat(1M) in the csw column, and reflects the number of context switches per second for each processor. The swtch() function first checks to see if the current thread is an interrupt thread. When interrupts happen, the thread stack is switched from the linked list of interrupt threads in the processor's cpu structure to the thread stack of an interrupt thread. If swtch() was entered with an interrupt thread as the current thread, the kernel must restore the interrupted thread's state so it can be resumed. The interrupted thread is unpinned (a thread that has been preempted for an interrupt is considered pinned), and the kernel resume_from_interrupt() assembly routine is called to restore the state of the interrupted thread.

If the current thread is not an interrupt thread, swtch() calls the disp() function, which is the code segment that looks for the highest-priority thread to run, sets the thread's state to running (TS_ONPROC), and arranges for it to be switched onto the current processor. At a high level, the disp() function searches the dispatch queues for the best-priority kernel thread, starting with the kernel preempt queue and then searching the queue of the current processorthat is, the processor executing the disp() code. If those searches come up blank, then the code searches the dispatch queues of other processors on a multiprocessor system, looking for a runnable kernel thread. If no threads are found on the dispatch queues, the processor executes its idle thread, which executes a tight loop, testing for runnable threads on each pass through the loop and entering swtch() if the run count is greater than 0.

The search for the highest-priority thread begins with the kernel preempt queue, as referenced by the current processor through its cpu_part structure, where the preempt queue is linked to cp_kp_queue. In this case, on a system with multiple processor partitions, the preempt queue for the processor partition that the executing processor belongs to is searched first. The cp_kp_queue search is represented in the following pseudocode.

kpq = pointer to kernel preempt queue dq = pointer to processor's dispatch queue while ( priority = kpq->dispmaxrunpri >= 0 ) AND                 ( priority >= dq->dispmaxrunpri) AND                 ( the current CPU is NOT offline) AND                 ( thread_pointer = disp_getbest(kpq) != NULL )                         if (disp_ratify(thread_pointer, kpq) != NULL)                                 return(thread_pointer) 


The preceding queue search loop validates the priority value according to the queue's disp_maxrunpri, which reflects the highest-priority thread sitting on the queue, makes sure the current processor is not offline, and calls the dispatcher disp_getbest() code to fetch the best-priority thread from the kernel preempt queue. disp_getbest() finds the highest-priority unbound thread, calls dispdeq() to have the thread removed from the dispatch queue, and returns the thread pointer back to disp(). If nothing is found, NULL is returned.

disp_getbest()         dpq = dispatch queue pointer (cp_kp_queue in this example)         priority = dpq->disp_max_unbound_pri         if (priority == -1)                 return(NULL)         queue = dpq->disp_q[pri];         thread_pointer = queue->dq_first;         loop through linked list of threads on queue, skip bound threads         if (no unbound threads)                 return NULL         else                 thread_pointer = thread found         dispdeq(thread_pointer)         set thread t_disp_queue, processorUs cpu_dispthread, thread state to ONPROC         return (thread_pointer) 


If an unbound thread is found in disp_getbest(), the thread is dequeued with dispdeq(), the thread's t_disp_queue pointer is set to reference the processor's cpu structure cpu_disp queue pointer, the processor's cpu_dispthread pointer is set to the selected thread pointer, and the thread state is set to ONPROC.

dispdeq() deals with updating the dispatch queue data structures with the selected thread removed from the queue. It decrements disp_nrunnable, which is the total count for all the queues, and dq_sruncnt, which maintains the count of runnable threads at the same priority. If the per-priority queue count, dq_sruncnt, is 0, then the queue bitmap is updated to reflect an empty queue. The disp_qactmap bitmap uses a set bit to reflect the presence of runnable threads on a per-priority queue; thus, the bit that corresponds to the zeroed queue is cleared. The disp_maxrunpri and disp_max_unbound_pri fields are also updated to reflect the new highest-priority thread on the queue if it is different from the thread that has just been removed from the queue.

Once the thread selection has been made and the thread dequeued, the code returns to disp(), which calls disp_ratify() to ensure that the selected thread was, in fact, the best candidate to run next. The fine-grained locking used within the dispatcher routines allows for simultaneous changes to be made to the queues and the queue state by potentially many processors. For this reason, a select-and-ratify algorithm was chosen for implementation.

Now that the select phase of the algorithm is completed, disp_ratify() is entered to complete the ratify phase. The ratify code simply compares the priority of the selected thread to the disp_maxrunpri values of the processor and kernel preempt queue. If the selected thread priority is greater than maxrunpri, the selection is ratified and the context switch is done. If not, the code loop is reentered to find the best runnable thread. More precisely, if a higher-priority thread appears on the queue when disp_ratify() executes, the selected thread is placed back on the dispatch queue with a call to setfrontdq(), and disp_ratify() returns NULL to disp().

If a thread is not found on the kernel preempt queue, then the per-processor queue disp_maxrunpri is tested. A value of -1 means that nothing is on the queue. In that case, the code searches the queues of the other processors on the system, beginning with the disp_getwork() code, which finds a processor with the highest-priority thread. Then, the code uses the disp_getbest() and disp_ratify() functions previously described.

If the current processor's disp_maxrunpri indicates runnable threads, the first thread from the highest-priority queue is removed, the queue data is updated (disp_nrunnable, dq_nruncnt, disp_qactmap, disp_max_unbound_pri, and disp_maxrunpri), the selection is ratified, and disp() returns the thread pointer to swtch().

If no work is found on any of the dispatch queues, disp_getwork() selects the processor's idle thread by setting the thread pointer to the cpu_idle_thread, referenced from the processor's cpu structure. The pointer to the idle thread is returned to the swtch() code.

Back in swtch(), with a thread pointer for the selected thread (or idle thread), the kernel resume() code is called to handle the switching of the thread on the processor. resume() is implemented in assembly language because the process of context switching requires low-level contact with processor hardware, to save the hardware context of the thread being switched off, and to set up the hardware registers and other context information so that the new thread can begin execution.




SolarisT Internals. Solaris 10 and OpenSolaris Kernel Architecture
Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)
ISBN: 0131482092
EAN: 2147483647
Year: 2004
Pages: 244

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net