Section 3.3. Dispatcher Queues, Structures, and Variables | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.3. Dispatcher Queues, Structures, and Variables

Dispatcher queues, or run queues, are linked lists of runnable kernel threads (threads in the RUN state), waiting to be selected by the dispatcher for execution on a processor. Solaris implements per-processor dispatch queues; that is, every processor on a Solaris system is initialized with its own set of dispatcher queues. The queues are organized as an array of queues, where a separate linked list of threads is maintained for each global priority. The per-processor queue arrangement improves scalability, eliminating the potential for a highly contended mutex lock that would be required for a global, systemwide queue. Additionally, per-processor queues simplify managing affinity and binding of threads to processors, since bound threads are simply placed on the dispatch queue of the processor to which they are bound.

Processors with more than one execution core per processor, such as Sun's UltraSPARC IV+, which has two cores per processor chip, have a dispatch queue per core. That is, the kernel configures each execution core as a processor. Sun's Sun Fire T1000 and T2000 servers, based on the UltraSPARC T1, are configured such that each hardware thread is a processor to the kernel dispatcher. For the typical configuration of a fully loaded T1000 or T2000, with eight execution cores, each with four hardware threads per core, Solaris will configure (8 x 4) 32 processors, each with its own set of dispatch queues.

These per-processor queues are used for threads in all scheduling classes, with the exception of real-time threads. The real-time scheduling class offers a unique set of features for applications with specific requirements. For optimal support for real-time, real-time threads are placed on a special set of dispatcher queues, called kernel preemption queues, or kp queues. Whenever a real-time thread is placed on a kp queue, a kernel preemption is generated, forcing the processor to enter the scheduler (see Section 3.9).

3.3.1. Dispatcher Structures

The dispatcher uses several data structures and per-structure variables to perform the tasks of thread management, queue management, and scheduling. These tasks are quite complex, especially on systems with multiple processors running workloads with hundreds or thousands of active threads. Resource management facilities for pools of processors and binding, dynamic reconfiguration capabilities, and processor state changes (offline, online) all combine to make dispatcher functions a delicate dance that required some brilliant engineering to maintain correctness while providing excellent performance and scalability.

The dispatcher uses the following data structures and variables:

cpu_t. There is a cpu_t structure for every processor in a system. In addition to several dispatcher-specific variables within the cpu_t structure itself, the cpu_t links to a disp_t structure that maintains additional per-processor dispatcher information, as well as a link to the processor's actual dispatch queues.

/*  * Scheduling variables.  */ disp_t          *cpu_disp;              /* dispatch queue data */ /*  * Note that cpu_disp is set before the CPU is added to the system  * and is never modified.  Hence, no additional locking is needed  * beyond what's necessary to access the cpu_t structure.  */ char            cpu_runrun;      /* scheduling flag - set to preempt */ char            cpu_kprunrun;            /* force kernel preemption */ pri_t           cpu_chosen_level;        /* priority at which cpu */                                                    /* was chosen for scheduling */ kthread_t       *cpu_dispthread; /* thread selected for dispatch */ disp_lock_t     cpu_thread_lock; /* dispatcher lock on current thread */ uint8_t         cpu_disp_flags; /* flags used by dispatcher */ /*  * The following field is updated whenever the cpu_dispthread  * changes. Also in places, where the current thread(cpu_dispthread)  * priority changes. This is used in disp_lowpri_cpu()  */ pri_t           cpu_dispatch_pri; /* priority of cpu_dispthread */ clock_t         cpu_last_swtch; /* last time switched to new thread */                                               See usr/src/uts/common/sys/cpuvar.h

cpupart_t. This structure is part of the framework for processor partitions, which is how processor sets are defined internally. Processor sets are a facility for grouping one or more processors on a multiprocessor system into a user-defined processor set. Processes and threads can be explicitly bound to the processors in the set, providing a means of partitioning processor resources for applications and workloads.

typedef struct cpupart {         disp_t          cp_kp_queue;     /* partition-wide kpreempt queue */         cpupartid_t     cp_id;           /* partition ID */         int             cp_ncpus;        /* number of online processors */         struct cpupart  *cp_next;        /* next partition in list */         struct cpupart  *cp_prev;        /* previous partition in list */         struct cpu      *cp_cpulist;     /* processor list */         struct kstat    *cp_kstat;       /* per-partition statistics */         /*          * cp_nrunnable and cp_nrunning are used to calculate load average.          */         uint_t          cp_nrunnable;    /* current # of runnable threads */         uint_t          cp_nrunning;     /* current # of running threads */         /*          * cp_updates, cp_nrunnable_cum, cp_nwaiting_cum, and cp_hp_avenrun          * are used to generate kstat information on an as-needed basis.          */         uint64_t        cp_updates;     /* number of statistics updates */         uint64_t        cp_nrunnable_cum; /* cum. # of runnable threads */         uint64_t        cp_nwaiting_cum;  /* cum. # of waiting threads */         struct loadavg_s cp_loadavg;    /* cpupart loadavg */         klgrpset_t      cp_lgrpset;     /* set of lgroups on which this */                                         /*    partition has cpus */         lpl_t           cp_lgrploads[NLGRPS_MAX];                                         /* table of load averages for this  */                                         /*     partition, indexed by lgrp ID */         uint64_t        cp_hp_avenrun[3]; /* high-precision load average */         uint_t          cp_attr;        /* bitmask of attributes */         lgrp_gen_t      cp_gen;         /* generation number */ #if defined(_MACHDEP)         /*         * These guarded members must reside at the end of the structure         */         cpuset_t        cp_haltset;     /* bitmask of halted cpus */         chip_set_t      cp_chipset;     /* set of chips spanned by this part */ #endif  /* _MACHDEP */ } cpupart_t;                                             See  usr/src/uts/common/sys/cpupart.h

dispq_t. A dispatch queue, linking to the first and last thread in the queue, along with counting the runnable threads on the queue.

* The following is the format of a dispatcher queue entry. */ typedef struct dispq { kthread_t       *dq_first;      /* first thread on queue or NULL */ kthread_t       *dq_last;       /* last thread on queue or NULL */ int             dq_sruncnt;     /* number of loaded, runnable */                                 /*    threads on queue */ } dispq_t;                                                 See usr/src/uts/common/sys/disp.h

disp_t. Per-processor variables on the state of the processor's dispatch queues, including a link to the dispatch queue.

typedef struct _disp { disp_lock_t     disp_lock;       /* protects dispatching fields */ pri_t           disp_npri;       /* # of priority levels in queue */ dispq_t         *disp_q;         /* the dispatch queue */ dispq_t         *disp_q_limit;   /* ptr past end of dispatch queue */ ulong_t         *disp_qactmap;   /* bitmap of active dispatch queues */ /*  * Priorities:  *      disp_maxrunpri is the maximum run priority of runnable threads  *      on this queue.  It is -1 if nothing is runnable.  *  *      disp_max_unbound_pri is the maximum run priority of threads on  *      this dispatch queue but runnable by any CPU.  This may be left  *      artificially high, then corrected when some CPU tries to take  *      an unbound thread.  It is -1 if nothing is runnable.  */  pri_t            disp_maxrunpri; /* maximum run priority */  pri_t            disp_max_unbound_pri;   /* max pri of unbound threads */  volatile int     disp_nrunnable; /* runnable threads in cpu dispq */ struct cpu       *disp_cpu;      /* cpu owning this queue or NULL */ } disp_t;                                                 See usr/src/uts/common/sys/disp.h

disp_queue_info. There is one queue info structure per processor. The variables in queue info provide temporary placeholders for queue data that will change during the normal flow of dispatcher events. The disp_queue_info structures are maintained in an array, referenced in the kernel through the disp_mem pointer.

/* Dispatch queue allocation structure and functions */ struct disp_queue_info {                 disp_t  *dp;                 dispq_t *olddispq;                 dispq_t *newdispq;                 ulong_t *olddqactmap;                 ulong_t *newdqactmap;                 int      oldnglobpris; };                                                 See usr/src/uts/common/disp/disp.c

kthread_t. This structure defines a kernel thread. The kthread_t maintains, among other things, the thread's assigned and inherited priority, a link to a scheduling-class-specific structure (xxproc_t), and timestamps for tracking execution and wait times.

typedef struct _kthread { ...         pri_t   t_pri;          /* assigned thread priority */         pri_t   t_epri;         /* inherited thread priority */ ...         struct thread_ops *t_clfuncs;   /* scheduling class ops vector */         void    *t_cldata;      /* per scheduling class specific data */ ...                                               See usr/src/uts/common/sys/thread.h

xxproc_t, where xx is one of either ts, ia, rt, fx, or fss, corresponding to the scheduling class of the kernel thread (for example, a kernel thread in the timeshare (TS) class will link to a tsproc_t structure, a kernel thread in the fair-share (FSS) class will link to a fssproc_t structure, and so forth). These structures maintain time quantum information and other class-specific data. The example below is the timeshare class structure.

/*  * time-sharing class specific thread structure  */ typedef struct tsproc {         int             ts_timeleft;    /* time remaining in procs quantum */         uint_t          ts_dispwait;    /* wall clock seconds since start */                                 /* of quantum (not reset upon preemption */         pri_t   ts_cpupri;      /* system controlled component of ts_umdpri */         pri_t   ts_uprilim;     /* user priority limit */         pri_t   ts_upri;        /* user priority */         pri_t   ts_umdpri;      /* user mode priority within ts class */         char    ts_nice;        /* nice value for compatibility */         char    ts_boost;       /* interactive priority offset */         unsigned char ts_flags; /* flags defined below */         kthread_t *ts_tp;       /* pointer to thread */         struct tsproc *ts_next; /* link to next tsproc on list */         struct tsproc *ts_prev; /* link to previous tsproc on list */ } tsproc_t;                                                   See usr/src/uts/common/sys/ts.h

3.3.2. Dispatcher Structure Linkage

Specific details on how and where the structure variables are used by the dispatcher are discussed in the following sections. First, let's look at how the structure pieces fits together in Figure 3.4. For clarity, the class-specific data that is linked to a kernel thread's t_cldata pointer is not shown in the figure.

Figure 3.4. Dispatcher Queue Structures

The diagram in Figure 3.4 shows the organization of the per-processor dispatcher queues, used for queueing all runnable threads except those in the real-time scheduling class. There is a dispq_t for every global priority, ordered numerically in descending order. That is, the first dispq_t is the root of all threads at the highest (best) global priority (either 109 or 169. See Section 3.6 for the particulars), the next one down links threads at next lowest priority, and so on.

Dispatcher queues for real-time threads (kp queues) are managed in a slightly different way. The per-priority queue arrangement is the same (60 queues for real-time priorities 059), but the number of actual queues is not per-processor, but rather per-processor partition or processor set. By default, on a Solaris system that has not had any user-defined processor sets created, there will be one systemwide kp_preempt queue for real-time threads. If processor sets are created, then the number of kp_queues will equal the number of configured processor sets plus 1. That is, one kp_queue per processor set, plus one for the default (system) set. Note also that the disp_t structure is embedded in the cpupart_t structure, as opposed to a pointer link.

The dispatcher structures are created and initialized at boot time in accordance with the number of processors installed on the system. Initialization functions can also be called while the system is up and running in order to support the dynamic reconfiguration capabilities of some of Sun's server systems. For example, adding a system board to a high-end server will change the number of available processors, requiring to kernel to allocate and initialize structures for the additional processors. Also, scheduling classes are implemented as dynamically loadable kernel modules; the loading of a scheduling class requires calling dispatcher initialization functions.

3.3.3. Examining Dispatcher Structures

Let's quickly look at how we can examine these structures on a running system and thus determine the values of the structure variables of interest. We use the system debugger, mdb(1), as well as dtrace(1).

# mdb -k > ::cpuinfo  ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC   0 0000180c000  1b    0    0  17   no t-2    300022f7660 threads   1 30001b50000  1b    0    0  17   no    no t-0    30002def920 threads   4 30001b52000  1b    1    0  51   no    no t-0    30003c1e640 tar   5 30001bf2000  1b    0    0  59   no    no t-0    300065a3320 mdb   8 30001bf0000  1b    0    0  59   no    no t-0    300063d2cc0 sshd   9 30001be6000  1b    1    0  32   no    no t-0    300065a29c0 tar  12 30001be0000  1b    0    0  17   no    no t-1    300061c9c80 threads  13 30001bdc000  1b    0    0  17   no    no t-2    3000230b960 threads

The mdb ::cpuinfo dcmd provides a tabular summary for every CPU on the system. In this example, we have eight CPUs. The cpuinfo output provides several interesting bits of information used by the dispatcher. We can see the name of the process the CPU is executing (if it's running a process thread at the time the command is executed), the address of the thread structure (THREAD), the priority of the running thread (PRI), and the number of runnable threads on the CPU's dispatch queues (NRUN). The ADDR column is the kernel virtual address of the processor's cpu_t structure. The RNRN and KRNRN fields represent the CPU's cpu_runrun and cpu_kprunrun flags, respectively. These flags trigger a preemption. A preemption is initiated by the operating system when an event occurs that requires a processor's attention, such as a thread insertion on a CPU's dispatch queue that is of a higher priority than the CPU's current highest-priority thread. Solaris defines user preemptions, triggered by cpu_runrun, and kernel preemptions, triggered by cpu_kprunrun. See Section 3.9.

Displaying the entire cpu_t for a CPU is a simple matter of grabbing the address (ADDR) of the desired CPU and using the mdb(1) print command.

> 30001be0000::print cpu_t {     cpu_id = 0xc     cpu_seqid = 0x6     cpu_flags = 0x1b     cpu_self = 0x30001be0000     cpu_thread = 0x30003b909e0 ...     cpu_disp = 0x30001b2ecd8     cpu_runrun = '\0'     cpu_kprunrun = '\0'     cpu_chosen_level = 0xffff     cpu_dispthread = 0x300061c9640     cpu_thread_lock = 0     cpu_disp_flags = 0     cpu_dispatch_pri = 0x31 ...

The output from the above command is quite large; we snipped most of it for this example. This gives us another view of some of the per-CPU data displayed by the cpuinfo dcmd, such as the CPU preemption flags (runrun), the address of the thread structure running on the CPU, the CPU's current priority, and the like.

The cpu_disp field provides the address of the CPU's dispatcher (disp_t) structure.

> 30001b2ecd8::print disp_t {     disp_lock = 0     disp_npri = 0x6e     disp_q = 0x30001be4a80     disp_q_limit = 0x30001be54d0     disp_qactmap = 0x30001b73a58     disp_maxrunpri = 0xffff     disp_max_unbound_pri = 0xffff     disp_nrunnable = 0     disp_cpu = 0x30001be0000 }

The disp_t fields include a dispatcher lock (see Section 3.4), the number of global priorities on the system (0x6E = 110 decimal. Section 3.7.1 explains this value). The disp_maxrunpri value of 0xffff equates to -1, which means the CPU is not currently running a thread (the CPU is idle). Other variables in the structure and others we have examined are described in the following sections that cover the execution of key dispatcher functions.

Be aware that examining this data on a running system is a moving target. Many of these fields change hundreds or thousands of times per second. That is why certain variables, such as thread address and priority, appear different as we move through the examples.

Pipelines of mdb(1) dcmds can be grouped to provide a more direct path to data of interest. Below is a snapshot of the priority of the thread running on a CPU, for all CPUs.

> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri t_pri = 0x11 t_pri = 0xffff t_pri = 0x1d t_pri = 0xffff t_pri = 0x1d t_pri = 0x11 t_pri = 0x11 t_pri = 0x11 >

Here we see that of the eight CPUs, two are idle (t_pri = 0xffff), four are running threads at priority 17 (0x11 = 17 decimal), and two are running threads at priority 29 (0x1d = 29 decimal).

Having a look at the kp_queue requires dumping the cpupart_t structure linked to a CPU of interest. In this example, a user-defined processor set has not been created, so we have the default set, which is all the processors on the system. For referencing the default (system) kp_queue, a kernel variable, cp_default, is set in the dispatcher code for quick reference to the default partition data. On systems that have not had processor sets or resource pools configured, there will be one kp_preempt queue for the entire system.

> cp_default::print cpupart_t {     cp_kp_queue = {         disp_lock = 0         disp_npri = 0xaa         disp_q = 0xd2e84800         disp_q_limit = 0xd2e84ff8         disp_qactmap = 0xd2cee158         disp_maxrunpri = 0xffff         disp_max_unbound_pri = 0xffff         disp_nrunnable = 0         disp_cpu = 0     }     cp_id = 0     cp_ncpus = 0x1 ....

Note that the embedded disp_t is properly formatted as part of the cpupart_t formatted structure output. For systems with configured processor sets or resource pools, use the cpu_part pointer in a CPU that is a member of the set of interest. You can use a simple but powerful command pipeline to format and dump the cpupart_t, as shown in the next example.

> ::cpuinfo  ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC   0 0000180c000  1b    0    0  -1   no    no t-1    2a10001fcc0 (idle)   1 30001bcc000  1b    0    0 120   no    no t-44   30002c37300 cpuhog   4 30001bce000  1b    0    0  -1   no    no t-6    2a10038fcc0 (idle)   5 30001c6e000  1b    0    0 120   no    no t-39   300029c15e0 cpuhog   8 30001c6c000  1b    0    0  -1   no    no t-1    2a100501cc0 (idle)   9 30001c62000  1b    0    0 120   no    no t-41   300022ca040 cpuhog  12 30001c5c000  1b    0    0 120   no    no t-36   300028aa680 cpuhog  13 30001c58000  1b    0    0  59   no    no t-1    30002a49920 mdb > 30001bcc000::print cpu_t cpu_part |::print cpupart_t {     cp_kp_queue = {         disp_lock = 0         disp_npri = 0xaa         disp_q = 0x30002901000         disp_q_limit = 0x30002901ff0         disp_qactmap = 0x3000622f398         disp_maxrunpri = 0xffff         disp_max_unbound_pri = 0xffff         disp_nrunnable = 0         disp_cpu = 0     }     cp_id = 0     cp_ncpus = 0x8     cp_next = cp_default     cp_prev = cp_default     cp_cpulist = cpu0     cp_kstat = kstat_initial+0xb1a0     cp_nrunnable = 0x1     cp_nrunning = 0x8     cp_updates = 0x31868d     cp_nrunnable_cum = 0x1058     cp_nwaiting_cum = 0     cp_loadavg = {         lg_cur = 0x7         lg_len = 0xb         lg_total = 0         lg_loads = [ 0x10edc5006, 0xf84cace8, 0xd9dcafcb, 0xce1366e8, 0x1103a83ac, 0xb9e7b718, 0x943a762f , 0xf5eb8825, 0xa56f5481, 0x135dd9a10, 0x984b5f3c ]     }     cp_lgrpset = 0x1     cp_lgrploads = cp_default_lpls     cp_nlgrploads = 0x5     cp_hp_avenrun = [ 0x3e292, 0x3e268, 0x3ada9 ]     cp_attr = 0x1     cp_gen = 0 } >

DTrace can also be used to examine dispatcher events of interest on a running system. For example, we may wish to monitor the number of runnable threads on a per-CPU basis, as well as RT-class threads on the kp_queue. The following dtrace script uses the dtrace FBT provider and segues nicely into a description of the dispatcher queue management functions since we enable probes in the queue insertion functions to track run queue depth.

For non-RT-class threads, the kernel inserts a thread on either the front or back of the target dispatch queue, using either setfrontdq() or setbackdq(), so we instrument the entry points for these functions and grab the disp_nrunnable value when these probes fire. Insertions onto the kp_queue are done with the setkpdq() kernel function, so the D script enables a probe at the entry point of that function and grabs the nrunnable value for the queue. The aggregation keys are the CPU ID for the per-CPU queues and the partition ID for the kp_queues, so we get queue depth for all CPUs and queue depth for all kp_queues. Here is the dtrace D script.

# !/usr/sbin/dtrace -qs long dq_enters; long kpdq_enters; fbt::setfrontdq:entry, fbt::setbackdq:entry {         dq_enters++;         cpu_id = args[0]->t_cpu->cpu_id;         dqcnt  = args[0]->t_cpu->cpu_disp->disp_nrunnable;         @nrt[cpu_id] = quantize(dqcnt); } fbt::setkpdq:entry {         kpdq_enters++;         part_id = args[0]->t_cpu->cpu_part->cp_id;         kpqcnt  = args[0]->t_cpu->cpu_part->cp_kp_queue.disp_nrunnable;         @rt[part_id] = quantize(kpqcnt); } tick-5sec {         printf("Non RT Class Threads, by CPU (%ld enters)\n",dq_enters);         printa(@nrt);         printf("\nRT Class Threads, by Partition ID (%ld enters)\n",kpdq_enters);         printa(@rt);         trunc(@nrt);         trunc(@rt);         dq_enters = 0; kpdq_enters = 0;         printf("\n\n"); }

Here is sample output from running the script for a few seconds on a four-processor system. A processor set has been created on the system, with CPU 1 the only CPU in the set. A multithreaded process was bound to the processor set and put in the real-time scheduling class. A second multithreaded process is running across the remaining three CPUs in the default (system) set.

# ./rqcc.d Non RT Class Threads, by CPU (48669 enters)         1            value  ------------- Distribution ------------- count                0 |                                         0                1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 106                2 |                                         1                4 |                                         0         2            value  ------------- Distribution ------------- count               -1 |                                         0                0 |@@@@@@@@@@@@@                            1559                1 |@@@@@@@@@@@@@@@@@@                       2153                2 |@@@@@@@@@                                1128                4 |                                         12                8 |                                         0         3            value  ------------- Distribution ------------- count               -1 |                                         0                0 |@@@@@@@@@@@@                             6354                1 |@@@@@@@@@@@@@@@@@@@                      9557                2 |@@@@@@@@@                                4515                4 |                                         69                8 |                                         0         0            value  ------------- Distribution ------------- count               -1 |                                         0                0 |@@@@@@@@@                                5505                1 |@@@@@@@@@@@@@@@@@@@                      11081                2 |@@@@@@@@@@@@                             6949                4 |@                                        306                8 |                                         0 RT Class Threads, by Partition ID (1146 enters)         0            value  ------------- Distribution ------------- count               -1 |                                         0                0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1062                1 |                                         0         1            value  ------------- Distribution ------------- count                4 |                                         0                8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 106               16 |                                         0

Referencing the sample output on the previous page, the "RT Class threads, by Partition ID" data shows two aggregations since there are two partitions (the default, and the one we created). The default partition (ID 0) also has a zero value for the count of runnable threads, which is expected since we ran the RT-class threads on the user-defined set (partition 1), which shows a varying number of runnable threads on the kp_queue.

The "Non-RT Class Threads, by CPU" shows a queue depth of various sizes over the source of the sampling period for each CPU.

The output data is in the form of aggregations; the result of the dtrace quantize aggregating function, which takes a scalar value as an argument and aggregates according to the value into a power-of-two distribution. The left column, value, represents a range of values of the aggregated data (the number of runnable threads on the queue in the case). The right column, count, represents the number of times the aggregated data fell within the corresponding value range. For example, looking at the CPU 0 distribution, the nrunnable value was 0 for 5505 occurrences of the probe firing. The value was 1 for 11081 occurrences of the probe firing. The value was not less than 2 and not greater than 4 (that is, 2 or 3) for 6949 occurrences of the probe firing, and the value was not less than 4 and not greater than 8 for 306 occurrences of the probe firing.