Section 3.7. Thread Priorities | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.7. Thread Priorities

In Solaris, two types of priorities are involved in scheduling activity: global priori-ties and user priorities. The latter are often referred to as user-mode priorities, implemented in the TS/IA, FSS, and FX classes; SYS and RT do not implement user priorities. Global priorities are the systemwide range of priorities used by the dispatcher to determine which thread gets to run next on a CPU. User-mode priorities are a range of user-settable priorities that allow users to alter a thread's priority, that is, to make it better or worse. For you who are familiar with the traditional UNIX nice(1) command: User-mode priorities are the modern implementation of nice(1); the command-line interface for setting user priorities is priocntl(1).

Figure 3.8 illustrates the global priority range and per-scheduling class user-priority range.

We should note that it is not required (or even recommended) that users, administrators, and developers apply user priorities as they put Solaris to work. The implementation supports them, but the dispatcher and underlying infrastructure are designed to work optimally without user-defined priorities being explicitly set.

3.7.1. Global Priorities

Global priorities refer to the numeric priority value assigned to every kernel thread on the system (t_pri variable in the ktHRead_t); they are initially-derived from the scheduling class of the thread issuing the thread_create() call. The global attribute means the priority value falls within a valid range of systemwide values, providing a scheme by which the highest-priority thread on the system can be determined simply by locating the thread with the highest numeric priority value relative to all other runnable threads on the system. In Solaris, larger values are better priorities.

The per-processor dispatch queues are arranged in a priority-ordered fashion, with a separate queue for each global priority on the system. We see from Figure 3.2 that there is one dispq_t for each priority. All threads at the same priority are placed on the same queue, implemented as a linked list of kernel threads. Threads are selected from the front of the per-priority queue but can be inserted at the front or back of the queue.

There are 170 global priorities: 0169, with 0 being the lowest priority, and 169 the highest (or best) priority. Priorities 160169 are not actually scheduling priorities, but rather priority levels reserved exclusively for interrupt threads. However, if an interrupt thread blocks, it becomes a real, schedulable thread, where its priority is 159 + PIL. If the clock thread blocks (for example), it becomes a (159 + PIL 10) priority 169 thread.

Priorities 100159 are used exclusively by the real-time (RT) scheduling class. Priorities 6099 are used exclusively for SYS class threads; core operating system kernel threads run in the SYS class. Last, priorities 059 are the global priority range shared by all the threads in the Timeshare (TS), Fixed (FX), Fair Share (FSS) and Interactive (IA) scheduling classes. This is shown in Figure 3.3.

The actual number of global priorities changes according to the presence or absence of the real-time scheduling class in the running system. By default, if a process or thread is not explicitly placed in the real-time class, the real-time class will not load into the kernel at boot time. If the real-time class is not loaded, the range is 0109, with interrupts occupying the top ten priority levels, 100109. Since the real-time class has a range of 60 priorities, once loaded, global priorities span 0169. Interrupts remain the highest-priority scheduling events on the system, moving to 160169 when the real-time class is loaded.

The global priority of a thread typically changes frequently over time (with the exception of FX class threads, and FSS class threads with 0 shares allocated), and a global priority change requires a change in the thread's position on a dispatch queue. As such, the priority change functions handle both the calculation and storage on the thread's new priority (the ktHRead_t t_pri field) and insert the thread into a new position on the dispatch queues.

3.7.2. User Priorities

User priorities warrant coverage here because they factor into the calculation of a thread's global priority every time a thread's priority is changed. Each scheduling class that supports user priorities has a predefined priority range, viewable with the priocntl(1) command.

# priocntl -l CONFIGURED CLASSES ================== SYS (System Class) TS (Time Sharing)         Configured TS User Priority Range: -60 through 60 FX (Fixed priority)         Configured FX User Priority Range: 0 through 60 RT (Real Time)         Maximum Configured RT Priority: 59 FSS (Fair Share)         Configured FSS User Priority Range: -60 through 60

The intent is to provide users some level of control over the priority of their processes and threads, without allowing a user to directly set the global priority. The setting of a user priority has the net effect of changing the global priority of the target thread (or process), making it either better or worse, depending on the user value specified. Think of user priorities as a priority control knob that allows users to turn the priority up or down (better or worse). Here's a quick example.

# ps -Lc    PID   LWP  CLS PRI TTY        LTIME CMD 23359     1   TS  59 pts/2       0:00 sh 23374     1   TS  59 pts/2       0:00 ps # priocntl -s -c TS -i pid -p 0 $$ # ps -Lc    PID   LWP  CLS PRI TTY        LTIME CMD 23359     1   TS  49 pts/2       0:00 sh 23376     1   TS  59 pts/2       0:00 ps # priocntl -s -c TS -i pid -p -60 $$ # ps -Lc    PID   LWP  CLS PRI TTY        LTIME CMD 23359     1   TS   0 pts/2       0:00 sh 23378     1   TS   0 pts/2       0:00 ps # priocntl -s -c TS -i pid -p 60 $$ # ps -Lc    PID   LWP  CLS PRI TTY        LTIME CMD 23359     1   TS  59 pts/2       0:00 sh 23380     1   TS  59 pts/2       0:00 ps

The example above uses priocntl(1) to tweak the priority of the shell process. It's a TS class process, at priority 59the best global priority for TS class threads. We set the user priority to 0, which is in the middle of the TS range of 60 to 60. This command results in the shell's global priority getting slightly worse, going to 49 (from 59). We then turn the knob all the way down, setting the user priority to 60, the lowest possible value, which has the effect of dragging the shell's global priority down to 0, the lowest possible global priority for TS class threads. Last, we turn the priority knob in the other direction, setting a user-priority value of 60. This results in a large global priority boost for the target process, bringing the global priority from 0 to 59.

The key point here is that user priorities do not map directly to global priorities; note that the changed global priority in the example was not the same absolute value specified on the priocntl(1) command line. User priorities serve as an advice/request mechanism to the dispatcher to make the priority of the target thread or process either better or worse. The actual effect will not always be as extreme as the example. Note also that nonprivileged users cannot improve a priority; they can only move it in a negative (worse) direction. The ability to improve priority requires either root or the process-level PRIV_PROC_PRIOCNTL privilege (see privileges(5)).

3.7.3. Setting Thread Priorities

Thread priorities can change as a result of event-driven or time-interval-driven events. Event-driven changes are asynchronous in nature; they include state transitions as a result of a blocking system call, a wakeup from sleep, a preemption, or expiration of the allotted time quantum. A user can generate a priority change event by changing a thread's user priority, its scheduling class, or both. Time-driven tick and update functions execute at regular intervals and typically result in changing the priority of threads. Changing a thread's priority varies in complexity depending on the scheduling class, with some substantial differences in implementation. There's a common component in the dispatch queue insertion functions, which happens (typically) as the last operation in a priority change. Figure 3.9 illustrates the flow.

Figure 3.9. Priority Change Flow

A thread's global priority is stored in the thread's kthread_t t_pri field. Kernel support for user priorities exists within the class-specific structures (xxproc_t), which includes a upri field to store the user-specified priority value and a umdprivariable to store the derived user-mode priority for a thread (ts_umdpri, fss_umdpri, and ia_umdpri for their respective scheduling classes). The implementation details differ across the different scheduling classes in terms of how user priority determines how the umdpri field is set and how umdpri determines the global priority of the thread.

3.7.3.1. Time-Based Class Functions

Two class-specific operations get called at regular time intervalstick processing and update processing. Tick processing is handled through the class xx_tick() function and is called from the kernel clock interrupt handler, which executes 100 times a second, based on the default hz value of 100 (100 Hz = 1/100 = .010 seconds or 10 milliseconds). Update processing is done for TS/IA and FSS class threads and is called through the kernel callout mechanism (timeout(9f)), using the class xx_update() code (the SYS, FX, and RT classes do not implement an update function).

The tick and update functions perform very different tasks. Tick processing operates on all threads that are executing on a CPU (TS_ONPROC state) and handles updating the tick counter in the thread's xxproc_t structure to track execution time. Update processing operates on threads that are either sitting on a dispatch queue (TS_RUN) or sitting on a sleep queue (TS_SLEEP). The intention of the update function for TS/IA class threads is to track threads that have spent an inordinate amount of time on a queue and could use a priority boost to more quickly get back on a CPU. Priority adjustments are made if needed. Essentially, TS/IA update is a starvation avoidance mechanism. The update function for FSS has a very different role. Time spent waiting for a CPU is not tracked for FSS threads. It's not necessary since fair-share scheduling, by definition, ensures that threads get the CPU resources allocated to them. Update for FSS manages the adjustment and normalization of share usage and resets thread priorities accordingly.

Tick Processing. Tick processing is done for all threads, except those in the SYS class since the rules of CPU usage and time quanta do not apply to threads running at a kernel priority. Tick processing begins in the clock interrupt handler (common/os/clock.c), which includes code that executes a loop, checking every CPU on the system. The interesting part of the loop code that may call clock_tick() is shown here.

 * If we haven't done tick processing for this  * lwp, then do it now. Since we don't hold the  * lwp down on a CPU it can migrate and show up  * more than once, hence the lbolt check.  *  * Also, make sure that it's okay to perform the  * tick processing before calling clock_tick.  * Setting thread_away to a TRUE value (ie. not 0)  * results in tick processing not being performed for  * that thread.  Or, in other words, keeps the thread  * away from clock_tick processing.  */ thread_away = ((cp->cpu_flags & CPU_QUIESCED) ||     CPU_ON_INTR(cp) || intr ||     (cp->cpu_dispthread == cp->cpu_idle_thread) || exiting); if ((!thread_away) && (lbolt - t->t_lbolt != 0)) {         t->t_lbolt = lbolt;         clock_tick(t); }                                              See usr/src/uts/common/os/clock.c

The clock_tick() function is called if the thread is due for tick processing, the CPU is not executing the idle thread, and the CPU is online (not quiesced) and not executing an interrupt thread. The class-specific tick function is called out of clock_tick() through the CL_TICK(t) macro. The work performed by the class-specific tick handler is to charge the thread with another tick of CPU time, check to see if the thread has used its time quantum, and, if it has, reprioritize the thread and force it to surrender the CPU. The details are covered in the per-class sections that follow. The following pseudocode is a generic representation of thread tick processing.

xx_tick()         get threadlock         get thread's fssproc_t structure         increment thread's tick count         if (the thread is not at a SYS priority)                 decrement thread's timeleft variable                 if (timeleft <= 0) /* no time left - quantum used */                         if (thread has preemption control enabled)                                 let it run unless it's been a while                                 return                         set a new priority for the thread                        reestablish the thread's position on a disp queue or sleep queue                 else (if the thread has not used its time quantum)                 if (thread's priority <  disp queue's highest priority thread)                        set the thread's flag to be placed on the back of the disp queue                         surrender the CPU         release the threadlock

Once the class-specific tick processing is completed, the code returns to clock_tick(), which performs a few additional tasks:

Updates user or system time used at the process and task level
Updates user-defined interval timers, if they exist, and send a signal to the process if a timer has expired
Tests resource control limits on allocated CPU time at the process and task level
Updates memory usage for the currently running process

Once clock_tick() completes, it returns to the main loop in clock.c. Each CPU in the system has tick processing done on the CPU's thread unless the conditions described at the beginning of this section are not met, in which case the CPU is passed over and the next CPU is checked.

Update Processing. While tick processing is driven directly out of clock interrupts, update functions are driven indirectly out of clock interrupts, through the kernel's callout mechanism. Each scheduling class uses the kernel timeout(9F) function to place its update function on the kernel callout queue, resulting in ts_update() and fss_update() executing once per second. Callout queue processing is done in the clock interrupt handler (the callout_schedule() function).

Only the TS/IA and FSS classes implement update functions, and the two class-specific functions differ significantly in implementation. For TS/IA class threads, the amount of time a thread waits to use its time quantum is tracked through the ts_diswait field in tsproc_t. The ts_update() function increments the disp-wait field for threads that are runnable (TS_RUN) or sleeping (TS_SLEEP) and will boost priorities as needed.

The FSS update function has a very different role, which is to update share usage and change priorities accordingly. The FSS class is significantly more complex due to the inherent nature of fair-share scheduling and the administrative framework required to implement the allocation of shares.

The implementation details are discussed in the following sections for the individual scheduling classes.

3.7.3.2. Timeshare Thread Priorities

This section covers both TS and IA class threads, since the majority of IA class work is done in the TS code. We start by covering user-priority setting, then get into setting the global priority.

The TS/IA class uses the kernel TS_NEWUMDPRI macro to set a user-mode priority.

#define TS_NEWUMDPRI(tspp) \ { \         pri_t pri; \         pri = (tspp)->ts_cpupri + (tspp)->ts_upri + (tspp)->ts_boost; \         if (pri > ts_maxumdpri) \                 (tspp)->ts_umdpri = ts_maxumdpri; \         else if (pri < 0) \                 (tspp)->ts_umdpri = 0; \         else \                 (tspp)->ts_umdpri = pri; \         ASSERT((tspp)->ts_umdpri >= 0 && (tspp)->ts_umdpri <= ts_maxumdpri); \ }                                                       See usr/src/uts/common/disp/ts.c

Three components are involved in setting the ts_umdpri value: ts_cpupri, ts_upri, and ts_boost. ts_cpupri is the system- (kernel-) controlled component of the user-mode priority, ts_upri stores the actual user-specified value (from the priocntl(1) command line), and ts_boost is an IA-class specific variable for priority-boosting threads attached to active windows on desktops.

The TS_NEWUMDPRI code is executed in many places throughout the TS functions: essentially whenever a priority change is required. As an example here, we assume that a user triggered an event by issuing a priocntl(1) command on a TS class thread to change the user priority. The ts_parmsset() function handles setting a user priority when a priocntl(1) command is issued. After permission and boundary testing, the interesting code is as shown below.

 ts_parmsset() . . .          * Set ts_nice to the nice value corresponding to the user          * priority we are setting.  Note that setting the nice field          * of the parameter struct won't affect upri or nice.          */         nice = NZERO - (reqtsupri * NZERO) / ts_maxupri;         if (nice >= 2 * NZERO)                 nice = 2 * NZERO - 1;         thread_lock(tx);         tspp->ts_uprilim = reqtsuprilim;         tspp->ts_upri = reqtsupri;         TS_NEWUMDPRI(tspp);         tspp->ts_nice = nice;         if ((tspp->ts_flags & TSKPRI) != 0) {                 thread_unlock(tx);                 return (0);         }         tspp->ts_dispwait = 0;         ts_change_priority(tx, tspp); . . .                                                      See usr/src/uts/common/disp/ts.c

A nice value is derived from the requested user priority, ts_uprilim (user-priority limit) and ts_upri are set according to command-line values, and TS_NEWUMDPRI() is executed to set ts_umdpri.

Setting the new ts_umdpri value is pretty clearsum the three component values and ensure that the new value falls within the maximum and minimum value boundaries. The ts_umdpri value is used subsequently when the thread's global priority is changed. Note that the default for these values is zero when a thread enters the TS class, unless user-defined values have been specified. On a thread create, the values are inherited from the parent thread.

Getting from a user priority to a new thread global priority is handled in ts_change_priority().

ts_change_priority() new_pri = ts_dptbl[tspp->ts_umdpri].ts_globpri;         ASSERT(new_pri >= 0 && new_pri <= ts_maxglobpri);         if (t == curthread || t->t_state == TS_ONPROC) {                 /* curthread is always onproc */                 cpu_t    *cp = t->t_disp_queue->disp_cpu;                 THREAD_CHANGE_PRI(t, new_pri);                                                         See usr/src/uts/common/disp/ts.c

If the thread is running (TS_ONPROC), the new global priority is derived from the ts_globpri column of the TS dispatcher table, as indexed with ts_umdpri and set with the ThrEAD_CHANGE_PRI macro. Otherwise, the class-independent thread_change_pri() function is called.

thread_change_pri(kthread_t *t, pri_t disp_pri, int front) {         state = t->t_state;         /*          * If it's not on a queue, change the priority with          * impunity.          */         if ((state & (TS_SLEEP | TS_RUN)) == 0) {                 t->t_pri = disp_pri;                 if (state == TS_ONPROC) {                         cpu_t *cp = t->t_disp_queue->disp_cpu;                         if (t == cp->cpu_dispthread)                                 cp->cpu_dispatch_pri = DISP_PRIO(t);                 }                  return (0);         }                                                    See usr/src/uts/common/disp/thread.c

The code does another test on the thread state, and if the thread is not on a queue, sets the thread's t_pri directly.

The bottom half of the function handles thread's on a run queue or a sleep queue.

 thread_change_pri(kthread_t *t, pri_t disp_pri, int front) . . .          * It's either on a sleep queue or a run queue.          */         if (state == TS_SLEEP) {                 /*                  * If the priority has changed, take the thread out of                  * its sleep queue and change the priority.                  * Re-enqueue the thread.                  * Each synchronization object exports a function                  * to do this in an appropriate manner.                  */                 if (disp_pri != t->t_pri)                         SOBJ_CHANGE_PRI(t->t_sobj_ops, t, disp_pri);         } else {                 /*                  * The thread is on a run queue.                  * Note: setbackdq() may not put the thread                  * back on the same run queue where it originally                  * resided.                  *                  * We still requeue the thread even if the priority                  * is unchanged to preserve round-robin (and other)                  * effects between threads of the same priority.                  */                 on_rq = dispdeq(t);                 ASSERT(on_rq);                 t->t_pri = disp_pri;                 if (front) {                         setfrontdq(t);                 } else {                         setbackdq(t);                 }                                                    See usr/src/uts/common/disp/thread.c

For threads on a sleep queue, invoke the synchronization object-specific change priority macro (SOBJ_CHANGE_PRI) to handle changing the priority and managing the thread's position on a sleep queue. If the thread is on a run queue, dequeue the thread, set the priority, and queue the thread (with a different priority, the thread's queue position will change).

TS Tick Processing. ts_tick() tracks thread execution time with the ts_timeleft variable in tsproc_t. ts_timeleft is set to the time quantum (from the dispatch table) when the thread is switched on a CPU to begin execution. It is decremented in ts_tick(), and if ts_timeleft has reached zero, the thread's priority is reset from the dispatch table (the ts_tqexp value), and the CPU's user preemption flag (cp_runrun) is set to force a preemption. If the thread has been assigned a short-term SYS priority (the TSKPRI flag is set in ts_flags), the tick processing is not done on the thread (a thread will be assigned a SYS priority when the thread is holding a critical resource, such as a reader/writer lock or a memory page lock).

In the case in which the thread has used its time quantum, the ts_tick() code tests to see if a scheduler activation has been turned on for the thread, in the form of preemption control (see the paragraph beginning "It is in the dispatcher queue insertion code" on page 262).

If preemption control has been turned on for the thread, it is allowed an extra couple of clock ticks to execute, no priority tweaks are done, and ts_tick() is finished with the thread. There is a limit to how many additional clock ticks a kthread with preemption control turned on will be given. If that limit has been exceeded, the kernel sets a flag such that the thread gets one more time slice and on the next pass through ts_tick(), the preemption control test fails and normal tick processing is done. In this way, the kernel does not allow the scheduler activation to keep the thread running indefinitely.

A thread priority adjustment from TS tick processing does a couple of extra steps in setting new values from the TS dispatch table. ts_cpupri is used as an index into the TS dispatch table and is assigned a new value that is based on ts_tqexp from the indexed location. The user-mode priority is calculated, ts_disp-wait is set to 0, and a new dispatcher priority is derived from the TS/IA dispatch table. The new priority is based on the global priority value in the table row corresponding to ts_umdpri, which is used as the dispatch table array index. A call to thread_change_pri() follows. A change in a thread's priority may warrant a change in its position on a queue; thread_change_pri() handles such a case. In the fork return, we are dealing with a new thread that has not yet been on a queue, so it's not an issue.

TS Update Processing. The work of ts_update() is well documented in the source code:

/*  * Update the ts_dispwait values of all time sharing threads that  * are currently runnable at a user mode priority and bump the priority  * if ts_dispwait exceeds ts_maxwait.  Called once per second via  * timeout which we reset here.  *  * There are several lists of time sharing threads broken up by a hash on  * the thread pointer.  Each list has its own lock.  This avoids blocking  * all ts_enterclass, ts_fork, and ts_exitclass operations while ts_update  * runs.  ts_update traverses each list in turn.  *  * If multiple threads have their priorities updated to the same value,  * the system implicitly favors the one that is updated first (since it  * winds up first on the run queue).  To avoid this unfairness, the  * traversal of threads starts at the list indicated by a marker.  When  * threads in more than one list have their priorities updated, the marker  * is moved.  This changes the order the threads will be placed on the run  * queue the next time ts_update is called and preserves fairness over the  * long run.  The marker doesn't need to be protected by a lock since it's  * only accessed by ts_update, which is inherently single-threaded (only  * one instance can be running at a time).  */                                                        See usr/src/uts/common/disp/ts.c

The actual priority tweaks are done in ts_update_list(), which is called by ts_update() to update a list of threads. The basic algorithm implemented in ts_update_list() is represented in the pseudocode flow below.

ts_update()         set list from ts_plisthead[] /* lists of tsproc structures */         ts_update_list()                 while (not at the end of the current list)                         if (thread is not in TS or IA class)                                         bail out                         incremement thread's dispwait                         if (thread is at a SYS priority)                                         bail out                         if (thread has preemption control turned on)                                         bail out                         if (thread is not TS_RUN)                                 AND                            (thread is not TS_SLEEP) OR (ts_sleep_promote is disabled)                                         set thread flags for post trap processing                                         bail out                         kthread->tsproc.ts_cpupri = ts_dptbl[ts_cpupri].ts_lwait                         TS_NEWUMDPRI                         kthread->tsproc.ts_dispwait = 0                         if (the thread's priority global priority changed)                                 ts_change_priority()                 end loop

The actual priority change is handled with the same code previously described: TS_NEWUMDPRI to set the user-mode priority, and ts_change_priority() if the global priority is different.

3.7.3.3. Fair-Share Thread Priorities

The FSS class also implements a user-mode priority, fss_umdpri, that is an integral part of establishing a thread's global priority. The use of fss_umdpri as a knob available to users to make priority adjustments is consistent with its use in the TS/IA class as well. Unlike the TS class, the FSS class does not have a specific code path just for setting fss_umdpri. Rather, fss_umdpri updates are done through fss_newpri(), a function used whenever an FSS priority change is required. By default, when a thread is placed in the FSS class, fss_umdpri is set to 29 (fss_maxumdpri / 2), and when a thread is created from an FSS-class thread, the fss_umdpri and fss_upri are inherited from the parent thread.

FSS Tick Processing. FSS class tick processing does a bit more work than we saw in the TS example. That's due to the share-based priority mechanism and the integration with the Projects and Zones frameworks, which are required as the administrative model for share allocation. Threads in the FSS class are associated with a project, through the projects database (/etc/project), and the execution time of FSS class threads needs to get charged to the project the thread belongs to, in addition to the actual thread. This is accomplish this by incrementing fssp_ticks in the project structure, in addition to the per-thread tick count (fss_tickssee Figure 3.7). Aside from the project update, the work done in fss_tick() is essentially the same as with the other classes. If the target thread has used its time quantum, a new priority is set with fss_newpri(), which is covered in the next section.

FSS Update Processing. The fss_update() work sets the stage for discussion of our next two topics: the concept of fair-share scheduling and the implementation of usage and shares management with respect to how it effects priority changes. A substantial amount of code and complexity in share decay usage processing awaits us, but first we need to lay a foundation.

FSS is based on shares, but the dispatcher schedules threads based on their global priority (see FSS(7)), and the allocation of CPU shares is at the project or zone level, not the process or thread (by the project.cpu-shares or zone.cpu-shares resource controls). Additionally, one or more projects can be configured within a zone. A project may have just one single-threaded process in it, or it may have many multithreaded processes. The actual number of processes and threads within a project does not factor into the usage measurement or adjustment mechanism. Thus, the FSS code must factor in the number of shares allocated and recent CPU utilization (shares consumed) within projects and zones in order to establish a FSS tHRead's new priority. Figure 3.10 provides the big picture.

Figure 3.10. Zones and Projects

A few points on Figure 3.10. First, the projects framework includes an abstraction called tasks, which are a subset of projects and a superset of processes and threadsa project can contain one or more tasks, and each task can encapsulate one or more processes. In the interest of space and simplicity, tasks are not shown (also, FSS share allocation is not done at the task level). Second, the objects shown in the figure may have resource allocations for CPU shares, and the shares may be charged to specific processor set configurations. That is, Projects A and B (for example) may both exist in a processor set, and the share allocation is based on the processors in the set, not on all the processors systemwide. These facilities are well documented in the System Administration Guide: Solaris ContainersResource Management and Solaris Zones guide (http://docs.sun.com). The figure sets the context for the current discussion. A brief summary of what the premise of fair-share scheduling is drawn from and what the decay usage component needs to accomplish will also help.

The FSS scheduler provides two levels of scheduling. At the top level, zones that compete with each other for the same CPU resources (that is, within the same processor set) are allocated CPU cycles based on the ratio between their shares and the total amount of zone shares. So, the actual number of shares assigned to each zone is not important, but the ratio between them is. If one zone has 5 shares and the only other zone has 10 shares, the first zone will receive 5/15th (or 1/3rd) of the CPU cycles, and the second would receive 10/15th (or 2/3rds). More importantly, assigning two zones 5 and 10 shares each will have the same effect if they were assigned 10 and 20 shares instead.

At the level below that, projects are allocated their CPU cycles according to the ratio of their assigned shares to the total amount of project shares within each zone. CPU cycles that were assigned to the zone at the top level get distributed between different projects in it if they all compete for the same CPU resources again. Note that if there is only one project in a zone, then the number of shares assigned to that project doesn't matterit will get all the CPU cycles that were allocated to the zone. Similarly, if there is only one zone on the system, it will get all available CPU cycles no matter how many shares were assigned to it.

It is important to understand that shares only start to impact CPU allocation when projects or zones actually compete for the same CPU resources. For example, imagine two CPU-bound threads running in projects with different amounts of shares on a two-processor system. Since each thread can only run on one CPU at the time, these two threads would not actually compete for CPU cycles, and therefore the number of shares allocated to each of their projects does not matter here. CPU shares are not reservations. A project or a zone without actively running threads does not affect other projects or zones. When the total number of shares is calculated for each zone running on each processor set, only shares of zones and projects that have at least one actively running thread are counted.

The fair-share scheduler must implement a model by which usage can be tracked and decayed at regular time intervals. The term decay here means normalize usage according to recent activity and allocated shares, keeping in mind that share allocation can change dynamically (some project, for example, can have its allocation increased from 20 to 50 shares between sampling periods). Simply put, the scheduler needs to calculate share usage over time, factoring actual usage with allocated shares and other share consumption in the project and zone.

In fss_decay_usage(), the usage adjustment is done according to the following formula:

where the share usage (shusage) is derived from the decayed actual usage, factored with the active shares in the processor set and total shares in the project. If we're calculating zone usage, the zone's active and allocated shares are factored in.

Getting back to fss_update(), the first step in the update process is a call to fss_usage_decay(), which manages usage updates for all projects. Before stepping through the actual code, we need to refer to some constants used in the calculations.

/*  * Decay rate percentages are based on n/128 rather than n/100 so that  * calculations can avoid having to do an integer divide by 100 (divide  * by FSS_DECAY_BASE == 128 optimizes to an arithmetic shift).  *  * FSS_DECAY_MIN        =  83/128 ~= 65%  * FSS_DECAY_MAX        = 108/128 ~= 85%  * FSS_DECAY_USG        =  96/128 ~= 75%  */ #define FSS_DECAY_MIN   83      /* fsspri decay pct for threads w/ nice -20 */ #define FSS_DECAY_MAX   108     /* fsspri decay pct for threads w/ nice +19 */ #define FSS_DECAY_USG   96      /* fssusage decay pct for projects */ #define FSS_DECAY_BASE  128     /* base for decay percentages above */ #define FSS_NICE_MIN    0 #define FSS_NICE_MAX    (2 * NZERO - 1) #define FSS_NICE_RANGE  (FSS_NICE_MAX - FSS_NICE_MIN + 1) static int      fss_nice_tick[FSS_NICE_RANGE]; static int      fss_nice_decay[FSS_NICE_RANGE];                                                       See usr/src/uts/common/disp/fss.c

The FSS_DECAY_MIN and FSS_DECAY_MAX constants are used when the FSS class is first initialized, to seed values in the fss_nice_tick[] and fss_nice_delay[] arrays (more on these arrays in a moment). FSS_DECAY_USG is used in the usage decay function to calculate the decayed usage.

First, let's take a look at how the decayed usage is derived.

* Decay usage for each project running on                                 * this cpu partition.                                 */                                fssproj->fssp_usage =                                    (fssproj->fssp_usage * FSS_DECAY_USG) /                                    FSS_DECAY_BASE + fssproj->fssp_ticks;                                fssproj->fssp_ticks = 0;                                                      See usr/src/uts/common/disp/fss.c

The project's fssp_usage decay is based on its current value, the decay constants (rate of 75%), and the number of ticks used by the project, that is, the actual CPU ticks usedsee page 212) Even though fssp_usage is stored as a 64-bit value, decaying is necessary to avoid possible integer overflows and to keep track of CPU usage history over a short time. The speed of decay determines the length of such a period. Floating-point operations generally are not permitted to be used by the kernel (mostly for performance), so the scheduler is using large integer values (note that the project's fssp_ticks gets charged by almost 1000 points for each clock tick) to get reasonable precision at a very low cost. To further increase the performance of this code, integer divisions used for decaying are optimized by the compiler into simple arithmetic shifts due to the carefully chosen decay base FSS_DECAY_BASE is set to 128.

The next step is to determine the number of actual shares allocated, in case it changed.

 /*                                  * Readjust our number of shares if it has                                  * changed since we checked it last time.                                  */                                 kpj_shares = fssproj->fssp_proj->kpj_shares;                                 if ((fssproj->fssp_shares != kpj_shares) &&                                     (fssproj->fssp_runnable != 0)) {                                         fsszone->fssz_shares -=                                             fssproj->fssp_shares;                                         fssproj->fssp_shares = kpj_shares;                                         fsszone->fssz_shares += kpj_shares;                                 }                                                       See usr/src/uts/common/disp/fss.c

In the above code segment, the current share allocation is saved (kpj_shares: from the kproject_t, which is linked to fssproj_t). If the share allocation changed and there are runnable threads in the project (meaning it's active), adjust the share values at the zone and project level. Note that a similar code segment follows in the source that does the same algorithm. The first case covers projects in a zone; the second case covers a zone with no projects.

With the decayed usage and share calculations done, the normalized share usage can be completed.

fssproj->fssp_shusage =         (fssproj->fssp_usage *          fsspset->fssps_shares *          fsspset->fssps_shares *          fsszone->fssz_shares *          fsszone->fssz_shares) /         (kpj_shares * kpj_shares *          zone_shares * zone_shares);                                                       See usr/src/uts/common/disp/fss.c

The code segment above is the implementation of the formula shown previously; doing the math with sample values is left as an exercise for the reader.

The fss_decay_usage() algorithm is summarized in the pseudocode below.

fss_decay_usage()     for (every CPU)         fsspset = pset /* set the pset for the CPU */         if (there's a partition)             if (there are projects)             decay the max FSS priority for the partition             for (every project in the partition)                 decay project usage based on accumulated project ticks                 reset project tick count to zero                 set the zone object pointers                 set the allocated share value (in case it changed)                if (project allocated shares changed) AND (runnable threads in the proj)                         readjust the number of shares                 if (zone allocated shares changed) AND (runnable threads in zone)                         readjust number of shares in the zone                 calculate normalized share value to be used for fsspri increments

fss_decay_usage() returns to the update function after completing the task of looping through all CPUs, partitions, and zones and updating share usage accordingly. The remaining work required in the update function is to make the actual thread priority adjustments, which happens in fss_update_list(). Looping through a partial list (similar to ts_upate()), the code runs the same tests to ensure that the thread is in the FSS class and not currently a SYS priority. Assuming a non-zero number of shares, the fsspri (priority) value is decayed. If the thread does not have a preemption control enabled and is in the TS_RUN state, fss_newpri() is called to set the thread's new priority.

All the dots get connected in fss_newpri() or fss_change_priority()the decayed fsspri and the normalized share usage all come together as part of the priority calculation process. The decayed fsspri value set in fss_update_list() is the fss_fsspri variable in the thread's fssproc_t and represents the internal FSS priority, not the thread's actual CPU priority, which is t_pri in the thread structure. In fss_newpri(), the fsspri value is readjusted according to the normalized share usage (shusage), the number of runnable threads in the project, and the current tick value of the thread.

 /*          * fsspri += shusage * nrunnable * ticks          */         ticks = fssproc->fss_ticks;         fssproc->fss_ticks = 0;         fsspri = fssproc->fss_fsspri;         fsspri += fssproj->fssp_shusage * fssproj->fssp_runnable * ticks;         fssproc->fss_fsspri = fsspri;                                                       See usr/src/uts/common/disp/fss.c

With an updated fsspri value, a new user-mode priority, fss_umdpri is set by the code segment below.

/*  * The general priority formula:  *  *                      (fsspri * umdprirange)  *   pri = maxumdpri - ------------------------  *                              maxfsspri  *  * If this thread's fsspri is greater than the previous largest  * fsspri, then record it as the new high and priority for this  * thread will be one (the lowest priority assigned to a thread  * that has non-zero shares).  * Note that this formula cannot produce out of bounds priority  * values; if it is changed, additional checks may need to be  * added.  */ maxfsspri = fsspset->fssps_maxfsspri; if (fsspri >= maxfsspri) {         fsspset->fssps_maxfsspri = fsspri;         disp_lock_exit_high(&fsspset->fssps_displock);         fssproc->fss_umdpri = 1; } else {         disp_lock_exit_high(&fsspset->fssps_displock);         invpri = (fsspri * (fss_maxumdpri - 1)) / maxfsspri;         fssproc->fss_umdpri = fss_maxumdpri - invpri; }                                                  See usr/src/uts/common/disp/fss.c

Note that the real dispatcher priority of the thread t_pri is calculated by reverse-quantizing of the internal FSS priority, fss_fsspri to one in the range from 1 to fss_maxumdpri. The quantization works such that the lower values of fss_fsspri map to higher dispatcher priorities, and vice versa. The lowest dispatcher priority of 0 is reserved for threads that run in zones and projects with zero shares. This allows administrators to easily have some noncritical jobs run in the background when other projects or zones with non-zero shares are not using all the available CPU cycles.

The code resets the maximum FSS priority, maxfsspri, if the new value is larger than the previous value, and finally sets the user-mode priority in the thread's fssproc_t. With the user-mode priority set, the code returns to fss_update_list() and calls fss_change_priority(), shown here.

fss_change_priority(kthread_t *t, fssproc_t *fssproc) {         pri_t new_pri;         ASSERT(THREAD_LOCK_HELD(t));         new_pri = fssproc->fss_umdpri;         ASSERT(new_pri >= 0 && new_pri <= fss_maxglobpri);         fssproc->fss_flags &= ~FSSRESTORE;         if (t == curthread || t->t_state == TS_ONPROC) {                 /*                  * curthread is always onproc                  */                 cpu_t *cp = t->t_disp_queue->disp_cpu;                 THREAD_CHANGE_PRI(t, new_pri);                 if (t == cp->cpu_dispthread)                          cp->cpu_dispatch_pri = DISP_PRIO(t);                 if (DISP_MUST_SURRENDER(t)) {                          fssproc->fss_flags |= FSSBACKQ;                         cpu_surrender(t);                 } else {                         fssproc->fss_timeleft = fss_quantum;                 }         } else {                 /*                  * When the priority of a thread is changed, it may be                  * necessary to adjust its position on a sleep queue or                  * dispatch queue.  The function thread_change_pri accomplishes                  * this.                  */                 if (thread_change_pri(t, new_pri, 0)) {                         /*                          * The thread was on a run queue.                          */                         fssproc->fss_timeleft = fss_quantum;                 } else {                         fssproc->fss_flags |= FSSBACKQ;                 }         } }                                                       See usr/src/uts/common/disp/fss.c

Algorithmically very similar to the TS class equivalent, the procedure is that if the thread is running, use THREAD_CHANGE_PRI to set the new t_pri value; otherwise, call thread_change_pri() and reset the time quantum if the thread was on a dispatch queue. If the thread was on a sleep queue, set the flag to instruct the dispatcher queue function to insert the thread at the back of the appropriate queue.

3.7.3.4. Fixed-Priority Thread Priorities

The FX class is a convenient class to use when you want to keep a process or thread at the same priority throughout its execution and not have the system change the priority over time. The FX priority range is somewhat unique in that it defines 61 priority levels (060), as opposed to the other classes (except SYS), that define 60 priority levels (059). As such, an FX class thread can be placed at global priority 60, while remaining in the FX class; typically, a thread at priority 60 has been promoted to SYS class for a short duration. Here's a snapshot of the FX dis-patch table with most of the lines deleted for space.

# Fixed Priority Dispatcher Configuration RES=1000 # TIME QUANTUM                    PRIORITY # (fx_quantum)                      LEVEL        200                    #        0 . . .        160                    #       10 . . .        120                    #       20 . . .         80                    #       30 . . .         40                    #       40 . . .         40                    #       50 . . .         20                    #       59         20                    #       60

The default table implements a descending quantum allocation scheme, in which the time quantum goes down as the priority of the thread goes up. Threads at priority 09 get a 200 millisecond quantum, which drops to 160 milliseconds for priorities 1019, and so on, down to 20 milliseconds for the highest-priority threads. Note the presence of 61 priority levels (060).

The FX priority implements user-mode priorities differently than we've seen in the previous examples. The fxproc_t does not include a fx_umdpri variable; the fx_pri field is used to store user priorities, and these translate directory to global priorities. That is, using priocntl(1) to set a process or thread to FX priority 30 (for example) results in a global priority of 30. And because this is a fixed-priority class, the priority remains 30 throughout the execution of the thread unless it is explicitly changed by a user.

A quick note for readers that will be reading the Solaris source code or taking advantage of OpenSolaris and developing kernel software. The FX source and header files include a callback mechanism. The FX callback functionality was added to support a specific OEM some time ago. It is not used by any bundled Solaris software, and use of the callback feature requires a header file that is not included in the standard Solaris distribution. We plan to remove the FX callback framework in the near future. We do not discuss the callback framework here.

FX Tick Processing. Tick processing for FX class threads is consistent with our previous examples: Decrement the counter for the active thread, charging it for another tick of CPU use. If the thread has used its time quantum, force the thread off the CPU and queue on the appropriate dispatcher queue. A check for an enables preemption control is also done in fx_tick().

 fx_tick() . . .                 new_pri = fx_dptbl[fxpp->fx_pri].fx_globpri;                 ASSERT(new_pri >= 0 && new_pri <= fx_maxglobpri);                 /*                  * When the priority of a thread is changed,                  * it may be necessary to adjust its position                  * on a sleep queue or dispatch queue. Even                  * when the priority is not changed, we need                  * to preserve round robin on dispatch queue.                  * The function thread_change_pri accomplishes                  * this.                  */                 if (thread_change_pri(t, new_pri, 0)) {                         fxpp->fx_timeleft = fxpp->fx_pquantum;                 } else {                         fxpp->fx_flags |= FXBACKQ;                         cpu_surrender(t);                 }         } else if (t->t_pri < t->t_disp_queue->disp_maxrunpri) {                 fxpp->fx_flags |= FXBACKQ;                 cpu_surrender(t);         } . . .                                                        See usr/src/uts/common/disp/fx.c

If the thread has used its time quantum, a new_pri value is set from the FX dispatch table, with the fxproc_t fx_pri value used as an index. With the FX class, priorities are not changed by the systemthe fx_pri value equates to the thread's global priority, t_pri, and the new global priority from the dispatch table is the same as the existing priority, as long as a user has not explicitly changed the priority. If a user issued a priority change, the fx_pri field reflects the user input value, indexing into a different row in the dispatch table, resulting in a priority change. thread_change_pri() sets the thread's t_pri field and calls into the appropriate subsystem to queue the thread if it's on a run queue or sleep queue.

The FX class does not implement an update function.

3.7.3.5. Real-Time Thread Priorities

Real-time applications require a system that can provide a dispatch latency that is fast, bound, and consistent. Dispatch latency refers to the amount of time that elapses from when a thread becomes runnable to when it is context-switched onto a processorfrom runnable to running. Solaris enables rapid context switching for real-time threads through several features.

Global priority placement. Real-time threads are the highest-priority threads on the system. Only interrupt threads have priority of real time. Processor control mechanisms can be enabled to keep interrupt threads off processors running real-time threads.
Kernel preempt dispatch queue. Real-time threads are managed on a separate dispatch queue from other class threads.
Kernel preemption. When a real-time thread becomes runnable, a kernel preemption is triggered, forcing the CPU to switch off its current thread and switch on the real-time thread (see Section 3.9).

Real-time class threads run at one of 60 priorities, 059, which translate to global priorities 100159. Like the FX class, the real-time class does not implement a rt_umdpri variable in support of user-mode priorities. The rt_pri field in rtproc_t stores a user-defined priority, which is used as an index into the RT dispatch table to set the global priority. A user-supplied RT priority of 0 results in a global priority of 100, user priority 1 yields global priority 101, and so on. Like FX, RT is a fixed priority classthe kernel will not change the priority of a real-time thread over time, unless initiated by a user event, such as a priocntl(1) command or a priocntl(2) system call.

Real-Time Tick Processing. Tick processing for real-time threads is consistent with previous examples. The implementation is much simpler because it is not necessary to test for preemption controls or a SYS priority. Real-time is already a higher priority than SYS, and preemption controls would be superfluous in real-time since real-time threads get the processor and keep it unless a higher-priority real-time thread comes along, or an interrupt needs to be processed.

/*  * Check for time slice expiration (unless thread has infinite time  * slice).  If time slice has expired arrange for thread to be preempted  * and placed on back of queue.  */ static void rt_tick(kthread_t *t) {         rtproc_t *rtpp = (rtproc_t *)(t->t_cldata);         ASSERT(MUTEX_HELD(&(ttoproc(t))->p_lock));         thread_lock(t);         if ((rtpp->rt_pquantum != RT_TQINF && --rtpp->rt_timeleft == 0) ||             (DISP_MUST_SURRENDER(t))) {                 if (rtpp->rt_timeleft == 0 && rtpp->rt_tqsignal) {                         thread_unlock(t);                         sigtoproc(ttoproc(t), t, rtpp->rt_tqsignal);                         thread_lock(t);                 }                 rtpp->rt_flags |= RTBACKQ;                 cpu_surrender(t);         }         thread_unlock(t); }                                                        See usr/src/uts/common/disp/rt.c

The code tests for an infinite time quantum, defined as RT_TQINF. The RT dis-patch table can be modified to establish infinite time quantums if required. The FX class provides this capability as well. Also, the priocntl(1) command and system call let us set the time quantum of an RT or FX class thread. The specifier RT_TQINFor FX_TQINF for FX class threadsestablishes an infinite time quantum. If the RT thread does not have an infinite quantum and has used its allotted quantum after decrementing rt_timeleft, DISP_MUST_SURRENDER() code runs. Let's quickly look at what this macro expands to.

#define DISP_MUST_SURRENDER(t)                               \         ((DISP_MAXRUNPRI(t) > DISP_PRIO(t)) ||            \         (CP_MAXRUNPRI(t->t_cpupart) > DISP_PRIO(t))) . . . #define CP_MAXRUNPRI(cp)        ((cp)->cp_kp_queue.disp_maxrunpri) . . . /*  * Macro for use by scheduling classes to decide whether the thread is about  * to be scheduled or not.  This returns the maximum run priority.  */ #define DISP_MAXRUNPRI(t)       ((t)->t_disp_queue->disp_maxrunpri) . . . /* The dispatch priority of a thread */ #define DISP_PRIO(t) ((t)->t_epri > (t)->t_pri ? (t)->t_epri : (t)->t_pri)                                  See usr/src/uts/common/sys/cpupart.h, thread.h, disp.h

The code segment shows the embedded macros as well, to facilitate walking through DISP_MUST_SURRENDER(). The purpose here is to determine if there is a higher-priority thread runnable, which would force the current thread to yield the CPU. This requires testing the thread's priority (DISP_PRIO) against the maxrunpri of the thread's current queue (DISP_MAXRUNPRI) and the CPU partition's kp_queue maxrunpri (CP_MAXRUNPRI).

Getting back to rt_tick(), if a higher-priority thread is runnable or one of the other conditions previously described is true, then another test is performed to determine if a signal was set for RT time quantum expiration. This feature is unique to the RT class, where it may be desirable for an application to be notified if its RT threads are using their time quanta. priocntl(1) implements a -t flag that can be used on the command line with RT class threads to set a signal number, which is stored in the rt_tqsignal field in rtproc_t. If a signal has been set, the kernel sigtoproc() function sends the signal. If a signal number has not been set, the code simply sets the rt_flag to instruct the dispatcher to queue this thread at the back of the queue, and cpu_surrender() is called. The cpu_surrender() code is discussed in Section 3.8.

The RT class does not implement an update function.

3.7.3.6. Monitoring Thread Priorities

The easiest way to monitor thread priorities is with the prstat(1) command, which by default displays the priority in the PRI column. prstat(1) with the -L flag provides a row for every thread in each process. To determine the scheduling class, use the ps(1) command with the -c flag. A ps -ec command lists all the processes on the system with a CLS and PRI column, displaying the scheduling class and priority, respectively.

To track the various priority-related fields of a thread, run the following DTrace script, which takes a process name as a command-line argument and displays various priority fields for the active thread.

#!/usr/sbin/dtrace -qs profile-5sec / execname == $$1 / {         self->cid    = curthread->t_cid;         self->pri    = curthread->t_pri;         self->epri   = curthread->t_epri;         self->cpupri = ((tsproc_t *)curthread->t_cldata)->ts_cpupri;         self->upril  = ((tsproc_t *)curthread->t_cldata)->ts_uprilim;         self->upri   = ((tsproc_t *)curthread->t_cldata)->ts_upri;         self->umdpri = ((tsproc_t *)curthread->t_cldata)->ts_umdpri;         self->nice   = ((tsproc_t *)curthread->t_cldata)->ts_nice;         printf("PID: %d, TID: %d, CID: %d, PRI: %d CPUPRI: %d UMDPRI: %d UPRI: %d\n",                 pid,tid,self->cid,self->pri,self->cpupri,self->umdpri,self->upri); }

Here is an examplerunning the script on a process called threads:

# ./pri.d threads PID: 5053, TID: 218149, CID: 1, PRI: 47, CPUPRI: 59, UMDPRI: 47 PID: 5053, TID: 218670, CID: 1, PRI: 47, CPUPRI: 59, UMDPRI: 47 PID: 5053, TID: 219148, CID: 1, PRI: 47, CPUPRI: 59, UMDPRI: 47

Most of the script output is (hopefully!) clear at this point. The CID is the scheduling class ID; 1 is the TS class.

DTrace implements a sched provider that manages several probes for tracking dispatcher activity. Several change-pri probes are implemented in the various scheduling-class-specific functions that initiate a priority change (yield, sleep, wakeup, preempt, and setrun).

# dtrace -l -n change-pri    ID   PROVIDER            MODULE                           FUNCTION NAME  1834      sched           genunix                  thread_change_pri change-pri  1876      sched                TS                 ts_change_priority change-pri  1877      sched                TS                           ts_yield change-pri  1878      sched                TS                          ts_wakeup change-pri  1879      sched                TS                         ts_trapret change-pri  1880      sched                TS                           ts_sleep change-pri  1881      sched                TS                          ts_setrun change-pri  1882      sched                TS                         ts_preempt change-pri  2554      sched                FX                 fx_change_priority change-pri  2555      sched                FX                           fx_yield change-pri  2556      sched                FX                          fx_wakeup change-pri  2557      sched                FX                         fx_preempt change-pri  2578      sched                RT                 rt_change_priority change-pri  2583      sched               FSS                         fss_wakeup change-pri  2584      sched               FSS                          fss_sleep change-pri  2585      sched               FSS                         fss_setrun change-pri  2586      sched               FSS                        fss_preempt change-pri  2587      sched               FSS                        fss_trapret change-pri  2588      sched               FSS                fss_change_priority change-pri

Here's a simple script that enables all the change-pri probes, uses the count() aggregating function, and keys the aggregation on the probe function name, the name of the executable, the thread ID, and the thread priority.

#!/usr/sbin/dtrace -qs change-pri {         @[probefunc,execname,tid,curthread->t_pri] = count(); } END {         printf("%-16s %-16s %-8s %-8s %-8s\n","FUNC","EXEC","TID","PRI","CNT");         printa("%-16s %-16s %-8d %-8d %-@8d\n",@); } solaris10> ./cpri.d ^C FUNC             EXEC             TID      PRI      CNT ts_preempt       threads          34923    47       1 . . . ts_preempt       threads          35347    37       283 ts_yield         threads          35347    59       283 ts_preempt       threads          35343    37       294 ts_yield         threads          35343    59       294 ts_preempt       threads          35339    37       303 ts_yield         threads          35339    59       303 ts_preempt       threads          35357    37       338 ts_yield         threads          35357    59       338 ts_preempt       threads          35355    37       427 ts_yield         threads          35355    59       427 thread_change_pri sched               0   169       433

In this example, our handy threads program is active and having its priorities changed through preempt and yield functions. Using the probe function makes it easy to see the scheduling class of the threads running when the probe fires. Since we see a lot of preemptions here, it will be interesting to see which threads are getting preempted and which are causing the preemptions. As it happens, /usr/ demo/dtrace/whopreempt.d gives us exactly this information.

# dtrace -s ./whopreempt.d ^C                      PREEMPTOR PRI                     PREEMPTED PRI      #                           Xorg  59                       threads  27      1                           Xorg  59                       threads  59      1 . . .                          sched  99                       threads  47     92                           Xorg  59                       threads  47    107                          sched  99                       threads  37    150                        threads  47                       threads  27    166                        threads  47              xscreensaver-loc  32   7615                        threads  47                       threads  37   9638                        threads  47              xscreensaver-loc  42  13287                        threads  37              xscreensaver-loc  32  13599

Based on the DTrace output, we can see that the process generating the most preemptions is our threads workload. The PRI columns show the higher priority of the preemptor threads over the threads getting preempted, so we see user preemption in action. If the PRI of the preemption thread falls in the 100159 range, we know that it's a real-time thread and that it caused a kernel preemption.

These examples barely scratch the surface of what can be observed and understood regarding dispatcher behavior with the sched provider. We encourage you to explore the endless possibilities DTrace offers.