Section 3.1. Fundamentals | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.1. Fundamentals

The kernel dispatcher is the code that places runnable threads on a dispatch queue (run queue), selects the next thread to run on a processor, and manages the switching of threads on and off processors. A thread's priority determines how soon it will run, and the kernel implements a global priority scheme that selects the highest-priority runnable thread from all other runnable threads at any time. Every thread is in one of several possible scheduling classes; this arrangement determines the range of priorities for the thread, as well as which class-specific scheduling algorithms will be applied as the thread goes through its state transitions.

By and large, the life cycle of a thread is typically spent in the ready-to-run (RUN) state, running (ONPROC) state, and waiting-for-an-event (SLEEP) state. A thread's transition between these states in managed largely by the dispatcher. The PINNED and IDL states in the figure are represented in shaded circles because they are not technically thread states. The states are defined as follows.

/*  * Values that t_state may assume. Note that t_state cannot have more  * than one of these flags set at a time.  */ #define TS_FREE         0x00    /* Thread at loose ends */ #define TS_SLEEP        0x01    /* Awaiting an event */ #define TS_RUN          0x02    /* Runnable, but not yet on a processor */ #define TS_ONPROC       0x04    /* Thread is being run on a processor */ #define TS_ZOMB         0x08    /* Thread has died but hasn't been reaped */ #define TS_STOPPED      0x10    /* Stopped, initial state */                                                     See usr/src/uts/common/sys/thread.h

IDL is a process state set when a process is created. A thread that is in the ONPROC state is pinned when the processor on which the thread is executing fields an interrupt. The processor switches to running an interrupt thread, temporarily moving aside (pinning) the thread that was running. This is discussed in Section 3.11.

As Figure 3.1 suggests, the core of the dispatcher's work can be described as a queue management system. All threads in the RUN state reside on dispatch queues, and all threads in the SLEEP state reside on a sleep queue. The available processors on the system can also be thought of as a queue of resources (execution resources in this case). Thus, we can summarize the core functions of the dispatcher as follows:

Queue management. Insert and remove threads from the dispatch queues.
Thread selection. Determine which thread among all runnable threads will next execute on a processor.
Processor selection. Choose the processor on which a thread will run. In some instances, the dispatcher may need to do this.
Context switching. Place a thread on a processor in preparation for execution (switch on) or the removal of a thread from a processor (switch off). This is referred to as a context switch because the processor sees one thread leave and another arrive, so the execution context changes.

Figure 3.1. Thread States

Scheduling decisions and actions taken by the dispatcher code are either time based or event based. That is, some dispatcher functions occur synchronously at regular intervals, while others are asynchronous, originating at random times while the system is running.

The time-based work is through the kernel clock interrupt mechanism and callout facility. By default, a clock interrupt occurs 100 times per second (every 10 milliseconds). The clock interrupt handler processes the running threads and determines their time quantum expiration. Also in the kernel callout queue are dispatcher kernel threads that execute at regular intervals.

Events of interest to the dispatcher originate from many places: the creation of a new thread, thread wakeups, etc. Such events may require a thread preemption, which forces the dispatcher to remove a thread running on a processor to make the processor available to run a higher-priority thread.

A detailed look at the time-based and event-based work performed by the dispatcher is just around the corner, so stay with us.

Different workloads have different scheduling and execution requirements. By default, a Solaris system prioritizes and runs threads on a time-share basis, attempting to maintain an even distribution of processor resources among the threads. A Solaris desktop systema workstation or notebook computer running a windowing systemruns threads on a time-share basis as well but accords an extra priority boost for threads bound to active windows on the user's computer. This is done with the interactive scheduling class.

Solaris implements several scheduling classes that constitute a powerful and flexible infrastructure for managing a variety of workloads by establishing the range of priorities a thread will be assigned, as well as which set of scheduling rules will apply. The following scheduling classes are integrated into Solaris 10:

Timeshare (TS). Priority adjustments are made based on the time a thread spends waiting for processor resources or consuming processor resources, and the thread's time quantum (the maximum amount of time the thread can execute on the processor) varies according to its priority.
Interactive (IA). IA is the same as timeshare, with the addition of a mechanism that boosts the priority of a thread connected to the active window on a desktop.
Fair Share (FSS). Available processor cycles are divided into units called shares, and administrative tools allocate shares to processes using the Solaris projects and tasks framework. A thread in the FSS class has its priority adjusted according to its share allocation, recent utilization, and shares consumed by other threads in the FSS class.
Fixed Priority (FX). The assigned priority is not changed or adjusted over the lifetime of the thread.
Real Time (RT). Real-time threads occupy the highest range of assignable priorities. Real-time scheduling provides the fastest possible dispatch latencythe elapsed time between an RT thread becoming runnable and getting scheduled onto a processor.
System (SYS). The kernel uses this class for the execution of operating system threads. The priority range occupied by the SYS class is higher than all other scheduling classes, with the exception of the real-time class.

For the dispatcher to make the appropriate scheduling decisions with thousands of threads at different priorities and scheduling classes, a global priority scheme is required. Every thread has a global priority, allowing the dispatcher to determine its position relative to all other threads on the system.

In addition to priority, other conditions and configuration parameters factor into dispatcher scheduling decisions. These can be broadly categorized as resource management parameters and system architecture.

Resource management refers to a set of technologies integrated into Solaris that provide the framework, tools, and utilities for allocating and managing different amounts of hardware resources. From the kernel dispatcher perspective, the effects on scheduling decisions have to do with some form of binding or affinity between processors and threads. The specific resource controls are listed below.

Processor binding. Binds processes to processors. The dispatcher would naturally need to honor a user-defined binding and ensure that bound processes have their threads scheduled onto the designated processor.
Processor sets. Enables the creation of one or more user-defined processor sets, comprising some subset of the total number of installed processors (introduced in Solaris 2.6). Use of processor sets requires explicit binding of processes to the set.
Resource Pools. Are essentially stateful processor sets (introduced in Solaris 9). The dispatcher accounts for resource pools and process bindings when it schedules threads.
Zones. Provides a virtualized execution environment. Solaris can bind a resource pool to a zone; thus, the dispatcher needs to honor such bindings when making scheduling decisions about threads running in a zone.

The second category, system architecture, refers to enhancements and optimizations made to the dispatcher code to account for the architectural nuances of the system. A good example of this is Memory Placement Optimization (MPO). MPO was introduced in Solaris 9; it mitigates the effects of systems with nonuniform memory access times by scheduling threads onto processors that are close to the thread's allocated physical memory. MPO is described in Kernel Support for NUMA and CMT Hardware.

A second, and more recent, architectural consideration is chip technology. Specifically the implementation of chip multithreading (CMT) processors, which integrate multiple execution pipelines (cores) and multiple hardware threads per core on a single piece of silicon. Sun's UltraSPARC T1 processor is a CMT design, with eight execution pipelines and four hardware threads per pipeline. To the Solaris kernel, a single UltraSPARC T1 chip appears as 32 processors (8 cores times 4 hardware threads per coreeach hardware thread is viewed as a processor by the kernel dispatcher). The dispatcher has been modified to accommodate certain implementation details of the hardware design, such as the level of sharing of hardware resources among the cores (caches, data paths, etc.) to minimize contention, while at the same time maintaining cache warmth through judicious assignments of threads to cores. CMT is discussed in Chapter 16.

The idea of placing unbound threads on the processor on which they last executed is not new. The kernel dispatcher implemented warm affinity as early as Solaris 2.5. The idea again is that a thread placed back on the same processor has a better probability of finding a warm cachea hardware cache that has some of the thread's instructions and data, thus reducing pipeline stalls for memory references.

As we move through the remainder of this chapter, we explore the topics introduced here in greater detail:

Processor abstractions and groupings in the kernel
Organization of the dispatcher queues and queue management
Core kernel dispatcher functions for selecting threads
Variables and parameters involved in scheduling decisions
Global priority scheme
Scheduling classespriorities and algorithms
Sleep and wakeup queues and queue management