Section 3.2. Processor Abstractions | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

3.2. Processor Abstractions

The kernel dispatcher primarily manages two types of objects: threads and processors. Threads were discussed in the previous chapter. Before we probe the internals of the dispatcher, we need a clear view of how hardware processors (CPUs) are abstracted and a definition of what specific groupings of processors are maintained in the kernel.

Previous releases of Solaris defined a cpu structure (cpu_t), and a one-to-one mapping existed between physical processors and instantiated cpu_t structures in the kernel. The cpu_t maintains information required by the dispatcher and kernel-at-large for thread scheduling, interrupt handling, CPU state transitions, utilization and accounting, processor groupings, and administrative controls (psradm(1M)). Processor resource control facilitiesprocessor sets and resource poolsare implemented through abstractions in the kernel that define groups of processors. Multicore processor technology and multiprocessor system designs introduced architectural considerations that require visibility by the kernel; thus, some new abstractions were needed for the kernel to take full advantage of new processors and systems.

The following processor-related abstractions are defined and maintained in the kernel:

cpu_t. A processor abstraction. Each cpu_t instantiated in the kernel is viewed by the dispatcher as an execution resource for a thread.
chip_t. The kernel representation of a physical processor chip. Chips with multiple execution cores have a cpu_t for each core. One or more cpu_t structures are linked to the chip_t, affording the kernel a view of which CPUs are associated with which chip. The chip_t was originally implemented to make the kernel aware of which cpu_ts share physical processors. With the introduction of multithreaded, multicore processors, the chip_t use was extended to track load and facilitate load balancing across groups of cpu_ts sharing processing cores. Chips with multiple cores and multiple hardware threads per core (for example, UltraSPARC T1) will have a chip_t per core and a cpu_t for each hardware thread per core. A Sun Fire T2000 system with an 8-core UltraSPARC T1 chip will have eight chip_t structures and four cpu_t structures per chip_t, so the kernel view is 32 (8 x 4) logical CPUs on which threads can be scheduled.

Figure 3.2. Chips and CPUs

The chip_t object provides various structure members used by the dispatcher for load balancing, assigning chips into latency groups (lgroupsmore on that in a minute), maintaining per-chip statistics, and identifying an enumerated chip type that can be used to make scheduling decisions based on the shared resources implemented in the chip, for example, shared hardware caches. The chip types are enumerated below.

typedef enum chip_type {         CHIP_DEFAULT,                    /* Default, non CMT processor */         CHIP_SMT,                        /* SMT, single core */         CHIP_CMP_SPLIT_CACHE,            /* CMP with split caches */         CHIP_CMP_SHARED_CACHE,           /* CMP with shared caches */         CHIP_NUM_TYPES } chip_type_t;                                                        See usr/src/uts/common/sys/chip.h

CHIP_DEFAULT. A traditional processor chip with one execution core and one thread per core.
CHIP_SMT. A symmetric multithread chipa chip with more than one execution core, where each core is visible to the kernel as a logical processor (a cpu_t). The logical processors on an SMT chip share an execution pipeline and typically share instruction and data caches and other chip resources.
CHIP_CMP_SPLIT_CACHE. A chip multiprocessor, where each CMP chip contains multiple execution cores, and each core is represented by a cpu_t and is visible to the kernel as a logical CPU. For CMP designs with some level of dedicated (nonshared) cache per core.
CHIP_CMP_SHARED_CACHE. As above, but a CMP design with shared hardware caches. The UltraSPARC T1 processor is an example of this type.

You can determine the kernel's defined chip type for your system by using mdb(1).

An UltraSPARC II-based system: # mdb -k > ::walk cpu |::print cpu_t cpu_chip |::print chip_t chip_type chip_type = 0 (CHIP_DEFAULT) chip_type = 0 (CHIP_DEFAULT) . . . chip_type = 0 (CHIP_DEFAULT) # prtdiag | more System Configuration: Sun Microsystems sun4u 8-slot Sun Enterprise 4000/5000 System clock frequency: 84 MHz Memory size: 4096Mb ========================= CPUs =========================                     Run   Ecache   CPU    CPU Brd  CPU   Module   MHz     MB    Impl.   Mask ---  ---  -------  -----  ------  ------  ----  0     0     0      336     4.0   US-II    2.0  0     1     1      336     4.0   US-II    2.0 . . . A T2000 UltraSPARC T1 based system: # mdb -k > ::walk cpu |::print cpu_t cpu_chip |::print chip_t chip_type chip_type = 3 (CHIP_CMP_SHARED_CACHE) chip_type = 3 (CHIP_CMP_SHARED_CACHE) chip_type = 3 (CHIP_CMP_SHARED_CACHE) . . . # prtdiag | more System Configuration: Sun Microsystems sun4v Sun Fire T200 System clock frequency: 200 MHz Memory size: 32760 Megabytes ========================= CPUs ===============================================                             CPU                 CPU Location     CPU   Freq     Implementation      Mask ------------ ----- -------- ------------------- ----- MB/CMP0/P0       0 1200 MHz  SUNW,UltraSPARC-T1 MB/CMP0/P1       1 1200 MHz  SUNW,UltraSPARC-T1 . . .

A CPU will belong to one of the CPU groupings listed here:

CPU partitions. A kernel abstraction consisting of a set of CPUs and partition-wide kernel preempt (kp) dispatch queue. At system initialization (boot) time, all CPUs belong to the default (system) partition. The system partition is not visible to users.
Processor sets. A user-level abstraction of a set of one or more processors. Processor sets are implemented internally as CPU partitions. Currently, there is a 1-to-1 mapping between processor sets and CPU partitions, with the exception of the default partition. Processor sets are created and managed with the psrset(1) command.
Resource pools. A resource pool is essentially a stateful processor set. Processor sets created with psrset(1) are statelessthe kernel does not create or maintain nonvolatile state for processor sets; thus, created sets and process/thread bindings are lost if the system is restarted. Resource pools, introduced in Solaris 9, address this by maintaining on-disk state, as well as by adding additional features, such as the ability to bind a scheduling class to a resource pool. In the kernel, the CPU grouping configured as a resource pool's processor set is, in fact, a CPU partition. Simply put, internally, processor sets created with psrset(1) and processor sets assigned to a resource pool are both instantiated as CPU partitions. A resource pool may also assigned to a Solaris 10 Zone.
Locality groups (lgroups). Solaris 9 included a feature called memory placement optimization (MPO). The goal of MPO is to mitigate the performance effects of systems with nonuniform memory access (NUMA) times. The kernel needs to know which CPUs and memory banks are close to each other so that it can optimize for localitykeep threads on CPUs close to the thread's memory. The implementation of MPO is through the kernel locality group abstraction. An lgroup is an object in the kernel that presents a grouping of processors and memory that exist within a bound latency to each other.
Lgroups are organized into a hierarchy or topology that represents the latency topology of the machine. There is always at least a root lgroup in the system. It represents all the hardware resources in the machine at a latency large enough that any hardware resource can at least access any other hardware resource within that latency. A Uniform Memory Access (UMA) machine is represented with one lgroup (the root). In contrast, a NUMA machine is represented at least by the root lgroup and some number of leaf lgroups, where the leaf lgroups contain the hardware resources within the least latency of each other, and the root lgroup still contains all the resources in the machine.
CPUs are assigned to lgroups at system initialization time according to platform-specific code that creates the lgroups as the architectural characteristics of the system dictate. As an example, a high-end Sun Fire server is configured with one or more system boards, where each system board is populated with CPUs and memory. Such systems will create an lgroup for each system board.
The kernel uses the lgroup abstraction to know how to allocate resources near a given process/thread. At fork() and lwp/thread_create() time, a "home" lgroup is chosen for a thread. The kernel dispatcher does this by picking the lgroup with the lowest load average. Binding to a processor or processor set changes the home lgroup for a thread. The scheduler has been modified to try to dispatch a thread on a CPU in its home lgroup. Physical memory allocation is lgroup aware, so memory is allocated from the current thread's home lgroup if possible. If the desired resources are not available, the kernel traverses the lgroup hierarchy, going to the parent lgroup to find resources at the next level of locality until it reaches the root lgroup.
The cpu_t structure in the kernel maintains several linked lists to locate all the CPUs in a processor set or lgroup.
```
/*  * Per-CPU data.  */ typedef struct cpu {         processorid_t   cpu_id;                  /* CPU number */ ...         /*          * Links to other CPUs. It is safe to walk these lists if          * one of the following is true:          *      - cpu_lock held          *      - preemption disabled via kpreempt_disable          *      - PIL >= DISP_LEVEL          *      - acting thread is an interrupt thread          *      - all other CPUs are paused          */         struct cpu      *cpu_next;               /* next existing CPU */         struct cpu      *cpu_prev;               /* prev existing CPU */         struct cpu      *cpu_next_onln;          /* next online (enabled) CPU */         struct cpu      *cpu_prev_onln;          /* prev online (enabled) CPU */         struct cpu      *cpu_next_part;          /* next CPU in partition */         struct cpu      *cpu_prev_part;          /* prev CPU in partition */         struct cpu      *cpu_next_lgrp;          /* next CPU in latency group */         struct cpu      *cpu_prev_lgrp;          /* prev CPU in latency group */         struct cpu      *cpu_next_chip;          /* next CPU on chip */         struct cpu      *cpu_prev_chip;          /* prev CPU on chip */         struct cpu      *cpu_next_lpl;           /* next CPU in lgrp partition */         struct cpu      *cpu_prev_lpl; ... }                                                      See usr/src/uts/common/sys/cpuvar.h 
```

Note two sets of lgroup-related pointers, cpu_next_lgrp and cpu_next_lpl (and their respective prev pointers). An lgroup can be partitioned when the CPUs in the lgroup reside in different CPU partitions (processor sets). An lgroup partition partition represents the intersection of an lgroup and processor set, as shown in Figure 3.3. The scheduling implications of dealing with lgroup partitions is discussed in Section 3.9

Figure 3.3. Lgroup Partitions

In Figure 3.3, CPUs 2, 4, 6, and 8 are in a user-created processor set spanning two lgroups. CPUs 2 and 4 would be on one cpu_[next|prev]_lpl list, and CPUs 6 and 8 on another. CPUs 2, 4, 6 and 8 would be linked togther on each cpu_[next|prev]_part list. CPUs 1, 2, 3, and 4 would be linked in the cpu_[next|prev]_lgrp pointer chain (as would CPUs 4, 5, 6, and 7). Maintaining multiple linked lists that reflect different group abstractions (partitions, lgroups, and lgroup partitions) simplifies operations from the dispatcher and the kernel-at-large that target a CPU group abstraction, such as determining the size and membership of a specific grouping of interest. For example, the dispatcher uses these lists to determine on what CPUs in a given lgroup a thread bound to a particular processor set could be legally scheduled to run on.

The chip_t objects are also linked in several useful ways.

typedef struct chip {           chipid_t        chip_id;                    /* chip's "id" */           chipid_t        chip_seqid;                 /* sequential id */           struct chip     *chip_prev;                 /* previous chip on list */           struct chip     *chip_next;                 /* next chip on list */           struct chip     *chip_prev_lgrp;            /* prev chip in lgroup */           struct chip     *chip_next_lgrp;            /* next chip in lgroup */           chip_type_t   chip_type;                  /* type of chip */           uint16_t        chip_ncpu;                  /* number of active cpus */           uint16_t        chip_ref;                   /* chip's reference count */           struct cpu      *chip_cpus;                 /* per chip cpu list */           struct lgrp     *chip_lgrp;                 /* chip lives in this lgroup */ ...                                                             See usr/src/uts/common/sys/chip.h

A CPU can belong to only one partition and lgroup at any time. CPUs in the same lgroup can be part of different partitions (such as a user-defined processor set or resource pool). All the CPUs in a chip_t belong to the same lgroup as the chip. In Section 3.9, we walk through the process of selecting which CPU's dispatch queue a thread will be inserted on, given configured partitions and lgroups and the possibility of user-defined thread-to-CPU bindings.

At the thread level, several fields in the thread_t structure maintain information on CPU bindings, the partition that contains the thread, and the thread's lgroup affinity. We examine the specific structure members in Section 3.9 as we go through the CPU selection algorithm.

3.2.1. Processor Observability

In the next few pages we have examples of lgroup observability ^[1]how the CPUs and lgroup configurations can be determined, and tracking the execution of all the threads in a targetshowing which CPUs in which lgroups execute the threads.

^[1] Solaris kernel engineering has created a wonderful set of tools targeting lgroup observability and control, including an lgroup-aware Perl module. You can find the tools, along with detailed descriptions of their use, on the OpenSolaris Web site:

http://www.opensolaris.org/os/community/performance/numa/observability/

The tool set includes some very useful DTrace scripts:

http://www.opensolaris.org/os/community/performance/numa/observability/dtrace/

We encourage you to spend time on this site and to download and use the tools provided.

Using mdb(1), we can examine a running system and determine which CPUs are members of which lgroups and partitions.

Example from a Sun v40Z 4-way Opteron based system: > ::walk cpu |::print cpu_t cpu_lpl |::print lgrp_t lgrp_id lgrp_id = 0x1 lgrp_id = 0x2 lgrp_id = 0x3 lgrp_id = 0x4 Example from a Sun Fire T2000 8-core UltraSPARC T1 based system: > ::walk cpu |::print cpu_t cpu_lpl |::print lgrp_t lgrp_id lgrp_id = 0 lgrp_id = 0 ... lgrp_id = 0 lgrp_id = 0

Note in the Sun Fire T2000 example, most of the lines were cut for brevity. All 32 virtual CPUs are in the same lgroup (0) because the T2000 has a uniform memory access architecture.

Here's a handy script created by Jon Haslam (author of the DTrace chapter in Solaris™ Perfarmance and Tools) that reports the lgroup of a particular process and dumps the CPUs and lgroups configured on the system.

# cat getlgrp /usr/ucb/echo -n "PID $1 lgrp = " echo "0t$1::pid2proc | ::walk thread | ::print -t kthread_t t_lpl | \ ::print struct lgrp_ld lpl_lgrpid" | mdb -k echo echo "CPUs on system" echo "cpus::list cpu_t cpu_next | ::print cpu_t cpu_id" | mdb -k echo echo "... and their lgrps" echo "cpus::list cpu_t cpu_next | ::print -t struct cpu cpu_lpl | \ ::print -t struct lgrp_ld lpl_lgrpid" | mdb -k # ./getlgrp $$ PID 3321 lgrp = lpl_lgrpid = 0x1 CPUs on system cpu_id = 0 cpu_id = 0x1 cpu_id = 0x2 cpu_id = 0x3 ... and their lgrps lgrp_id_t lpl_lgrpid = 0x1 lgrp_id_t lpl_lgrpid = 0x2 lgrp_id_t lpl_lgrpid = 0x3 lgrp_id_t lpl_lgrpid = 0x4

The kernel maintains statistics on lgroups through the kstats framework, and these can be examined on a running system to track load, migrations (number of times a thread was migrated to the lgroup), and memory events.

# kstat -m lgrp -n lgrp4 module: lgrp                             instance: 4 name:   lgrp4                            class:    misc         alloc fail                       2         cpus                             1         crtime                           252.646499713         default policy                   0         load average                     65516         lwp migrations                   44         next-touch policy                119290         pages avail                      2097152         pages failed to mark             0         pages failed to migrate from     0         pages failed to migrate to       0         pages free                       2078708         pages installed                  2097152         pages marked for migration       0         pages migrated from              0         pages migrated to                0         random policy                    15584         round robin policy               0         snaptime                         679637.602945989         span process policy              0         span psrset policy               0

And, of course, we can use DTrace to track which CPUs and lgrps the thread was scheduled on:

#!/usr/sbin/dtrace -qs sched:::on-cpu / pid == $1/ {         self->lgrp = curthread->t_cpu->cpu_chip->chip_lgrp->lgrp_id;         @[tid,self->lgrp,cpu]=count(); } END {         printf("Threads CPUs and lgrps for PID %d\n",pid);         printf("%-8s %-8s %-8s %-8s\n","TID","LGRP","CPUID","COUNT");         printf("==================================\n");         printa("%-8d %-8d %-8d %-@8d\n",@); } # ./lgrp.d 3416 ^C Threads CPUs and lgrps for PID 3416 TID      LGRP     CPUID    COUNT ================================== 1        2        1        1 2        2        1        1014 5        3        2        1149 4        3        2        1193 3        2        1        1313 3        1        0        1460 3        3        2        1465 5        4        3        1820 2        1        0        1898 . . .

The D script above was saved in a file called lgrp.d and executed to track process 3416. The aggregation shows the number of times (COUNT) a given thread (TID) executed on a particular CPU and lgroup. We can see from the output that each thread is getting a respectable number of runs on a given CPU, but some migration is also happening, likely the result of load balancing by the dispatcher.