Section 16.6. Lgroup Implementation | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

16.6. Lgroup Implementation

A data structure called an lgroup represents a locality group. The lgroup contains information about itself and its resources. Each lgroup contains

A unique ID that identifies this lgroup
A pointer to the parent lgroup
A pointer to a list of child lgroups
A pointer to chips present in this lgroup
A pointer to sets of memory groups in this lgroup
A platform handle used between the common and platform-specific parts of Solaris to identify this lgroup and the hardware resources that are inside it

/*  * lgroup structure  *  * Visible to generic code and contains the lgroup ID, CPUs in this lgroup,  * and a platform handle used to identify this lgroup to the lgroup platform  * support code  */ typedef struct lgrp {         lgrp_id_t       lgrp_id;        /* which lgroup */         int             lgrp_latency;         lgrp_handle_t   lgrp_plathand;  /* handle for platform calls */         struct lgrp     *lgrp_parent;   /* parent lgroup */         uint_t          lgrp_reserved1; /* filler */         uint_t          lgrp_childcnt;  /* number of children lgroups */         klgrpset_t      lgrp_children;  /* children lgroups */         klgrpset_t      lgrp_leaves;    /* (direct descendant) leaf lgroups */         /*          * set of lgroups containing a given type of resource          * at this level of locality          */         klgrpset_t      lgrp_set[LGRP_RSRC_COUNT];         mnodeset_t      lgrp_mnodes;    /* set of memory nodes in this lgroup */         uint_t          lgrp_nmnodes;   /* number of memnodes */         uint_t          lgrp_reserved2; /* filler */         struct cpu      *lgrp_cpu;      /* pointer to a cpu may be null */         uint_t          lgrp_cpucnt;    /* number of cpus in this lgrp  */         uint_t          lgrp_chipcnt;         struct chip     *lgrp_chips;    /* pointer to chips in this lgrp */         kstat_t         *lgrp_kstat;    /* per-lgrp kstats */ } lgrp_t;

The lgroup platform handle enables the separation of the lgroup implementation into common (platform-independent) and platform-specific components. This separation fosters a clean interface between the two components, makes the implementation more portable, and allows the common part to focus on scheduling, virtual memory, APIs, etc., while the platform-specific part can deal with the hardware resources. The handles are managed and maintained by the platform code, so the platform can implement them as it chooses and decide what CPUs, memory, etc., to associate with each one (and consequently each corresponding lgroup).

For some machines, proper expression of the latency topology will require that some lgroups have no CPUs or no memory. Our implementation deals with this by letting the platform-specific code decide how to implement this and whether the CPUs or memory should be in their own lgroup or associated with another lgroup.

The data structure is defined below.

16.6.1. Parameters Affecting MPO

Most of the locality-related optimizations introduced with MPO rely on some fairly simple heuristics to provide good performance for most applications. Some applications that do not behave as expected may possibly experience performance problems with this new functionality. In case of any such issues, the values of the MPO internal system variables can help explain the system behavior, and the controlling APIs described in the next section can help provide a solution.

Important. The description of the MPO system variables is provided here solely to explain the MPO implementation. Changes to these variables are not supported, and customers experiencing any problems may be required to change the variables back to their default values for proper diagnostics.

Users should keep in mind that these variables are all internal kernel variables and do not constitute a formal interface. Although the commands and variables below are implemented in current releases of Solaris, these variables may change or disappear over time. In addition, since these are internal variables, there may be no error detection should they be changed to unexpected values. Their default values have been carefully chosen to work well together.

lgrp_mem_default_policy. This variable reflects the default memory allocation policy used by the kernel. This variable is an integer, and its value should correspond to one of the policies listed in <sys/lgrp.h>. On Sun Fire 38006800 servers, this value is LGRP_MEM_POLICY_NEXT, starting with the Solaris 9, signifying that memory allocation will default to first touch. On Sun Fire 12K and 15K servers, this value is

LGRP_MEM_POLICY_RANDOM in the Solaris 9 9/02 OS, meaning that 1 defaults to random allocation
LGRP_MEM_POLICY_NEXT starting with the Solaris 9 12/02 OS, meaning that 1 defaults to first touch allocation. However, on Sun Fire 12K and 15K servers without the hardware prerequisite installed, all processors and memory will be placed in a single lgroup, essentially disabling the MPO feature.

lgrp_shm_random_thresh. As described above, large shared memory regions are allocated randomly rather than by first touch. This variable controls how large a region can be before we switch to random allocation. The default is 8 Mbytes, which is large enough to allow communication buffers such as those used by MPI programs to be local to one of the ends of the communication pipe; yet it is small enough that memory regions which are likely to become hot spots will be spread across the system's memory controllers.

This variable is an unsigned 64-bit integer; it can be modified at runtime with a kernel debugger or through /etc/system.

lgrp_mem_pset_aware. If a process is running within a user processor set (see psrset(1M)), this variable determines whether randomly placed memory for the process is selected from among all the lgroups in the system or only from those lgroups that are spanned by the processors in the processor set. This value defaults to zero, signifying that the kernel will select memory from all the lgroups in the system. This default is appropriate for systems in which processor sets are not used or are only used to isolate applications from operating system threads. If processor sets are used to isolate applications from one another, then setting this value to 1 will likely lead to more reproducible performance.

lgrp_expand_proc_thresh. This variable controls how quickly a process's threads will spread across multiple lgroups. If the lowest load among all the lgroups across which the process is spread exceeds this threshold, that suggests that our current lgroups are all approaching or exceeding their capacity. Thus, we will consider placing the next thread on a new lgroup.

This value reflects the fraction of an lgroup's capacity that is being used. To allow the kernel to evaluate loads by using only integer arithmetic, we make this value an unsigned 32-bit integer that is set to INT16_MAX times some fractional capacity.

On Sun Fire 12K and 15K servers, this value defaults to (INT16_MAX*3)/4, meaning that we will not consider spreading a process to a new lgroup until each of its existing lgroups is at least 75% loaded. On Sun Fire 38006800 servers, this value defaults to (INT16_MAX/4), meaning that we will consider spreading to a new lgroup if our existing lgroups are at least 25% loaded. The different values arise from the differences in architecture between the two servers. On Sun Fire 12K and 15K servers, the remote latency is significantly higher than the remote latency on Sun Fire 38006800 servers, and, conversely, the available bandwidth is much greater. Thus, these values reflect an attempt to manage load to minimize an application's latency on a Sun Fire 12K/15K server and maximize an application's bandwidth on Sun Fire 38006800 servers.

lgrp_privm_random_thresh. As described above, by default, private memory is always allocated by first touch. This variable makes it possible to allocate large private memory regions by using random placement rather than first touch. By default, this value is ULONG_MAX.

This variable is an unsigned 64-bit integer; it can be safely modified at runtime with a kernel debugger or through /etc/system.

lgrp_expand_proc_diff. Once we have decided to spread a process out to a new lgroup, there is no point in spreading it to a new lgroup that is just as loaded as the lgroups we are already running on. This variable uses the same capacity units as lgrp_expand_proc_thresh, and it specifies how much lower the load must be on a new lgroup before we will assign a new thread to that lgroup. On both Sun Fire 38006800 and 12K/15K servers, this value defaults to (INT16_MAX/4), or a 25% difference in load.

lgrp_loadavg_tolerance As with system load, an lgroup's load is calculated with a decaying average function; this tends to be more useful than the "instantaneous" load measurement, which can fluctuate widely and quickly. Thus, the load value for an lgroup is really only a constantly changing estimate. When this value is actually used to decide which lgroup a new thread should be placed on, lgrp_loadavg_tolerance is used as a "fudge factor." If the current estimated loads on two lgroups are within lgrp_loadavg_tolerance of each other, we treat those lgroups as being identically loaded and choose randomly between them. The value is specified with the same units as the other load variables. The default value is 0x10000, which leads to good performance results for a variety of database and mixed workloads. Our tests have shown that HPC workloads frequently benefit from a lower value, such as 0x1000.