|
|
3.6. Scheduling ClassesBefore diving into the specifics of dispatcher thread selection and operations, we need to discuss thread priorities and the individual scheduling classes implemented in the kernel. The core dispatcher code and scheduling-class specific code are tightly integrated, and a thorough explanation of the CPU and thread selection and scheduling process requires a background in the priority scheme and the functions managed by the scheduling-class specific code. The dispatcher subsystem can be decomposed into the core dispatcher functions and the scheduling-class specific functions. While core dispatcher code and scheduling class functions are tightly integrated and are maintained in the same source directoryusr/src/uts/common/dispthe architecture allows for a single instance of the dispatcher to support multiple scheduling classes. The different scheduling classes determine the priority range for threads and vary in terms of the algorithms applied to thread-specific functions. Solaris provides six bundled scheduling classes:
The default scheduling class out of the box is the TS class or the IA class for desktops and laptops for threads started by the user under a window manager. User and administrative commands exist for placing threads in other classes. priocntl(1) can change the scheduling class and priority of a thread or process; note that improving priorities and using the RT class requires a privileged account). Using the FSS class requires a little more administrative work to do the share allocation. See System Administration Guide: Solaris ContainersResource Management and Solaris Zones (http://docs.sun.com) for specifics. 3.6.1. Scheduling Class DataEach scheduling class has a unique data structure referenced through a kernel thread's t_cldata pointer. The structures take the name of xxproc, where xx is ts, rt, fss, fx or ia. As an example, the tsproc_t object is shown below. The class-specific structures for the other scheduling classes are similar in terms of the structure members and their use. /* * time-sharing class specific thread structure */ typedef struct tsproc { int ts_timeleft; /* time remaining in procs quantum */ uint_t ts_dispwait; /* wall clock seconds since start */ /* of quantum (not reset upon preemption */ pri_t ts_cpupri; /* system controlled component of ts_umdpri */ pri_t ts_uprilim; /* user priority limit */ pri_t ts_upri; /* user priority */ pri_t ts_umdpri; /* user mode priority within ts class */ pri_t ts_scpri; /* remembered priority, for schedctl */ char ts_nice; /* nice value for compatibility */ char ts_boost; /* interactive priority offset */ uchar_t ts_flags; /* flags defined below */ kthread_t *ts_tp; /* pointer to thread */ struct tsproc *ts_next; /* link to next tsproc on list */ struct tsproc *ts_prev; /* link to previous tsproc on list */ } tsproc_t; See usr/src/uts/common/sys/ts.h The kernel maintains doubly linked lists of the class-specific structuresseparate lists for each class, with the exception of IA class threads. Threads in the IA class link to a tsproc structure, and most of the class-supporting code for interactive threads is handled by the TS routines. IA threads are distinguished from TS threads by a flag in the ts_flags field, the TSIA flag. Maintaining the linked lists for the class structures greatly simplifies the dispatcher-supporting code that updates different fields, such as time quantum, in the structures during the clock-driven dispatcher housekeeping functions. For the TS/IA, FX, and FSS classes, the kernel builds an array of 16 xxproc structure pointers that anchor up to 16 doubly linked lists of the xxproc structures, systemwide. The code implements a hash function, based on the thread pointer, to determine which list to place a thread on, and each list is protected by its own kernel mutex, implemented as a listlock array, once for each class. Implementing multiple linked lists in this way makes for faster traversal of all the xxproc structures for a given scheduling class in a running system, and the use of a lock per list allows for concurrencymultiple kernel threads can traverse the lists. Here's the implementation for the FSS class. /* * The fssproc_t structures are kept in an array of circular doubly linked * lists. A hash on the thread pointer is used to determine which list each * thread should be placed in. Each list has a dummy "head" which is never * removed, so the list is never empty. fss_update traverses these lists to * update the priorities of threads that have been waiting on the run queue. */ #define FSS_LISTS 16 /* number of lists, must be power of 2 */ #define FSS_LIST_HASH(t) (((uintptr_t)(t) >> 9) & (FSS_LISTS - 1)) #define FSS_LIST_NEXT(i) (((i) + 1) & (FSS_LISTS - 1)) #define FSS_LIST_INSERT(fssproc) \ { \ int index = FSS_LIST_HASH(fssproc->fss_tp); \ kmutex_t *lockp = &fss_listlock[index]; \ fssproc_t *headp = &fss_listhead[index]; \ . . . #define FSS_LIST_DELETE(fssproc) \ { \ int index = FSS_LIST_HASH(fssproc->fss_tp); \ kmutex_t *lockp = &fss_listlock[index]; \ . . . static fssproc_t fss_listhead[FSS_LISTS]; static kmutex_t fss_listlock[FSS_LISTS]; See usr/src/uts/common/disp/fss.c The fss_listhead[] array represents the beginning of the 16 lists of fssproc_t structures, each with a corresponding lock in fss_listlock[]. The lists for the other classes are implemented in much the same fashion, with the exception of the RT list, which is implemented as a single list. The kernel framework for scheduling classes begins with the sclass array of sclass_t structures. extern struct sclass sclass[]; /* the class table */ typedef struct sclass { char *cl_name; /* class name */ /* class specific initialization function */ pri_t (*cl_init)(id_t, int, classfuncs_t **); classfuncs_t *cl_funcs; /* pointer to classfuncs structure */ krwlock_t *cl_lock; /* class structure read/write lock */ int cl_count; /* # of threads trying to load class */ } sclass_t; See usr/src/uts/common/sys/class.h For each loaded scheduling class, the sclass array is initialized with the members listed above and indexed with the class ID (cid) kernel variable. # mdb -k Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp usba uhci s1394 fctl nca lofs zfs random nfs audiosup cpc fcip crypto ptm sppp ipc ] > ::class SLOT NAME INIT FCN CLASS FCN 0 SYS sys_init sys_classfuncs 1 TS ts_init ts_classfuncs 2 FX fx_init fx_classfuncs 3 IA ia_init ia_classfuncs 4 RT rt_init rt_classfuncs 5 0 0 6 0 0 . . . The example above uses the mdb(1) class dcmd to dump the sclass array. The cid is displayed in the SLOT column. Note that the FSS class is not loaded in the example. The kernel loaded required classes at boot time (SYS, TS)other classes get loaded dynamically as needed (as a result of placing a thread in a particular class) or through administrative commands (modload(1)). Part of the scheduling class loading and initializing process is the instantiation of the sclass_t object and entry in the sclass array. Part of each scheduling class is a set of pointers to the functions within the class, referenced with the cl_funcs pointer in the sclass_t. Scheduling class functions are subdivided into two categoriesthread operations and class operations. As the names suggest, the thread operations are the class functions that act on a kernel thread, and the class operations are administrative and management functions. typedef struct classfuncs { class_ops_t sclass; thread_ops_t thread; } classfuncs_t; typedef struct sclass { char *cl_name; /* class name */ /* class specific initialization function */ pri_t (*cl_init)(id_t, int, classfuncs_t **); classfuncs_t *cl_funcs; /* pointer to classfuncs structure */ krwlock_t *cl_lock; /* class structure read/write lock */ int cl_count; /* # of threads trying to load class */ } sclass_t; See usr/src/uts/common/sys/class.h The class functions are embedded in a sclass_t object, which is also linked to kernel threads (based of course on the scheduling class of the thread). Figure 3.6 illustrates the big picture: Figure 3.6. Scheduling Class Framework
For space and readability, the FSS class framework is shown separately in Figure 3.6. The framework is similar for FSS, with the addition of several FSS-specific objects linked off fssproc_t. The FSS class is unique since it implements a share-based scheduling policy that requires administrative input for share allocation and (optionally) processor sets. Additional support structures, the fssproj_t (project interface) and fsspset_t (processor set interface) are linked to the fssproc_t. There is also a fsszone_t to manage FSS threads running in zones. Figure 3.7 shows three FSS class threads that are all part of the same project each thread's fssproc_t references the same fssproj_t project structure. The kernel's internal project structure, kproject_t, maintains the share value allocated to the project and various project-level resource controls. Data on the CPU set allocated to the project is maintained in the fsspset_t, which links to a CPU partition structure (cpupart_t). The fsszone_t object is defined and instantiated by the kernel when a zone is created and shares are allocated. This behavior supports Solaris Zones and the ability to allocate a given number of CPU shares to a zone. Figure 3.7. FSS Structure Framework
Getting back to Figure 3.6, the scheduling class operations vector (the function pointers in the class_t object) is at the center of the framework, referenced by the kernel through the system class array and by individual kernel threads through the thread's t_clfuncs pointer. The class and thread operations function prototypes can be found in the class.h header file. typedef struct class_ops { int (*cl_admin)(caddr_t, cred_t *); int (*cl_getclinfo)(void *); int (*cl_parmsin)(void *); int (*cl_parmsout)(void *, pc_vaparms_t *); int (*cl_vaparmsin)(void *, pc_vaparms_t *); int (*cl_vaparmsout)(void *, pc_vaparms_t *); int (*cl_getclpri)(pcpri_t *); int (*cl_alloc)(void **, int); void (*cl_free)(void *); } class_ops_t; typedef struct thread_ops { int (*cl_enterclass)(kthread_id_t, id_t, void *, cred_t *, void *); void (*cl_exitclass)(void *); int (*cl_canexit)(kthread_id_t, cred_t *); int (*cl_fork)(kthread_id_t, kthread_id_t, void *); void (*cl_forkret)(kthread_id_t, kthread_id_t); void (*cl_parmsget)(kthread_id_t, void *); int (*cl_parmsset)(kthread_id_t, void *, id_t, cred_t *); void (*cl_stop)(kthread_id_t, int, int); void (*cl_exit)(kthread_id_t); void (*cl_active)(kthread_id_t); void (*cl_inactive)(kthread_id_t); pri_t (*cl_swapin)(kthread_id_t, int); pri_t (*cl_swapout)(kthread_id_t, int); void (*cl_trapret)(kthread_id_t); void (*cl_preempt)(kthread_id_t); void (*cl_setrun)(kthread_id_t); void (*cl_sleep)(kthread_id_t); void (*cl_tick)(kthread_id_t); void (*cl_wakeup)(kthread_id_t); int (*cl_donice)(kthread_id_t, cred_t *, int, int *); pri_t (*cl_globpri)(kthread_id_t); void (*cl_set_process_group)(pid_t, pid_t, pid_t); void (*cl_yield)(kthread_id_t); } thread_ops_t; See usr/src/uts/common/sys/class.h The functions are described in the next section. 3.6.2. Scheduling Class FunctionsBelow is a complete list of the kernel scheduling-class-specific routines and a description of what they do. More details on many of the functions described below follow in the subsequent discussions on thread priorities and the dispatcher algorithms. The first nine functions fall into the class management category and, in general, support the priocntl(2) system call, which is invoked from the priocntl(1) and dispadmin(1M) commands. priocntl(2) can, of course, be called from an application program as well.
The following functions support and manage threads.
The code is relatively simple; if in softswap mode, set effective priority to 0. If in hardswap mode, calculate an effective priority in a similar fashion as for swap-in, such that threads with a small address space that have been in memory for a relatively long amount of time are swapped out first. A time field, t_stime, in the kthread structure is set by the swapper when a thread is marked for swap-out as well as swap-in.
The dispatcher and the kernel-at-large call the appropriate routine for a specific scheduling class, using essentially the same method used in the VFS/Vnode subsystem. A set of macros resolve to the class-specific function by indexing through either the current kernel thread pointer or the system class array. Certain functions exist in support of setting up a thread for a scheduling class; as such, the links will not yet be in place in the thread to locate a function in the class operations array, so calls are resolved through the system class array. #define CL_ENTERCLASS(t, cid, clparmsp, credp, bufp) \ (sclass[cid].cl_funcs->thread.cl_enterclass) (t, cid, \ (void *)clparmsp, credp, bufp) #define CL_EXITCLASS(cid, clprocp)\ (sclass[cid].cl_funcs->thread.cl_exitclass) ((void *)clprocp) #define CL_CANEXIT(t, cr) (*(t)->t_clfuncs->cl_canexit)(t, cr) #define CL_FORK(tp, ct, bufp) (*(tp)->t_clfuncs->cl_fork)(tp, ct, bufp) #define CL_FORKRET(t, ct) (*(t)->t_clfuncs->cl_forkret)(t, ct) #define CL_GETCLINFO(clp, clinfop) \ (*(clp)->cl_funcs->sclass.cl_getclinfo)((void *)clinfop) . . . See usr/src/uts/common/sys/class.h CL_ENTERCLASS, for example, is entered through the system class array, indexed with the class ID (cid). CL_CANEXIT, CL_FORK, etc., are entered through the thread's t_clfuncs pointer. For a complete list of the class operations macros, see usr/src/uts/common/sys/class.h. 3.6.3. Scheduling Class Dispatcher TablesThreads execute on a CPU until they block (sleepissue a blocking system call), are preempted (a higher-priority thread becomes runnable), or they use their time quantum. A time quantum is the maximum execution time allotted to a thread before it gets forced off the CPU and must wait for its turn to come around again. The allotted time quantum varies according to the scheduling class and, in some cases, the priority of the thread. Solaris maintains time quanta for each scheduling class in an object called a dispatch table. The row and columns in a table vary across the different scheduling classes, but they all provide the user interface to adjusting time quanta. You can examine the dispatch table for a given scheduling class by using dispadmin(1): # dispadmin -g -c FSS # # Fair Share Scheduler Configuration # RES=1000 # # Time Quantum # QUANTUM=110 The -c flag in the command line is followed by the scheduling class we're interested in, FSS in this example. The QUANTUM unit of time is based on a resolution value (reported as RES in the output). The unit of time is a reciprocal of the resolution; thus, a resolution value of 1000 equates to a unit of time of milliseconds (1/1000 = 0.001), meaning the time quantum shown for FSS tHReads is 110 milliseconds for FSS threads at any priority. The FX and RT classes allocate different time quanta according to the priority of the thread: # Real Time Dispatcher Configuration RES=1000 # TIME QUANTUM PRIORITY # (rt_quantum) LEVEL 1000 # 0 . . . 800 # 10 . . . 600 # 20 . . . 400 # 30 . . . 200 # 40 . . . 100 # 50 . . . 100 # 59 The RT table above lists quantum values for every one of 60 (059) possible priorities. Starting with a quantum of 1 second (1000 milliseconds) for the lowest-priority RT threads (priorities 09), the quantum is reduced as the priorities get better, providing a balance: Higher-priority threads can consume fewer CPU cycles, and lower-priority threads, which tend to wait longer for CPU time, get a larger time quantum. The dispatch table for the FX class is similar, in that the table has two columns, assigning different time quanta for different priority threadsthe actual time quantum values are different. The SYS class is not implemented with a dispatch table, since SYS class threads are not subject to time limits when they execute. A SYS class thread runs until it completes, is preempted, or voluntarily releases the processor. The TS/IA table has several additional columns for managing the priority of TS/IA class threads based on different events and conditions. The example below shows the default values for a selected group of timeshare/interactive priorities. In the interest of space and readability, we don't list all 60 (059) priorities since we only need a representative sample for this discussion. # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL 200 0 50 0 50 # 0 . . . 160 0 51 0 51 # 10 . . . 120 10 52 0 52 # 20 . . . 80 20 53 0 53 # 30 . . . 40 30 55 0 55 # 40 . . . 20 49 59 32000 59 # 59 Each entry in the TS/IA dispatch table (each row) is defined by the tsdpent (timeshare dispatch entry) data structure. /* * time-sharing dispatcher parameter table entry */ typedef struct tsdpent { pri_t ts_globpri; /* global (class independent) priority */ int ts_quantum; /* time quantum given to procs at this level */ pri_t ts_tqexp; /* ts_umdpri assigned when proc at this level */ /* exceeds its time quantum */ pri_t ts_slpret; /* ts_umdpri assigned when proc at this level */ /* returns to user mode after sleeping */ short ts_maxwait; /* bumped to ts_lwait if more than ts_maxwait */ /* secs elapse before receiving full quantum */ short ts_lwait; /* ts_umdpri assigned if ts_dispwait exceeds */ /* ts_maxwait */ } tsdpent_t; See usr/src/uts/common/sys/ts.h RES and the PRIORITY LEVEL column are not defined in tsdpent. Those fields, along with the defined members in the structure table, are described below.
We can change the RES value by using the -r flag with dispadmin(1M). # dispadmin -g -c TS -r 100 # Time Sharing Dispatcher Configuration RES=100 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL 20 0 50 0 50 # 0 20 0 50 0 50 # 1 . . . This command causes the values in the ts_quantum column to change but does not change the actual quantum allocation. For example, at priority 0, instead of a quantum value of 200 with a RES of 1000, we have a quantum value of 20 with a RES of 100. The fractional unit is different. Instead of 200 milliseconds with a RES value of 1000, we get 20 tenths-of-a-second, which is the same amount of time, just represented differently [20 x .010 = 200 x .001]. In general, it makes sense to simply leave the RES value at the default of 1000, which makes it easy to interpret the ts_quantum field as milliseconds.
You can apply user-supplied values to the dispatch tables by using the dispadmin(1M) command or by compiling a new /kernel/sched/TS_DPTBL loadable module and replacing the default module. The ts_dptbl(4) man page provides the source and the instructions for doing this. Either way, any changes to the dispatch tables should be done with extreme caution and tested extensively before going into production. |
|
|