3.3 Process Scheduling | System Performance Tuning2002

Another factor that affects processor performance is process scheduling. Process scheduling is how the operating system determines which process to run on the CPU. The next four sections look at different process scheduling models used in the Unix world.

3.3.1 The System V Model: The Linux Model

Linux implements a relatively simple, standard Unix scheduling model. It supports process preemption (interrupting a running process with a higher priority process), although its kernel is nonpreemptable.

The Linux kernel scheduling algorithm works by breaking the CPU time into units called epochs . In a given epoch , each process has a specified time quantum whose duration is computed when the epoch begins; therefore, each process can have a different quantum. This quantum represents the most CPU time that is consumable by that process during that epoch. When a process has exhausted its quantum, it is removed from the processor and replaced by another runnable process; of course, a process can be selected by the scheduler several times, as long as its quantum has not been exhausted (for example, it might have voluntarily relinquished the processor to wait for some disk I/O activity to complete).

When the scheduler is deciding what process to run next, it considers the priority of each process. There are two types of priority: static priority and dynamic priority . Static priority is assigned by users to real-time processes, and is never adjusted. Dynamic priority applies only to conventional processes, and is the sum of the base time quantum (also known as the base priority of the process) and the number of ticks of CPU time left to the process before its quantum expires in the current epoch. The static priority of a real-time process is always higher than any dynamic priority: the scheduler will only run a conventional process if there are no runnable real-time processes.

The epoch ends when all runnable processes have exhausted their quantum; the scheduler then recomputes the time quantum durations for all processes and a new epoch begins. The default quantum for each process, if it exhausted its quantum in the prior epoch, is the process's base time quantum. A new process inherits the base time quantum of its parent. Since process 0 (the swapper ) has a default base time quantum of 20 ticks (about 200 milliseconds ), each process by default has a base time quantum of equal size . The base time quantum value for a specific process is adjustable by the nice(3C) and setpriority (3C) system calls, as well as the nice command.

3.3.1.1 Finding a process's priority

Determining the priority of a running process can be done by observing the PRI column of the output of ps -el , as illustrated in Example 3-1.

Example 3-1. Finding a process's priority in Linux

 %  ps -el  F S   UID   PID  PPID  C  PRI  NI ADDR    SZ WCHAN  TTY          TIME CMD 100 S     0     1     0  0  60  0    -   326 do_sel ?        00:00:48 init 040 S     0     2     1  0  69  0    -     0 contex ?        00:00:00 keventd 040 S     0     3     1  0  69  0    -     0 kswapd ?        00:05:32 kswapd 040 S     0     4     1  0  69  0    -     0 krecla ?        00:00:00 kreclaimd 040 S     0     5     1  0  69  0    -     0 bdflus ?        00:24:43 bdflush 040 S     0     6     1  0  69  0    -     0 kupdat ?        00:23:01 kupdate ... 140 S     0 11602   488  0  68  0    -   745 tcp_da ?        00:00:00 sendmail 140 S     0 11727   488  0  68  0    -   745 tcp_da ?        00:00:00 sendmail 140 S     0 11793   488  0  69  0    -   745 tcp_da ?        00:00:00 sendmail 000 R   563 11877  5019  0  75  0    -   759 -      pts/3    00:00:00 ps

3.3.1.2 Adjusting a process's effective priority

Adjusting a process's effective priority is accomplished in a straightforward manner via the renice command. The renice command changes the absolute "niceness" value of the process. This value can range from +20 (the affected processes will only run when the system is otherwise idle) to -20 (the affected processes will be scheduled with priority). The default value is zero.

For example, let's look at a certain process ( rhof ):

 %  ps -el grep rhof  F S   UID   PID  PPID  C  PRI   NI  ADDR    SZ WCHAN  TTY          TIME CMD 000 S   563  5620  5601  0  69    -   514 do_sel pts/6    00:00:00 rhof

Let's decrease the priority of this process to -20:

 %  renice -20 5620  5620: old priority 0, new priority 20 %  ps -el grep rhof  F S   UID   PID  PPID  C  PRI   NI  ADDR    SZ WCHAN  TTY          TIME CMD 000 S   563  5620  5601  0  69   19  -   514 do_sel pts/6    00:00:00 rhof

Notice that the NI column value has become 19, reflecting the process's decreased priority. Nonroot users cannot increase the priority for their own processes, even if they were the ones who decreased them in the first place:

 %  ps -el grep rhof  F S   UID   PID  PPID  C  PRI   NI  ADDR    SZ WCHAN  TTY          TIME CMD 000 S   563  5865  5601  0  70    -   514 do_sel pts/6    00:00:00 rhof %  renice +5 5865  5865: old priority 0, new priority 5 %  ps -el grep rhof  F S   UID   PID  PPID  C  PRI   NI  ADDR    SZ WCHAN  TTY          TIME CMD 000 S   563  5865  5601  0  71   5  -   514 do_sel pts/6    00:00:00 rhof %  renice +3 5865  renice: 5865: setpriority: Permission denied

In order to decrease the process's priority, we need to be root:

 % sudo renice -20 5865 5865: old priority 0, new priority -20 %  ps -el grep rhof  F S   UID   PID  PPID  C  PRI   NI  ADDR    SZ WCHAN  TTY          TIME CMD 000 S  1000  3274  3240  0  59 -20  -  1619 select pts/1    00:00:00 rhof

This provides a simple, easy way to adjust the priority of a running process.

3.3.1.3 Modifications for SMP systems

In a multiprocessor system, it is possible for an interesting conundrum to arise. Let's say there are two processors: one of them is idle, and the other is executing a process at relatively low priority. On the busy processor, a high priority process becomes executable. Should we execute the high priority process immediately on the idle CPU, in which case we have to do some work to migrate the cache lines for the process between processors, or do we defer it to run on the existing CPU? Linux addresses this issue by adopting an empirical rule related to the processor's cache size: the larger the processor's cache, the longer a process will wait for a piece of time on that processor.

3.3.2 Multilayered Scheduling Classes: The Solaris Model

One of the very nice features of the Solaris operating environment is the ability for administrators to control how the operating system schedules processes for execution. This allows fine-grained control over what processes are running at what priorities. One fundamental thing to keep in mind is that Solaris does not directly schedule processes; rather, it schedules the underlying kernel threads.

3.3.2.1 The Solaris threading model

Before we can continue our discussion of process scheduling, we need to discuss exactly what a kernel thread is. Solaris implements a multilayer threading model, which is intended to decouple the management of user-level threads from the kernel. User-level threads have their own priority scheme, and are scheduled by a scheduling thread that is created by the threading library when a multithreaded application is compiled and executed. This allows multithreaded applications to spawn thousands of threads without significantly loading the kernel; in fact, user threads aren't made visible to the kernel until they are bound to a lightweight process , or LWP, which has some state in the operating system. The scheduling thread is responsible for mapping the user threads into the LWP, which is associated with a kernel thread for execution on a processor. ^[10] Every LWP has a kernel thread, but every kernel thread need not have an LWP: some kernel threads are used exclusively by the operating system, and thus an LWP is not required.

^[10] A nonthreaded application has one LWP and one kernel thread associated with it.

Periodically, some literature indicates that kernel threads and LWPs are the same thing. The truth is that they are distinctly different, but in the discussion of user processes, it's acceptable to think of them as one large entity. There is always, without exception, a one-to-one relationship between LWPs and kernel threads.

For the rest of our discussion of scheduling, the word "thread" will refer to kernel threads specifically .

3.3.2.2 Scheduling classes

Threads generally fall into five categories for scheduling: timesharing ( ts ), interactive ( ia ), kernel, real-time ( rt ), and interrupt. The relative priorities in each of these classes can be mapped into absolute priorities, as is summarized in Table 3-3.

Table 3-3. Process priority table layout in Solaris

Absolute priority range	Relative priority range	Scheduling queue
169-160 (or 109-100)	9-0	Interrupt threads (if real-time scheduling class is not loaded)
159-100	59-0	Real-time priorities
99-60	39-0	Kernel priorities
59-0	59-0	Timesharing and interactive priorities

The timeshare class is the default class that processes are scheduled in. However, the interactive class exists to improve interactive performance on windowing systems. This is accomplished by giving the foreground processes in the active window a higher priority. The processes that are chosen to be in the interactive class are those that come up under the windowing environment; remote processes are still scheduled in the timesharing class. However, the interactive and timesharing scheduling classes overlap in priority, so an interactive process at priority 59 is running at the same priority as a timesharing process at priority 59.

Let's discuss how the operating system decides the threads that should be run.

Notes on Real-Time Scheduling

Real-time scheduling refers to the means by which a process can be assured of having a slice of the processor when it needs it. This is a requirement for certain types of applications, such as robotics and flight control. In general, operating system support for real-time scheduling falls into three categories: none whatsoever, hard real-time support (in which all hard deadlines for applications are met), or soft real-time support (in which deadlines may be missed, but this is kept to a statistical minimum). Solaris is a soft real-time environment.

In a soft real-time environment, the response time must be finitely bounded (e.g., there must be an upper bound on the time during which the application can be off the processor -- in Solaris, this may be as long as five milliseconds). Unix environments typically do not implement real-time support because of two problems. First, page faults can induce unpredictable latency (see Section 4.3). Solaris solves this problem by locking all the pages of a real-time process in memory. The other problem is that in most Unix systems, the kernel is not preemptable , meaning that it can be removed from the processor and another process run. There are certain points at which the kernel cannot be preempted -- these are called nonpreemption points . Only special cases are nonpreemptable: the most common nonpreemptable case is when the kernel is executing in interrupt context. However, a higher priority interrupt can displace an interrupt that is in the process of being serviced.

3.3.2.3 The dispatcher

When a thread is ready to run, it is placed on a dispatch queue . Every processor has a separate set of dispatch queues. This state (waiting on the dispatch queue) is known as the RUN state. When a processor becomes free, the dispatcher takes the thread with the highest priority from the dispatch queue for that processor, and places it on the processor. The thread is then in the ONPROC state, and can perform computation.

A thread can leave the processor by several mechanisms. The first is that it performs computation until it reaches a time limit (determined by its priority); this is called the quantum . The quantum is usually inversely related to the priority (higher priority processes have smaller quanta), since higher priority processes are scheduled more frequently. When this happens, the thread's priority is set to a new level, which is referred to as the tqexp value. This level is almost always lower than the thread's priority when it started; this is done to ensure "fair share" scheduling.

Another way for the thread to leave the processor is if it blocks; for example, by waiting for an I/O to complete or for a lock to be released. This causes the thread to go into the SLEEP state. Threads that go to sleep while holding "critical" resources (such as reader/writer locks) are assigned a priority in the kernel class, so that they will have a chance to release their locks quickly. When processes on the sleep queues are returned to the processor, they have their priority set to a new value, called slpret . This is usually higher than the thread's earlier priority, so that the thread gets some execution time fairly quickly after it's gone to sleep.

The third and final way for a thread to leave the processor is if it is forcibly required to, usually by the kernel preempting the thread. If a preempted thread accrues enough time (determined by the maxwait parameter) of being starved for the processor, its priority is increased to the value specified by the lwait parameter. This provides a way of compensating threads that have been waiting on the dispatch queues for a long time after being preempted.

3.3.2.4 Checking a process's priority

You can find out what lightweight process and scheduling class a specific process is associated with by using the -L and -c switches, respectively, with /usr/bin/ps , as shown in Example 3-2.

Example 3-2. Viewing a process's priority in Solaris

 %  /usr/bin/ps -efcL  UID   PID  PPID   LWP  NLWP  CLS PRI    STIME TTY     LTIME CMD     root     0     0     1     1  SYS  96   Mar 20 ?        0:00 sched     root     1     0     1     1   TS  58   Mar 20 ?        0:06 /etc/init -     root     2     0     1     1  SYS  98   Mar 20 ?        0:00 pageout     root     3     0     1     1  SYS  60   Mar 20 ?       22:37 fsflush ...     root   200     1     1     7   TS  51   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     2     7   TS  58   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     3     7   TS  59   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     4     7   TS  58   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     5     7   TS  59   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     6     7   TS  59   Mar 20 ?        0:00 /usr/sbin/nscd     root   200     1     7     7   TS  58   Mar 20 ?        0:01 /usr/sbin/nscd     root   185     1     1     1   TS  48   Mar 20 ?        0:01 /usr/sbin/cron ...     root  4897   164     1     1   TS  38 20:28:02 ?        0:00 in.telnetd     root  5235  4899     1     1   TS  48 21:10:19 pts/1    0:00 /usr/bin/ps -cefL      gdm  4899  4897     1     1   TS  58 20:28:02 pts/1    0:00 -csh

I have significantly trimmed the output of this command. Of particular interest here are the LWP , CLS , and PRI columns , which represent the lightweight process number, scheduling class, and priority, respectively. Of definite interest is the fact that multithreaded processes (for example, the name service cache daemon, /usr/sbin/nscd ) can exist on multiple LWPs concurrently (the number of LWPs a process is associated with is given by the NWLP column).

3.3.2.5 Tuning the dispatch tables

The dispatch tables are tunable via the dispadmin command. This is accomplished by writing the dispatch table for a specific class to a file, editing the file, and then reloading the file into the dispatch table.

To see what's available for editing, you can display the list of active scheduling classes by dispadmin -l :

 %  dispadmin -l  CONFIGURED CLASSES ================== SYS     (System Class) TS      (Time Sharing) RT      (Real Time) IA      (Interactive)

You can then dump the scheduling table by using dispadmin -c class -g :

 %  dispadmin -c TS -g  # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum  ts_tqexp  ts_slpret  ts_maxwait ts_lwait  PRIORITY LEVEL        200         0        50           0        50        #     0        200         0        50           0        50        #     1        200         0        50           0        50        #     2        200         0        50           0        50        #     3 ...        200         0        50           0        50        #     9        160         0        51           0        51        #    10        160         1        51           0        51        #    11        160         2        51           0        51        #    12        160         3        51           0        51        #    13        160         4        51           0        51        #    14        160         5        51           0        51        #    15 ...         40        42        58           0        59        #    52         40        43        58           0        59        #    53         40        44        58           0        59        #    54         40        45        58           0        59        #    55         40        46        58           0        59        #    56         40        47        58           0        59        #    57         40        48        58           0        59        #    58         20        49        59       32000        59        #    59

The output of this table is significantly trimmed. Let's walk through it piece by piece.

The first line specifies the resolution of the time figures in the table below. The reciprocal of the resolution value is used to interpret the ts_quantum column in the dispatch table: as the value of RES increases , the ts_quantum values increase correspondingly. By default, RES is 1000. The reciprocal of RES is 0.001, which in fractions of a second represents one millisecond. By default, then, the units for the ts_quantum field are milliseconds. While the resolution value is changeable , it is usually advisable to leave it at the default.

I've already described the meanings of most of these columns in the discussions of how the dispatcher works (see Section 3.3.2.3 earlier in this chapter). I'll briefly summarize them here:

The ts_quantum field represents the maximum amount of time the thread can remain on the processor.
The ts_tqexp field specifies the priority the thread will be assigned following its removal from the processor after it uses its entire time quantum.
The ts_slpret field gives the priority a process will have after it returns to execution after going to sleep (e.g., waiting for an I/O to complete).
The ts_maxwait and ts_lswait fields control the compensation of a process (set to the priority given by ts_lswait ) that has been preempted by a kernel-priority thread for a long period of time (greater than ts_maxwait ).

Some features are of particular interest: the first is that the quantum values decrease correspondingly with the priority level. Another is that the maxwait value for priority level 59 is set very high; this ensures that every thread has a fair chance at being scheduled.

Because each priority class has a zero value in ts_maxwait for every priority except the highest level, all but the highest priority timesharing threads will have their priorities readjusted into the 50-59 range once a second or so. This has the effect of not penalizing a thread that is CPU-bound for an extended period. Threads that are CPU hogs will, over time, end up spending most of their time in the 0-9 priority range, as they will keep using up their time quantum and suffer from ts_tqexp priority readjustment. Once every second, they'll be bumped up again, but will only migrate back down if they continue to sustain their CPU- intensive behavior.

The real-time ( RT ) class has a much simpler table:

 # Real Time Dispatcher Configuration RES=1000 # TIME QUANTUM                    PRIORITY # (rt_quantum)                      LEVEL       1000                    #        0       1000                    #        1       1000                    #        2       1000                    #        3       1000                    #        4 ...        100                    #       55        100                    #       56        100                    #       57        100                    #       58        100                    #       59

This is simpler because the real-time thread scheduling does not involve any of the complex adjustments described previously. Real-time threads run until they stop or until their quantum expires (in case multiple real-time threads are pending). Also, note that even though the priority level range is from zero to 59, the real-time threads are scheduled at a higher priority than the kernel, as these relative priorities are mapped to absolute priorities via the map discussed previously.

You can tweak the dispatch tables by writing the output of dispadmin to a file (as shown), editing the file, and then loading it by running dispadmin -c class -s file .

The most common technique employed here is to fence off the top timesharing class, so that high priority CPU-intensive jobs run without being forced down to lower priority levels. You can accomplish this by editing the timesharing class dispatch table to look something like what's shown in Example 3-3.

Example 3-3. Creating a priority fence in the timesharing class

 # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum  ts_tqexp  ts_slpret  ts_maxwait ts_lwait  PRIORITY LEVEL ...         40        36        57           0        57        #    46         40        37        58           0        58        #    47         40        38        58           0        58        #    48         40        39        58           0  58  #    49         40        40        58           0  58  #    50         40        41        58           0  58  #    51         40        42        58           0  58  #    52         40        43        58           0  58  #    53         40        44        58           0  58  #    54         40        45        58           0  58  #    55         40        46        58           0  58  #    56         40        47        58           0  58  #    57  20  48        58  32000   58  #    58  200  49        59           0  59  #    59

I have highlighted changed values. The net effect is that no process can naturally find its way into priority level 59 (this is why the ts_lwait parameter has been adjusted). Priority level 58 now functions the same as priority level 59: it has an unusually low quantum and a high value of ts_maxwait . By increasing the quantum of priority 59, threads in that class will run longer before they are forced to relinquish the CPU and look for other threads that need to run. In order to get a process into this priority level, you will need to run priocntl , as described in the next section.

There's one small catch with this scheme: you will need to flush any existing processes out of priority level 59. This is accomplished by creating a copy of Example 3-3, and changing the last line (for priority level 59) to this:

 # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum  ts_tqexp  ts_slpret  ts_maxwait ts_lwait  PRIORITY LEVEL ...  200        #    59

This ensures that any process in priority level 59 quickly departs from it. You can then use ps -ecL grep 59 to find processes that are still running in priority 59; they may need to be stopped and restarted.

Please exercise care when adjusting the dispatch tables. It is possible to cause a great amount of damage to the stability and usability of a production system through this interface. Test thoroughly!

3.3.2.6 Adjusting priorities

You can directly adjust the priority of a process via the priocntl command. This is accomplished by specifying the -s flag, which indicates that you are setting a priority: either -p priority , which specifies the priority level to be set, or -i pid process-id , which specifies the process ID to be adjusted. For example, let's adjust the nscd process into priority level 59, after having adjusted the dispatch tables as described in Example 3-3 to fence off the timesharing priority level 59:

 #  /usr/bin/ps -ecL  grep 59  #  /usr/bin/ps -ecL /usr/xpg4/bin/grep -E 'nscdPID'  PID   LWP  CLS PRI TTY     LTIME CMD    200     1   TS  51 ?        0:00 nscd    200     2   TS  58 ?        0:00 nscd    200     3   TS  58 ?        0:00 nscd    200     4   TS  58 ?        0:00 nscd    200     5   TS  58 ?        0:00 nscd    200     6   TS  58 ?        0:00 nscd    200     7   TS  58 ?        0:01 nscd #  priocntl -s -p 59 -i pid 200  #  /usr/bin/ps -ecL /usr/xpg4/bin/grep -E 'nscdPID'  PID   LWP  CLS PRI TTY     LTIME CMD    200     1   TS  59 ?        0:00 nscd    200     2   TS  59 ?        0:00 nscd    200     3   TS  59 ?        0:00 nscd    200     4   TS  59 ?        0:00 nscd    200     5   TS  59 ?        0:00 nscd    200     6   TS  59 ?        0:00 nscd    200     7   TS  59 ?        0:01 nscd

The advantage of a trick like this is to ensure that high priority jobs get as much processor time as possible, regardless of the environment in which they are running.

The renice command, which we discussed earlier in this chapter in Section 3.3.1.2, provides another way of adjusting a process's priority, by providing a static offset into the priority tables. This offset is equal to the renice amount, but with the negative sign reversed . For example, a process that has been renice d to -20 recieves a +20 offset into the priority table (so its priority will never drop below 20).