The 2.6 Linux Scheduler | Performance Tuning for Linux Servers

Before release 2.4 of the Linux kernel, SMP machines were not seriously considered in terms of performance. As the 2.2 version of the kernel matured, however, it became increasingly evident that good SMP performance was important to Linux users. Although it remains desirable for performance reasons to keep as much of the SMP code as possible under a configuration option, it has also become more important to commercial Linux users that SMP code utilize the processors efficiently and treat applications equitably.

To achieve its scheduling objectives, Linux assigns a static priority to each task that can be modified by the user through the nice() system call. Linux has a range of priority classes, varying from 0 to MAX_PRIO, where MAX_PRIO=140. The first MAX_RT_PRIO priorities, where MAX_RT_PRIO=100, are set aside for real-time tasks. The remaining 40 priority classes, [100..140], are set aside for time sharing (that is, normal) jobs, representing the [20..19] nice value of UNIX processes. Lower nice values correspond to lower static priority and higher importance. Real-time tasks always have a higher priority than normal tasks.

Each task maintains an effective priority, which is determined by the interactivity of the task. The more interactive a task, the more important it becomes. Temporary effective priority bonuses or penalties are given based on the recent sleep average of a given task. The sleep average is a number in the range of [0..MAX_SLEEP_AVG(=10 seconds)], and it accounts for the number of ticks a task spent voluntarily not using its timeslice and waiting for an I/O completion. A bonus is given for higher sleep averages, and a penalty is given for lower sleep averages.

A task is considered interactive when its effective priority exceeds its static priority by a certain level (which can only be due to its accumulating sleep average). High-priority tasks reach interactivity with a much smaller sleep average than lower-priority tasks.

A timeslice is the maximum time a task can run before yielding to another task. It is simply a linear function of the static priority of normal tasks projected into the range of [MIN_TIMESLICE(=10 milliseconds)..MAX_TIMESLICE(=200 milliseconds)]. The higher the static priority of a task, the longer the timeslice. The timeslice in the kernel is defined as multiples of a system tick. A tick is defined by the fixed delta (1 HZ) of two consecutive timer interrupts. In Linux 2.6, HZ=1000the timer interrupt routine is called once every millisecondat which time the currently executing task is charged a tick.

The scheduler needs to decide which task to run next and for how long. Effective priority determines which task to run next; static priority determines for how long.

The Linux scheduler in 2.6 is a multiqueue scheduler that assigns a run queue to each CPU. Each run queue performs per-CPU scheduling. It consists of two arrays of task lists: the active array and the expired array. Each array index represents a list of runnable tasks at their respective effective priority level. A task with the highest effective priority from the active array is chosen to run first. After the allotted timeslice completes, the task is moved from the active list to the expired list after replenishing the timeslice and effective priority to guarantee that all tasks get a chance to execute. When the active array is empty, expired and active arrays are swapped. This makes the scheduler O(1) because it does not have to traverse a potentially large list of tasks, as was needed in the 2.4 scheduler.

Interactive tasks relinquish the CPU before their timeslices expire because they have to wait for I/O completion. When the timeslice of an interactive task expires, the task is put back in the active array instead of in the expired array in order to give it an advantage over the processor hogs. As a result, a situation can arise when the active array continues to have tasks, which in turn may cause the starvation of tasks in the expired array. To avoid this situation, if the first expired task is waiting for more than STARVATION_LIMIT(=10seconds) times the number of tasks in the run queue, the interactive task is simply put in the expired array.

The 2.6 scheduler contains support for different CPU topologies like SMT and NUMA through sched domains. The scheduling policy is set in the sched domain data structure by the architecture depending on the system topology. Those scheduling policies are used while scheduling tasks to the CPUs.

Sched domains are created hierarchically as it appears on the system topology.

Although the 2.6 scheduler can work on different CPU configurations, some of the configurations depend on load balancing to assist. Load balancing keeps the system load, especially for multiprocessers, balanced.