Section 3.6. Keeping Track of Processes: Basic Scheduler Construction | The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

3.6. Keeping Track of Processes: Basic Scheduler Construction

Until this point, we kept the concepts of the states and the transitions process-centric. We have not spoken about how the transition is managed, nor have we spoken about the kernel infrastructure, which performs the running and stopping of processes. The scheduler handles all these details. Having finished the exploration of the process lifecycle, we now introduce the basics of the scheduler and how it interacts with the do_fork() function during process creation.

3.6.1. Basic Structure

The scheduler operates on a structure called a run queue. There is one run queue per CPU on the system. The core data structures within a run queue are two priority-ordered arrays. One of these contains active tasks and the other contains expired tasks. In general, an active task runs for a set amount of time, the length of its timeslice or quantum, and is then inserted into the expired array to wait for more CPU time. When the active array is empty, the scheduler swaps the two arrays by exchanging the active and expired pointers. The scheduler then begins executing tasks on the new active array.

Figure 3.12 illustrates the priority arrays within the run queue. The definition of the priority array structure is as follows:

 ----------------------------------------------------------------------- kernel/sched.c 192  struct prio_array { 193   int nr_active; 194   unsigned long bitmap[BITMAP_SIZE]; 195   struct list_head queue[MAX_PRIO]; 196  }; -----------------------------------------------------------------------

Figure 3.12. Priority Arrays in a Run Queue

The fields of the prio_array struct are as follows:

nr_active. A counter that keeps track of the number of tasks held in the priority array.
bitmap. This keeps track of the priorities within the array. The actual length of bitmap depends on the size of unsigned longs on the system. It will always be enough to store MAX_PRIO bits, but it could be longer.
queue. An array that stores lists of tasks. Each list holds tasks of a certain priority. Thus, queue[0] holds a list of all tasks of priority 0, queue[1] holds a list of all tasks of priority 1, and so on.

With this basic understanding of how a run queue is organized, we can now embark on following a task through the scheduler on a single CPU system.

3.6.2. Waking Up from Waiting or Activation

Recall that when a process calls fork(), a new process is made. As previously mentioned, the process calling fork() is called the parent, and the new process is called the child. The newly created process needs to be scheduled for access to the CPU. This occurs via the do_fork() function.

Two important lines deal with the scheduler in do_fork() related to waking up processes. copy_process(), called on line 1184 of linux/kernel/fork.c, calls the function sched_fork(), which initializes the process for an impending insertion into the scheduler's run queue. wake_up_forked_process(), called on line 1222 of linux/kernel/fork.c, takes the initialized process and inserts it into the run queue. Initialization and insertion have been separated to allow for the new process to be killed, or otherwise terminated, before being scheduled. The process will only be scheduled if it is created, initialized successfully, and has no pending signals.

3.6.2.1. sched_fork(): Scheduler Initialization for Newly Forked Process

The sched_fork()function performs the infrastructure setup the scheduler requires for a newly forked process:

 ----------------------------------------------------------------------- kernel/sched.c 719 void sched_fork(task_t *p) 720 { 721   /*   722   * We mark the process as running here, but have not actually 723   * inserted it onto the runqueue yet. This guarantees that 724   * nobody will actually run it, and a signal or other external 725   * event cannot wake it up and insert it on the runqueue either. 726   */ 727   p->state = TASK_RUNNING; 728   INIT_LIST_HEAD(&p->run_list); 729   p->array = NULL; 730   spin_lock_init(&p->switch_lock); -----------------------------------------------------------------------

Line 727

The process is marked as running by setting the state attribute in the task structure to TASK_RUNNING to ensure that no event can insert it on the run queue and run the process before do_fork() and copy_process() have verified that the process was created properly. When that verification passes, do_fork() adds it to the run queue via wake_up_forked_process().

Line 728730

The process' run_list field is initialized. When the process is activated, its run_list field is linked into the queue structure of a priority array in the run queue. The process' array field is set to NULL to represent that it is not part of either priority array on a run queue. The next block of sched_fork(), lines 731 to 739, deals with kernel preemption. (Refer to Chapter 7 for more information on preemption.)

 ----------------------------------------------------------------------- kernel/sched.c 740   /* 741   * Share the timeslice between parent and child, thus the 742   * total amount of pending timeslices in the system doesn't change, 743   * resulting in more scheduling fairness. 744   */ 745   local_irq_disable(); 746   p->time_slice = (current->time_slice + 1) >> 1; 747   /* 748   * The remainder of the first timeslice might be recovered by 749   * the parent if the child exits early enough. 750   */ 751   p->first_time_slice = 1; 752   current->time_slice >>= 1; 753   p->timestamp = sched_clock(); 754   if (!current->time_slice) { 755     /* 756     * This case is rare, it happens when the parent has only 757     * a single jiffy left from its timeslice. Taking the 758     * runqueue lock is not a problem. 759     */ 760     current->time_slice = 1; 761     preempt_disable(); 762     scheduler_tick(0, 0); 763     local_irq_enable(); 764     preempt_enable(); 765   } else 766     local_irq_enable(); 767 } -----------------------------------------------------------------------

Lines 740753

After disabling local interrupts, we divide the parent's timeslice between the parent and the child using the shift operator. The new process' first timeslice is set to 1 because it hasn't been run yet and its timestamp is initialized to the current time in nanosec units.

Lines 754767

If the parent's timeslice is 1, the division results in the parent having 0 time left to run. Because the parent was the current process on the scheduler, we need the scheduler to choose a new process. This is done by calling scheduler_tick() (on line 762). Preemption is disabled to ensure that the scheduler chooses a new current process without being interrupted. Once all this is done, we enable preemption and restore local interrupts.

At this point, the newly created process has had its scheduler-specific variables initialized and has been given an initial timeslice of half the remaining timeslice of its parent. By forcing a process to sacrifice a portion of the CPU time it's been allocated and giving that time to its child, the kernel prevents processes from seizing large chunks of processor time. If processes were given a set amount of time, a malicious process could spawn many children and quickly become a CPU hog.

After a process has been successfully initialized, and that initialization verified, do_fork() calls wake_up_forked_process():

 ----------------------------------------------------------------------- kernel/sched.c 922 /* 923 * wake_up_forked_process - wake up a freshly forked process. 924 * 925 * This function will do some initial scheduler statistics housekeeping 926 * that must be done for every newly created process. 927 */ 928 void fastcall wake_up_forked_process(task_t * p) 929 { 930   unsigned long flags; 931   runqueue_t *rq = task_rq_lock(current, &flags); 932 933   BUG_ON(p->state != TASK_RUNNING); 934 935   /* 936   * We decrease the sleep average of forking parents 937   * and children as well, to keep max-interactive tasks 938   * from forking tasks that are max-interactive. 939   */ 940   current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * 941     PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); 942 943   p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * 944     CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); 945 946   p->interactive_credit = 0; 947 948   p->prio = effective_prio(p); 949   set_task_cpu(p, smp_processor_id()); 950 951   if (unlikely(!current->array)) 952     __activate_task(p, rq); 953   else { 954     p->prio = current->prio; 955     list_add_tail(&p->run_list, &current->run_list); 956     p->array = current->array; 957     p->array->nr_active++; 958     rq->nr_running++; 959   } 960   task_rq_unlock(rq, &flags); 961  } -----------------------------------------------------------------------

Lines 930934

The first thing that the scheduler does is lock the run queue structure. Any modifications to the run queue must be made with the lock held. We also throw a bug notice if the process isn't marked as TASK_RUNNING, which it should be thanks to the initialization in sched_fork() (see Line 727 in kernel/sched.c shown previously).

Lines 940947

The scheduler calculates the sleep average of the parent and child processes. The sleep average is the value of how much time a process spends sleeping compared to how much time it spends running. It is incremented by the amount of time the process slept, and it is decremented on each timer tick while it's running. An interactive, or I/O bound, process spends most of its time waiting for input and normally has a high sleep average. A non-interactive, or CPU-bound, process spends most of its time using the CPU instead of waiting for I/O and has a low sleep average. Because users want to see results of their input, like keyboard strokes or mouse movements, interactive processes are given more scheduling advantages than non-interactive processes. Specifically, the scheduler reinserts an interactive process into the active priority array after its timeslice expires. To prevent an interactive process from creating a non-interactive child process and thereby seizing a disproportionate share of the CPU, these formulas are used to lower the parent and child sleep averages. If the newly forked process is interactive, it soon sleeps enough to regain any scheduling advantages it might have lost.

Line 948

The function effective_prio() modifies the process' static priority. It returns a priority between 100 and 139 (MAX_RT_PRIO to MAX__PRIO-1). The process' static priority can be modified by up to 5 in either direction based on its previous CPU usage and time spent sleeping, but it always remains in this range. From the command line, we talk about the nice value of a process, which can range from +19 to -20 (lowest to highest priority). A nice priority of 0 corresponds to a static priority of 120.

Line 749

The process has its CPU attribute set to the current CPU.

Lines 951960

The overview of this code block is that the new process, or child, copies the scheduling information from its parent, which is current, and then inserts itself into the run queue in the appropriate place. We have finished our modifications of the run queue, so we unlock it. The following paragraph and Figure 3.13 discuss this process in more detail.

Figure 3.13. Run Queue Insertion

The pointer array points to a priority array in the run queue. If the current process isn't pointing to a priority array, it means that the current process has finished or is asleep. In that case, the current process' runlist field is not in the queue of the run queue's priority array, which means that the list_add_tail() operation (on line 955) would fail. Instead, we insert the newly created process using __activate_task(), which adds the new process to the queue without referring to its parent.

In the normal case, when the current process is waiting for CPU time on a run queue, the process is added to the queue residing at slot p->prio in the priority array. The array that the process was added to has its process counter, nr_active, incremented and the run queue has its process counter, nr_running, incremented. Finally, we unlock the run queue lock.

The case where the current process doesn't point to a priority array on the run queue is useful in seeing how the scheduler manages the run queue and priority array attributes.

 ----------------------------------------------------------------------- kernel/sched.c 366 static inline void __activate_task(task_t *p, runqueue_t *rq) 367 { 368   enqueue_task(p, rq->active); 369   rq->nr_running++; 370 } -----------------------------------------------------------------------

__activate_task() places the given process p on to the active priority array on the run queue rq and increments the run queue's nr_running field, which is the counter for total number of processes that are on the run queue.

 ----------------------------------------------------------------------- kernel/sched.c 311 static void enqueue_task(struct task_struct *p, prio_array_t *array) 312 { 313   list_add_tail(&p->run_list, array->queue + p->prio); 314   __set_bit(p->prio, array->bitmap); 315   array->nr_active++; 316   p->array = array; 317 } -----------------------------------------------------------------------

Lines 311312

enqueue_task() takes a process p and places it on priority array array, while initializing aspects of the priority array.

Line 313

The process' run_list is added to the tail of the queue located at p->prio in the priority array.

Line 314

The priority array's bitmap at priority p->prio is set so when the scheduler runs, it can see that there is a process to run at priority p->prio.

Line 315

The priority array's process counter is incremented to reflect the addition of the new process.

Line 316

The process' array pointer is set to the priority array to which it was just added.

To recap, the act of adding a newly forked process is fairly straightforward, even though the code can be confusing because of similar names throughout the scheduler. A process is placed at the end of a list in a run queue's priority array at the slot specified by the process' priority. The process then records the location of the priority array and the list it's located in within its structure.