Section 7.4. Scheduling


7.4. Scheduling

A timesharing system provides the illusion of multiple processes running concurrently by interleaving their execution, context switching from one to another based on various conditions. The set of rules based on which the order of execution of threads is determined is called the scheduling policy. A system component called the scheduler implements the policy through data structures and algorithms. The implementation allows the scheduler to apply the policy while selecting threads for running from among those that are runnable. Although execution concurrency and parallelism are important goals of schedulers, especially as multiprocessor systems become commonplace, it is common for a modern operating system to support multiple scheduling policies, allowing different types of workloads to be treated differently. In its typical operation, the Mac OS X scheduler gives the processor to each thread for a brief period of time, after which it considers switching to another thread. The amount of time a scheduled thread can run before being preempted is called the thread's timslicing quantum, or simply quantum. Once a thread's quantum expires, it can be preempted because another thread of equal or higher priority wants to run. Moreover, a running thread can be preempted regardless of its quantum if a higher-priority thread becomes runnable.

We will first look at how the Mac OS X scheduling infrastructure is initialized. Then we will discuss the scheduler's operation.

7.4.1. Scheduling Infrastructure Initialization

We saw several aspects of processor initialization during our discussion of kernel startup in Chapter 5. Figure 738 shows selected initializations related to scheduling. When ppc_init() starts executing on the master processor, none of the processor set structures, processor structures, and other scheduler structures have been initialized. The master processor's processor structure is initialized by processor_init() [osfmk/kern/processor.c], which sets up the processor's local run queue, sets the processor's state as PROCESSOR_OFF_LINE, marks it as belonging to no processor set, and sets other fields of the structure to their initial values.

Figure 738. Scheduling-related initializations during system startup


7.4.1.1. Timeslicing Quantum

As shown in Figure 738, processor_init() calls timer_call_setup() [osfmk/kern/timer_call.c] to arrange for the quantum expiration functionthread_quantum_expire() [osfmk/kern/priority.c]to be called. thread_quantum_expire() recalculates the quantum and priority for a thread. Note that timer_call_setup() only initializes a call entry structure specifying which function is to be called and with what parameters. This call entry will be placed on each processor's timer call queue. (The kernel maintains per-processor timer call queues.) Until the real-time clock subsystem is configured, these queues are not serviced.

// osfmk/kern/processor.c void processor_init(register processor_t p, int slot_num) {     ...     timer_call_setup(&p->quantum_timer, thread_quantum_expire, p);     ... }


ppc_init() finally calls kernel_bootstrap() [osfmk/kern/startup.c] to start the higher-level boot process. One of the latter's first operations is scheduler initialization by calling sched_init() [osfmk/kern/sched_prim.c], which first calculates the standard timeslicing quantum. The built-in default preemption ratethat is, the frequency at which the kernel will preempt threadsis 100Hz. A preemption rate of 100Hz yields a timeslicing quantum of 0.01 s (10 ms). The preempt boot argument can be used to specify a custom value of the default preemption rate to the kernel. The kern.clockrate sysctl variable contains the values of the preemption rate and the timeslicing quantum (as microseconds).

$ sysctl kern.clockrate kern.clockrate: hz = 100, tick = 10000, profhz = 100, stathz = 100


The tick value represents the number of microseconds in a scheduler tick. The hz value can be seen as the frequency of a hardware-independent system clock.


sched_init() then initializes the global wait queues used by threads for waiting on events, initializes the default processor set by calling pset_init() [osfmk/kern/processor.c], and sets the sched_tick global variable to 0.

7.4.1.2. Timing and Clocks

Scheduling is a clock-based activity in that several of its critical functions are driven by a periodic clock or timer interrupts. Therefore, the clock subsystem must be configured for scheduling to be started. clock_config() [osfmk/kern/clock.c] configures the clock subsystem.

Timer facilities on the PowerPC include the Timebase (TB) Register and the Decrementer Register (DEC). As we saw in Chapter 3, the TB Register is a 64-bit counter driven by an implementation-dependent frequency. On certain processor models, the frequency may be a function of the processor's clock frequency, whereas on some other models the TB Register is updated in accordance with an independent clock. In fact, the frequency is not even required to be constant, although a frequency change must be explicitly managed by the operating system. In any case, each increment of the TB Register adds 1 to its low-order bit. The TB Register is a volatile resource and must be initialized by the kernel during boot.

The DEC is a 32-bit counter that is updated at the same frequency as the TB Register but is decremented by 1 on every update.

For a typical Timebase frequency, it will take thousands of years for the TB Register to attain its maximum value, but the DEC will pass zero in a few hundred seconds for the same frequency.


When the DEC's value becomes negative, that is, the sign bit of the 32-bit signed integer represented by the DEC's contents changes from 0 to 1, a decrementer interrupt is caused. As we saw in Chapter 5, the PowerPC exception vector entry for this interrupt resides at address 0x900. The low-level handler in osfmk/ppc/lowmem_vectors.s sets the trap code as T_DECREMENTER and passes up the exception's processing to ihandler() [osfmk/ppc/hw_exception.s]the higher-level interrupt handler. ihandler() in turn calls interrupt() [osfmk/ppc/interrupt.c].

// osfmk/ppc/interrupt.c struct savearea * interrupt(int type, struct savearea *ssp, ...) {     ...     switch (type) {     case T_DECREMENTER:         ...         rtclock_intr(0, ssp, 0);     break;     }     ... }


rtclock_intr() [osfmk/ppc/rtclock.c] is the real-time clock device interrupt handler routine. The real-time clock subsystem maintains per-processor data structures such as the following:

  • A real-time clock timer structure with its own, configurable deadline

  • A deadline for the real-time clock tick that is driven at a frequency of HZ, which is defined to be 100, resulting in a clock tick every 10 ms

Figure 739 shows an overview of real-time clock interrupt processing.

Figure 739. Real-time clock interrupt processing

// osfmk/ppc/exception.h struct per_proc_info {     ...     uint64_t rtclock_tick_deadline;     struct rtclock_timer {         uint64_t deadline;         uint32_t is_set:1,                  has_expired:1,                  :0;     } rtclock_timer;     ... }; // osfmk/ppc/rtclock.c #define NSEC_PER_HZ (NSEC_PER_SEC / HZ) static uint32_t rtclock_tick_interval; ... static clock_timer_func_t rtclock_timer_expire; ... #define DECREMENTER_MAX 0x7FFFFFFFUL #define DECREMENTER_MIN 0XAUL ... void clock_set_timer_deadline(uint64_t deadline) {     // set deadline for the current processor's rtclock_timer     ... } void clock_set_timer_func(clock_timer_func_t func) {     spl_t s;     LOCK_RTC(s);     // global timer expiration handler     if (rtclock_timer_expire == NULL)         rtclock_timer_expire = func;     UNLOCK_RTC(s); } ... // real-time clock device interrupt void rtclock_intr(__unused int device, struct savearea *ssp, __unused spl_t old_spl) {     uint64_t              abstime;     int                   decr1, decr2;     struct rtclock_timer *mytimer;     struct per_proc_info *pp;     decr1 = decr2 = DECREMENTER_MAX;     pp = getPerProc();     abstime = mach_absolute_time();     if (pp->rtclock_tick_deadline <= abstime) {         // set pp->rtclock_tick_deadline to "now" (that is, abstime) plus         // rtclock_tick_interval         // call the hertz_tick() function     }     mytimer = &pp->rtclock_timer;     abstime = mach_absolute_time();     if (mytimer->is_set && mytimer->deadline <= abstime) {         mytimer->has_expired = TRUE;         mytimer->is_set = FALSE;         (*rtclock_timer_expire)(abstime);         mytimer->has_expired = FALSE;     }     // Look at the deadlines in pp->rtclock_tick_deadline and mytimer->deadline     // Choose the earlier one. Moreover, if a still earlier deadline is     // specified via the special variable rtclock_decrementer_min, choose that     // instead. None of these deadlines can be greater than DECREMENTER_MAX.     // Now that we have a deadline, load the Decrementer Register with it.     ...     treqs(decr1); // sets decrementer using mtdec()     ... }

So far, we have seen the following functionality provided by the real-time clock subsystem.

  • The hertz_tick() function is called via the rtclock_intr() function HZ times a second.

  • The function pointed to by the rtclock_timer_expire() function pointer is called depending on the deadline in the processor's timer structure.

A global list of clock devices is maintained by the kernel, with each entry being a clock object structure containing that particular clock's control port, service port, and a machine-dependent operations list. clock_config() calls the "config" function of each clock device on the list. Subsequently, clock_init() [osfmk/kern/clock.c] is called to initialize the clock devicesit calls the "init" function of each clock device. Note that unlike clock_config(), which is called only once during bootstrapping, clock_init() is called on a processor each time the processor is started. Consider the configuration and initialization of the system clock (Figure 740), whose "config" and "init" functions are sysclk_config() and sysclk_init(), respectively.

Figure 740. System clock configuration

// osfmk/ppc/rtclock.c static void timebase_callback(...) {     ...     // Initialize commpage timestamp     // Set rtclock_tick_interval, which is the global variable used by     // rtclock_intr() to arrange for the next "tick" to occur by loading     // the decrementer with the next deadline     //     nanoseconds_to_absolutetime(NSEC_PER_HZ, &abstime);     rtclock_tick_interval = abstime;     ...     // This will call sched_timebase_init()     clock_timebase_init(); } ... int sysclk_config(void) {     ...     // The Platform Expert knows the implementation-dependent conversion factor     // between absolute-time (Timebase-driven) and clock-time values.     //     // The following registration will cause the provided function --     // timebase_callback() -- to be invoked with the Timebase frequency values     // as parameters.     //     PE_register_timebase_callback(timebase_callback);     ... } ... int sysclk_init(void) {     ...     // set decrementer and our next tick due     ... }

clock_config() also calls timer_call_initialize() [osfmk/kern/timer_call.c] to initialize the timer interrupt callout mechanism, which is used by the thread-based callout mechanism.

// osfmk/kern/timer_call.c void timer_call_initialize(void) {     ...     clock_set_timer_func((clock_timer_func_t)timer_call_interrupt);     ... }


As shown in Figure 739, clock_set_timer_func() [osfmk/ppc/rtclock.c] merely sets its parameter (the timer_call_interrupt function pointer in this case) as the value of the rtclock_timer_expire global function pointer. Every time timer_call_interrupt() is called, it will service the timer call queue for the current processor. This way, the scheduler can arrange for tHRead_quantum_expire() to be invoked on a processor.

clock_timebase_init() [osfmk/kern/clock.c] is a machine-independent function that calls sched_timebase_init() [osfmk/kern/sched_prim.c] to set up various time-related values used by the scheduler, for example:

  • std_quantum (10,000 μs), the standard timeslicing quantum

  • min_std_quantum (250 μs), the smallest remaining quantum

  • min_rt_quantum (50 μs), the smallest real-time computation

  • max_rt_quantum (50 ms), the largest real-time computation

  • sched_tick_interval (1000 >> SCHED_TICK_SHIFT ms)

sched_timebase_init() uses clock_interval_to_absolutetime_interval() [osfmk/ppc/rtclock.c] to convert conventional (clock) intervals to machine-specific absolute-time intervals. SCHED_TICK_SHIFT is defined to be 3 in osfmk/kern/sched.h, yielding a value of 125 ms for sched_tick_interval.

7.4.1.3. Converting between Absolute- and Clock-Time Intervals

The kernel often needs to convert between absolute- and clock-time intervals. Absolute time is based on the machine-dependent TB Register. The Mach trap mach_absolute_time(), which is available in the commpage, retrieves the current value of the TB Register. It is the highest-resolution time-related function on Mac OS X. To convert an absolute-time interval to a conventional clock interval (such as a value expressed in seconds), you need the implementation-dependent conversion factor, which can be retrieved by mach_timebase_info(). The conversion factor consists of a numerator and a denominator. The resultant ratio can be multiplied with an absolute-time interval to yield an equivalent clock interval in nanoseconds. Figure 741 shows an example of converting between the two time intervals.

Figure 741. Converting between absolute- and clock-time intervals

// timebase_demo.c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <mach/mach.h> #include <mach/mach_time.h> #define DEFAULT_SLEEP_TIME 1 #define MAXIMUM_SLEEP_TIME 60 int main(int argc, char **argv) {     kern_return_t kr;     u_int64_t     t1, t2, diff;     double        abs2clock;     int           sleeptime = DEFAULT_SLEEP_TIME;     mach_timebase_info_data_t info;     kr = mach_timebase_info(&info);     if (kr != KERN_SUCCESS) {         mach_error("mach_timebase_info:", kr);         exit(kr);     }     if (argc == 2) {         sleeptime = atoi(argv[1]);         if ((sleeptime < 0) || (sleeptime > MAXIMUM_SLEEP_TIME))             sleeptime = DEFAULT_SLEEP_TIME;     }     t1 = mach_absolute_time();     sleep(sleeptime);     t2 = mach_absolute_time();     diff = t2 - t1;     printf("slept for %d seconds of clock time\n", sleeptime);     printf("TB increments = %llu increments\n", diff);     printf("absolute-to-clock conversion factor = (%u/%u) ns/increment\n",            info.numer, info.denom);     printf("sleeping time according to TB\n");     abs2clock = (double)info.numer/(double)info.denom;     abs2clock *= (double)diff;     printf("\t= %llu increments x (%u/%u) ns/increment\n\t= %f ns\n\t= %f s\n",            diff, info.numer, info.denom,            abs2clock, abs2clock/(double)1000000000);     exit(0); } $ gcc -Wall -o timebase_demo timebase_demo.c $ ./timebase_demo 5 slept for 5 seconds of clock time TB increments = 166651702 increments absolute-to-clock conversion factor = (1000000000/33330173) ns/increment sleeping time according to TB         = 166651702 increments x (1000000000/33330173) ns/increment         = 5000025112.380905 ns         = 5.000025 s

7.4.1.4. Starting the Scheduler

The first thread to execute on the boot processor, kernel_bootstrap_thread() [osfmk/kern/startup.c], is started via load_context() [osfmk/kern/startup.c]. Besides setting up the machine-specific context of the thread, load_context() initializes certain aspects of the processor. In particular, it calls processor_up() [osfmk/kern/machine.c] to add the processor to the default processor set.

kernel_bootstrap_thread() creates an idle thread for the processor, calls sched_startup() [osfmk/kern/sched_prim.c] to initiate the scheduler's periodic activities, and calls thread_bind() [osfmk/kern/sched_prim.c] to bind the current thread to the boot processor. The latter step is required so that execution remains bound to the boot processor and does not move to any other processors as they come online. Figure 742 shows an overview of scheduler startup.

Figure 742. Scheduler startup

// osfmk/kern/sched_prim.c void sched_startup(void) {     ...     result = kernel_thread_start_priority(                  (thread_continue_t)sched_tick_thread,                  NULL,                  MAXPRI_KERNEL,                  &thread);     ...     thread_call_initialize(); } // perform periodic bookkeeping functions void sched_tick_continue(void) {     ... } void sched_tick_thread(void) {     ...     sched_tick_deadline = mach_absolute_time();     sched_tick_continue();     /* NOTREACHED */ }

sched_startup() also initializes the thread-based callout mechanism that allows functions to be recorded by the kernel for invocation later. For example, setitimer(2), which allows real, virtual, and profiling timers to be set for a process, is implemented using a thread callout.

At this point, we have the following primary scheduling-related periodic activities occurring in the kernel.

  • rtclock_intr() [osfmk/ppc/rtclock.c] is called when there is a decrementer exception. This typically occurs HZ times a second, with the default value of HZ being 100. rtclock_intr() reloads the decrementer (register) with the next deadline value.

  • hertz_tick() [osfmk/kern/mach_clock.c] is called by rtclock_intr().

  • timer_call_interrupt() [osfmk/kern/timer_call.c] is called by rtclock_intr() if the current processor's real-time clock timer's deadline has expired. The rtclock_timer_expire function pointer points to timer_call_interrupt()as set by clock_set_timer_func().

  • sched_tick_continue() [osfmk/kern/sched_prim.c] runs on every scheduler tick, which occurs once every 125 ms by default.

7.4.1.5. Retrieving the Value of the Scheduler Tick

Let us read the value of the sched_tick variable from kernel memory to examine the rate at which it is incremented. We can determine the address of the variable in kernel memory by running the nm command on the kernel executable. Thereafter, we will use the dd command to read its value from /dev/kmem, sleep for an integral number of seconds, and read its value again. Figure 743 shows a shell script that performs these steps. As seen in the output, the variable's value is incremented by 80 in 10 seconds, which is as we expected, since it should increment by 1 every 125 ms (or by 8 every second).

Figure 743. Sampling the value of the scheduler tick

#!/bin/sh # sched_tick.sh SCHED_TICK_ADDR="0x`nm /mach_kernel | grep -w _sched_tick | awk '{print $1}'`" if [ "$SCHED_TICK_ADDR" == "0x" ] then     echo "address of _sched_tick not found in /mach_kernel"     exit 1 fi dd if=/dev/kmem bs=1 count=4 iseek=$SCHED_TICK_ADDR of=/dev/stdout | hexdump -d sleep 10 dd if=/dev/kmem bs=1 count=4 iseek=$SCHED_TICK_ADDR of=/dev/stdout | hexdump -d exit 0 $ sudo ./sched_tick.sh 2>/dev/null 0000000   00035   09878 0000004 0000000   00035   09958 0000004

7.4.1.6. Some Periodic Kernel Activities

We have already seen what rtclock_intr() does. Let us briefly look at the operations of hertz_tick(), timer_call_interrupt(), and sched_tick_continue().

hertz_tick() [osfmk/kern/mach_clock.c] performs certain operations on all processors, such as gathering statistics, tracking thread states, and incrementing user-mode and kernel-mode thread timers. Examples of statistics gathered include the total number of clock ticks and profiling information (if profiling is enabled). On the master processor, hertz_tick() additionally calls bsd_hardclock().

bsd_hardclock() [bsd/kern/kern_clock.c] performs several operations if there is a valid, current BSD process and the process is not exiting. If the processor was in user mode, bsd_hardclock() checks whether the process has a virtual interval timerthat is, an interval timer of type ITIMER_VIRTUAL that decrements in process-virtual time (only when the process is executing). Such a timer can be set by setitimer(2). If such a timer exists and has expired, bsd_hardclock() arranges for a SIGVTALRM signal to be delivered to the process.

As we saw in Chapter 6, the USER_MODE() macrodefined in osfmk/ppc/proc_reg.his used to examine the saved SRR1, which holds the old contents of the MSR. The PR (privileged) bit of the MSR distinguishes between kernel and user mode.


bsd_hardclock() performs other operations regardless of whether the processor was in user mode, as long as the processor was not idle. It charges the currently scheduled process with resource utilization for a tick. It then checks whether the process has exceeded its CPU time limit (as specified by the RLIMIT_CPU resource limit), sending it a SIGXPU signal if it has. Next, it checks whether the process has a profiling timerthat is, an interval timer of type ITIMER_PROF. Such a timer decrements both in process-virtual time and when the kernel is running on behalf of the process. It can also be set by setitimer(2). If such a timer exists and has expired, bsd_hardclock() arranges for a SIGPROF signal to be delivered to the process.

timer_call_interrupt() [osfmk/kern/timer_call.c] traverses the timer call queue for the current processor and calls handlers for those timers whose deadlines have expired (Figure 744).

Figure 744. Timer call processing

// osfmk/kern/timer_call.c #define qe(x) ((queue_entry_t)(x)) #define TC(x) ((timer_call_t)(x)) static void timer_call_interrupt(uint64_t timestamp) {     timer_call_t call;     queue_t      queue;     simple_lock(&timer_call_lock);     queue = &PROCESSOR_DATA(current_processor(), &timer_call_queue);     call = TC(queue_first(queue));     while (!queue_end(queue, qe(call))) {         if (call->deadline <= timestamp) {             ...             // invoke call->func(), passing it call->param0 and call->param1             ...         } else             break;         call = TC(queue_first(queue));     }     ... }

sched_tick_continue() [osfmk/kern/sched_prim.c] performs periodic bookkeeping functions for the scheduler. As Figure 745 shows, it increments the sched_tick global variable by 1, calls compute_averages() [osfmk/kern/sched_average.c] to compute the load average and the Mach factor, and calls tHRead_update_scan() [osfmk/kern/sched_prim.c] to scan the run queues of all processor sets and processors to possibly update thread priorities.

Figure 745. The scheduler's bookkeeping function

// osfmk/kern/sched_prim.c void sched_tick_continue(void) {     uint64_t abstime = mach_absolute_time();     sched_tick++;     // compute various averages     compute_averages();     // scan the run queues to account for timesharing threads that may need     // to be updated -- the scanner runs in two passes     thread_update_scan();     // compute next deadline for our periodic event     clock_deadline_for_periodic_event(sched_tick_interval,                                       abstime, &sched_tick_deadline);     assert_wait_deadline((event_t)sched_tick_thread, THREAD_UNINT,                          sched_tick_deadline);     thread_block((thread_continue_t)sched_tick_continue);     // NOTREACHED }

7.4.2. Scheduler Operation

Mac OS X is primarily a timesharing system in that threads are subject to timesharing scheduling unless explicitly designated otherwise. Typical timesharing scheduling aims to providewithout guaranteeseach competing thread a fair share of processor time, where fairness implies that the threads receive roughly equal amounts of processor resources over a reasonably long time.

Mapping the Scheduler

Figure 746 shows a call graph consisting of several key functions that are involved in the execution and scheduling of threads. Given the density of the graph, we will not discuss it in this chapter. However, it can be used as an accessory to further study of the Mac OS X scheduler.


Figure 746. A nonexhaustive call graph of functions involved in thread execution and scheduling


The following general points are noteworthy about scheduling on Mac OS X.

  • The scheduler schedules only Mach threads and no other higher-level entities.

  • The scheduler does not use the knowledge that two or more threads may belong to the same task to select between them. In theory, such knowledge could be used to optimize intratask context switching.

  • Mach uses the same scheduler for both multiprocessors and uniprocessors. In fact, Mac OS X uses the same kernelthe multiprocessor versionregardless of the number of processors on a machine.[16]

    [16] The multiprocessor kernel runs with some overhead on uniprocessor systems.

  • The Mac OS X kernel supports handoff scheduling, wherein a thread can directly yield the processor to another thread without fully involving the scheduler. The kernel's message-passing mechanism can use handoff scheduling while sending a messageif a thread is waiting to receive a message, the sending thread can directly switch to the receiving thread. The receiver effectively inherits the sender's scheduling attributes, including the remainder of the sender's current quantum. Once this quantum expires, however, the effects of the "inheritance" disappear.

  • The Mac OS X scheduler supports multiple scheduling policies, including a "soft" real-time policy. However, the scheduler does not provide an interface for loading custom policies.[17]

    [17] For example, the Solaris operating system supports dynamically loadable scheduling policies.

  • Each processor has its own, dedicated idle thread that looks for other threads to execute while it runs.

7.4.2.1. Priority Ranges

The Mac OS X scheduler is priority-based. The selection of threads for running takes into account the priorities of runnable threads. Table 72 shows the various priority ranges in the scheduling subsystemnumerically higher values represent higher priorities. The HOST_PRIORITY_INFO flavor of the host_info() Mach routine can be used to retrieve the values of several specific priorities.

Table 72. Mac OS X Scheduler Priorities

Levels

Description

010

This range contains the lowest priorities (aged, idle) to lowered priorities (aged). The lowest priority (0) has several synonyms, such as MINPRI_USER, MINPRI, IDLEPRI (idle priority), and DEPRESSPRI (depress priority).

1130

This range contains lowered priorities.

31

This is the default base priority (BASEPRI_DEFAULT) for user threads. host_info() returns this value as the user priority.

3251

This range contains elevated priorities, such as those attainable through task_policy_set(). For example, BASEPRI_BACKGROUND (46), BASEPRI_FOREGROUND (47), and BASEPRI_CONTROL (48) correspond to the base priorities of tasks that have been designated as background, foreground, and control tasks, respectively.

5263

This range also contains elevated priorities. MAXPRI_USER (63) is set as a new task's maximum priority when the task is created.

6479

This range contains high priorities normally reserved for the system. The end points64 and 79are called MINPRI_RESERVED and MAXPRI_RESERVED, respectively. MINPRI_RESERVED is returned as the server priority by host_info().

8095

This range contains kernel-only priorities. The priorities 80, 81, 93, and 95 are called MINPRI_KERNEL, BASEPRI_KERNEL, BASEPRI_PREEMPT, and MAXPRI_KERNEL, respectively. host_info() returns MINPRI_KERNEL as the value of both the kernel and system priorities.

96127

The priorities in this range are reserved for real-time threads and are attainable through tHRead_policy_set(). The priorities 96, 97, and 127 are called BASEPRI_REALTIME, BASEPRI_RTQUEUES, and MAXPRI, respectively.


7.4.2.2. Run Queues

A fundamental data structure maintained by the Mach scheduler is a run queue. Each run queue structure (Figure 747) represents a priority queue of runnable threads and contains an array of NRQS doubly linked lists, one corresponding to each priority level. The structure's highq member is a hint that indicates the likely location of the highest priority thread, which may be at a priority lower than the one specified by highq but will not be at a higher priority. Recall that each processor set has a run queue and each processor has a local run queue.

Figure 747. The run queue structure

// osfmk/kern/sched.h #define NRQS       128         // 128 levels per run queue #define NRQBM      (NRQS/32)   // number of words per bitmap #define MAXPRI     (NRQS-1)    // maximum priority possible #define MINPRI     IDLEPRI     // lowest legal priority schedulable #define IDLEPRI    0           // idle thread priority #define DEPRESSPRI MINPRI      // depress priority ... struct run_queue {     int highq;                 // highest runnable queue     int bitmap[NRQBM];         // run queue bitmap array     int count;                 // number of threads total     int urgency;               // level of preemption urgency     queue_head_t queues[NRQS]; // one for each priority };

7.4.2.3. Scheduling Information in Tasks and Threads

To balance processor usage among threads, the scheduler adjusts thread priorities to account for each thread's usage. Associated with each thread and task are several priority-related limits and measurements. Let us revisit the task and thread structures to examine some of the scheduling-related information contained within them. The relevant portions of the structures are annotated in Figure 748.

Figure 748. Important scheduling-related fields of the task and thread structures

// osfmk/kern/task.h struct task {     ...     // task's role in the system     // set to TASK_UNSPECIFIED during user task creation     task_role_t role;     // default base priority for threads created within this task     // set to BASEPRI_DEFAULT during user task creation     integer_t priority;     // no thread in this task can have priority greater than this     // set to MAXPRI_USER during user task creation     integer_t max_priority;     ... }; // osfmk/kern/thread.h struct thread {     ...     // scheduling mode bits include TH_MODE_REALTIME (time-constrained thread),     // TH_MODE_TIMESHARE (uses standard timesharing scheduling),     // TH_MODE_PREEMPT (can preempt kernel contexts), ...     //     // TH_MODE_TIMESHARE is set during user thread creation     integer_t sched_mode;     integer_t sched_pri;      // scheduled (current) priority     // base priority     // set to parent_task->priority during user thread creation     integer_t priority;     // maximum base priority     // set to parent_task->max_priority during user thread creation     integer_t max_priority;     // copy of parent task's base priority     // set to parent_task->priority during user thread creation     integer_t task_priority;  // copy of task's base priority     ...     // task-relative importance     // set to (self->priority - self->task_priority) during user thread creation     integer_t importance;     // parameters for time-constrained scheduling policy     struct {         ...     } realtime;     uint32_t current_quantum; // duration of current quantum     ...     // last scheduler tick     // set to the global variable sched_tick during user thread creation     natural_t sched_stamp;     // timesharing processor usage     // initialized to zero in the "template" thread     natural_t sched_usage;     // factor for converting usage to priority     // set to the processor set's pri_shift value during user thread creation     natural_t pri_shift; };

As shown in Figure 748, each thread has a base priority. However, the thread's scheduled priority is the one that the scheduler examines while selecting threads to run.[18] The scheduled priority is computed from the base priority along with an offset derived from the thread's recent processor usage. The default base priority for timesharing user threads is 31, whereas the minimum kernel priority is 80. Consequently, kernel threads are substantially favored over standard user threads.

[18] This discussion applies only to timesharing threads. Real-time threads are treated specially by the scheduler.

7.4.2.4. Processor Usage Accounting

As a thread accumulates processor usage, its priority decreases. Since the scheduler favors higher priorities, this could lead to a situation where a thread has used so much processor time that the scheduler will assign it no further processor time owing to its greatly lowered priority. The Mach scheduler addresses this issue by aging processor usageit exponentially "forgets" a thread's past processor usage, gradually increasing that thread's priority. However, this creates another problem: If the system is under such heavy load that most (or all) threads receive little processor time, the priorities of all such threads will increase. The resultant contention will deteriorate system response under heavy load. To counter this problem, the scheduler multiplies a thread's processor usage by a conversion factor related to system load, thereby ensuring that thread priorities do not rise because of increased system load alone. Figure 749 shows the calculation of a thread's timesharing priority based on its processor usage and the system's load.

Figure 749. Computation of the timesharing priority of a thread

// osfmk/kern/priority.c #define do_priority_computation(thread, pri) \     do { \         (pri) = (thread->priority) /* start with base priority */ \         - ((thread)->sched_usage >> (thread)->pri_shift); \         if ((pri) < MINPRI_USER) \             (pri) = MINPRI_USER; \         else if ((pri) > MAXPRI_KERNEL) \             (pri) = MAXPRI_KERNEL; \     } while (FALSE);

We see in Figure 749 that the thread's processor usage (thread->sched_usage), after being lowered by a conversion factor (thread->pri_shift), is subtracted from its base priority (thread->priority) to yield the scheduled priority. Let us now see how the conversion factor is calculated and how the thread's processor usage decays over time.

update_priority() [osfmk/kern/priority], which is frequently called as part of the scheduler's operation, under certain conditions updates the thread's conversion factor value by setting it to that of the processor set containing the thread.


The conversion factor consists of two components: a fixed part based on the machine-dependent absolute-time unit and a dynamic part based on system load. The global variable sched_pri_shift contains the fixed part, which is computed during scheduler initialization. The dynamic part is an entry in a constant array, with the array index based on the system load. Figure 750 shows a user-space implementation of a function to convert clock intervals to absolute-time intervals. Using this function, we can reconstruct the computation of sched_pri_shift in user space. The program also computes the value of sched_tick_interval, which corresponds to an interval of 125 ms.

Figure 750. User-space computation of sched_pri_shift and sched_tick_interval

// sched_pri_shift.c #include <stdio.h> #include <stdlib.h> #include <mach/mach.h> #include <mach/mach_time.h> // defined in osfmk/kern/sched.h #define BASEPRI_DEFAULT 31 #define SCHED_TICK_SHIFT 3 void clock_interval_to_absolutetime_interval(uint32_t interval,                                         uint32_t scale_factor,                                         uint64_t *result) {     uint64_t t64;     uint32_t divisor, rtclock_sec_divisor;  uint64_t nanosecs = (uint64_t)interval * scale_factor;     mach_timebase_info_data_t tbinfo;     (void)mach_timebase_info(&tbinfo);     // see timebase_callback() [osfmk/ppc/rtclock.c]     rtclock_sec_divisor = tbinfo.denom / (tbinfo.numer / NSEC_PER_SEC);     *result = (t64 = nanosecs / NSEC_PER_SEC) * (divisor = rtclock_sec_divisor);     nanosecs -= (t64 * NSEC_PER_SEC);     *result += (nanosecs * divisor) / NSEC_PER_SEC; } int main(void) {     uint64_t abstime;     uint32_t sched_pri_shift;     uint32_t sched_tick_interval;     clock_interval_to_absolutetime_interval(USEC_PER_SEC >> SCHED_TICK_SHIFT,                                             NSEC_PER_USEC, &abstime);     sched_tick_interval = abstime; // lvalue is 32-bit     abstime = (abstime * 5) / 3;     for (sched_pri_shift = 0; abstime > BASEPRI_DEFAULT; ++sched_pri_shift)         abstime >>= 1;     printf("sched_tick_interval = %u\n", sched_tick_interval);     printf("sched_pri_shift = %u\n", sched_pri_shift);     exit(0); } $ gcc -Wall -o sched_pri_shift sched_pri_shift.c $ ./sched_pri_shift sched_tick_interval = 4166271 sched_pri_shift = 18

Figure 751 shows a code excerpt from the computation of the conversion factor's dynamic part.

Figure 751. Computation of the usage-to-priority conversion factor for timeshared priorities

// osfmk/kern/sched_prim.c int8_t sched_load_shifts[NRQS]; ... // called during scheduler initialization // initializes the array of load shift constants static void load_shift_init(void) {     int8_t   k, *p = sched_load_shifts;     uint32_t i, j;     *p++ = INT8_MIN; *p++ = 0;     for (i = j = 2, k = 1; i < NRQS; ++k) {         for (j <<= 1; i < j; ++i)             *p++ = k;     } } // osfmk/kern/sched_average.c void compute_averages(void) {     ...     register int      nthreads, nshared;     register uint32_t load_now = 0;     ...     if ((ncpus = pset->processor_count) > 0) {         nthreads = pset->run_count - 1; // ignore current thread         nshared = pset->share_count;    // so many timeshared threads         ...         if (nshared > nthreads)             nshared = nthreads;         // current was timeshared!         if (nshared > ncpus) {             if (ncpus > 1)                 load_now = nshared / ncpus;             else                 load_now = nshared;             if (load_now > NRQS - 1)                 load_now = NRQS - 1;         }         pset->pri_shift = sched_pri_shift - sched_load_shifts[load_now];     } else {         ...         pset->pri_shift = INT8_MAX; // hardcoded to 127         ...     }     // compute other averages     ... }

The scheduler ages processor usage of threads in a distributed manner: update_priority() [osfmk/kern/priority.c], which performs the relevant calculations, is called from several places. For example, it is called when a thread's quantum expires. The function call graph in Figure 746 shows several invocations of update_priority(). It begins by calculating the difference (ticks) between the current scheduler tick (sched_tick), which is incremented periodically, and the thread's recorded scheduler tick (thread->sched_stamp). The latter is brought up to date by adding ticks to it. If ticks is equal to or more than SCHED_DECAY_TICKS (32), the thread's processor usage is reset to zero. Otherwise, the usage is multiplied by 5/8 for each unit of differencethat is, it is multiplied by (5/8)ticks. There were two primary reasons behind the choice of 5/8 as the exponential decay factor: It provided scheduling behavior similar to other timesharing systems, and multiplication with it can be approximated by using only shift, addition, and subtraction operations. Consider multiplying a number by 5/8, which can be written as (4 + 1)/8that is, (4/8 + 1/8), or (1/2 + 1/8). Multiplication with (1/2 + 1/8) can be performed with a right shift by 1, a right shift by 3, and an addition. To facilitate decay calculations, the kernel maintains a static array with SCHED_DECAY_TICKS pairs of integersthe pair at index i contains shift values to approximate (5/8)i. If the value of ticks falls between 0 and 31, both inclusive, the pair at index ticks is used according to the following formula:

if (/* the pair's second value is positive */) {     usage = (usage >> (first value)) + (usage >> abs(second value))); else     usage = (usage >> (first value)) - (usage >> abs(second value)));


The program in Figure 752 computes (5/8)n, where 0 <= n < 32, using the shift values in the kernel's decay shift array and using functions from the math library. It also calculates the percentage differencethat is, the approximation error, which is less than 15% in the worst case.

Figure 752. Approximating multiplication by 5/8 as implemented in the scheduler

// approximate_5by8.c #include <stdio.h> #include <math.h> struct shift {     int shift1;     int shift2; }; #define SCHED_DECAY_TICKS 32 static struct shift sched_decay_shifts[SCHED_DECAY_TICKS] = {     {1, 1}, {1, 3}, {1, -3}, {2, -7}, {3, 5}, {3, -5}, {4, -8}, {5, 7},     {5, -7}, {6, -10}, {7, 10}, {7, -9}, {8, -11}, {9, 12}, {9, -11}, {10, -13},     {11,14}, {11,-13}, {12,-15}, {13,17}, {13,-15}, {14,-17}, {15,19}, {16,18},     {16,-19}, {17,22}, {18,20}, {18,-20}, {19,26}, {20,22}, {20,-22}, {21,-27} }; int main(void) {     int    i, v, v0 = 10000000;     double x5_8,  y5_8;     double const5_8 = (double)5/(double)8;     struct shift *shiftp;     for (i = 0; i < SCHED_DECAY_TICKS; i++) {         shiftp = &sched_decay_shifts[i];         v = v0;         if (shiftp->shift2 > 0)             v = (v >> shiftp->shift1) + (v >> shiftp->shift2);         else             v = (v >> shiftp->shift1) - (v >> -(shiftp->shift2));         x5_8 = pow(const5_8,  (double)i);         y5_8 = (double)v/(double)v0;         printf("%10.10f\t%10.10f\t%10.2f\n", x5_8, y5_8,                ((x5_8 - y5_8)/x5_8) * 100.0);     }     return 0; } $ gcc -Wall -o approximate_5by8 approximate_5by8.c $ ./approximate_5by8 1.0000000000    1.0000000000          0.00 0.6250000000    0.6250000000          0.00 0.3906250000    0.3750000000          4.00 0.2441406250    0.2421875000          0.80 ... 0.0000007523    0.0000007000          6.95 0.0000004702    0.0000004000         14.93

Note that it is not sufficient to make a thread responsible for decaying its processor usage. Threads with low priorities may continue to remain on the run queue without getting a chance to run because of higher-priority threads. In particular, these low-priority threads will be unable to raise their priorities by decaying their own usagesomebody else must do so on their behalf. The scheduler runs a dedicated kernel threadthread_update_scan()for this purpose.

// Pass #1 of thread run queue scanner // Likely threads are referenced in thread_update_array[] // This pass locks the run queues, but not the threads // static boolean_t runq_scan(run_queue_t runq) {     ... } // Pass #2 of thread run queue scanner (invokes pass #1) // A candidate thread may have its priority updated through update_priority() // This pass locks the thread, but not the run queue // static void thread_update_scan(void) {     ... }


thread_update_scan() is called from the scheduler tick function sched_tick_continue(), which periodically runs to perform scheduler-related bookkeeping functions. It consists of two logical passes. In the first pass, it iterates over the run queues, comparing the sched_stamp values of timesharing threads with sched_tick. This pass collects up to ThrEAD_UPDATE_SIZE (128) candidate threads in an array. The second pass iterates over this array's elements, calling update_priority() on timesharing threads that satisfy the following criteria.

  • The thread is neither stopped nor requested to be stopped (the TH_SUSP bit in its state is not set).

  • The thread is not queued for waiting (the TH_WAIT bit in its state is not set).

  • The thread's sched_stamp is still not up to date with sched_tick.

7.4.3. Scheduling Policies

Mac OS X supports multiple scheduling policies, namely, ThrEAD_STANDARD_POLICY (timesharing), ThrEAD_EXTENDED_POLICY, THREAD_PRECEDENCE_POLICY, and ThrEAD_TIME_CONSTRAINT_POLICY (real time). The Mach routines tHRead_policy_get() and thread_policy_set() can be used to retrieve and modify, respectively, the scheduling policy of a thread. The Pthreads API supports retrieving and setting pthread scheduling policies and scheduling parameters through pthread_getschedparam() and pthread_setschedparam(), respectively. Scheduling policy information can also be specified at pthread creation time as pthread attributes. Note that the Pthreads API uses different policies, namely, SCHED_FIFO (first in, first out), SCHED_RR (round robin), and SCHED_OTHER (system-specific policymaps to the default, timesharing policy on Mac OS X). In particular, the Pthreads API does not support specifying a real-time policy. Let us now look at each of the scheduling policies.

7.4.3.1. THREAD_STANDARD_POLICY

This is the standard scheduling policy and is the default for timesharing threads. Under this policy, threads running long-running computations are fairly assigned approximately equal processor resources. A count of timesharing threads is maintained for each processor set.

7.4.3.2. THREAD_EXTENDED_POLICY

This is an extended version of the standard policy. In this policy, a Boolean hint designates a thread as non-long-running (nontimesharing) or long-running (timesharing). In the latter case, this policy is identical to ThrEAD_STANDARD_POLICY. In the former case, the thread will run at a fixed priority, provided its processor usage does not exceed an unsafe limit, in which case the scheduler will temporarily demote it to being a timesharing thread through a fail-safe mechanism (see Section 7.4.3.4).

7.4.3.3. ThrEAD_PRECEDENCE_POLICY

This policy allows an importance valuea signed integerto be associated with a thread, thus allowing threads within a task to be designated as more or less important relative to each other. Other aspects being equal (the same time constraint attributes, say), the more important thread in a task will be favored over a less important thread. Note that this policy can be used in conjunction with the other policies.

Let us look at an example of using ThrEAD_PRECEDENCE_POLICY. The program in Figure 753 creates two pthreads within a task. Both threads run a function that continuously prints a thread labelthe first thread prints the character 1 whereas the second thread prints 2. We set the scheduling policies of both threads to THREAD_PRECEDENCE_POLICY, with the respective importance values specified on the command line. The program runs for a few seconds, with both threads printing their labels on the standard output. We can pipe the output through the awk command-line tool to count how many times 1 and 2 were printed, which will indicate the respective amounts of processing time the two threads received.

Figure 753. Experimenting with the ThrEAD_PRECEDENCE_POLICY scheduling policy

// thread_precedence_policy.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <pthread.h> #include <sys/param.h> #include <mach/mach.h> #include <mach/thread_policy.h> #define PROGNAME "thread_precedence_policy" void usage(void) {     fprintf(stderr, "usage: %s <thread1 importance> <thread2 importance>\n"                     "       where %d <= importance <= %d\n",             PROGNAME, -MAXPRI, MAXPRI);     exit(1); } void * adder(void *arg) {     unsigned long long *ctr = (unsigned long long *)arg;     sleep(1);     while (1)         (*ctr)++;     return NULL; } int main(int argc, char **argv) {     int                ret, imp1, imp2;     kern_return_t      kr;     pthread_t          t1, t2;     unsigned long long ctr1 = 0, ctr2 = 0;     thread_precedence_policy_data_t policy;     if (argc != 3)         usage();     imp1 = atoi(argv[1]);     imp2 = atoi(argv[2]);     if ((abs(imp1) > MAXPRI) || (abs(imp2) > MAXPRI))         usage();     ret = pthread_create(&t1, (pthread_attr_t *)0, adder, (void *)&ctr1);     ret = pthread_create(&t2, (pthread_attr_t *)0, adder, (void *)&ctr2);     policy.importance = imp1;     kr = thread_policy_set(pthread_mach_thread_np(t1),                            THREAD_PRECEDENCE_POLICY,                            (thread_policy_t)&policy,                            THREAD_PRECEDENCE_POLICY_COUNT);     policy.importance = imp2;     kr = thread_policy_set(pthread_mach_thread_np(t2),                            THREAD_PRECEDENCE_POLICY,                            (thread_policy_t)&policy,                            THREAD_PRECEDENCE_POLICY_COUNT);     ret = pthread_detach(t1);     ret = pthread_detach(t2);     sleep(10);     printf("ctr1=%llu ctr2=%llu\n", ctr1, ctr2);     exit(0); } $ gcc -Wall -o thread_precedence_policy thread_precedence_policy.c $ ./thread_precedence_policy -127 -127 ctr1=953278876 ctr2=938172399 $ ./thread_precedence_policy -127 127 ctr1=173546131 ctr2=1201063747

7.4.3.4. ThrEAD_TIME_CONSTRAINT_POLICY

This is a real-time scheduling policy intended for threads with real-time constraints on their execution. Using this policy, a thread can specify to the scheduler that it needs a certain fraction of processor time, perhaps periodically. The scheduler will favor a real-time thread over all other threads, except perhaps other real-time threads. The policy can be applied to a thread using thread_policy_set() with the following policy-specific parameters: three integers (period, computation, and constraint) and a Boolean (preemptible). Each of the three integer parameters is specified in absolute-time units. A nonzero period value specifies the nominal periodicity in the computationthat is, the time between two consecutive processing arrivals. The computation value specifies the nominal time needed during a processing span. The constraint value specifies the maximum amount of real time that may elapse from the start of a processing span to the end of the computation. Note that the constraint value cannot be less than the computation value. The difference of the constraint and the computation values is the real-time latency. Finally, the preemtible parameter specifies whether the computation may be interrupted.

Note that the real-time policy does not require special privileges to be used. Therefore, it must be used with care, given that it raises a thread's priority above that of several kernel threads. For example, using a real-time thread may be beneficial if the thread has a time-critical deadline to meet and latency is an issue. However, if the thread consumes too much processor time, using the real-time policy can be counterproductive.

The scheduler includes a fail-safe mechanism for nontimesharing threads whose processor usage exceeds an unsafe threshold. When such a thread's quantum expires, it is demoted to being a timesharing thread, and its priority is set to DEPRESSPRI. However, in the case of a real-time thread, the scheduler remembers the thread's erstwhile real-time desires. After a safe release duration, the thread is promoted to being a real-time thread again, and its priority is set to BASEPRI_RTQUEUES.

The maximum unsafe computation is defined as the product of the standard quantum and the max_unsafe_quanta constant. The default value of max_unsafe_quanta is MAX_UNSAFE_QUANTA, defined to be 800 in osfmk/kern/sched_prim.c. An alternate value can be provided through the unsafe boot-time argument.


The following are examples of the use of ThrEAD_TIME_CONSTRAINT_POLICY:

  • The dynamic_pager program

  • Multimedia applications such as GarageBand, iTunes, MIDI Server, QuickTime Player, and the Core Audio layer in general

  • The I/O Kit's FireWire family

  • The WindowServer program

  • The IIDCAssistant program, which is part of the audio plug-in for Apple's iSight camera

You can use the lstasks program from Figure 721 to display the scheduling policy of a task's threads.

$ sudo ./lstasks -v ... Task #70   BSD process id (pid)   = 605 (QuickTime Player) ...     thread 2/4 (0x16803) in task 70 (0x5803) ...       scheduling policy            = TIME_CONSTRAINT         period                     = 0         computation                = 166650         constraint                 = 333301         preemptible                = TRUE ...


The program in Figure 754 is a crude example of time-constrained processing. It creates a thread that performs a periodic computation that involves sleeping for a fixed duration followed by processing for a fixed duration. We use mach_absolute_time() to measure the approximate difference between the time the thread wished to sleep for and the actual sleeping time. If the difference is more than a predefined threshold, we increment an error count. If the program is run with no command-line arguments, it will not modify the thread's scheduling policy. If one or more command-line arguments are provided, the program will set the policy to ThrEAD_TIME_CONSTRAINT_POLICY using predefined parameters. Thus, we can compare the number of errors in the two cases. Moreover, we can run other programs to load the system. For example, we can run an infinite loopsay, through a command such as perl -e 'while (1) {}'.

Figure 754. Experimenting with the ThrEAD_TIME_CONSTRAINT_POLICY scheduling policy

// thread_time_constraint_policy.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <pthread.h> #include <mach/mach.h> #include <mach/mach_time.h> #include <mach/thread_policy.h> #define PROGNAME "thread_time_constraint_policy" #define SLEEP_NS 50000000 // sleep for 50 ms // if actual sleeping time differs from SLEEP_NS by more than this amount, // count it as an error #define ERROR_THRESH_NS ((double)50000) // 50 us static double             abs2clock; static unsigned long long nerrors = 0, nsamples = 0; static struct timespec    rqt = { 0, SLEEP_NS }; // before exiting, print the information we collected void atexit_handler(void) {     printf("%llu errors in %llu samples\n", nerrors, nsamples); } void * timestamper(void *arg) {     int       ret;     double    diff_ns;     u_int64_t t1, t2, diff;     while (1) {         t1 = mach_absolute_time();   // take a high-resolution timestamp         ret = nanosleep(&rqt, NULL); // sleep for SLEEP_NS seconds         t2 = mach_absolute_time();   // take another high-resolution timestamp         if (ret != 0)                // if sleeping failed, give up             exit(1);         diff = t2 - t1;              // how much did we sleep?         // the "error" (in nanoseconds) in our sleeping time         diff_ns = ((double)SLEEP_NS) - (double)diff * abs2clock;         if (diff_ns < 0)             diff_ns *= -1;         if (diff_ns > ERROR_THRESH_NS)             nerrors++;         nsamples++;     }     return NULL; } int main(int argc, char **argv) {     int           ret;     kern_return_t kr;     pthread_t     t1;     static double clock2abs;     mach_timebase_info_data_t            tbinfo;     thread_time_constraint_policy_data_t policy;     ret = pthread_create(&t1, (pthread_attr_t *)0, timestamper, (void *)0);     ret = atexit(atexit_handler);     (void)mach_timebase_info(&tbinfo);     abs2clock = ((double)tbinfo.numer / (double)tbinfo.denom);     // if any command-line argument is given, enable real-time     if (argc > 1) {         clock2abs = ((double)tbinfo.denom / (double)tbinfo.numer) * 1000000;         policy.period      = 50 * clock2abs; // 50 ms periodicity         policy.computation = 1 * clock2abs;  // 1 ms of work         policy.constraint  = 2 * clock2abs;         policy.preemptible = FALSE;         kr = thread_policy_set(pthread_mach_thread_np(t1),                                THREAD_TIME_CONSTRAINT_POLICY,                                (thread_policy_t)&policy,                                THREAD_TIME_CONSTRAINT_POLICY_COUNT);         if (kr != KERN_SUCCESS) {             mach_error("thread_policy_set:", kr);             goto OUT;         }     }     ret = pthread_detach(t1);     printf("waiting 10 seconds...\n");     sleep(10); OUT:     exit(0); } $ gcc -Wall -o thread_time_constraint thread_time_constraint.c $ ./thread_time_constraint waiting 10 seconds... 117 errors in 189 samples $ ./thread_time_constraint enable_real_time 0 errors in 200 samples

7.4.3.5. Priority Recomputation on Policy Change

When thread_policy_set() is used to change a thread's scheduling policy, or to modify the parameters of an existing policy in effect, the kernel recomputes the thread's priority and importance values, subject to the thread's maximum and minimum priority limits. Figure 755 shows the relevant calculations.

Figure 755. Recomputing a thread's priority on a scheduling-policy change

// osfmk/kern/thread_policy.c static void thread_recompute_priority(thread_t thread) {     integer_t priority;     if (thread->sched_mode & TH_MODE_REALTIME)         priority = BASEPRI_RTQUEUES;           // real-time     else {         if (thread->importance > MAXPRI)       // very important thread             priority = MAXPRI;         else if (thread->importance < -MAXPRI) // very unimportant thread             priority = -MAXPRI;         else             priority = thread->importance;         priority += thread->task_priority;     // add base priority         if (priority > thread->max_priority)   // clip to maximum allowed             priority = thread->max_priority;         else if (priority < MINPRI)            // clip to minimum possible             priority = MINPRI;     }     // set the base priority of the thread and reset its scheduled priority     set_priority(thread, priority); }

7.4.3.6. Task Roles

As we saw earlier in this chapter, the task_policy_set() routine can be used to set the scheduling policy associated with a task. TASK_CATEGORY_POLICY is an example of a task policy flavor. It informs the kernel about the role of the task in the operating system. With this flavor, task_policy_set() can be used to designate a task's role. The following are examples of task roles in Mac OS X.

  • TASK_UNSPECIFIED is the default role.

  • TASK_FOREGROUND_APPLICATION is intended for a normal UI-based application meant to run in the foreground from the UI's standpoint. Assigning this role to a task sets its priority to BASEPRI_FOREGROUND (see Table 72). The task's maximum priority remains unchanged.

  • TASK_BACKGROUND_APPLICATION is intended for a normal UI-based application meant to run in the background from the UI's standpoint. Assigning this role to a task sets its priority to BASEPRI_BACKGROUND. Again, the maximum priority is unaltered.

  • TASK_CONTROL_APPLICATION can be assigned to at most one task at a time on a first-come first-serve basis. It designates the task as the UI-based control application. The loginwindow program normally uses this designation. Assigning this role is a privileged action that results in the task's priority being set to BASEPRI_CONTROL without affecting its maximum priority.

  • TASK_GRAPHICS_SERVER should be assigned to the window management serverthat is, the WindowServer program. Like TASK_CONTROL_APPLICATION, this role too is assignablewith privileged accessonly to one task on a first-come first-serve basis. The task's priority and maximum priority are set to (MAXPRI_RESERVED - 3) and MAXPRI_RESERVED, respectively. The system may or may not use this role.

Note that roles are not inherited across tasks. Therefore, every task begins life with TASK_UNSPECIFIED as its role. We can use our lstasks program to examine the roles of various tasks in the system.

$ sudo ./lstasks -v ... Task #21   BSD process id (pid)   = 74 (loginwindow) ...     role                 = CONTROL_APPLICATION ... Task #29   BSD process id (pid)   = 153 (Dock) ...     role                 = BACKGROUND_APPLICATION ... Task #31   BSD process id (pid)   = 156 (Finder) ...     role                 = BACKGROUND_APPLICATION ... Task #45   BSD process id (pid)   = 237 (Terminal) ...     role                 = FOREGROUND_APPLICATION ...





Mac OS X Internals. A Systems Approach
Mac OS X Internals: A Systems Approach
ISBN: 0321278542
EAN: 2147483647
Year: 2006
Pages: 161
Authors: Amit Singh

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net