Section 7.2. Mach Abstractions, Data Structures, and APIs

7.2. Mach Abstractions, Data Structures, and APIs

Let us examine some of the kernel data structures that play important roles in the Mac OS X process subsystem. These include the following:

struct processor_set [osfmk/kern/processor.h] the processor set structure
struct processor [osfmk/kern/processor.h] the processor structure
struct task [osfmk/kern/task.h] the Mach task structure
struct thread [osfmk/kern/thread.h] the machine-independent Mach thread structure
struct machine_thread [osfmk/ppc/thread.h] the machine-dependent thread state structure
struct proc [bsd/sys/proc.h] the BSD process structure
struct uthread [bsd/sys/user.h] the BSD per-thread user structure
struct run_queue [osfmk/kern/sched.h] the run queue structure used by the scheduler

7.2.1. Summary of Relationships

Mach groups processors into one or more processor sets. Each processor set has a run queue of runnable threads, and each processor has a local run queue. Besides the run queue, a processor set also maintains a list of all threads in the set, along with the tasks assigned to the set. A task contains a list of its threads. It also refers back to its assigned processor set. A thread's machine-dependent state, including the so-called process control block (PCB), is captured in a machine_thread structure. A BSD process additionally has a proc structure that refers to the associated task. A multithreaded process is implemented as a Mach task containing multiple Mach threads. Each thread in a BSD process contains a pointer to a utHRead structure. Moreover, the proc structure contains a list of pointers to uthread structuresone for each thread within the process.

7.2.2. Processor Sets

Mach divides the available processors on a system into one or more processor sets. There is always a default processor set, which is initialized during kernel startupbefore the scheduler can run. It initially contains all processors in the system. The first task created by the kernel is assigned to the default set. A processor set may be empty, except for the default set, which must contain at least one processor. A processor belongs to at most one processor set at a time.

The Purpose of Processor Sets

The original motivation behind processor sets was to group processors to allocate them to specific system activitiesa coarse-grained allocation. Moreover, in early versions of Mac OS X, processor sets had associated scheduling policies and attributes, which provided a uniform control of the scheduling aspects of the threads in the set. Specific policies could be enabled and disabled at the processor set level.

7.2.2.1. Representation

As shown in Figure 72, a processor set object has two Mach ports representing it: a name port and a control port. The name port is only an identifierit can be used only for retrieving information about the processor set. The control port represents the underlying objectit can be used for performing control operations, for example, to assign processors, tasks, and threads to the processor set. This scenario is representative of Mach's architecture, wherein operations on various Mach objects are performed by sending the appropriate messages to the objects' respective control ports.

Figure 72. The processor set structure in the xnu kernel

// osfmk/kern/processor.h struct processor_set {     queue_head_t     idle_queue;      // queue of idle processors     int              idle_count;      // how many idle processors?     queue_head_t     active_queue;    // queue of active processors     queue_head_t     processors;      // queue of all processors     int              processor_count; // how many processors?     decl_simple_lock_data(,sched_lock)// scheduling lock     struct           run_queue runq;  // run queue for this set     queue_head_t     tasks;           // tasks assigned to this set     int              task_count;      // how many tasks assigned?     queue_head_t     threads;         // threads in this set     int              thread_count;    // how many threads assigned?     int              ref_count;       // structure reference count     int               active;         // is this set in use?     ...     struct ipc_port *pset_self;       // control port (for operations)     struct ipc_port *pset_name_self;  // name port (for information)     uint32_t         run_count;       // threads running     uint32_t         share_count;     // timeshare threads running     integer_t        mach_factor;     // the Mach factor     integer_t        load_average;    // load average     uint32_t         pri_shift;       // scheduler load average }; extern struct processor_set default_pset;

7.2.2.2. The Processor Set API

The processor set Mach API provides routines that can be called from user space to query and manipulate processor sets. Note that processor set manipulation is a privileged operation. The following are examples of routines in the processor set API.

host_processor_sets() returns a list of send rights representing all processor set name ports on the host.
host_processor_set_priv() translates a processor set name port into a processor set control port.
processor_set_default() returns the name port for the default processor set.
processor_set_create() creates a new processor set and returns the name and the control ports, whereas processor_set_destroy() destroys the specified processor set while reassigning its processors, tasks, and threads to the default set.^[5]
^[5] Since the kernel supports only one processor set, the create and destroy calls always fail.
processor_set_info() retrieves information about the specified processor set. As we saw in Chapter 6, "info" calls in the Mach APIs typically require a flavor argument that specifies the type of information desired. This way, the same call may fetch a variety of information depending on the flavor specified. Examples of processor_set_info() flavors are PROCESSOR_SET_BASIC_INFO (the number of assigned processors to the set and the default policy^[6] in effect; returned in a processor_set_basic_info structure), PROCESSOR_SET_TIMESHARE_DEFAULT (the base attributes for the timeshare scheduling policy; returned in a policy_timeshare_base structure), and PROCESSOR_SET_TIMESHARE_LIMITS (the limits on the allowed timeshare policy attributes; returned in a policy_timeshare_limit structure).
^[6] The default policy is hardcoded to POLICY_TIMESHARE.
processor_set_statistics() retrieves scheduling statistics for the specified processor set. It also requires a flavor argument. For example, the PROCESSOR_SET_LOAD_INFO flavor returns load statistics in a processor_set_load_info structure.
processor_set_tasks() returns a list of send rights to the kernel ports of all tasks currently assigned to the specified processor set. Similarly, processor_set_threads() retrieves the processor set's assigned threads.
processor_set_stack_usage() is a debugging routine that is enabled only if the kernel was compiled with the MACH_DEBUG option. It retrieves information on thread stack usage in a given processor set.

Note that using the list of processor sets, all tasks and threads in the system can be found.

The processor set interface is deprecated in Mac OS X and is likely to change or disappear at some point. In fact, the xnu kernel supports only a single processor setthe interface routines operate on the default processor set.

7.2.3. Processors

The processor structure is a machine-independent description of a physical processor. Some of the processor structure's fields are similar to those of the processor_set structure, but with a per-processor (local) scope. For example, the processor structure's run queue field is used for threads bound only to that processor. Figure 73 shows an annotated excerpt from the processor structure's declaration. The possible states that a processor can be in are PROCESSOR_OFF_LINE (not available), PROCESSOR_RUNNING (in normal execution), PROCESSOR_IDLE (idle), PROCESSOR_DISPATCHING (transitioning from the idle state to the running state), PROCESSOR_SHUTDOWN (going offline), and PROCESSOR_START (being started).

Figure 73. The `processor` structure in the xnu kernel

// osfmk/kern/processor.h struct processor {     queue_chain_t     processor_queue; // idle, active, or action queue link     int               state;           // processor state     struct thread    *active_thread;   // thread running on processor     struct thread    *next_thread;     // next thread to run if dispatched     struct thread    *idle_thread;     // this processor's idle thread     processor_set_t   processor_set;   // the processor set that we belong to     int               current_pri;     // current thread's priority     timer_call_data_t quantum_timer;   // timer for quantum expiration     uint64_t          quantum_end;     // time when current quantum ends     uint64_t          last_dispatch;   // time of last dispatch     int               timeslice;       // quantum before timeslice ends     int               deadline;        // current deadline     struct run_queue  runq;            // local run queue for this processor     queue_chain_t     processors;      // all processors in our processor set     ...     struct ipc_port  *processor_self;  // processor's control port     processor_t       processor_list;  // all existing processors     processor_data_t  processor_data;  // per-processor data }; ... extern processor_t master_processor;

7.2.3.1. Interconnections

Figure 74 shows how the processor_set and processor structures are interconnected in a system with one processor set and two processors. The processors are shown to be neither on the set's idle queue nor on the active queue. The processors field of each processor, along with the processors field of the processor set, are all chained together in a circular list. In both the processor and the processor_set structures, the processors field is a queue element containing only two pointers: prev (previous) and next. In particular, the next pointer of the processor set's processors field points to the first (master) processor. Thus, you can traverse the list of all processors in a processor set starting from either the set or any of the processors. Similarly, you can traverse the list of all active processors in a set using the active_queue field of the processor_set structure and the processor_queue field of each processor structure in the set.

Figure 74. A processor set containing two processors

Figure 75 shows the situation when both processors in Figure 74 are on the active queue of the default processor set.

Figure 75. A processor set with two processors on its active queue

7.2.3.2. The Processor API

The following are examples of Mach routines that deal with processors.

host_processors() returns an array of send rights representing all processors in the system. Note that the user-space caller receives the array as out-of-line data in a Mach IPC messagethe memory appears implicitly allocated in the caller's virtual address space. In such a case, the caller should explicitly deallocate the memory when it is no longer needed by calling the vm_deallocate() or mach_vm_deallocate() Mach routines.
processor_control() runs machine-dependent control operations, or commands, on the specified processor. Examples of such commands include setting performance-monitoring registers and setting or clearing performance-monitoring counters.
processor_info() retrieves information about the specified processor. Examples of processor_info() flavors are PROCESSOR_BASIC_INFO (processor type, subtype, and slot number; whether it is running; and whether it is the master processor) and PROCESSOR_CPU_LOAD_INFO (the number of tasks and threads assigned to the processor, its load average, and its Mach factor).
On a multiprocessor system, processor_start() starts the given processor if it is currently offline. The processor is assigned to the default processor set after startup. Conversely, processor_exit() stops the given processor and removes it from its assigned processor set.
processor_get_assignment() returns the name port for the processor set to which the given processor is currently assigned. The complementary callprocessor_assign()assigns a processor to a processor set. However, since the xnu kernel supports only one processor set, processor_assign() always returns a failure, whereas processor_get_assignment() always returns the default processor set.

Several processor-related Mach routines have machine-dependent behavior. Moreover, routines that affect a processor's global behavior are privileged.

There also exist calls to set or get the processor set affinity of tasks and threads. For example, task_assign() assigns a task, and optionally all threads within the task, to the given processor set. Unless all threads are included, only newly created threads will be assigned to the new processor set. As with other calls dealing with multiple processor sets, task_assign() always returns failure on Mac OS X.

Let us look at two examples of using the Mach processor API. First, we will write a program to retrieve information about the processors in a system. Next, we will write a program to disable a processor on a multiprocessor system.

7.2.3.3. Displaying Processor Information

Figure 76 shows the program for retrieving processor information.

Figure 76. Retrieving information about processors on the host

// processor_info.c #include <stdio.h> #include <stdlib.h> #include <mach/mach.h> void print_basic_info(processor_basic_info_t info) {     printf("CPU: slot %d%s %s, type %d, subtype %d\n", info->slot_num,            (info->is_master) ? " (master)," : ",",            (info->running) ? "running" : "not running",            info->cpu_type, info->cpu_subtype); } void print_cpu_load_info(processor_cpu_load_info_t info) {     unsigned long ticks;     // Total ticks do not amount to the uptime if the machine has slept     ticks = info->cpu_ticks[CPU_STATE_USER]   +             info->cpu_ticks[CPU_STATE_SYSTEM] +             info->cpu_ticks[CPU_STATE_IDLE]   +             info->cpu_ticks[CPU_STATE_NICE];     printf("     %ld ticks "            "(user %ld, system %ld, idle %ld, nice %ld)\n", ticks,            info->cpu_ticks[CPU_STATE_USER],            info->cpu_ticks[CPU_STATE_SYSTEM],            info->cpu_ticks[CPU_STATE_IDLE],            info->cpu_ticks[CPU_STATE_NICE]);     printf("     cpu uptime %ld h %ld m %ld s\n",           (ticks / 100) / 3600,        // hours           ((ticks / 100) % 3600) / 60, // minutes           (ticks / 100) % 60);         // seconds } int main(void) {     int                            i;     kern_return_t                  kr;     host_name_port_t               myhost;     host_priv_t                    host_priv;     processor_port_array_t         processor_list;     natural_t                      processor_count;     processor_basic_info_data_t    basic_info;     processor_cpu_load_info_data_t cpu_load_info;     natural_t                      info_count;     myhost = mach_host_self();     kr = host_get_host_priv_port(myhost, &host_priv);     if (kr != KERN_SUCCESS) {         mach_error("host_get_host_priv_port:", kr);         exit(1);     }     kr = host_processors(host_priv, &processor_list, &processor_count);     if (kr != KERN_SUCCESS) {         mach_error("host_processors:", kr);         exit(1);     }     printf("%d processors total.\n", processor_count);     for (i = 0; i < processor_count; i++) {         info_count = PROCESSOR_BASIC_INFO_COUNT;         kr = processor_info(processor_list[i],                             PROCESSOR_BASIC_INFO,                             &myhost,                             (processor_info_t)&basic_info,                             &info_count);         if (kr == KERN_SUCCESS)             print_basic_info((processor_basic_info_t)&basic_info);         info_count = PROCESSOR_CPU_LOAD_INFO_COUNT;         kr = processor_info(processor_list[i],                             PROCESSOR_CPU_LOAD_INFO,                             &myhost,                             (processor_info_t)&cpu_load_info,                             &info_count);         if (kr == KERN_SUCCESS)             print_cpu_load_info((processor_cpu_load_info_t)&cpu_load_info);     }     // Other processor information flavors (may be unsupported)     //     //  PROCESSOR_PM_REGS_INFO,  // performance monitor register information     //  PROCESSOR_TEMPERATURE,   // core temperature     // This will deallocate while rounding up to page size     (void)vm_deallocate(mach_task_self(), (vm_address_t)processor_list,                         processor_count * sizeof(processor_t *));     exit(0); } $ gcc -Wall -o processor_info processor_info.c $ sudo ./processor_info 2 processors total. CPU: slot 0 (master), running, type 18, subtype 100      16116643 ticks (user 520750, system 338710, idle 15254867, nice 2316)      cpu uptime 44 h 46 m 6 s CPU: slot 1, running, type 18, subtype 100      16116661 ticks (user 599531, system 331140, idle 15182087, nice 3903)      cpu uptime 44 h 46 m 6 s $ uptime 16:12 up 2 days, 16:13, 1 user, load averages: 0.01 0.04 0.06

The cpu uptime value printed by the program in Figure 76 is calculated based on the number of ticks reported for the processor. There is a tick every 10 ms on Mac OS Xthat is, there are 100 ticks a second. Therefore, 16,116,643 ticks correspond to 161,166 seconds, which is 44 hours, 46 minutes, and 6 seconds. When we use the uptime utility to display how long the system has been running, we get a higher value: over 64 hours. This is because the processor uptime is not the same as the system uptime if the system has sleptthere are no processor ticks when a processor is sleeping.

Let us modify the program in Figure 76 to verify that memory is indeed allocated in the calling task's address space as a side affect of receiving out-of-line data in response to the host_processors() call. We will print the process ID and the value of the processor_list pointer after the call to host_processors() and make the process sleep briefly. While the program sleeps, we will use the vmmap utility to display the virtual memory regions allocated in the process. We expect the region containing the pointer to be listed in vmmap's output. Figure 77 shows the modified program excerpt and the corresponding output.

Figure 77. Out-of-line data received by a process from the kernel as a result of a Mach call

// processor_info.c     ...     processor_list = (processor_port_array_t)0;     kr = host_processors(host_priv, &processor_list, &processor_count);     if (kr != KERN_SUCCESS) {         mach_error("host_processors:", kr);         exit(1);     }     // #include <unistd.h> for getpid(2) and sleep(3)     printf("processor_list = %p\n", processor_list);     printf("my process ID is %d\n", getpid());     sleep(60);     ... $ sudo ./processor_info processor_list = 0x6000 my process ID is 2463 ... $ sudo vmmap 2463 Virtual Memory Map of process 2463 (processor_info) ... ==== Writable regions for process 2463 ... Mach message           00006000-00007000 [    4K] rw-/rwx SM=PRV ...

7.2.3.4. Stopping and Starting a Processor in a Multiprocessor System

In this example, we will programmatically stop and start one of the processors in a multiprocessor system. Figure 78 shows a program that calls processor_exit() to take the last processor offline and processor_start() to bring it online.

Figure 78. Starting and stopping a processor through the Mach processor interface

// processor_xable.c #include <stdio.h> #include <stdlib.h> #include <mach/mach.h> #define PROGNAME "processor_xable" #define EXIT_ON_MACH_ERROR(msg, retval) \     if (kr != KERN_SUCCESS) { mach_error(msg, kr); exit((retval)); } int main(int argc, char **argv) {     kern_return_t          kr;     host_priv_t            host_priv;     processor_port_array_t processor_list;     natural_t              processor_count;     char                  *errmsg = PROGNAME;     if (argc != 2) {         fprintf(stderr,                 "usage: %s <cmd>, where <cmd> is \"exit\" or \"start\"\n",                 PROGNAME);         exit(1);     }     kr = host_get_host_priv_port(mach_host_self(), &host_priv);     EXIT_ON_MACH_ERROR("host_get_host_priv_port:", kr);     kr = host_processors(host_priv, &processor_list, &processor_count);     EXIT_ON_MACH_ERROR("host_processors:", kr);     // disable last processor on a multiprocessor system     if (processor_count > 1) {         if (*argv[1] == 'e') {             kr = processor_exit(processor_list[processor_count - 1]);             errmsg = "processor_exit:"         } else if (*argv[1] == 's') {             kr = processor_start(processor_list[processor_count - 1]);             errmsg = "processor_start:"         } else {             kr = KERN_INVALID_ARGUMENT;         }     } else         printf("Only one processor!\n");     // this will deallocate while rounding up to page size     (void)vm_deallocate(mach_task_self(), (vm_address_t)processor_list,                         processor_count * sizeof(processor_t *));     EXIT_ON_MACH_ERROR(errmsg, kr);     fprintf(stderr, "%s successful\n", errmsg);     exit(0); } $ gcc -Wall -o processor_xable processor_xable.c $ sudo ./processor_info 2 processors total. CPU: slot 0 (master), running, type 18, subtype 100      88141653 ticks (user 2974228, system 2170409, idle 82953261, nice 43755)      cpu uptime 244 h 50 m 16 s CPU: slot 1, running, type 18, subtype 100      88128007 ticks (user 3247822, system 2088151, idle 82741221, nice 50813)      cpu uptime 244 h 48 m 0 s $ sudo ./processor_xable exit processor_exit: successful $ sudo ./processor_info 2 processors total. CPU: slot 0 (master), running, type 18, subtype 100      88151172 ticks (user 2975172, system 2170976, idle 82961265, nice 43759)      cpu uptime 244 h 51 m 51 s CPU: slot 1, not running, type 18, subtype 100      88137333 ticks (user 3248807, system 2088588, idle 82749125, nice 50813)      cpu uptime 244 h 49 m 33 s $ sudo ./processor_xable start processor_start: successful $ sudo ./processor_info 2 processors total. CPU: slot 0 (master), running, type 18, subtype 100      88153641 ticks (user 2975752, system 2171100, idle 82963028, nice 43761)      cpu uptime 244 h 52 m 16 s CPU: slot 1, running, type 18, subtype 100      88137496 ticks (user 3248812, system 2088590, idle 82749281, nice 50813)      cpu uptime 244 h 49 m 34 s

7.2.4. Tasks and the Task API

A Mach task is a machine-independent abstraction of the execution environment of threads. We saw earlier that a task is a container for resourcesit encapsulates protected access to a sparse virtual address space, IPC (port) space, processor resources, scheduling control, and threads that use these resources. A task has a few task-specific ports, such as the task's kernel port and task-level exception ports (corresponding to task-level exception handlers). Figure 79 shows an annotated excerpt from the task structure.

Figure 79. The `task` structure in the xnu kernel

// osfmk/kern/task.h struct task {     ...     vm_map_t      map;          // address space description     queue_chain_t pset_tasks;   // list of tasks in our processor set     ...     queue_head_t  threads;      // list of threads in this task     int           thread_count; // number of threads in this task     ...     integer_t     priority;     // base priority for threads     integer_t     max_priority; // maximum priority for threads     ...     // IPC structures     ...     struct ipc_port *itk_sself;                            // a send right     struct exception_action  exc_actions[EXC_TYPES_COUNT]; // exception ports     struct ipc_port         *itk_host;                     // host port     struct ipc_port         *itk_bootstrap;                // bootstrap port     // "registered" ports -- these are inherited across task_create()     struct ipc_port         *itk_registered[TASK_PORT_REGISTER_MAX];     struct ipc_space        *itk_space;                    // the IPC space     ...     // locks and semaphores     queue_head_t semaphore_list;   // list of owned semaphores     queue_head_t lock_set_list;    // list of owned lock sets     int          semaphores_owned; // number of owned semaphores     int          lock_sets_owned;  // number of owned locks     ... #ifdef MACH_BSD     void        *bsd_info;         // pointer to BSD process structure #endif     struct shared_region_mapping *system_shared_region;     struct tws_hash              *dynamic_working_set;     ... };

The following are examples of Mach task routines accessible through the system library.

mach_task_self() is the task "identity trap"it returns a send right to the calling task's kernel port. As we saw in Chapter 6, the system library caches the right returned by this call in a per-task variable.
pid_for_task() retrieves the BSD process ID for the task specified by the given port. Note that whereas all BSD processes have a corresponding Mach task, it is technically possible to have a Mach task that is not associated with a BSD process.
task_for_pid() retrieves the port for the task corresponding to the specified BSD process ID.
task_info() retrieves information about the given task. Examples of task_info() flavors include TASK_BASIC_INFO (suspend count, virtual memory size, resident memory size, and so on), TASK_THREAD_TIMES_INFO (total times for live threads), and TASK_EVENTS_INFO (page faults, system calls, context switches, and so on).
task_threads() returns an array of send rights to the kernel ports of all threads within the given task.
task_create() creates a new Mach task that either inherits the calling task's address space or is created with an empty address space. The calling task gets access to the kernel port of the newly created task, which contains no threads. Note that this call does not create a BSD process and as such is not useful from user space.
task_suspend() increments the suspend count for the given task, stopping all threads within the task. Newly created threads within a task cannot execute if the task's suspend count is positive.
task_resume() decrements the suspend count for the given task. If the new suspend count is zero, task_resume() also resumes those threads within the task whose suspend counts are zero. The task suspend count cannot become negativeit is either zero (runnable task) or positive (suspended task).
task_terminate() kills the given task and all threads within it. The task's resources are deallocated.
task_get_exception_ports() retrieves send rights to a specified set of exception ports for the given task. An exception port is one to which the kernel sends messages when one or more types of exceptions occur. Note that threads can have their own exception ports, which are preferred over the task's. Only if a thread-level exception port is set to the null port (IP_NULL), or returns with a failure, does the task-level exception port come into play.
task_set_exception_ports() sets the given task's exception ports.
task_swap_exception_ports() performs the combined function of task_get_exception_ports() and task_set_exception_ports().
task_get_special_port() retrieves a send right to the given special port in a task. Examples of special ports include TASK_KERNEL_PORT (the same as the port returned by mach_task_self()used for controlling the task), TASK_BOOTSTRAP_PORT (used in requests for retrieving ports representing system services), and TASK_HOST_NAME_PORT (the same as the port returned by mach_host_self()used for retrieving host-related information).
task_set_special_port() sets one of the task's special ports to the given send right.
task_policy_get() retrieves scheduling policy parameters for the specified task. It can also be used to retrieve default task policy parameter values.
task_policy_set() is used to set scheduling policy information for a task.

7.2.5. Threads

A Mach thread is a single flow of control in a Mach task. Depending on an application's nature and architecture, using multiple threads within the application can lead to improved performance. Examples of situations in which multiple threads could be beneficial include the following.

When computation and I/O can be separated and are mutually independent, dedicated threads can be used to perform these two types of activities simultaneously.
When execution contextsthreads or processesneed to be created and destroyed frequently, using threads may improve performance since a thread is substantially less expensive to create than an entire process.^[7]
^[7] The performance improvement will typically be perceptible only if the application creates so many processes (or creates them in such a manner) that the overhead is a limiting factor in the application's performance.
On a multiprocessor system, multiple threads within the same task can run truly concurrently, which improves performance if the thread can benefit from concurrent computations.

A thread contains information such as the following:

Scheduling priority, scheduling policy, and related attributes
Processor usage statistics
A few thread-specific port rights, including the thread's kernel port and thread-level exception ports (corresponding to thread-level exception handlers)
Machine state (through a machine-dependent thread-state structure), which changes as the thread executes

Figure 710 shows the important constituents of the thread structure in xnu.

Figure 710. The `tHRead` structure in the xnu kernel

// osfmk/kern/thread.h struct thread {     queue_chain_t      links;          // run/wait queue links     run_queue_t        runq;           // run queue thread is on     wait_queue_t       wait_queue;     // wait queue we are currently on     event64_t          wait_event;     // wait queue event     ...     thread_continue_t  continuation;   // continue here next dispatch     void              *parameter;      // continuation parameter     ...     vm_offset_t        kernel_stack;   // current kernel stack     vm_offset_t        reserved_stack; // reserved kernel stack     int                state;          // state that thread is in     // scheduling information     ...     // various bits of stashed machine-independent state     ...     // IPC data structures     ...     // AST/halt data structures     ...     // processor set information     ...     queue_chain_t          task_threads; // threads in our task     struct machine_thread  machine;      // machine-dependent state     struct task           *task;         // containing task     vm_map_t               map;          // containing task's address map     ...     // mutex, suspend count, stop count, pending thread ASTs     ...     // other     ...     struct ipc_port         *ith_sself;                    // a send right     struct exception_action  exc_actions[EXC_TYPES_COUNT]; // exception ports     ... #ifdef    MACH_BSD     void *uthread; // per-thread user structure #endif };

7.2.5.1. The Thread API

A user program controls a Mach threadnormally indirectly, through the Pthreads library^[8]using the thread's kernel port. The following are examples of Mach thread routines accessible through the system library.

^[8] The Pthreads library is part of the system library (libSystem.dylib) on Mac OS X.

mach_thread_self() returns send rights to the calling thread's kernel port.
thread_info() retrieves information about the given thread. Examples of thread_info() flavors include THREAD_BASIC_INFO (user and system run times, scheduling policy in effect, suspend count, and so on) and obsoleted flavors for fetching scheduling policy information, such as ThrEAD_SCHED_FIFO_INFO, ThrEAD_SCHED_RR_INFO, and ThrEAD_SCHED_TIMESHARE_INFO.
tHRead_get_state() retrieves the machine-specific user-mode execution state for the given thread, which must not be the calling thread itself. Depending on the flavor, the returned state contains different sets of machine-specific register contents. Flavor examples include PPC_THREAD_STATE, PPC_FLOAT_STATE, PPC_EXCEPTION_STATE, PPC_VECTOR_STATE, PPC_THREAD_STATE64, and PPC_EXCEPTION_STATE64.
tHRead_set_state() is the converse of tHRead_get_state()it takes the given user-mode execution state information and flavor type and sets the target thread's state. Again, the calling thread cannot set its own state using this routine.
tHRead_create() creates a thread within the given task. The newly created thread has a suspend count of one. It has no machine stateits state must be explicitly set by calling thread_set_state() before the thread can be resumed by calling thread_resume().
thread_create_running() combines the effect of thread_create(), tHRead_set_state(), and thread_resume(): It creates a running thread using the given machine state within the given task.
tHRead_suspend() increments the suspend count of the given thread. As long as the suspend count is greater than zero, the thread cannot execute any more user-level instructions. If the thread was already executing within the kernel because of a trap (such as a system call or a page fault), then, depending on the trap, it may block in situ, or it may continue executing until the trap is about to return to user space. Nevertheless, the trap will return only on resumption of the thread. Note that a thread is created in the suspended state so that its machine state can be set appropriately.
thread_resume() decrements the suspend count of the given thread. If the decremented count becomes zero, the thread is resumed. Note that if a task's suspend count is greater than zero, a thread within it cannot execute even if the thread's individual suspend count is zero. Similar to the task suspend count, a thread's suspend count is either zero or positive.
thread_terminate() destroys the given thread. If the thread is the last thread to terminate in a task that corresponds to a BSD process, the thread termination code also performs a BSD process exit.
thread_switch() instructs the scheduler to switch context directly to another thread. The caller can also specify a particular thread as a hint, in which case the scheduler will attempt to switch to the specified thread. Several conditions must hold for the hinted switch to succeed. For example, the hint thread's scheduling priority must not be real time, and it should not be bound to a processorif at allother than the current processor. Note that this is an example of handoff scheduling, as the caller's quantum is handed off to the new thread. If no hint thread is specified, tHRead_switch() forces a reschedule, and a new thread is selected to run. The caller's existing kernel stack is discardedwhen it eventually resumes, it executes the continuation function^[9] thread_switch_continue() [osfmk/kern/syscall_subr.c] on a new kernel stack. tHRead_switch() can be optionally instructed to block the calling thread for a specified timea wait that can be canceled only by thread_abort(). It can also be instructed to depress the thread's priority temporarily by setting its scheduling attributes such that the scheduler provides it with the lowest possible service for the specified time, after which the scheduling depression is aborted. It is also aborted when the current thread is executed next. It can be explicitly aborted through thread_abort() or thread_depress_abort().
^[9] We will look at continuations later in this chapter.
thread_wire() marks the given thread as privileged such that it can consume physical memory from the kernel's reserved pool when free memory is scarce. Moreover, when such a thread is to be inserted in a wait queue of threads waiting for a particular event to be posted to that queue, it is inserted at the head of the queue. This routine is meant for threads that are directly involved in the page-out mechanismit should not be invoked by user programs.
thread_abort() can be used by one thread to stop another threadit aborts a variety of in-progress operations in the target thread, such as clock sleeps, scheduling depressions, page faults, and other Mach message primitive calls (including system calls). If the target thread is in kernel mode, a successful tHRead_abort() will result in the target appearing to have returned from the kernel. For example, in the case of a system call, the thread's execution will resume in the system call return code, with an "interrupted system call" return code. Note that tHRead_abort() works even if the target is suspendedthe target will be interrupted when it resumes. In fact, tHRead_abort() should be used only on a thread that is suspended. If the target thread is executing a nonatomic operation when tHRead_abort() is called on it, the operation will be aborted at an arbitrary point and cannot be restarted. tHRead_abort() is meant for cleanly stopping the target. In the case of a call to thread_suspend(), if the target is executing in the kernel and the thread's state is modified (through tHRead_set_state()) when it is suspended, the state may be altered unpredictably as a side effect of the system call when the thread resumes.
thread_abort_safely() is similar to thread_abort(). However, unlike tHRead_abort(), which aborts even nonatomic operations (at arbitrary points and in a nonrestartable manner), thread_abort_safely() returns an error in such cases. The thread must then be resumed, and another thread_abort_safely() call must be attempted.
thread_get_exception_ports() retrieves send rights to one or more exception ports of a given thread. The exception types for which to retrieve the ports are specified through a flag word.
thread_set_exception_ports() sets the given porta send rightas the exception port for the specified exception types. Note that a thread's exception ports are each set to the null port (IP_NULL) during thread creation.
thread_get_special_port() returns a send right to a specific special port for the given thread. For example, specifying ThrEAD_KERNEL_PORT returns the target thread's name portthe same as that returned by mach_thread_self() within the thread. Thereafter, the port can be used to perform operations on the thread.
thread_set_special_port() sets a specific special port for the given thread by changing it to the caller-provided send right. The old send right is released by the kernel.
tHRead_policy_get() retrieves scheduling policy parameters for the given thread. It can also be used to retrieve default thread-scheduling policy parameter values. thread_policy_set() is used to set scheduling policy information for a thread. Examples of thread-scheduling policy flavors include ThrEAD_EXTENDED_POLICY, ThrEAD_TIME_CONSTRAINT_POLICY, and THREAD_PRECEDENCE_POLICY.

A thread can send a port right to another threadincluding to a thread in another taskusing Mach IPC. In particular, if a thread sends its containing task's kernel port to a thread in another task, a thread in the receiving task can control all threads in the sending task, since access to a task's kernel port implies access to the kernel ports of its threads.

7.2.5.2. Kernel Threads

We could call a Mach thread a kernel thread since it is the in-kernel representation of a user-space thread. As we will see in Section 7.3, all commonly available user-space thread abstractions on Mac OS X use one Mach thread per instance of the respective user threads. Another connotation of the term kernel thread applies to internal threads that the kernel runs for its own functioning. The following are examples of functions that the kernel runs as dedicated threads^[10] to implement kernel functionality such as bootstrapping, scheduling, exception handling, networking, and file system I/O.

^[10] Such threads are created within the kernel task.

processor_start_thread() [osfmk/kern/startup.c] is the first thread to execute on a processor.
kernel_bootstrap_thread() [osfmk/kern/startup.c] starts various kernel services during system startup and eventually becomes the page-out daemon, running vm_page() [osfmk/vm/vm_pageout.c]. The latter creates other kernel threads for performing I/O and for garbage collection.
idle_thread() [osfmk/kern/sched_prim.c] is the idle processor thread that runs looking for other threads to execute.
sched_tick_thread() [osfmk/kern/sched_prim.c] performs scheduler-related periodic bookkeeping functions.
thread_terminate_daemon() [osfmk/kern/thread.c] performs the final cleanup for terminating threads.
thread_stack_daemon() [osfmk/kern/thread.c] allocates stacks for threads that have been enqueued for stack allocation.
serial_keyboard_poll() [osfmk/ppc/serial_io.c] polls for input on the serial port.
The kernel's callout mechanism runs functions supplied to it as kernel threads.
The IOWorkLoop and IOService I/O Kit classes use IOCreateThread() [iokit/Kernel/IOLib.c], which is a wrapper around a Mach kernel-thread creation function, to create kernel threads.
The kernel's asynchronous I/O (AIO) mechanism creates worker threads to handle I/O requests.
audit_worker() [bsd/kern/kern_audit.c] processes the queue of audit records by writing them to the audit log file or otherwise removing them from the queue.
mbuf_expand_thread() [bsd/kern/uipc_mbuf.c] runs to add more free mbufs by allocating an mbuf cluster.
ux_handler() [bsd/uxkern/ux_exception.c] is the Unix exception handler that converts Mach exceptions to Unix signal and code values.
nfs_bind_resv_thread() [bsd/nfs/nfs_socket.c] runs to handle bind requests on reserved ports from unprivileged processes.
dlil_input_thread() [bsd/net/dlil.c] services mbuf input queues of network interfaces, including that of the loopback interface, by ingesting network packets via dlil_input_packet() [bsd/net/dlil.c]. It also calls proto_input_run() to perform protocol-level packet injection.
dlil_call_delayed_detach_thread() [bsd/net/dlil.c] performs delayed (safe) detachment of protocols, filters, and interface filters.
bcleanbuf_thread() [bsd/vfs/vfs_bio.c] performs file system buffer laundryit cleans dirty buffers enqueued on a to-be-cleaned queue by writing them to disk.
bufqscan_thread() [bsd/vfs/vfs_bio.c] balances a portion of the buffer queues by issuing cleanup of buffers and by releasing cleaned buffers to the queue of empty buffers.

Figure 711 shows the high-level kernel functions involved in kernel thread creation. Of these, kernel_thread() and IOCreateThread() should be used from Mach and the I/O Kit, respectively.

Figure 711. Functions for creating kernel threads

As we will see in Section 7.3, various user-space application environments in Mac OS X use their own thread abstractions, all of which are eventually layered atop Mach threads.

7.2.6. Thread-Related Abstractions

Let us now look at a few related abstractions that are relevant in the context of Mach threads as implemented in the Mac OS X kernel. In this section, we will discuss the following terms:

Remote procedure call (RPC)
Thread activation
Thread shuttle
Thread migration
Continuations

7.2.6.1. Remote Procedure Call

Since Mach is a communication-oriented kernel, the remote procedure call (RPC) abstraction is fundamental to Mach's functioning. We define RPC to be the procedure call abstraction when the caller and the callee are in different tasksthat is, the procedure is remote from the caller's standpoint. Although Mac OS X uses only kernel-level RPC between local tasks, the concept is similar even if RPC participants are on different machines. In a typical RPC scenario, execution (control flow) temporarily moves to another location (that corresponding to the remote procedure) and later returns to the original locationakin to a system call. The caller (client) marshals any parameters together into a message and sends the message to the service provider (server). The service provider unmarshals the messagethat is, separates it into its original pieces and processes it as a local operation.

7.2.6.2. Activation and Shuttle

Prior to Mac OS X 10.4, a kernel thread was divided into two logical parts: the activation and the shuttle. The motivation behind the division is to have one part that provides explicit control over the thread (the activation) and another part that is used by the scheduler (the shuttle). The thread activation represents the execution context of a thread. It remains attached to its task and thus always has a fixed, valid task pointer. Until the activation is terminated, it remains on the task's activation stack.^[11] The thread shuttle is the scheduled entity corresponding to a thread. At a given time, a shuttle operates within some activation. However, a shuttle may migrate during RPC because of resource contention. It contains scheduling, accounting, and timing information. It also contains messaging support. While a shuttle uses an activation, it holds a reference on the activation.

^[11] A thread, as a logical flow of control, is represented by a stack of activations in a task.

Note that the activation is closer to the popular notion of a threadit is the externally visible thread handle. For example, a thread's programmer-visible kernel port internally translates to a pointer to its activation. In contrast, the shuttle is the internal part of a thread. Within the kernel, current_act() returns a pointer to the current activation, whereas current_thread() returns a pointer to the shuttle.

The shuttle/activation dual abstraction has undergone implementation changes across various Mac OS X versions. In Mac OS X 10.0, a thread's implementation consisted of two primary data structures: the tHRead_shuttle structure and the tHRead_activation structure, with the thread data type (tHRead_t) being a pointer to the tHRead_shuttle structure. The activation could be accessed from the shuttle, and thus a thread_t represented a thread in its entirety. Figure 712 shows the structures in Mac OS X 10.0.

Figure 712. The shuttle and thread data structures in Mac OS X 10.0

In later versions of Mac OS X, the thread data structure subsumed both the shuttle and the activationfrom a syntactic standpointinto a single structure. This is shown in Figure 713.

Figure 713. The shuttle and the thread within a single structure in Mac OS X 10.3

In Mac OS X 10.4, the distinction between shuttle and activation is not present. Figure 714 shows the implementation of current_thread() and current_act() in Mac OS X 10.3 and 10.4. In the x86 version of Mac OS X, a pointer to the current (active) thread is stored as a field of the per-cpu data structure^[12] (struct cpu_data [osfmk/i386/cpu_data.h]).

^[12] The GS segment register is set up such that it is based at the per-cpu data structure.

Figure 714. Retrieving the current thread (shuttle) and the current activation on Mac OS X 10.3 and 10.4

// osfmk/ppc/cpu_data.h (Mac OS X 10.3) extern __inline__ thread_act_t current_act(void) {     thread_act_t act;     __asm__ volatile("mfsprg %0,1" : "=r" (act));     return act; }; ... #define current_thread() current_act()->thread // osfmk/ppc/cpu_data.h (Mac OS X 10.4) extern __inline__ thread_t current_thread(void) {     thread_t result;     __asm__ volatile("mfsprg %0,1" : "=r" (result));     return (result); } // osfmk/ppc/machine_routines_asm.s (Mac OS X 10.4) /*  * thread_t current_thread(void)  * thread_t current_act(void)  *  * Return the current thread for outside components.  */             align  5            .globl  EXT(current_thread)            .globl  EXT(current_act) LEXT(current_thread) LEXT(current_act)             mfsprg  r3,1             blr

7.2.6.3. Thread Migration

In the preceding discussion of the shuttle and the activation, we alluded to the concept of thread migration. The migrating threads model was developed at the University of Utah. The term migration refers to the way control is transferred between a client and a server during an RPC. For example, in the case of static threads, an RPC between a client and a server involves a client thread and an unrelated, independent server thread. The sequence of events after the client initiates an RPC involves multiple context switches, among other things. With the split thread model, rather than blocking the client thread on its RPC kernel call, the kernel can migrate it so that it resumes execution in the server's code. Although some context switch is still required (in particular, that of the address space, the stack pointer, and perhaps a subset of the register state), no longer are two entire threads involved. There also is no context switch from a scheduling standpoint. Moreover, the client uses its own processor time while it executes in the server's codein this sense, thread migration is a priority inheritance mechanism.

This can be compared to the Unix system call model, where a user thread (or process) migrates into the kernel during a system call, without a full-fledged context switch.

Procedural IPC

Mach's original RPC model is based on message-passing facilities, wherein distinct threads read messages and write replies. We have seen that access rights are communicated in Mach through messaging as well. Operating systems have supported procedural IPC in several forms: gates in Multics, lightweight RPC (LRPC) in TaOS, doors in Solaris, and event pairs in Windows NT are examples of cross-domain procedure call mechanisms. The thread model in Sun's Spring system had the concept of a shuttle, which was the true kernel-schedulable entity that supported a chain of application-visible threadsanalogous to the activations we discussed.

7.2.6.4. Continuations

The Mac OS X kernel uses a per-thread kernel stack size of 16KB (KERNEL_STACK_SIZE, defined in osfmk/mach/ppc/vm_param.h). As the number of threads in a running system increases, the memory consumed by kernel stacks alone can become unreasonable, depending on available resources. Some operating systems multiplex several user threads onto one kernel thread, albeit at the cost of concurrency, since those user threads cannot be scheduled independently of each other. Recall that a Mach thread (or an activation, when a thread is divided into a shuttle and an activation) is bound to its task for the lifetime of the thread, and that each nontrivial task has at least one thread. Therefore, the number of threads in the system will at least be as many as the number of tasks.

Operating systems have historically used one of two models for kernel execution: the process model or the interrupt model. In the process model, the kernel maintains a stack for every thread. When a thread executes within the kernelsay, because of a system call or an exceptionits dedicated kernel stack is used to track its execution state. If the thread blocks in the kernel, no explicit state saving is required, as the state is captured in the thread's kernel stack. The simplifying effect of this approach is offset by its higher resource requirement and the fact that machine state is harder to evaluate if one were to analyze it for optimizing transfer control between threads. In the interrupt model, the kernel treats system calls and exceptions as interrupts. A per-processor kernel stack is used for all threads' kernel execution. This requires kernel-blocking threads to explicitly save their execution state somewhere. When resuming the thread at a later point, the kernel will use the saved state.

The typical UNIX kernel used the process model, and so did early versions of Mach. The concept of continuations was used in Mach 3 as a middle-ground approach that gives a blocking thread an option to use either the interrupt or the process model. Mac OS X continues to use continuations. Consequently, a blocking thread in the Mac OS X kernel can choose how to block. The tHRead_block() function [osfmk/kern/sched_prim.c] takes a single argument, which can be either THREAD_CONTINUE_NULL [osfmk/kern/kern_types.h] or a continuation function.

// osfmk/kern/kern_types.h typedef void (*thread_continue_t)(void *, wait_result_t); #define THREAD_CONTINUE_NULL    ((thread_continue_t) 0) ...

tHRead_block() calls tHRead_block_reason() [osfmk/kern/sched_prim.c], which calls thread_invoke() [osfmk/kern/sched_prim.c] to perform a context switch and start executing a new thread selected for the current processor to run. thread_invoke() checks whether a valid continuation was specified. If so, it will attempt^[13] to hand off the old thread's kernel stack to the new thread. Therefore, as the old thread blocks, its stack-based context is discarded. When the original thread resumes, it will be given a new kernel stack, on which the continuation function will execute. The thread_block_parameter() [osfmk/kern/sched_prim.c] variant accepts a single parameter that thread_block_reason() stores in the tHRead structure, from where it is retrieved and passed to the continuation function when the latter runs.

^[13] A thread with a real-time scheduling policy does not hand off its stack.

Threading code must be explicitly written to use continuations. Consider the example shown in Figure 715: someFunc() needs to block for some event, after which it calls someOtherFunc() with an argument. In the interrupt model, the thread must save the argument somewhereperhaps in a structure that will persist as the thread blocks and resumes (the tHRead structure itself can be used for this purpose in some cases) and will block using a continuation. Note that a thread uses the assert_wait() primitive to declare the event it wishes to wait on and then calls thread_block() to actually wait.

Figure 715. Blocking with and without continuations

someOtherFunc(someArg) {     ...     return; } #ifdef USE_PROCESS_MODEL someFunc(someArg) {     ...     // Assert that the current thread is about to block until the     // specified event occurs     assert_wait(...);     // Pause to let whoever catch up     // Relinquish the processor by blocking "normally"     thread_block(THREAD_CONTINUE_NULL);     // Call someOtherFunc() to do some more work     someOtherFunc(someArg);     return; } #else // interrupt model, use continuations someFunc(someArg) {     ...     // Assert that the current thread is about to block until the     // specified event occurs     assert_wait(...);     // "someArg", and any other state that someOtherFunc() will require, must     // be saved somewhere, since this thread's kernel stack will be discarded     // Pause to let whoever catch up     // Relinquish the processor using a continuation     // someOtherFunc() will be called when the thread resumes     thread_block(someOtherFunc);     /* NOTREACHED */ } #endif

A function specified as a continuation cannot return normally. It may call only other functions or continuations. Moreover, a thread that uses a continuation must save any state that might be needed after resuming. This state may be saved in a dedicated space that the thread structure or an associated structure might have, or the blocking thread may have to allocate additional memory for this purpose. The continuation function must know how the blocking thread stores this state. Let us consider two examples of continuations being used in the kernel.

The implementation of the select() system call on Mac OS X uses continuations. Each BSD threadthat is, a thread belonging to a task with an associated BSD proc structurehas an associated uthread structure, which is akin to the traditional u-area. The uthread structure stores miscellaneous information, including system call parameters and results. It also has space for storing saved state for the select() system call. Figure 716 shows this structure's relevant aspects.

Figure 716. Continuation-related aspects of the BSD `uthread` structure

// bsd/sys/user.h struct uthread {     ...     // saved state for select()     struct _select {         u_int32_t *ibits, *obits; // bits to select on         uint       nbytes;        // number of bytes in ibits and obits         ...     } uu_select;     union {         // saved state for nfsd         int uu_nfs_myiods;         // saved state for kevent_scan()         struct _kevent_scan {             kevent_callback_t call;     // per-event callback             kevent_continue_t cont;     // whole call continuation             uint64_t          deadline; // computed deadline for operation             void             *data;     // caller's private data         } ss_kevent_scan;         // saved state for kevent()         struct _kevent {             ...             int               fd;       // file descriptor for kq             register_t       *retval;   // for storing the return value             ...         } ss_kevent;     } uu_state;     int (* uu_continuation)(int);     ... };

When a thread calls select() [bsd/kern/sys_generic.c] for the first time, select() allocates space for the descriptor set bit fields. On subsequent invocations, it may reallocate this space if it is not sufficient for the current request. It then calls selprocess() [bsd/kern/sys_generic.c], which, depending on the conditions, calls tsleep1() [bsd/kern/kern_synch.c], with selcontinue() [bsd/kern/sys_generic.c] specified as the continuation function. Figure 717 shows how the select() implementation uses continuations.

Figure 717. The use of continuations in the implementation of the `select()` system call

// bsd/kern/sys_generic.c int select(struct proc *p, struct select_args *uap, register_t *retval) {     ...     thread_t        th_act;     struct uthread *uth     struct _select *sel;     ...     th_act = current_thread();     uth = get_bsdthread_info(th_act);     sel = &uth->uu_select;     ...     // if this is the first select by the thread, allocate space for bits     if (sel->nbytes == 0) {         // allocate memory for sel->ibits and sel->obits     }     // if the previously allocated space for the bits is smaller than     // is requested, reallocate     if (sel->nbytes < (3 * ni)) {         // free and reallocate     }     // do select-specific processing     ... continuation:     return selprocess(error, SEL_FIRSTPASS); } int selcontinue(int error) {     return selprocess(error, SEL_SECONDPASS); } int selprocess(int error, int sel_pass) {     // various conditions and processing     ...     error = tsleep1(NULL, PSOCK|PCATCH, "select", sel->abstime, selcontinue);     ... }

tsleep1() [bsd/kern/kern_synch.c] calls _sleep() [bsd/kern/kern_synch.c], which saves the relevant state in the utHRead structure and blocks using the _sleep_continue() continuation. The latter retrieves the saved state from the utHRead structure.

// bsd/kern/kern_synch.c static int _sleep(caddr_t   chan,       int        pri,       char      *wmsg,       u_int64_t  abstime,       int (* continuation)(int),       lck_mtx_t *mtx) {     ...     if ((thread_continue_t)continuation !=         THREAD_CONTINUE_NULL) {         ut->uu_continuation = continuation;         ut->uu_pri = pri;         ut->uu_timo = abstime ? 1 : 0;         ut->uu_mtx = mtx;         (void)thread_block(_sleep_continue);         /* NOTREACHED */     }     ... }

Let us look at another examplethat of nfsiod, which implements a throughput-improving optimization for NFS clients. nfsiod is a local NFS asynchronous I/O server that runs on an NFS client machine to service asynchronous I/O requests to its server. It is a user-space program, with the skeletal structure shown in Figure 718.

Figure 718. The skeleton of the `nfsiod` program

// nfsiod.c int main(argc, argv) {     ...     for (i = 0; i < num_servers; i++) {         ...         rv = pthread_create(&thd, NULL, nfsiod_thread, (void *)i);         ...     } } ... void * nfsiod_thread(void *arg) {     ...     if ((rv = nfssvc(NFSSVC_BIOD, NULL)) < 0) {         ...     }     ... }

nfssvc() [bsd/nfs/nfs_syscalls.c] is an NFS system call used by both nfsiod and nfsd, the NFS daemon, to enter the kernel. Thereafter, nfsiod and nfsd essentially become in-kernel servers. In the case of nfsiod, nfssvc() dispatches to nfssvc_iod() [bsd/nfs/nfs_syscalls.c].

// bsd/nfs/nfs_syscalls.c int nfssvc(proc_t p, struct nfssvc_args *uap, __unused int *retval) {     ...     if (uap->flag & NFSSVC_BIOD)          error = nfssvc_iod(p);     ... }

nfssvc_iod() determines the index of the would-be nfsiod in a global array of such daemons. It saves the index in the uu_nfs_myiod field of the uu_state union within the uthread structure. Thereafter, it calls nfssvc_iod_continue(), which is nfsiod's continuation function.

// bsd/nfs/nfs_syscalls.c static int nfsscv_iod(__unused proc_t p) {     register int    i, myiod;     struct uthread *ut;     // assign my position or return error if too many already running     myiod = -1;     for (i = 0; i < NFS_MAXASYNCDAEMON; i++)         ...     // stuff myiod into uthread to get off local stack for continuation     ut = (struct uthread *)get_bsdthread_info(current_thread());     ut->uu_state.uu_nfs_myiod = myiod; // stow away for continuation     nfssvc_iod_continue(0):     /* NOTREACHED */     return (0); }

nfssvc_iod_continue() retrieves the daemon's index from its saved location in the uthread structure, performs the necessary processing, and blocks on the continuationthat is, on itself.

// bsd/nfs/nfs_syscalls.c static int nfssvc_iod_continue(int error) {     register struct nfsbuf *bp;     register int            i, myiod;     struct nfsmount        *nmp;     struct uthread         *ut;     proc_t                  p;     // real myiod is stored in uthread, recover it     ut = (struct uthread *)get_bsdthread_info(current_thread());     myiod = ut->uu_state.uu_nfs_myiod;     ...     for (;;) {         while (...) {             ...             error = msleep0((caddr_t)&nfs_iodwant[myiod],                             nfs_iod_mutex,                             PWAIT | PCATCH | PDROP,                             "nfsidl",                             0,                             nfssvc_iod_continue);             ...         }         if (error) {             ...             // must use this function to return to user             unix_syscall_return(error);         }         ...     } }

Continuations are most useful in the cases where no or little state needs to be saved for a thread when it is blocking. Other examples of the use of continuations in the Mac OS X kernel include continuations for per-processor idle threads, the scheduler tick thread, the swap-in thread, and the page-out daemon.