Section 9.18. Synchronization | Mac OS X Internals: A Systems Approach

9.18. Synchronization

Mac OS X provides several synchronization mechanisms, two of which we have already come across in this chapter, namely, POSIX and System V semaphores. Figure 969 shows the important kernel-level and user-level synchronization mechanisms. Frameworks such as Core Foundation and Foundation provide their own wrappers around some of the mechanisms shown in Figure 969.

Figure 969. An overview of Mac OS X synchronization mechanisms

In general, a synchronization mechanism is based on a hardware implementation of a multiprocessor lock. Depending on a specific locking mechanism's semantics, along with the lock's associated storage, there may be additional data structures, such as a queue of threads waiting for the lock.

Typical operations required to implement some form of synchronization include atomic compare-and-store (also called test-and-set) and compare-and-swap. For example, given a hardware implementation of the test-and-set operation, we can treat a word of storage as a simple lock. We initialize the word to 0 (unlocked) and define the lock operation as a successful test-and-set operation that sets the word's value to 1. Conversely, the unlock operation sets the word's value to 0. A test-and-set operation also returns the old value, so the thread attempting to acquire a lock will know if it succeeded. If the lock acquisition attempt failed, what the thread does depends on the nature of the locking mechanism. Two obvious options are that the thread keeps trying actively and that the thread sleeps.

Atomic memory access is required to maintain a consistent and ordered storage state. An atomic access is always performed in its entirety, with no externally visible suboperations. Thus, two or more atomic memory accesses will never overlapthey will always be serialized. Moreover, the order in which memory operations are completed and the order in which they are seen by other processors (in a multiprocessor system) do matter. Therefore, besides atomicity of memory access, we also need to be able to control the order of memory operations. The PowerPC architecture provides special hardware instructions for these purposes.

We discussed the implementation of an atomic compare-and-store function in Section 3.5.2. That function used the lwarx/stwcx. pair of instructions, which can be used to atomically write a memory word. The 64-bit PowerPC 970FX also provides ldarx/stdcx. to atomically write a double-word of memory. The lowest-level synchronization mechanisms in Mac OS X use these instructions as building blocks. Other relevant PowerPC instructions are the following:

sync This instruction is used to synchronize memory with respect to other processors and memory access mechanisms. Executing this instruction ensures that all instructions that appear before it are (effectively) completed before the sync instruction completes. Moreover, no instructions that appear after the sync are executed until the sync completes. It can be used to ensure that the results of all stores into a shared storage locationsay, one that corresponds to a mutexare seen by other processors before performing a store to unlock that mutex. The sync instruction is rather heavy-duty in that it may take a substantial (and variable) amount of time to execute. The eieio instruction is often a better alternative.
eieio The eieio (enforce-in-order-execution-of-I/O) instruction is similar to sync but enforces a weaker orderingit may itself complete before memory accesses caused by instructions that appear before it have completed with respect to main memory. However, it does ensure that the accesses have completed before any instructions that appear after the eieio instruction can access main memory. Thus, eieio can be used to enforce memory ordering without stalling dispatch of further instructions.
lwsync This is a lightweight version of sync available on the 970FX, on which it is faster than eieio. It cannot, however, be used in place of eieio under all circumstances.
isync This instruction ensures that all instructions before it have completed before it itself completes. Until isync completes, the processor does not initiate any instructions that appear after the isync. Moreover, when isync completes, any prefetched instructions are discarded. Note that isync waits only for the preceding instructions to completenot for the completion of any memory accesses caused by the preceding instructions. isync does not affect any other processor or another processor's caches.

With the understanding that the atomic access and memory-ordering instructions are directly or indirectly used as primitives in all Mac OS X synchronization mechanisms, let us look at some of the individual mechanisms shown in Figure 969.

9.18.1. Interfaces for Atomic Operations

The system library provides functions for performing a variety of atomic operations, ordering memory accesses through a memory barrier, and using spinlocks. These functions are actually implemented in the kernel but made available to user space through the commpage mechanism. The implementations reside in osfmk/ppc/commpage/atomic.s and osfmk/ppc/commpage/spinlock.s.

9.18.2. Low-Level Locking

The Mach portion of the kernel provides the following primary types of low-level locks (or lock protocols) that can be held by threads^[18]:

^[18] A lock holder in Mach is always a thread.

Spinlocks (or simple locks)
Mutexes
Read/write locks

9.18.2.1. Spinlocks

A spinlock is a simple locking primitive: It protects a shared resource by making the lock-holder thread busy-wait, or "spin" (in a tight loop). Since a thread holding a spinlock causes a processor to be tied up, it is important not to hold such locks for too long. In general, if a resource is accessed only briefly, it is a likely candidate for protection through a spinlock. Moreover, the use of a spinlock is different on a multiprocessor system compared with a uniprocessor system. On the former, a thread could busy-wait on one processor while the holder of a spinlock uses the protected resource on another processor. On a uniprocessor, a tight loopif not preemptedwill spin forever, since the holder of the lock will never get a chance to run and free the lock!

Mach uses simple locks to protect most of the kernel data structures. It provides three flavors of spinlocks: hw_lock (hw_lock_t), usimple (usimple_lock_t), and simple (lck_spin_t). Only the latter is exported to loadable kernel extensions.

An hw_lock is the lowest-level locking abstraction provided by the kernel. The following primary functions are exported by this lock package:

void         hw_lock_init(hw_lock_t);              [osfmk/ppc/hw_lock.s] void         hw_lock_lock(hw_lock_t);              [osfmk/ppc/hw_lock.s] void         hw_lock_unlock(hw_lock_t);            [osfmk/ppc/hw_lock.s] unsigned int hw_lock_to(hw_lock_t, unsigned int);  [osfmk/ppc/hw_lock.s] unsigned int hw_lock_try(hw_lock_t);               [osfmk/ppc/hw_lock.s] unsigned int hw_lock_held(hw_lock_t);              [osfmk/ppc/hw_lock.s]

The hw_lock_t data type is declared in osfmk/ppc/hw_lock_types.h.

// osfmk/ppc/hw_lock_types.h struct hslock {     int lock_data; }; typedef struct hslock hw_lock_data_t, *hw_lock_t;

A lock attempt for an hw_lock lock can be madethrough hw_lock_to()with a timeout value specified as a number of ticks of the Timebase Register. The lock spins for up to the duration of the timeout. The locking function even disables interruptions for up to 128 ticks of the Timebase Register.

The usimple variant (the "u" stands for uniprocessor) has two implementations: a portable C implementation [osfmk/ppc/locks_ppc.c] built atop hw_lock and an assembly-language implementation [osfmk/ppc/hw_lock.s]. The portable implementation also supports interfaces for debugging and statistics gathering. Unlike a simple lock, which disappears on a uniprocessor, a usimple lock provides actual locking on a uniprocessor. Acquiring a usimple lock returns with preemption disabled, whereas releasing a usimple lock reenables preemption.

The simple lock variant is the primary spin-locking mechanism in Mac OS X for multiprocessor systems. The following primary functions are exported by this lock package:

lck_spin_t *lck_spin_alloc_init(lck_grp_t *grp, lck_attr_t *attr); void        lck_spin_free(lck_spin_t *lck, lck_grp_t *grp); void        lck_spin_init(lck_spin_t *lck, lck_grp_t *grp, lck_attr_t *attr); void        lck_spin_destroy(lck_spin_t *lck, lck_grp_t *grp); void        lck_spin_lock(lck_spin_t *lck); void        lck_spin_unlock(lck_spin_t *lck); wait_result_t lck_spin_sleep(lck_spin_t         *lck,                              lck_sleep_action_t  lck_sleep_action,                              event_t             event,                              wait_interrupt_t    interruptible); wait_result_t lck_spin_sleep_deadline(lck_spin_t         *lck,                                       lck_sleep_action_t  lck_sleep_action,                                       event_t             event,                                       wait_interrupt_t    interruptible,                                       uint64_t            deadline);

When preemption is disabled, the holder of a spinlock must notdirectly or indirectlyacquire a blocking lock (such as a mutex or a semaphore). Doing so will result in a kernel panic.

9.18.2.2. Mutexes

Mach mutexes are blocking mutual-exclusion locks. If a thread attempts to acquire a mutex that is currently locked, it will relinquish the processor and sleep until the mutex is available. In doing so, the thread will also give up any scheduling time quantum that it may have remaining. Although a thread is permitted to block^[19] while holding a Mach mutex, the mutexes are not recursive: If a thread attempts to acquire a mutex that it already holds, it will cause a kernel panic.

^[19] The safety of blocking still depends on whether blocking is allowed in the given context and whether the code is written correctly.

The mutex package exports the following functions, whose prototypes are listed in osfmk/kern/locks.h:

lck_mtx_t     lck_mtx_alloc_init(lck_grp_t *grp, lck_attr_t *attr); void          lck_mtx_free(lck_mtx_t *lck, lck_grp_t *grp); void          lck_mtx_init(lck_mtx_t *lck, lck_grp_t *grp, lck_attr_t *attr); void          lck_mtx_destroy(lck_mtx_t *lck, lck_grp_t *grp); void          lck_mtx_lock(lck_mtx_t *lck); void          lck_mtx_unlock(lck_mtx_t *lck); wait_result_t lck_mtx_assert(lck_mtx_t *lck, int type); wait_result_t lck_mtx_sleep(lck_mtx_t          *lck,                             lck_sleep_action_t  lck_sleep_action,                             event_t             event,                             wait_interrupt_t    interruptible); wait_result_t lck_mtx_sleep_deadline(lck_mtx_t          *lck,                                      lck_sleep_action_t  lck_sleep_action,                                      event_t             event,                                      wait_interrupt_t    interruptible                                      uint64_t            deadline);

The mutex package is implemented in osfmk/ppc/locks_ppc.c, osfmk/ppc/hw_lock.s, and osfmk/kern/locks.c. The lck_mtx_t data type is declared in osfmk/ppc/locks.h.

9.18.2.3. Read/Write Locks

Mach read/write locks are blocking synchronization locks that permit multiple simultaneous readers or a single writer. Before a writer can acquire the lock for writing, it must wait until all readers have released the lock. Moreover, if a writer is already waiting on a lock, a new reader attempting to get the read lock will block until the writer has acquired and released the lock. It is possible to downgrade (write read) or upgrade (read write) a lock. A read-to-write upgrade is favored over a new writer.

The read/write locks package exports the following functions, whose prototypes are listed in osfmk/kern/locks.h:

lck_rw_t *lck_rw_alloc_init(lck_grp_t *grp, lck_attr_t *attr); void      lck_rw_free(lck_rw_t *lck, lck_grp_t *grp); void      lck_rw_init(lck_rw_t *lck, lck_grp_t *grp, lck_attr_t *attr); void      lck_rw_destroy(lck_rw_t *lck, lck_grp_t *grp); void      lck_rw_lock(lck_rw_t *lck, lck_rw_type_t lck_rw_type); void      lck_rw_unlock(lck_rw_t *lck, lck_rw_type_t lck_rw_type); void      lck_rw_lock_shared(lck_rw_t *lck); void      lck_rw_unlock_shared(lck_rw_t *lck); void      lck_rw_lock_exclusive(lck_rw_t *lck); void      lck_rw_unlock_exclusive(lck_rw_t *lck); wait_result_t lck_rw_sleep(lck_rw_t           *lck,                            lck_sleep_action_t  lck_sleep_action,                            event_t             event,                            wait_interrupt_t    interruptible); wait_result_t lck_rw_sleep_deadline(lck_rw_t           *lck,                                     lck_sleep_action_t  lck_sleep_action,                                     event_t             event,                                     wait_interrupt_t    interruptible,                                     uint64_t            deadline);

The implementation of the read/write locks package is split across the same files as those of the mutex package.

9.18.2.4. Lock Groups and Attributes

As we saw in the previous three sections, spinlocks, mutexes, and read/write locks all provide similar interfaces. In particular, the functions in these interfaces deal with lock groups (lck_grp_t) and lock attributes (lck_attr_t). A lock group is a container for one or more locksthat is, it names a set of locks. It is allocated separately, after which it can be used to group together lockssay, based on the purpose the locks are used for. Every lock belongs to exactly one group.

Lock attributes are flagsa collection of bitsthat qualify a lock. Examples of lock attributes are LCK_ATTR_NONE (no attributes specified) and LCK_ATTR_DEBUG (lock debugging enabled). A lock group also has its own attributes (lck_grp_attr_t). Figure 970 shows an example of using the lock interfaces.

Figure 970. Using locks in the kernel

lck_grp_attr_t *my_lock_group_attr; // lock group attributes lck_grp_t      *my_lock_group       // lock group lck_attr_t     *my_lock_attr        // lock attributes lck_mtx_t      *my_mutex; void my_init_locking() // set up locks {     ...     // allocate lock group attributes and the lock group     my_lock_group_attr = lck_grp_attr_alloc_init();     my_lock_group = lck_grp_alloc_init("my-mutexes", my_lock_group_attr);     my_lock_attr = lck_attr_alloc_init(); // allocate lock attribute     lck_attr_setdebug(my_lock_attr);      // enable lock debugging     my_mutex = lck_mtx_alloc_init(my_lock_group, my_lock_attr);     ... } void my_fini_locking() // tear down locks {     lck_mtx_free(my_mutex, my_lock_group);     lck_attr_free(my_lock_attr);     lck_grp_free(my_lock_group);     lck_grp_attr_free(my_lock_group_attr); }

9.18.3. BSD Condition Variables

The BSD portion of the kernel implements the msleep(), wakeup(), and wakeup_one() functions [bsd/kern/kern_synch.c], which provide the semantics of condition variables, with an additional feature that a timeout value can be specified.

9.18.4. Mach Lock Sets

Mach provides an interface for creating and using lock sets, where a set contains one or more ulocks. The contents of a ulock data structure (struct ulock [osfmk/kern/sync_lock.h]) include a mutex and a wait queue of blocked threads. Figure 971 shows examples of routines in the lock set interface.

Figure 971. The Mach lock set interface

// create a lock set with nlocks ulocks kern_return_t lock_set_create(task_t task, lock_t lock_set, int nlocks, int policy); // destroy lock set and all of its associated locks // any blocked threads will unblock and receive KERN_LOCK_SET_DESTROYED kern_return_t lock_set_destroy(task_t task, lock_set_t lock_set); // acquire access rights to the given lock in the lock set kern_return_t lock_acquire(lock_set_t lock_set, int lock_id); // release access rights to the given lock in the lock set kern_return_t lock_release(lock_set_t lock_set, int lock_id); // hand off ownership of lock to an anonymous accepting thread kern_return_t lock_handoff(lock_set_t lock_set, int lock_id); // accept an ownership handoff from an anonymous sending thread // caller will block if nobody is waiting to hand off the lock // at most one thread can wait to accept handoff of a given lock kern_return_t lock_handoff_accept(lock_set_t lock_set, int lock_id); // mark the internal state of the lock as stable // the state destabilizes when a lock-holder thread terminates kern_return_t lock_make_stable(lock_set_t lock_set, int lock_id);

9.18.5. Mach Semaphores

Besides the POSIX and System V semaphore interfaces that we have seen earlier, there is another semaphore interface available in user spaceMach semaphores. In fact, POSIX semaphores in Mac OS X are implemented atop Mach semaphores. Other parts of the kernel that use Mach semaphores include the IOCommandQueue, IOService, and IOGraphics classes in the I/O Kit.

A Mach semaphore is represented as a Mach port (semaphore_t) that names a kernel object of type IKOT_SEMAPHORE. The corresponding kernel structure is struct semaphore [osfmk/kern/sync_sema.h]. A new Mach semaphore is obtained by calling semaphore_create(), which returns with a send right naming the new semaphore.

kern_return_t semaphore_create(task_t task, semaphore_t *semaphore, int policy, int value);

The value argument to semaphore_create() specifies the initial value of the semaphore count, whereas the policy argument specifies the policy (e.g., SYNC_POLICY_FIFO) the kernel will use to select a thread to wake up from among multiple threads that are blocked on the semaphore.

Given a semaphore, semaphore_wait() decrements the semaphore count, blocking if the count is negative after decrementing. semaphore_signal() increments the semaphore count, scheduling a waiting thread to execute if the new count becomes non-negative. semaphore_signal_all() can be used to wake up all threads blocked on a semaphore, while resetting the semaphore count to zero. Finally, semaphore_signal_thread() can be used to signal a specific thread.

Figure 972 shows a program that demonstrates the use of Mach semaphores. The main thread creates two semaphoresboth with an initial value of 0and three threads. It calls semaphore_wait() tHRee times on one of the semaphores. Each thread calls semaphore_signal() on this semaphore as its first operation. Therefore, the main thread blocks until all three threads are ready. Each thread then calls semaphore_wait() on the other semaphore. Since the latter's value is 0, all threads will block. The main thread first wakes up a specific thread using semaphore_signal_thread() and then wakes up the remaining two threads using semaphore_signal_all().

Figure 972. Using Mach semaphores

// mach_semaphore.c #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <pthread.h> #include <mach/mach.h> #define OUT_ON_MACH_ERROR(msg, retval) \     if (kr != KERN_SUCCESS) { mach_error(msg ":" , kr); goto out; } #define PTHID() (unsigned long)(pthread_self()) #define SEMAPHORE_WAIT(s, n) \     { int i; for (i = 0; i < (n); i++) { semaphore_wait((s)); } } void * start_routine(void *semaphores) {     semaphore_t *sem = (semaphore_t *)semaphores;     semaphore_signal(sem[1]);     printf("thread: %lx about to decrement semaphore count\n", PTHID());     semaphore_wait(sem[0]);     printf("thread: %lx succeeded in decrementing semaphore count\n", PTHID());     semaphore_signal(sem[1]);     return (void *)0; } int main(void) {     pthread_t     pthread1, pthread2, pthread3;     semaphore_t   sem[2] = { 0 };     kern_return_t kr;     setbuf(stdout, NULL);     kr = semaphore_create(mach_task_self(), &sem[0], SYNC_POLICY_FIFO, 0);     OUT_ON_MACH_ERROR("semaphore_create", kr);     kr = semaphore_create(mach_task_self(), &sem[1], SYNC_POLICY_FIFO, 0);     OUT_ON_MACH_ERROR("semaphore_create", kr);     (void)pthread_create(&pthread1, (const pthread_attr_t *)0,                          start_routine, (void *)sem);     printf("created thread1=%lx\n", (unsigned long)pthread1);     (void)pthread_create(&pthread2, (const pthread_attr_t *)0,                          start_routine, (void *)sem);     printf("created thread2=%lx\n", (unsigned long)pthread2);     (void)pthread_create(&pthread3, (const pthread_attr_t *)0,                          start_routine, (void *)sem);     printf("created thread3=%lx\n", (unsigned long)pthread3);     // wait until all three threads are ready     SEMAPHORE_WAIT(sem[1], 3);     printf("main: about to signal thread3\n");     semaphore_signal_thread(sem[0], pthread_mach_thread_np(pthread3));     // wait for thread3 to sem_signal()     semaphore_wait(sem[1]);     printf("main: about to signal all threads\n");     semaphore_signal_all(sem[0]);     // wait for thread1 and thread2 to sem_signal()     SEMAPHORE_WAIT(sem[1], 2); out:     if (sem[0])         semaphore_destroy(mach_task_self(), sem[0]);     if (sem[1])         semaphore_destroy(mach_task_self(), sem[1]);     exit(kr); } $ gcc -Wall -o mach_semaphore mach_semaphore.c $ ./mach_semaphore created thread1=1800400 created thread2=1800800 created thread3=1800c00 thread: 1800400 about to decrement semaphore count thread: 1800800 about to decrement semaphore count thread: 1800c00 about to decrement semaphore count main: about to signal thread3 thread: 1800c00 succeeded in decrementing semaphore count main: about to signal all threads thread: 1800400 succeeded in decrementing semaphore count thread: 1800800 succeeded in decrementing semaphore count

Figure 973 shows the kernel data structure associated with a Mach semaphore. Note that the semaphore lock, which exists within the wait queue structure, is an hw_lock_t spinlock.

Figure 973. Internal structure of a Mach semaphore

// osfmk/kern/sync_sema.h typedef struct semaphore {     queue_chain_t     task_link;  // chain of semaphores owned by a task     struct wait_queue wait_queue; // queue of blocked threads and lock     task_t            owner;      // task that owns semaphore     ipc_port_t        port;       // semaphore port     int               ref_count;  // reference count     int               count;      // current count value     boolean_t         active;     // active status } Semaphore; // osfmk/mach/mach_types.h typedef struct semaphore *semaphore_t; // osfmk/kern/wait_queue.h typedef struct wait_queue {     unsigned int      wq_type    : 16, // the only public field                       wq_fifo    : 1,  // FIFO wakeup policy                       wq_isrepost: 1,  // is waitq preposted?                                  : 0;     hw_lock_data_t    wq_interlock;    // interlock     queue_data_t      wq_queue;        // queue of elements } WaitQueue;

9.18.6. Pthreads Synchronization Interfaces

The Pthreads library provides functions for using mutexes, condition variables, and read/write locks. The internal structures of these abstractions are as follows:

A Pthreads mutex includes two Mach semaphores, a spinlock, and other data.
A Pthreads condition variable internally includes a Mach semaphore, a Pthreads mutex, a spinlock, and other data.
A Pthreads read/write lock internally includes a pair of Pthreads condition variables, a Pthreads mutex, and other data.

The Pthreads library uses the spinlock implementation that the kernel makes available through the commpage mechanism.

9.18.7. Locking in the I/O Kit

The I/O Kit is the object-oriented driver subsystem of the xnu kernel. It provides synchronization primitives that are simple wrappers around the Mach primitives discussed in this chapter.

IOSimpleLock is a wrapper around Mach spinlocks (specifically, lck_spin_t). When used to synchronize between interrupt context and thread context, an IOSimpleLock should be locked with interrupts disabled. The I/O Kit provides IOSimpleLockLockDisableInterrupt as a metafunction that performs both operations. It also provides the corresponding inverse function, IOSimpleLockUnlockEnableInterrupt.
IOLock is a wrapper around Mach mutexes (lck_mtx_t).
IORecursiveLock is also a wrapper around Mach mutexes, along with a reference counter that allows one thread to lock it more than once (recursively). Note that if another thread is holding the recursive mutex, an attempt to lock it would still block.
IORWLock is a wrapper around Mach read/write locks (lck_rw_t).

Besides these, the I/O Kit supports a more sophisticated construct, the IOWorkLoop, which provides both implicit and explicit synchronization, among an extensive array of other features. We will discuss the IOWorkLoop and the I/O Kit in general in Chapter 10.

9.18.8. Funnels

The xnu kernel provides a synchronization abstraction called funnels to serialize access to the BSD portion of the kernel. In the simplest terms, an xnu funnel is a giant mutex with the special property that it gets automatically unlocked when the holding thread sleeps. Funnels were heavily used in the kernel before Mac OS X 10.4for example, in file systems and system call processing. Mac OS X 10.4 replaced the use of funnels with finer-grained locking in many but not all instancesthe kernel still provides funnels for backward compatibility and uses them in some portions that are not performance-critical.

Let us look at the background of funnels and how they are used in Mac OS X.

9.18.8.1. History

Funnels originated in the Digital UNIX operating system as a mechanism to help implement SMP-safe device drivers. A Digital UNIX funnel allowed device drivers to force execution onto a single processor. Therefore, a funneled device driver saw a single-processor environment even on an SMP system. There was no locking of resources or code blocksSMP resource protection was achieved as a side effect of an entire subsystem always running on a single processor. A device driver could be funneled by setting the d_funnel member of its device switch table entry data structure to the value DEV_FUNNEL. Using funnels degraded SMP performance, but then, no locking mechanism is without tradeoffs in preemption latency and performance. An important caveat in using Digital UNIX funnels was that a funneled driver's resources had to be self-contained if they were to be protected by the funnel. If the driver shared resources with the kernel or with another driver, you still had to use another locking mechanism to protect the integrity of those resources. Moreover, the kernel had only one funnel, which was tied to the primary processor.

Digital UNIX funnels were a poor man's way of making a driver SMP-safe transitionally, while the developer worked on making the driver really SMP-safe.

9.18.8.2. Funnels in Mac OS X

We have seen that the xnu kernel is a combination of a few very different components. In particular, Mac OS X file system and networking support comes largely from the kernel's BSD portion. In the traditional BSD architecture, the kernel is logically divided into a top half and a bottom half. When a user thread makes a system call, the top half runs either until it is done or until it is blocked, which can occur when the kernel is waiting on a resource. The bottom half is invoked to handle hardware interruptsit runs synchronously with respect to the interrupt source. Since hardware interrupts have higher priority than threads in the top half, a thread in the top half cannot assume that it will not be preempted by the lower half. Historically, the top half synchronizes with the bottom half by disabling interrupts. Some newer BSD-flavored kernels use mutexes to protect data structures that both halves may try to access concurrently.

Mac OS X's bottom half is not executed in the context of a hardware interrupt, as an interrupt would simply cause an I/O Kit work-loop thread in the kernel to wake up, which would actually run the bottom half. This means that disabling interrupts is no longer a viable synchronization approach because the top and bottom halves in xnu could be running concurrentlyas threads on different processors in a multiprocessor system. In such situations where access to the BSD portion of xnu must be serialized, Mac OS Xdepending on the kernel versionuses funnels as a cooperative serialization mechanism.

Phasing Out Funnels

xnu funnels are implemented differently from Digital UNIX funnels. Notably, there can be multiple funnels and they can run on any processor, not just the primary processor. However, a thread holding a funnel on one processor cannot take that funnel on another processor in an SMP system. Another way of looking at this is that any code that runs under a funnel becomes implicitly single-threaded.

Nevertheless, the reason for the existence of funnels on Mac OS X is similar to that on Digital UNIXthat is, to provide a transitional mechanism for making the xnu kernel SMP-safe. With the evolution of Mac OS X, components of xnu are being rewritten using finer-grained locking with reasonably bounded latencies, thus phasing out dependencies on funnels.

An xnu funnel is built on top of a Mach mutex, as shown in Figure 974.

Figure 974. The structure of a Mac OS X funnel

// osfmk/kern/thread.h struct funnel_lock {     int        fnl_type;       // funnel type     lck_mtx_t *fnl_mutex;      // underlying mutex for the funnel     void      *fnl_mtxholder;  // thread (last) holding mutex     void      *fnl_mtxrelease; // thread (last) releasing mutex     lck_mtx_t *fnl_oldmutex;   // mutex before collapsing split funnel }; typedef struct funnel_lock funnel_t;

Even though a funnel is built on a mutex, there is an important difference in how funnels and mutexes are used: If a thread holding a mutex is blocked (say, in a memory allocation operation), the mutex will still be held. However, the scheduler will release a thread's funnel on descheduling and reacquire it when the thread is rescheduled. Another thread can enter the critical section protected by the funnel in this window. Therefore, any critical state that was being protected by the funnel is not guaranteed to be preserved while the thread is blocked. The thread must ensure that such state is protectedperhaps through other locking mechanisms. Consequently, the programmer must be careful while using potentially blocking operations in kernel code.

Before Mac OS X 10.4, there were two funnels in xnu: the kernel funnel (kernel_flock) and the network funnel (network_flock). Mac OS X 10.4 has only the kernel funnel. When Mach initializes the BSD subsystem at boot time, the first operation performed is allocation of these funnels. The rationale behind having two funnels was that the networking subsystem and the rest of the BSD kernel (file system, process management, device management, and so on) are not likely to contend for the same resources. Hence, one funnel for networking and one for everything else is likely to benefit SMP performance. The kernel funnel ensures that only one thread runs inside the BSD portion of xnu at a time.

Funnels affect only the BSD portion of the kernel. Other components, such as Mach and the I/O Kit, use their own locking and synchronization mechanisms.

In Mac OS X 10.4, the file system and the networking subsystem use fine-grained locks, as shown in these examples.

The domain structure (struct domain [bsd/sys/domain.h]) now contains a mutex.
The protocol switch structure (structure protosw [bsd/sys/protosw.h]) provides locking hooks, namely, pr_lock(), pr_unlock(), and pr_getlock().
The vnode structure (struct vnode [bsd/sys/vnode_internal.h]) contains a mutex.

If a file system is thread- and preemption-safe, this capability (including others, such as whether the file system is 64-bit-safe) is maintained as part of the configuration information within a mount structure (struct mount [bsd/sys/mount_internal.h]). When a vnode corresponding to a file on this file system is created, the v_unsafefs field of the vnode structure inherits this capability as a Boolean value. Thereafter, the file system layer uses the THREAD_SAFE_FS macro to determine whether a given vnode belongs to a reentrant file system.

// bsd/vfs/kpi_vfs.c #define THREAD_SAFE_FS(VP) ((VP)->v_unsafefs ? 0 : 1)

If a file system is not reentrant, the VNOP (vnode operation) and VFS interfaces take the kernel funnel before calling the file system's operations. Figure 975 shows an overview of the relevant kernel code for a VNOP call.

Figure 975. Automatic funnel use in a thread-unsafe file system

// bsd/vfs/kpi_vfs.c errno_t VNOP_OPEN(vnode_t vp, int mode, vfs_context_t context) {     int _err;     struct vnop_open_args a;     int thread_safe;     int funnel_state = 0;     ...     thread_safe = THREAD_SAFE_FS(vp);     if (!thread_safe) {         // take the funnel         funnel_state = thread_funnel_set(kernel_flock, TRUE);         ...     }     // call the file system entry point for open     err = (*vp->v_op[vnop_open_desc.vdesc_offset])(&a);     if (!thread_safe) {         ...         // drop the funnel         (void)thread_funnel_set(kernel_flock, funnel_state);         ...     }     ... }

To determine whether a given file system is thread- and preemption-safe, the VFS interfaces check the vfc_threadsafe field of the vfstable structure [bsd/sys/mount_internal.h] within the mount structure [bsd/sys/mount_internal.h] for that file system.

// bsd/vfs/kpi_vfs.c int VFS_START(struct mount *mp, int flags, vfs_context_t context) {     int thread_safe;     ...     thread_safe = mp->mnt_vtable->vfc_threadsafe;     ... }

A file system can (indirectly) set the vfc_threadsafe field by passing the appropriate flags (VFS_TBLTHREADSAFE or VFS_TBLFSNODELOCK) to the vfs_fsadd() function [bsd/vfs/kpi_vfs.c], which adds a new file system to the kernel.

Certain parts of the Mac OS X 10.4 kernel, such as the audit subsystem [bsd/kern/kern_audit.c], the vnode disk driver [bsd/dev/vn/vn.c], and the console driver [bsd/dev/ppc/cons.c], expressly use funnels.

A thread can hold only one funnel at a time. If the tHRead_funnel_set() function detects that a thread is trying to hold multiple funnels concurrently, it will panic the system. The pre-10.4 funnel implementation provides a function for merging two funnels (tHRead_funnel_merge()), which can merge two funnels into a single funnel. There is no function to get the two original funnels back from a merged funnel.

In contrast to a merged funnel, the multiple-funnel scheme that is normally used may be called a split-funnel scheme. It is possible to disable this scheme (in pre-10.4 kernels) and have both funnel locks point to the same funnel by using the dfnl=1 boot-time argument.

Before Mac OS X 10.4, a network file system was a likely candidate for needing to hold both the kernel and network funnels concurrently. xnu's NFS implementation made heavy use of thread_funnel_switch() to switch between the two funnels. This function was called with two funnels, an old one and a new one, as arguments, where the old funnel must be held by the calling thread.

boolean_t thread_funnel_switch(int oldfnl, int newfnl); ... thread_funnel_switch(KERNEL_FUNNEL, NETWORK_FUNNEL);

Funnels can also be acquired as part of BSD system call entry. As we saw in Chapter 6, a BSD system call table entry in xnu has a member indicating the funnel type to acquire when entering the kernel.

// bsd/sys/sysent.h struct sysent {    ...    int8_t  sy_funnel; // funnel type    ... } sysent[];

The sysent array is initialized in bsd/kern/init_sysent.c. Since Mac OS X 10.4 has only the kernel funnel, a system call that takes this funnel on entry will have the sy_funnel field of its sysent entry set to KERNEL_FUNNEL.

// bsd/kern/init_sysent.c __private_extern__ struct sysent sysent[] = {     ...     { ..., KERNEL_FUNNEL, (sy_call_t *)exit, ... },     { ..., KERNEL_FUNNEL, (sy_call_t *)fork, ...},     ...     { ..., KERNEL_FUNNEL, (sy_call_t *)ptrace, ...},     ... };

Only certain BSD system calls (most of them in pre-10.4 systems, fewer in 10.4) take funnels by default. In both Mac OS X 10.4 and earlier versions, Mach system calls, or system calls related to the I/O Kit, do not take a funnel as they enter the kernel. That said, an I/O Kit driver can take a funnel if it really must. For example, if a driver is bent on invoking certain file system operations using BSD functions within the kernel, it must take the kernel funnel on pre-10.4 systems. I/O Kit work-loop threads do make upcalls into the BSD kernelfor example, to deliver network packets or to complete disk I/O requests. Such a thread in a pre-10.4 kernel will acquire the appropriate funnel before calling the BSD functions. In many cases, the underlying driver family handles funnel-related details. For example, in the case of a USB networking driver, the IONetworkingFamily hides the details of using funnels.

It was said earlier that a thread's funnel is automatically released if the thread sleeps in the kernel. A funnel state is maintained for each thread in the funnel_state field of the thread structure. When the scheduler switches to a new thread, it checks the funnel state of the old thread. If it is TH_FN_OWNED (i.e., the thread owns the funnel pointed to by the funnel_lock member of the tHRead structure), the thread's funnel state is set to TH_FN_REFUNNEL, which marks the funnel to be reacquired on dispatch. After this, the thread's funnel is released. Conversely, if the new thread's funnel_state field is TH_FN_REFUNNEL, the funnel pointed to by the funnel_lock field will be acquired, and funnel_state will be set to TH_FN_OWNED.

9.18.9. SPLs

In traditional BSD kernels, a critical section makes a set-priority-level (SPL) call to block interrupt routines at (and below) a given priority level, for example:

// raise priority level to block network protocol processing // return the current value s = splnet(); // do network-related operations ... // reset priority level to the previous (saved) value splx(s);

The usual repertoire of SPL functions alone would not be sufficient for synchronization on Mac OS X for reasons discussed earlier. Although xnu implements these functions, they are all null implementations on Mac OS X 10.4. On earlier versions, they are still null operations, but they also ensure that the calling thread is running under a funnel (causing a panic otherwise).

// bsd/kern/spl.c (Mac OS X 10.3) ... unsigned splnet(void) {     if (thread_funnel_get() == THR_FUNNEL_NULL)         panic("%s not under funnel", "splnet()");     return(0); } ... // bsd/kern/spl.c (Mac OS X 10.4) ... unsigned splnet(void) {     return(0); } ...

9.18.10. Advisory-Mode File Locking

Mac OS X provides several interfaces for advisory-mode locking of files, both in their entirety and as byte ranges. Figure 976 shows an overview of file locking through the lockf() library function, the flock() system call, and the fcntl() system call.

Figure 976. Interfaces for advisory-mode file locking

// to lock, specify operation as either LOCK_SH (shared) or LOCK_EX (exclusive) // additionally, bitwise OR operation with LOCK_NB to not block when locking // to unlock, specify operation as LOCK_UN int flock(int fd, int operation); // cmd is one of F_GETLK, F_SETLK, or F_SETLKW // arg is a pointer to a flock structure int fcntl(int fd, int cmd, int arg); // function is one of F_ULOCK, F_TEST, F_TLOCK, or F_TEST // size specifies the number of bytes to lock, starting from the current offset int lockf(int fd, int function, off_t size);

The term advisory in advisory-mode locking means that all processes accessing a shared file must cooperate and use the advisory locking mechanism before reading from or writing to the file. If a process accesses such a file without using advisory locking, inconsistencies may result.

As shown in Figure 976, all three interfaces lead to the same locking mechanism in the kernel. The kernel provides a file-system-independent locking implementation in the VFS layer. This is referred to as the local lock implementation. Alternatively, a file system can implement its own advisory locking. Given a vnode, VNOP_ADVLOCK() decides whether to use local locking or to call the file system's advisory locking operation based on the VLOCKLOCAL flag on the vnode. This flag, in turn, depends on the MNTK_LOCK_LOCAL flag of the file system. If a file system wants the VFS layer to handle advisory locking, it can call the vfs_setlocklocal() function [bsd/vfs/vfs_subr.c] in its mount operation to set the MNTK_LOCK_LOCAL flag.

The local lock implementation uses the lockf structure [bsd/sys/lockf.h] to represent a byte-range advisory lock. The vnode structure contains a list of advisory locks for the file. The list is sorted on the starting offset of the lock.