Section 6.2. Mach | Mac OS X Internals: A Systems Approach

6.2. Mach

Let us briefly review our discussion of Mach from Chapters 1 and 2. Mach was designed as a communications-oriented operating system kernel with full multiprocessing support. Various types of operating systems could be built upon Mach. It aimed to be a microkernel in which traditional operating system services such as file systems, I/O, memory managers, networking stacks, and even operating system personalities were meant to reside in user space, with a clean logical and modular separation between them and the kernel. In practice, releases of Mach prior to release 3 had monolithic implementations. Release 3a project started at Carnegie Mellon University and continued by the Open Software Foundationwas the first true microkernel version of Mach: BSD ran as a user-space task in this version.

The Mach portions of xnu were originally based on Open Group's Mach Mk 7.3 system, which in turn was based on Mach 3. xnu's Mach contains enhancements from MkLinux and work done on Mach at the University of Utah. Examples of the latter include the migrating thread model, wherein the thread abstraction is further decoupled into an execution context and a schedulable thread of control with an associated chain of contexts.

xnu Is Not a Microkernel

All kernel components reside in a single kernel address space in Mac OS X. Although the kernel is modular and extensible, it is still monolithic. Nevertheless, note that the kernel closely works with a few user-space daemons such as dynamic_pager, kextd, and kuncd.

In this chapter, we will discuss basic Mach concepts and programming abstractions. We will look at some of these concepts in more detail in the next three chapters in the context of process management, memory management, and interprocess communication (IPC).

In this book, Mach-related programming examples are presented to demonstrate the internal working of certain aspects of Mac OS X. However, Apple does not support the direct use of most Mach-level APIs by third-party programs. Consequently, you are advised against using such APIs in software you distribute.

6.2.1. Kernel Fundamentals

Mach provides a virtual machine interface to higher layers by abstracting system hardwarea scenario that is common among many operating systems. The core Mach kernel is designed to be simple and extensible: It provides an IPC mechanism that is the building block for many services offered by the kernel. In particular, Mach's IPC features are unified with its virtual memory subsystem, which leads to various optimizations and simplifications.

The 4.4BSD virtual memory system was based on the Mach 2.0 virtual memory system, with updates from newer versions of Mach.

Mach has five basic abstractions from a programmer's standpoint:

Task
Thread
Port
Message
Memory object

Besides providing the basic kernel abstractions, Mach represents various other hardware and software resources as port objects, allowing manipulation of such resources through its IPC mechanism. For example, Mach represents the overall computer system as a host object, a single physical CPU as a processor object, and one or more groups of CPUs in a multiprocessor system as processor set objects.

6.2.1.1. Tasks and Threads

Mach divides the traditional Unix abstraction of a process into two parts: a task and a thread. As we will see in Chapter 7, the terms thread and process have context-specific connotations in the Mac OS X user space, depending on the environment. Within the kernel, a BSD process, which is analogous to a traditional Unix process, is a data structure with a one-to-one mapping with a Mach task. A Mach task has the following key features.

It is an execution environment and a static entity. A task does not executethat is, it performs no computationby itself. It provides a framework within which other entities (threads) execute.
It is the basic unit of resource allocation and can be thought of as a resource container. A task contains a collection of resources such as access to processors, paged virtual address space (virtual memory), IPC space, exception handlers, credentials, file descriptors, protection state, signal management state, and statistics. Note that a task's resources include Unix items too, which on Mac OS X are contained in a task through its one-to-one association with a BSD process structure.
It represents the protection boundary of a program. One task cannot access another task's resources unless the former has obtained explicit access using some well-defined interface.

A thread is the actual executing entity in Machit is a point of control flow in a task. It has the following features.

It executes within the context of a task, representing an independent program countera stream of instructionswithin the task. A thread is also the fundamental schedulable entity, with associated scheduling priority and attributes. Each thread is scheduled preemptively and independently of other threads, whether they are in the same task or in any other task.
The code that a thread executes resides within the address space of its task.
Each task may contain zero or more threads, but each thread belongs to exactly one task. A task with no threadsalthough legitimatecannot run.
All threads within a task share all the task's resources. In particular, since all threads share the same memory, a thread can overwrite another thread's memory within the same task, without requiring any additional privileges. Since there may be several concurrently executing threads in one task, threads within a task must cooperate.
A thread may have its own exception handlers.
Each thread has its own computation state, which includes processor registers, a program counter, and a stack. Note that while a thread's stack is designated as private, it resides in the same address space as other threads within the same task. As noted earlier, threads within a task can access each other's stacks if they choose to.
A thread uses a kernel stack for handling system calls. A kernel stack's size is 16KB.

To sum up, a task is passive, owns resources, and is a basic unit of protection. Threads within a task are active, execute instructions, and are basic units of control flow.

A single-threaded traditional Unix process is analogous to a Mach task with only one thread, whereas a multithreaded Unix process is analogous to a Mach task with many threads.

A task is considerably more expensive to create or destroy than a thread.

Whereas every thread has a containing task, a Mach task is not related to its creating task, unlike Unix processes. However, the kernel maintains process-level parent-child relationships in the BSD process structures. Nevertheless, we may consider a task that creates another task to be the parent task and the newly created task to be the child task. During creation, the child inherits certain aspects of the parent, such as registered ports, exception and bootstrap ports, audit and security tokens, shared mapping regions, and the processor set. Note that if the parent's processor set has been marked inactive, the child is assigned to the default processor set.

The Kernel Task

As we saw in our discussion of kernel startup in Chapter 5, the kernel uses the task and thread abstractions to divide its functionality into various execution flows. The kernel uses a single taskthe kernel taskwith multiple threads that perform kernel operations such as scheduling, thread reaping, callout management, paging, and Unix exception handling. Thus, xnu is a monolithic kernel containing markedly different components such as Mach, BSD, and the I/O Kit, all running as groups of threads in a single task in the same address space.

Once a task is created, anyone with a valid task identifier (and thus the appropriate rights to a Mach IPC port) can perform operations on the task. A task can send its identifier to other tasks in an IPC message, if it so desires.

6.2.1.2. Ports

A Mach port is a multifaceted abstraction. It is a kernel-protected unidirectional IPC channel, a capability, and a name. Traditionally in Mach, a port is implemented as a message queue with a finite length.

Besides Mach ports, Mac OS X provides many other types of IPC mechanisms, both within the kernel and in user space. Examples of such mechanisms include POSIX and System V IPC, multiple notification mechanisms, descriptor passing, and Apple Events. We will examine several IPC mechanisms in Chapter 9.

The port abstraction, along with associated operations (the most fundamental being send and receive), is the basis for communication in Mach. A port has kernel-managed capabilitiesor rightsassociated with it. A task must hold the appropriate rights to manipulate a port. For example, rights determine which task can send messages to a given port or which task may receive messages destined for it. Several tasks can have send rights to a particular port, but only one task can hold receive rights to a given port.

In the object-oriented sense, a port is an object reference. Various abstractions in Mach, including data structures and services, are represented by ports. In this sense, a port acts as a protected access provider to a system resource. You access objects such as tasks, threads, or memory objects^[1] through their respective ports. For example, each task has a task port that represents that task in calls to the kernel. Similarly, a thread's point of control is accessible to user programs through a thread port. Any such access requires a port capability, which is the right to send or receive messages to that port, or rather, to the object the port represents. In particular, you perform operations on an object by sending messages to one of its ports.^[2] The object holding receive rights to the port can then receive the message, process it, and possibly perform an operation requested in the message. The following are two examples of this mechanism.

^[1] With the exception of virtual memory, all Mach system resources are accessed through ports.

^[2] Objects may have multiple ports representing different types of functionality or access level. For example, a privileged resource may have a control port accessible only to the superuser and an information port accessible to all users.

A window manager can represent each window it manages by a port. Its client tasks can perform window operations by sending messages to the appropriate window ports. The window manager task receives and processes these operations.
Each task, and each thread within the task, has an exception port. An error handler can register one of its ports as a thread's exception port. When an exception occurs, a message will be sent to this port. The handler can receive and process this message. Similarly, a debugger can register one of its ports as the task's exception port. Thereafter, unless a thread has explicitly registered its own thread exception port, exceptions in all of the task's threads will be communicated to the debugger.

Since a port is a per-task resource, all threads within a task automatically have access to the task's ports. A task can allow other tasks to access one or more of its ports. It does so by passing port rights in IPC messages to other tasks. Moreover, a thread can access a port only if the port is known to the containing taskthere is no global, system-wide port namespace.

Several ports may be grouped together in a port set. All ports in a set share the same queue. Although there still is a single receiver, each message contains an identifier for the specific port within the port set on which the message was received. This functionality is similar to the Unix select() system call.

Network-Transparent Ports

Mach ports were designed to be network transparent, allowing tasks on network-connected machines to communicate with each other without worrying about where other tasks were located. A network message server (netmsgserver) was typically used in such scenarios as a trusted intermediary. Tasks could advertise their services by checking in with the netmsgserver. A check-in operation registered a unique name with the netmsgserver. Other tasks, including tasks on other machines, could look up service names on the netmsgserver, which itself used a port available to all tasks. This way, the netmsgserver could propagate port rights across networks. Mac OS X does not support this distributed IPC feature of Mach, and as such does not have any internal or external network message servers. Distributed IPC is, however, possible on Mac OS X using higher-level mechanisms such as the Cocoa API's Distributed Objects feature.

Note that a port can be used to send messages in only one direction. Therefore, unlike a BSD socket, a port does not represent an end point of a bidirectional communication channel. If a request message is sent on a certain port and the sender needs to receive a reply, another port must be used for the reply.

As we will see in Chapter 9, a task's IPC space includes mappings from port names to the kernel's internal port objects, along with rights for these names. A Mach port's name is an integerconceptually similar to a Unix file descriptor. However, Mach ports differ from file descriptors in several ways. For example, a file descriptor may be duplicated multiple times, with each descriptor being a different number referring to the same open file. If multiple port rights are similarly opened for a particular port, the port names will coalesce into a single name, which would be reference-counted for the number of rights it represents. Moreover, other than certain standard ports such as registered, bootstrap, and exception ports, Mach ports are not inherited implicitly across the fork() system call.

6.2.1.3. Messages

Mach IPC messages are data objects that threads exchange with each other to communicate. Typical intertask communication in Mach, including between the kernel and user tasks, occurs using messages. A message may contain actual inline data or a pointer to out-of-line (OOL) data. OOL data transfer is an optimization for large transfers, wherein the kernel allocates a memory region for the message in the receiver's virtual address space, without making a physical copy of the message. The shared memory pages are marked copy-on-write (COW).

A message may contain arbitrary program data, copies of memory ranges, exceptions, notifications, port capabilities, and so on. In particular, the only way to transfer port capabilities from one task to another is through messages.

Mach messages are transferred asynchronously. Even though only one task can hold receive rights to a port, multiple threads within a task may attempt to receive messages on a port. In such a case, only one of the threads will succeed in receiving a given message.

6.2.1.4. Virtual Memory and Memory Objects

Mach's virtual memory (VM) system can be cleanly separated into machine-independent and machine-dependent parts. For example, address maps, memory objects, share maps, and resident memory are machine-independent, whereas the physical map (pmap) is machine-dependent. We will discuss VM-related abstractions in detail in Chapter 8.

Features of Mach's VM design include the following.

Mach provides per-task protected address spaces, with a sparse memory layout. A task's address space description is a linear list of memory regions (vm_map_t), where each region points to a memory object (vm_object_t).
The machine-dependent address mappings are contained in a pmap object (pmap_t).
A task can allocate or deallocate regions of virtual memory both within its own address space and in other tasks' address spaces.
Tasks can specify protection and inheritance properties of memory on a per-page basis. Memory pages can be unshared between tasks or shared using either copy-on-write or read-write mode. Each group of pagesa memory regionhas two protection values: a current and a maximum. The current protection corresponds to the actual hardware protection being used for the pages, whereas the maximum protection is the highest (most permissive) value that current protection may ever achieve. The maximum protection is an absolute upper limit in that it cannot be elevated (made more permissive), but only lowered (made more restrictive). Therefore, the maximum protection represents the most access that can be had to a memory region.

A memory object is a container for data (including file data) that is mapped into the address space of a task. It serves as a channel for providing memory to tasks. Mach traditionally allows a memory object to be managed by a user-mode external memory manager, wherein the handling of page faults and page-out data requests can be performed in user space. An external pager can also be used to implement networked virtual memory. This external memory management (EMM) feature of Mach is not used in Mac OS X. xnu provides basic paging services in the kernel through three pagers: the default (anonymous) pager, the vnode pager, and the device pager.

The default pager handles anonymous memorythat is, memory with no explicitly designated pager. It is implemented in the Mach portion of the kernel. With help from the dynamic_pager user-space application,^[3] which manages on-disk backing-store (or swap) files, the default pager pages to swap files on a normal file system.

^[3] The dynamic_pager application is not involved in actual paging operationsit only creates or deletes swap files based on various criteria.

Swap files reside under the /var/vm/ directory by default. The files are named swapfileN, where N is the swap file's number. The first swap file is called swapfile0.

The vnode pager is used for memory-mapped files. Since the Mac OS X VFS is in the BSD portion of the kernel, the vnode pager is implemented in the BSD layer.

The device pager is used for non-general-purpose memory. It is implemented in the Mach layer but used by the I/O Kit.

6.2.2. Exception Handling

A Mach exception is a synchronous interruption of a program's execution that occurs due to the program itself. The causes for exceptions can be erroneous conditions such as executing an illegal instruction, dividing by zero, or accessing invalid memory. Exceptions can also be caused deliberately, such as during debugging, when a debugger breakpoint is hit.

xnu's Mach implementation associates an array of exception ports with each task and another with each thread within a task. Each such array has as many slots as there are exception types defined for the implementation, with slot 0 being invalid. All of a thread's exception ports are set to the null port (IP_NULL) when the thread is created, whereas a task's exception ports are inherited from those of the parent task. The kernel allows a programmer to get or set individual exception ports for both tasks and threads. Consequently, a program can have multiple exception handlers. A single handler may also handle multiple exception types. Typical preparation for exception handling by a program involves allocation of one or more ports to which the kernel will send exception notification messages. The port can then be registered as an exception port for one or more types of exceptions for either a thread or a task. The exception handler code typically runs in a dedicated thread, waiting for notification messages from the kernel.

Exception handling in Mach can be viewed as a metaoperation consisting of several suboperations. The thread that causes an exception is called the victim thread, whereas the thread that runs the exception handler is called the handler thread. When a victim causes (raises) an exception, the kernel suspends the victim thread and sends a message to the appropriate exception port, which may be either a thread exception port (more specific) or a task exception port (if the thread has not set an exception port). Upon receiving (catching) the message, the handler thread processes the exceptionan operation that may involve fixing the victim's state, arranging for it to be terminated, logging an error, and so on. The handler replies to the message, indicating whether the exception was processed successfully (cleared). Finally, the kernel either resumes the victim or terminates it.

A thread exception port is typically relevant for error handling. Each thread may have its own exception handlers that process exceptions corresponding to errors that affect only individual threads. A task exception port is typically relevant for debugging. A debugger can attach to a task by registering one of its own ports as the debugged task's exception port. Since a task inherits its exception ports from the creating task, the debugger will also be able to control child processes of the debugged program. Moreover, exception notifications for all threads that have no registered exception port will be sent to the task exception port. Recall that a thread is created with null exception ports and, correspondingly, with no default handlers. Therefore, this works well in the general case. Even when a thread does have valid exception ports, the corresponding exception handlers can forward exceptions to the task exception port.

We will look at a programming example of Mach exception handling in Chapter 9.