Section 9.2. Mach IPC: An Overview | Mac OS X Internals: A Systems Approach

9.2. Mach IPC: An Overview

Mach provides a message-oriented, capability-based IPC facility that represents an evolution of similar approaches used by Mach's precursors, namely, Accent and RIG. Mach's IPC implementation uses the VM subsystem to efficiently transfer large amounts of data using copy-on-write optimizations. The Mac OS X kernel uses the general message primitive provided by Mach's IPC interface as a low-level building block. In particular, the Mach calls mach_msg() and mach_msg_overwrite() can be used for both sending and receiving messages (in that order), allowing RPC^[2]-style interaction as a special case of IPC. This type of RPC is used for implementing several system services in Mac OS X.

^[2] Remote procedure call.

A Portly Look Back

David C. Walden's 1972 paper titled "A System for Interprocess Communication in a Resource Sharing Computer Network"^[3] describes a set of operations enabling interprocess communication within a single timesharing system, but using techniques that could easily be generalized to permit communication between remote processes. Walden's description included an abstraction called a port, which he defined to be a particular data path to a process (a RECEIVE port) or from a process (a SEND port). All ports had associated unique identifiers called port numbers. The kernel maintained a table of port numbers associated with processes and restart locations. On completion of an IPC transmission, the kernel transferred the participant (sender or receiver) to a restart location, which was specified as part of a SEND or RECEIVE operation.

Although Walden's description was that of a hypothetical system, many parallels can be found in latter-day IPC mechanisms in systems like RIG, Accent, and Machincluding Mac OS X.

As we saw in Chapter 1, the Rochester's Intelligent Gateway (RIG) system, whose implementation began in 1975, used an IPC facility as the basic structuring tool. RIG's IPC facility used ports and messages as basic abstractions. A RIG port was a kernel-managed message queue, globally identified by a <process number.port number> pair of integers. A RIG message was a limited-size unit consisting of a header and some data.

The Accent system improved upon RIG's IPC by defining ports to be capabilities as well as communication objects and by using a larger address space along with copy-on-write techniques to handle large objects. An intermediary Network Server process could transparently extend Accent's IPC across the network.

A process was an address space and a single program counter in both RIG and Accent. Mach split the process abstraction into a task and a thread, with the task portion owning port access rights. The use of thread mechanisms to handle errors and certain asynchronous activities simplified Mach's IPC facility. Mach 3.0, from which the Mac OS X kernel's Mach component is derived, incorporated several performance- and functionality-related improvements to IPC.

^[3] "A System for Interprocess Communication in a Resource Sharing Computer Network," by David C. Walden (Communications of the ACM 15:4, April 1972, pp. 221230).

The Mach IPC facility is built on two basic kernel abstractions: ports and messages, with messages passing between ports as the fundamental communication mechanism. A port is a multifaceted entity, whereas a message is an arbitrarily sized collection of data objects.

9.2.1. Mach Ports

Mach ports serve the following primary purposes in the operating system.

A port is a communications channela kernel-protected, kernel-managed, and finite-length queue of messages. The most basic operations on a port are for sending and receiving messages. Sending to a port allows a task to place messages into the port's underlying queue. Receiving allows a task to retrieve messages from that queue, which holds incoming messages until the recipient removes them. When a queue corresponding to a port is full or empty, senders and receivers, respectively, are blocked in general.
Ports are used to represent capabilities in that they themselves are protected by a capability mechanism to prevent arbitrary Mach tasks from accessing them. To access a port, a task must have a port capability, or port right, such as a send right or a receive right. The specific rights a task has to a port limit the set of operations the task may perform on that port. This allows Mach to prevent unauthorized tasks from accessing ports and, in particular, from manipulating objects associated with ports.
Ports are used to represent resources, services, and facilities, thus providing object-style access to these abstractions. For example, Mach uses ports to represent abstractions such as hosts, tasks, threads, memory objects, clocks, timers, processors, and processor sets. Operations on such port-represented objects are performed by sending messages to their representative ports. The kernel, which typically holds the receive rights to such ports, receives and processes the messages. This is analogous to object-oriented method invocation.

A port's name can stand for several entities, such as a right for sending or receiving messages, a dead name, a port set, or nothing. In general, we refer to what a port name stands for as a port right, although the term right may seem unintuitive in some situations. We will discuss details of these concepts later in this chapter.

9.2.1.1. Ports for Communication

In its role as a communications channel, a Mach port resembles a BSD socket, but there are important differences, such as those listed here.

Mach IPC, by design, is integrated with the virtual memory subsystem.
Whereas sockets are primarily used for remote communication, Mach IPC is primarily used for (and optimized for) intramachine communication. However, Mach IPC, by design, can be transparently extended over the network.
Mach IPC messages can carry typed content.
In general, Mach IPC interfaces are more powerful and flexible than the socket interfaces.

When we talk of a message being sent to a task, we mean that the message is sent to a port that the recipient task has receive rights to. The message is dequeued by a thread within the recipient task.

Integration of IPC with virtual memory allows messages to be mappedcopy-on-write, if possible and appropriateinto the receiving task's address space. In theory, a message could be as large as the size of a task's address space.

Although the Mach kernel itself does not include any explicit support for distributed IPC, communication can be transparently extended over the network by using external (user-level) tasks called Network Servers, which simply act as local proxies for remote tasks. A message sent to a remote port will be sent to a local Network Server, which is responsible for forwarding it to a Network Server on the remote destination machine. The participant tasks are unaware of these details, hence the transparency.

Although the xnu kernel retains most of the semantics of Mach IPC, network-transparent Mach IPC is not used on Mac OS X.

9.2.1.2. Port Rights

The following specific port right types are defined on Mac OS X.

MACH_PORT_RIGHT_SEND A send right to a port implies that the right's holder can send messages to that port. Send rights are reference counted. If a thread acquires a send right that the task already holds, the right's reference count is incremented. Similarly, a right's reference count is decremented when a thread deallocates the right. This mechanism prevents race conditions involving premature deallocation of send rights, as the task will lose the send right only when the right's reference count becomes zero. Therefore, several threads in a multithreaded program can use such rights safely.
MACH_PORT_RIGHT_RECEIVE A receive right to a port implies that the right's holder can dequeue messages from that port. A port may have any number of senders but only one receiver. Moreover, if a task has a receive right to a port, it automatically has a send right to it too.
MACH_PORT_RIGHT_SEND_ONCE A send-once right allows its holder to send only one message, after which the right is deleted. Send-once rights are used as reply ports, wherein a client can include a send-once right in a request message, and the server can use that right to send a reply. A send-once right always results in exactly one message being senteven if it is destroyed, in which case a send-once notification is generated.
MACH_PORT_RIGHT_PORT_SET A port set name can be considered as a receive right encompassing multiple ports. A port set represents a group of ports to which the task has a receive right. In other words, a port set is a bucket of receive rights. It allows a task to receive a message, the first that is available, from any of the member ports of a set. The message identifies the specific port it was received on.
MACH_PORT_RIGHT_DEAD_NAME A dead name is not really a right; it represents a send or send-once right that has become invalid because the corresponding port was destroyed. As a send right transforms into a dead name on invalidation, its reference count also carries over to the dead name. Attempting to send a message to a dead name results in an error, which allows senders to realize that the port is destroyed. Dead names prevent the port names they take over from being reused prematurely.

A port is considered to be destroyed when its receive right is deallocated. Although existing send or send-once rights will transform into dead names when this happens, existing messages in the ports queue are destroyed, and any associated out-of-line memory is freed.

The following are some noteworthy aspects of port rights.

Rights are owned at the task level. For example, although the code to create a port executes in a thread, the associated rights are granted to the thread's task. Thereafter, any other thread within that task can use or manipulate the rights.
The namespace for ports is per-task privatethat is, a given port name is valid only within the IPC space of a task. This is analogous to per-task virtual address spaces.
If a task holds both the send right and receive right for a port, the rights have the same name.
No two send-once rights held by the task have the same name.
Rights can be transferred through message passing. In particular, the frequent operation of gaining access to a port involves receiving a message containing a port right.
After a task has sent a message containing one or more port rights, and before the message is dequeued by the receiver, the rights are held by the kernel. Since a receive right can be held by only one task at any time, there is the possibility of messages being sent to a port whose receive right is being transferred. In such a case, the kernel will enqueue the messages until the receiver task receives the rights and dequeues the messages.

9.2.1.3. Ports as Objects

The Mach IPC facility is a general-purpose object-reference mechanism that uses ports as protected access points. In semantic terms, the Mach kernel is a server that serves objects on various ports. This kernel server receives incoming messages, processes them by performing the requested operations, and, if required, sends a reply. This approach allows a more general and useful implementation of several operations that have been historically implemented as intraprocess function calls. For example, one Mach task can allocate a region of virtual memory in another task's address spaceif permittedby sending an appropriate message to the port representing the target task.

Note that the same model is used for accessing both user-level and kernel services. In either case, a task accesses the service by having one of its threads send messages to the service provider, which can be another user task or the kernel.

Besides message passing, little Mach functionality is exposed through Mach traps. Most Mach services are provided through message-passing interfaces. User programs typically access these services by sending messages to the appropriate ports.

We saw earlier that ports are used to represent both tasks and threads. When a task creates another task or a thread, it automatically gets access to the newly created entity's port. Since port ownership is task-level, all per-thread ports in a task are accessible to all threads within that task. A thread can send messages to other threads within its tasksay, to suspend or resume their execution. It follows that having access to a task's port implicitly provides access to all threads within that task. The converse does not hold, however: Having access to a thread's port does not give access to its containing task's port.

9.2.1.4. Mach Port Allocation

A user program can acquire a port right in several ways, examples of which we will see later in this chapter. A program creates a new port right through the mach_port_allocate family of routines, of which mach_port_allocate() is the simplest:

int mach_port_allocate(ipc_space_t        task,   // task acquiring the port right                    mach_port_right_t  right,  // type of right to be created                    mach_port_name_t   *name); // returns name for the new right

We will discuss details of port allocation in Section 9.3.5.

9.2.2. Mach IPC Messages

Mach IPC messages can be sent and received through the mach_msg family of functions. The fundamental IPC system call in Mac OS X is a trap called mach_msg_overwrite_trap() [osfmk/ipc/mach_msg.c], which can be used for sending a message, receiving a message, or both sending and receiving (in that orderan RPC) in a single call.

// osfmk/ipc/mach_msg.c mach_msg_return_t mach_msg_overwrite_trap(     mach_msg_header_t  *snd_msg,   // message buffer to be sent     mach_msg_option_t   option,    // bitwise OR of commands and modifiers     mach_msg_size_t     send_size, // size of outgoing message buffer     mach_msg_size_t     rcv_size,  // maximum size of receive buffer (rcv_msg)     mach_port_name_t    rcv_name,  // port or port set to receive on     mach_msg_timeout_t  timeout,   // timeout in milliseconds     mach_port_name_t    notify,    // receive right for a notify port     mach_msg_header_t  *rcv_msg,   // message buffer for receiving     mach_msg_size_t     scatterlist_sz); // size of scatter list control info

The behavior of mach_msg_overwrite_trap() is controlled by setting the appropriate bits in the option argument. These bits determine what the call does and how it does it. Some bits cause the call to use one or more of the other arguments, which may be unused otherwise. The following are some examples of individual bits that can be set in option.

MACH_SEND_MSG If set, send a message.
MACH_RCV_MSG If set, receive a message.
MACH_SEND_TIMEOUT If set, the timeout argument specifies the timeout while sending.
MACH_RCV_TIMEOUT If set, timeout specifies the timeout while receiving.
MACH_SEND_INTERRUPT If set, the call returns MACH_SEND_INTERRUPTED if a software interrupt aborts the call; otherwise, an interrupted send is reattempted.
MACH_RCV_INTERRUPT This bit is similar to MACH_SEND_INTERRUPT, but for receiving.
MACH_RCV_LARGE If set, the kernel will not destroy a received message even if it is larger than the receive limit; this way, the receiver can reattempt to receive the message.

The header file osfmk/mach/message.h contains the full set of modifiers that can be used with the mach_msg family of functions.

Another Mach trap, mach_msg_trap(), simply calls mach_msg_overwrite_trap() with zeros as the last two argumentsit uses the same buffer when the call is used for both sending and receiving, so the rcv_msg argument is not needed.

The scatterlist_sz argument is used when the receiver, while receiving an out-of-line message (see Section 9.5.5), wants the kernel not to dynamically allocate memory in the receiver's address space but to overwrite one or more preexisting valid regions with the received data. In this case, the caller describes which regions to use through out-of-line descriptors in the ingoing rcv_msg argument, and scatterlist_sz specifies the size of this control information.

The system library provides user-level wrappers around the messaging traps (Figure 91). The wrappers handle possible restarting of the appropriate parts of IPC operations in the case of interruptions.

Figure 91. System library wrappers around Mach messaging traps

// system library #define LIBMACH_OPTIONS (MACH_SEND_INTERRUPT|MACH_RCV_INTERRUPT) mach_msg_return_t mach_msg(msg, option, /* other arguments */) {     mach_msg_return_t mr;     // try the trap     mr = mach_msg_trap(msg, option &~ LIBMACH_OPTIONS, /* arguments */);     if (mr == MACH_MSG_SUCCESS)         return MACH_MSG_SUCCESS;     // if send was interrupted, retry, unless instructed to return error     if ((option & MACH_SEND_INTERRUPT) == 0)         while (mr == MACH_SEND_INTERRUPTED)             mr = mach_msg_trap(msg, option &~ LIBMACH_OPTIONS, /* arguments */);     // if receive was interrupted, retry, unless instructed to return error     if ((option & MACH_RCV_INTERRUPT) == 0)         while (mr == MACH_RCV_INTERRUPTED)             // leave out MACH_SEND_MSG: if we needed to send, we already have             mr = mach_msg_trap(msg, option &~ (LIBMACH_OPTIONS|MACH_SEND_MSG),                                /* arguments */);     return mr; } mach_msg_return_t mach_msg_overwrite(...) {     ...     // use mach_msg_overwrite_trap()     ... }

User programs normally use mach_msg() or mach_msg_overwrite() to perform IPC operations. Variants such as mach_msg_receive() and mach_msg_send() are other wrappers around mach_msg().

The anatomy of a Mach message has evolved over time, but the basic layout consisting of a fixed-size header^[4] and other variable-size data has remained unchanged. Mach messages in Mac OS X contain the following parts:

^[4] Note that unlike an Internet Protocol packet header, the send- and receive-side headers are not identical for a Mach IPC message.

A fixed-size message header (mach_msg_header_t).
A variable-size, possibly empty, body containing kernel and user data (mach_msg_body_t).
A variable-size trailerone of several typescontaining message attributes appended by the kernel (mach_msg_trailer_t). A trailer is only relevant on the receive side.

A message can be either simple or complex. A simple message contains a header immediately followed by untyped data, whereas a complex message contains a structured message body. Figure 92 shows how the parts of a complex Mach message are laid out. The body consists of a descriptor count followed by that many descriptors, which are used to transfer out-of-line memory and port rights. Removing the body from this picture gives the layout of a simple message.

Figure 92. The layout of a complex Mach message

9.2.2.1. Message Header

The meanings of the message header fields are as follows.

msgh_bits contains a bitmap describing the properties of the message. The MACH_MSGH_BITS_LOCAL() and MACH_MSGH_BITS_REMOTE() macros can be applied on this field to determine how the local port (msgh_local_port) and remote port (msgh_remote_port) fields will be interpreted. The MACH_MSG_BITS() macro combines the remote and local bits to yield a single value that can be used as msgh_bits. In particular, the presence of the MACH_MSGH_BITS_COMPLEX flag in msgh_bits marks the message as a complex message.
```
// osfmk/mach/message.h #define MACH_MSGH_BITS(remote, local) ((remote | ((local) << 8))
```
msgh_size is ignored while sending because the send size is provided as an explicit argument. In a received message, this field specifies the combined size, in bytes,^[5] of the header and the body.
^[5] It is rather common for Mach routines to deal with sizes in units of natural_t instead of bytes. To avoid mysterious errors, be sure to verify the units that a given routine uses.
msgh_remote_port specifies the destination porta send or send-once rightwhile sending.
msgh_local_portcan be used to specify the reply port that the recipient will use to send a reply. It can be a valid send or send-once right but can also be MACH_PORT_NULL or MACH_PORT_DEAD.
msgh_id contains an identifier that can be used to convey the meaning or format of the message, to be interpreted by the recipient. For example, a client can use this field to specify an operation to be performed by the server.

The msgh_remote_port and msgh_local_port values are swapped (reversed with respect to the sender's view) in the message header seen by the recipient. Similarly, the bits in msgh_bits are also reversed.

9.2.2.2. Message Body

A nonempty message body may contain data that is passive (uninterpreted by the kernel), active (processed by the kernel), or both. Passive data resides inline in the message body and is meaningful only to the sender and the recipient. Examples of active data include port rights and out-of-line memory regions. Note that a message that carries anything but inline passive data is a complex message.

As noted earlier, a complex message body contains a descriptor count followed by that many descriptors. Figure 93 shows some descriptor types that are available for carrying different types of content.

Figure 93. Descriptors for sending ports and out-of-line memory in Mach IPC messages

// osfmk/mach/message.h // for carrying a single port typedef struct {     mach_port_t                name; // names the port whose right is being sent     mach_msg_size_t            pad1;     unsigned int               pad2        : 16;     mach_msg_type_name_t       disposition : 8; // what to do with the right     mach_msg_descriptor_type_t type        : 8; // MACH_MSG_PORT_DESCRIPTOR } mach_msg_port_descriptor_t; // for carrying an out-of-line data array typedef struct {     void                       *address; // address of the out-of-line memory #if !defined(__LP64__)     mach_msg_size_t            size;     // bytes in the out-of-line region #endif     boolean_t                  deallocate  : 8; // deallocate after sending?     mach_msg_copy_options_t    copy        : 8; // how to copy?     unsigned int               pad1        : 8;     mach_msg_descriptor_type_t type        : 8; // MACH_MSG_OOL_DESCRIPTOR #if defined(__LP64__)     mach_msg_size_t            size;     // bytes in the out-of-line region #endif } mach_msg_ool_descriptor_t; // for carrying an out-of-line array of ports typedef struct {     void                      *address; // address of the port name array #if !defined(__LP64__)     mach_msg_size_t            count;   // number of port names in the array #endif     boolean_t                  deallocate  : 8;     mach_msg_copy_options_t    copy        : 8; // how to copy?     mach_msg_type_name_t       disposition : 8; // what to do with the rights?     mach_msg_descriptor_type_t type        : 8; // MACH_MSG_OOL_PORTS_DESCRIPTOR #if defined(__LP64__)     mach_msg_size_t            count;   // number of port names in the array #endif } mach_msg_ool_ports_descriptor_t;

A mach_msg_port_descriptor_t is used for passing a port right. Its name field specifies the name of the port right being carried in the message, whereas the disposition field specifies the IPC processing to be performed for the right, based on which the kernel passes the appropriate right to the recipient. The following are examples of disposition types.

MACH_MSG_TYPE_PORT_NONE The message carries neither a port name nor a port right.
MACH_MSG_TYPE_PORT_NAME The message carries only a port name and no rights. The kernel does not interpret the name.
MACH_MSG_TYPE_PORT_RECEIVE The message carries a receive right.
MACH_MSG_TYPE_PORT_SEND The message carries a send right.
MACH_MSG_TYPE_PORT_SEND_ONCE The message carries a send-once right.

A mach_msg_ool_descriptor_t is used for passing out-of-line memory. Its address field specifies the starting address of the memory in the sender's address space, whereas the size field specifies the memory's size in bytes. If the deallocate Boolean value is true, the set of pages containing the data will be deallocated in the sender's address space after the message is sent. The copy field is used by the sender to specify how the data is to be copiedeither virtually (MACH_MSG_VIRTUAL_COPY) or physically (MACH_MSG_PHYSICAL_COPY). The recipient uses the copy field to specify whether to dynamically allocate space for the received out-of-line memory regions (MACH_RCV_ALLOCATE) or to write over existing specified regions of the receiver's address space (MACH_MSG_OVERWRITE). As far as possible, and unless explicitly overridden, memory transferred in this manner is shared copy-on-write between senders and recipients.

Once a send call returns, the sender can modify the message buffer used in the send call without affecting the message contents. Similarly, the sender can also modify any out-of-line memory regions transferred.

A mach_msg_ool_ports_descriptor_t is used to pass an out-of-line array of ports. Note that such an array is always physically copied while being sent.

9.2.2.3. Message Trailer

A received Mach message contains a trailer after the message data. The trailer is aligned on a natural boundary. The msgh_size field in the received message header does not include the size of the received trailer. The trailer itself contains the trailer size in its msgh_trailer_size field.

The kernel may provide several trailer formats, and within each format, there can be multiple trailer attributes. Mac OS X 10.4 provides only one trailer format: MACH_MSG_TRAILER_FORMAT_0. This format provides the following attributes (in this order): a sequence number, a security token, and an audit token. During messaging, the receiver can request the kernel to append one or more of these attributes as part of the received trailer on a per-message basis. However, there is a caveat: To include a later attribute in the trailer, the receiver must accept all previous attributes, where the later/previous qualifiers are with respect to the aforementioned order. For example, including the audit token in the trailer will automatically include the security token and the sequence number. The following types are defined to represent valid combinations of trailer attributes:

mach_msg_trailer_t the simplest trailer; contains a mach_msg_trailer_type_t and a mach_msg_trailer_size_t, with no attributes
mach_msg_seqno_trailer_t also contains the sequence number (mach_port_seqno_t) of the message with respect to its port
mach_msg_security_trailer_t also contains the security token (security_token_t) of the task that sent the message
mach_msg_audit_trailer_t also contains an audit token (audit_token_t)

A security token is a structure containing the effective user and group IDs of the sending task (technically, of the associated BSD process). These are populated by the kernel securely and cannot be spoofed by the sender. An audit token is an opaque object that identifies the sender of a Mach message as a subject to the kernel's BSM auditing subsystem. It is also filled in securely by the kernel. Its contents can be interpreted using routines in the BSM library.

A task inherits its security and audit tokens from the task that creates it. A task without a parent (i.e., the kernel task) has its security and audit tokens set to KERNEL_SECURITY_TOKEN and KERNEL_AUDIT_TOKEN, respectively. These are declared in osfmk/ipc/mach_msg.c. As the kernel evolves, it is likely that other types of tokens that include more comprehensive information could be supported.

Figure 94 shows an example of how to request the kernel to include the security token in the trailer of a received message.

Figure 94. Requesting the kernel to include the sender's security token in the message trailer

typedef struct { // simple message with only an integer as inline data     mach_msg_header_t           header;     int                         data;     mach_msg_security_trailer_t trailer; } msg_format_recv_t; ... int main(int argc, char **argv) {     kern_return_t      kr;     msg_format_recv_t  recv_msg;     msg_format_send_t  send_msg;     mach_msg_header_t *recv_hdr, *send_hdr;     mach_msg_option_t  options;     ...     options  = MACH_RCV_MSG | MACH_RCV_LARGE;     options |= MACH_RCV_TRAILER_TYPE(MACH_MSG_TRAILER_FORMAT_0);     // the following will include all trailer elements up to the specified one     options |= MACH_RCV_TRAILER_ELEMENTS(MACH_RCV_TRAILER_SENDER);     kr = mach_msg(recv_hdr, options, ...);     ...     printf("security token = %u %u\n",            recv_msg.trailer.msgh_sender.val[0],  // sender's user ID            recv_msg.trailer.msgh_sender.val[1]); // sender's group ID     ... }

The MACH_RCV_TRAILER_ELEMENTS() macro is used to encode the number of trailer elements desiredvalid numbers are defined in osfmk/mach/message.h:

#define MACH_RCV_TRAILER_NULL   0 // mach_msg_trailer_t #define MACH_RCV_TRAILER_SEQNO  1 // mach_msg_trailer_seqno_t #define MACH_RCV_TRAILER_SENDER 2 // mach_msg_security_trailer_t #define MACH_RCV_TRAILER_AUDIT  3 // mach_msg_audit_trailer_t

Note that the receive buffer must contain sufficient space to hold the requested trailer type.

In a client-server system, both the client and the server can request the other party's security token to be appended to the incoming message trailer.

An Empty Message Sounds Much

Because of the trailer, the size of the smallest message you can send is different from the size of the smallest message you can receive. On the send side, an empty message consists of only the message header. The receiver must account for a trailer, so the smallest message that can be received consists of a header and the smallest trailer possible.