Section 6.7. System Call Categories

6.7. System Call Categories

Let us now look at details and examples of the various system call categories, beginning with the most staple variety from a developer's standpoint: the BSD system calls.

6.7.1. BSD System Calls

shandler() calls unix_syscall() [bsd/dev/ppc/systemcalls.c] to handle BSD system calls. unix_syscall() receives as its argument a pointer to a save areathe process control block. Before we discuss unix_syscall()'s operation, let us look at some relevant data structures and mechanisms.

6.7.1.1. Data Structures

BSD system calls on Mac OS X have numbers that start from zero and go as high as the highest numbered BSD system call. These numbers are defined in <sys/syscall.h>.

// <sys/syscall.h> #ifdef __APPLE_API_PRIVATE #define SYS_syscall        0 #define SYS_exit           1 #define SYS_fork           2 #define SYS_read           3 #define SYS_write          4 #define SYS_open           5 #define SYS_close          6 ... #define SYS_MAXSYSCALL 370 #endif

Several system call numbers are reserved or simply unused. In some cases, they may represent calls that have been obsoleted and removed, creating holes in the sequence of implemented system calls.

Note that the zeroth system callsyscall()is the indirect system call: It allows another system call to be invoked given the latter's number, which is provided as the first argument to syscall(), followed by the actual arguments required by the target system call. The indirect system call has traditionally been used to allow testingsay, from a high-level language like Cof new system calls that do not have stubs in the C library.

// Normal invocation of system call number SYS_foo ret = foo(arg1, arg2, ..., argN); // Indirect invocation of foo using the indirect system call ret = syscall(SYS_foo, arg1, arg2, ..., argN);

The syscall.h file is generated during kernel compilation by the bsd/kern/makesyscalls.sh shell script,^[6] which processes the system call master file bsd/kern/syscalls.master. The master file contains a line for each system call number, with the following entities in each column within the line (in this order):

^[6] The script makes heavy use of the awk and sed Unix utilities.

The system call number
The type of cancellation supported by the system call in the case of a thread cancellation: one of PRE (can be canceled on entry itself), POST (can be canceled only after the call is run), or NONE (not a cancellation point)
The type of funnel^[7] to be taken before executing the system call: one of KERN (the kernel funnel) or NONE
^[7] Beginning with Mac OS X 10.4, the network funnel is not used.
The files to which an entry for the system call will be added: either ALL or a combination of T (bsd/kern/init_sysent.cthe system call table), N (bsd/kern/syscalls.cthe table of system call names), H (bsd/sys/syscall.hsystem call numbers), and P (bsd/sys/sysproto.hsystem call prototypes)
The system call function's prototype
Comments that will be copied to output files

; bsd/kern/syscalls.master ; ; Call# Cancel  Funnel  Files   { Name and Args }      { Comments } ; ... 0       NONE    NONE    ALL     { int nosys(void); }   { indirect syscall } 1       NONE    KERN    ALL     { void exit(int rval); } 2       NONE    KERN    ALL     { int fork(void); } ... 368     NONE    NONE    ALL     { int nosys(void); } 369     NONE    NONE    ALL     { int nosys(void); }

The file bsd/kern/syscalls.c contains an array of stringssyscallnames[]that contains each system call's textual name.

// bsd/kern/syscalls.c const char *syscallnames[] = {         "syscall",    /* 0 = syscall indirect syscall */         "exit",       /* 1 = exit */         "fork",       /* 2 = fork */         ...         "#368",       /* 368 = */         "#369",       /* 369 = */ };

We can examine the contents of the syscallnames[] arrayand for that matter, other kernel data structuresfrom user space by reading from the kernel memory device /dev/kmem.^[8]

^[8] The /dev/kmem and /dev/mem devices have been removed from the x86 version of Mac OS X. A simple kernel extension can be written to provide /dev/kmem's functionality, allowing experiments such as this one. This book's accompanying web site provides information about writing such a driver.

Running nm on the kernel binary gives us the address of the symbol syscallnames, which we can dereference to access the array.

$ nm /mach_kernel | grep syscallnames 0037f3ac D _syscallnames $ sudo dd if=/dev/kmem of=/dev/stdout bs=1 count=4 iseek=0x37f3ac | od -x ... 0000000      0032    a8b4 0000004 $ sudo dd if=/dev/kmem of=/dev/stdout bs=1 count=1024 iseek=0x32a8b4 | strings syscall exit fork ...

The file bsd/kern/init_sysent.c contains the system call switch table, sysent[], which is an array of sysent structures, containing one structure for each system call number. This file is generated from the master file during kernel compilation.

// bsd/kern/init_sysent.c #ifdef __ppc__ #define AC(name) (sizeof(struct name) / sizeof(uint64_t)) #else #define AC(name) (sizeof(struct name) / sizeof(register_t)) #endif __private_extern__ struct sysent sysent[] = { {         0,         _SYSCALL_CANCEL_NONE,         NO_FUNNEL,         (sy_call_t *)nosys,         NULL,         NULL,         _SYSCALL_RET_INT_T     }, /* 0 = nosys indirect syscall */     {         AC(exit_args),         _SYSCALL_CANCEL_NONE,         KERNEL_FUNNEL,         (sy_call_t *)exit,         munge_w,         munge_d,         _SYSCALL_RET_NONE     }, /* 1 = exit */     ...     {         0,         _SYSCALL_CANCEL_NONE,         NO_FUNNEL,         (sy_call_t *)nosys,         NULL,         NULL,         _SYSCALL_RET_INT_T     }, /* 369 = nosys */ }; int nsysent = sizeof(sysent) / sizeof(sysent[0]);

The sysent structure is declared in bsd/sys/sysent.h.

// bsd/sys/sysent.h typedef int32_t sy_call_t(struct proc *, void *, int *); typedef void    sy_munge_t(const void *, void *); extern struct sysent {     int16_t     sy_narg;        // number of arguments     int8_t      sy_cancel;      // how to cancel, if at all     int8_t      sy_funnel;      // funnel type, if any, to take upon entry     sy_call_t  *sy_call;        // implementing function     sy_munge_t *sy_arg_munge32; // arguments munger for 32-bit process     sy_munge_t *sy_arg_munge64; // arguments munger for 64-bit process     int32_t     sy_return_type; // return type } sysent[];

The sysent structure's fields have the following meanings.

sy_narg is the number of argumentsat most eighttaken by the system call. In the case of the indirect system call, the number of arguments is limited to seven since the first argument is dedicated for the target system call's number.
As we saw earlier, a system call specifies whether it can be canceled before execution, after execution, or not at all. The sy_cancel field holds the cancellation type, which is one of _SYSCALL_CANCEL_PRE, _SYSCALL_CANCEL_POST, or _SYSCALL_CANCEL_NONE (corresponding to the PRE, POST, and NONE cancellation specifiers, respectively, in the master file). This feature is used in the implementation of the pthread_cancel(3) library call, which in turn invokes the __pthread_markcancel() [bsd/kern/kern_sig.c] system call to cancel a thread's execution. Most system calls cannot be canceled. Examples of those that can be canceled include read(), write(), open(), close(), recvmsg(), sendmsg(), and select().
The sy_funnel field may contain a funnel type that causes the system call's processing to take (lock) the corresponding funnel before the system call is executed, and drop (unlock) the funnel after it has executed. The possible values for this argument in Mac OS X 10.4 are NO_FUNNEL and KERNEL_FUNNEL (corresponding to the KERN and NONE funnel specifiers, respectively, in the master file).
The sy_call field points to the kernel function that implements the system call.
The sy_arg_munge32 and sy_arg_munge64 fields point to functions that are used for munging^[9] system call arguments for 32-bit and 64-bit processes, respectively. We will discuss munging in Section 6.7.1.2.
^[9] Munging a data structure means rewriting or transforming it in some way.
The sy_return_type field contains one of the following to represent the system call's return type: _SYSCALL_RET_NONE, _SYSCALL_RET_INT_T, _SYSCALL_RET_UINT_T, _SYSCALL_RET_OFF_T, _SYSCALL_RET_ADDR_T, _SYSCALL_RET_SIZE_T, and _SYSCALL_RET_SSIZE_T.

Recall that unix_syscall() receives a pointer to the process control block, which is a savearea structure. The system call's arguments are received as saved registers GPR3 through GPR10 in the save area. In the case of an indirect system call, the actual system call arguments start with GPR4, since GPR3 is used for the system call number. unix_syscall() copies these arguments to the uu_arg field within the uthread structure before passing them to the call handler.

// bsd/sys/user.h struct uthread {     int       *uu_ar0;     // address of user's saved GPR0     u_int64_t  uu_arg[8];  // arguments to current system call     int       *uu_ap;      // pointer to argument list     int        uu_rval[2]; // system call return values     ... };

As we will see in Chapter 7, an xnu thread structure contains a pointer to the thread's user structure, roughly analogous to the user area in BSD. Execution within the xnu kernel refers to several structures, such as the Mach task structure, the Mach thread structure, the BSD process structure, and the BSD uthread structure. The latter contains several fields used during system call processing.

The U-Area

Historically, the UNIX kernel maintained an entry for every process in a process table, which always remained in memory. Each process was also allocated a user structureor a u-areathat was an extension of the process structure. The u-area contained process-related information that needed to be accessible to the kernel only when the process was executing. Even though the kernel would not swap out a process structure, it could swap out the associated u-area. Over time, the criticality of memory as a resource has gradually lessened, but operating systems have become more complex. Correspondingly, the process structure has grown in size and the u-area has become less important, with much of its information being moved into the process structure.

6.7.1.2. Argument Munging

Note that uu_arg is an array of 64-bit unsigned integerseach element represents a 64-bit register. This is problematic since a parameter passed in a register from 32-bit user space will not map as is to the uu_arg array. For example, a long long parameter will be passed in a single GPR in a 64-bit program, but in two GPRs in a 32-bit program.

unix_syscall() addresses the issue arising from the difference depicted in Figure 614 by calling the system call's specified argument munger, which copies arguments from the save area to the uu_arg array while adjusting for the differences.

Figure 614. Passing a long long parameter in 32-bit and 64-bit ABIs

$ cat foo.c extern void bar(long long arg); void foo(void) {     bar((long long)1); } $ gcc -static -S foo.c $ cat foo.s ...         li r3,0         li r4,1         bl _bar ... $ gcc -arch ppc64 -static -S foo.c $ cat foo.s ...         li r3,1         bl _bar ...

The munger functions are implemented in bsd/dev/ppc/munge.s. Each function takes two arguments: a pointer to the beginning of the system call parameters within the save area and a pointer to the uu_arg array. A munger function is named munge_<encoding>, where <encoding> is a string that encodes the number and types of system call parameters. <encoding> is a combination of one or more of the d, l, s, and w characters. The characters mean the following:

d represents a 32-bit integer, a 64-bit pointer, or a 64-bit long when the calling process is 64-bitthat is, in each case, the parameter was passed in a 64-bit GPR. Such an argument is munged by copying two words from input to output.
l represents a 64-bit long long passed in two GPRs. Such an argument is munged by skipping a word of input (the upper 32 bits of the first GPR), copying a word of input to output (the lower 32 bits of the first GPR), skipping another word of input, and copying another word from input to output.
s represents a 32-bit signed value. Such an argument is munged by skipping a word of input, loading and sign-extending the next word of input to yield two words, and copying the two words to output.
w represents a 32-bit unsigned value. Such an argument is munged by skipping a word of input, copying a zero word to output, and copying a word from input to output.

Moreover, multiple munger functions are aliased to a common implementation if each function, except one, is a prefix of another. For example, munger_w, munger_ww, munger_www, and munger_wwww are aliased to the same implementationconsequently, four arguments are munged in each case, regardless of the actual number of arguments. Similarly, munger_wwwww, munger_wwwwww, munger_wwwwwww, and munger_wwwwwwww are aliased to the same implementation, whose operation is shown in Figure 615.

Figure 615. An example of system call argument munging

Consider the example of the read() system call. It takes three arguments: a file descriptor, a pointer to a buffer, and the number of bytes to read.

ssize_t read(int d, void *buf, size_t nbytes);

The 32-bit and 64-bit mungers for the read() system call are munge_www() and munge_ddd(), respectively.

6.7.1.3. Kernel Processing of BSD System Calls

Figure 616 shows pseudocode depicting the working of unix_syscall(), which, as we saw earlier, is called by shandler() to process BSD system calls.

Figure 616. Details of the final dispatching of BSD system calls

// bsd/dev/ppc/systemcalls.c void unix_syscall(struct savearea *regs) {     thread_t        thread_act;     struct uthread *uthread;     struct proc    *proc;     struct sysent  *callp;     int             error;     unsigned short  code;     ...     // Determine if this is a direct or indirect system call (the "flavor").     // Set the 'code' variable to either GPR3 or GPR0, depending on flavor.     ...     // If kdebug tracing is enabled, log an entry indicating that a BSD     // system call is starting, unless this system call is kdebug_trace().     ...     // Retrieve the current thread and the corresponding uthread structure.     thread_act = current_thread();     uthread = get_bsdthread_info(thread_act);     ...     // Ensure that the current task has a non-NULL proc structure associated     // with it; if not, terminate the current task.     ...     // uu_ar0 is the address of user's saved GPR0.     uthread->uu_ar0 = (int *)regs;     // Use the system call number to retrieve the corresponding sysent     // structure. If system call number is too large, use the number 63, which     // is an internal reserved number for a nosys().     //     // In early UNIX, the sysent array had space for 64 system calls. The last     // entry (that is, sysent[63]) was a special system call.     callp = (code >= nsysent) ? &sysent[63] : &sysent[code];     if (callp->sy_narg != 0) { // if the call takes one or more arguments         void       *regsp;         sy_munge_t *mungerp;         if (/* this is a 64-bit process */) {             if (/* this is a 64-bit unsafe call */) {                 // Turn it into a nosys() -- use system call #63 and bail out.                 ...             }             // 64-bit argument munger             mungerp = callp->sy_arg_munge64;         } else { /* 32-bit process */             // 32-bit argument munger             mungerp = callp->sy_arg_munge32;         }         // Set regsp to point to either the saved GPR3 in the save area (for a         // direct system call), or to the saved GPR4 (for an indirect system         // call). An indirect system call can take at most 7 arguments.         ...         // Call the argument munger.         (*mungerp)(regsp, (void *)&uthread->uu_arg[0]);     }     // Evaluate call for cancellation, and cancel, if required and possible.     ...     // Take the kernel funnel if the call requires so.     ...     // Assume there will be no error.     error = 0;     // Increment saved SRR0 by one instruction.     regs->save_srr0 += 4;     // Test if this is a kernel trace point -- that is, if system call tracing     // through ktrace(2) is enabled for this process. If so, write a trace     // record for this system call.     ...     // If auditing is enabled, set up an audit record for the system call.     ...     // Call the system call's specific handler.     error = (*(callp->sy_call))(proc, (void *)uthread->uu_arg,              &(uthread->uu_rval[0]));     // If auditing is enabled, commit the audit record.     ...     // Handle return value(s)     ...     // If this is a ktrace(2) trace point, write a trace record for the     // return of this system call.     ...     // Drop the funnel if one was taken.     ...     // If kdebug tracing is enabled, log an entry indicating that a BSD     // system call is ending, unless this system call is kdebug_trace().     ...     thread_exception_return();     /* NOTREACHED */ }

unix_syscall() potentially performs several types of tracing or logging: kdebug tracing, ktrace(2) tracing, and audit logging. We will discuss kdebug and kTRace(2) later in this chapter.

Arguments are passed packaged into a structure to the call-specific handler. Let us consider the example of the socketpair(2) system call, which takes four arguments: three integers and a pointer to a buffer for holding two integers.

int socketpair(int domain, int type, int protocol, int *rsv);

The bsd/sys/sysproto.h file, which, as noted earlier, is generated by bsd/kern/makesyscalls.sh, contains argument structure declarations for all BSD system calls. Note also the use of left and right padding in the declaration of the socketpair_args structure.

// bsd/sys/sysproto.h #ifdef __ppc__ #define PAD_(t) (sizeof(uint64_t) <= sizeof(t) \                          ? 0 : sizeof(uint64_t) - sizeof(t)) #else ... #endif #if BYTE_ORDER == LITTLE_ENDIAN ... #else #define PADL_(t) PAD_(t) #define PADR_(t) 0 #endif ... struct socketpair_args {     char domain_l_[PADL_(int)]; int domain; char domain_r_[PADR_(int)];     char type_l_[PADL_(int)]; int type; char type_r_[PADR_(int)];     char protocol_l_[PADL_(int)]; int protocol; char protocol_r_[PADR_(int)];     char rsv_l_[PADL_(user_addr_t)]; user_addr_t rsv; \                                              char rsv_r_[PADR_(user_addr_t)]; }; ...

The system call handler function for socketpair(2) retrieves its arguments as fields of the incoming socket_args structure.

// bsd/kern/uipc_syscalls.c // Create a pair of connected sockets int socketpair(struct proc            *p,            struct socketpair_args *uap,            __unused register_t    *retval) {     struct fileproc *fp1, *fp2;     struct socket   *so1, *so2;     int fd, error, sv[2];     ...     error = socreate(uap->domain, &so1, uap->type, &uap->protocol);     ...     error = socreate(uap->domain, &so2, uap->type, &uap->protocol);     ...     error = falloc(p, &fp1, &fd);     ...     sv[0] = fd;     error = falloc(p, &fp2, &fd);     ...     sv[1] = fd;     ...     error = copyout((caddr_t)sv, uap->rsv, 2 * sizeof(int));     ...     return (error); }

Note that before calling the system call handler, unix_syscall() sets the error status to zero, assuming that there will be no error. Recall that the saved SRR0 register contains the address of the instruction immediately following the system call instruction. This is where execution would resume after returning to user space from the system call. As we will shortly see, a standard user-space library stub for a BSD system call invokes the cerror() library function to set the errno variablethis should be done only if there is an error. unix_syscall() increments the saved SRR0 by one instruction, so that the call to cerror() will be skipped if there is no error. If the system call handler indeed does return an error, the SRR0 value is decremented by one instruction.

After returning from the handler, unix_syscall() examines the error variable to take the appropriate action.

If error is ERESTART, this is a restartable system call that needs to be restarted. unix_syscall() decrements SRR0 by 8 bytes (two instructions) to cause execution to resume at the original system call instruction.
If error is EJUSTRETURN, this system call wants to be returned to user space without any further processing of return values.
If error is nonzero, the system call returned an error, which unix_syscall() copies to the saved GPR3 in the process control block. It also decrements SRR0 by one instruction to cause the cerror() routine to be executed upon return to user space.
If error is 0, the system call returned success. unix_syscall() copies the return values from the uthread structure to the saved GPR3 and GPR4 in the process control block. Table 610 shows how the return values are handled.

Table 610. Handling of BSD System Call Return Values
Call Return Type	Source for GPR3	Source for GPR4
Erroneous return	The `error` variable	Nothing
`_SYSCALL_RET_INT_T`	`uu_rval[0]`	`uu_rval[1]`
`_SYSCALL_RET_UINT_T`	`uu_rval[0]`	`uu_rval[1]`
`_SYSCALL_RET_OFF_T` (32-bit process)	`uu_rval[0]`	`uu_rval[1]`
`_SYSCALL_RET_OFF_T` (64-bit process)	`uu_rval[0]` and `uu_rval[1]` as a single `u_int64_t` value	The value `0`
`_SYSCALL_RET_ADDR_T`	`uu_rval[0]` and `uu_rval[1]` as a single `user_addr_t` value	The value `0`
`_SYSCALL_RET_SIZE_T`	`uu_rval[0]` and `uu_rval[1]` as a single `user_addr_t` value	The value `0`
`_SYSCALL_RET_SSIZE_T`	`uu_rval[0]` and `uu_rval[1]` as a single `user_addr_t` value	The value `0`
`_SYSCALL_RET_NONE`	Nothing	Nothing

Finally, to return to user mode, unix_syscall() calls thread_exception_return() [osfmk/ppc/hw_exception.s], which checks for outstanding ASTs. If any ASTs are found, ast_taken() is called. After ast_taken() returns, thread_exception_return() checks for outstanding ASTs one more time (and so on). It then jumps to .L_thread_syscall_return() [osfmk/ppc/hw_exception.s], which branches to chkfac() [osfmk/ppc/hw_exception.s], which branches to exception_exit() [osfmk/ppc/lowmem_vectors.s]. Some of the context is restored during these calls. exception_exit() eventually branches to EatRupt [ofsmk/ppc/lowmem_vectors.s], which releases the save area, performs the remaining context restoration and state cleanup, and finally executes the rfid (rfi for 32-bit) instruction to return from the interrupt.

Looking Back at System Calls

The system call mechanism in early UNIX operated similarly in concept to the one we have discussed here: It allowed a user program to call on the kernel by executing the trap instruction in user mode. The low-order byte of the instruction word encoded the system call number. Therefore, in theory, there could be up to 256 system calls. Their handler functions in the kernel were contained in a sysent table whose first entry was the indirect system call. First Edition UNIX (circa November 1971) had fewer than 35 documented system calls. Figure 617 shows a code excerpt from Third Edition UNIX (circa February 1973)note that the system call numbers for various system calls are identical to those in Mac OS X.

Figure 617. System call data structures in Third Edition UNIX

/* Third Edition UNIX */ /* ken/trap.c */ ... struct {       int count;       int (*call)(); } sysent[64]; ... /* ken/sysent.c */ int sysent[] {     0, &nullsys,      /* 0 = indir */     0, &rexit,        /* 1 = exit */     0, &fork,         /* 2 = fork */     2, &read,         /* 3 = read */     2, &write,        /* 4 = write */     2, &open,         /* 5 = open */     ...     0, &nosys,        /* 62 = x */     0, &prproc        /* 63 = special */ ...

6.7.1.4. User Processing of BSD System Calls

A typical BSD system call stub in the C library is constructed using a set of macros, some of which are shown in Figure 618. The figure also shows an assembly-language fragment for the the exit() system call. Note that the assembly code is shown with a static call to cerror() for simplicity, as the invocation is somewhat more complicated in the case of dynamic linking.

Figure 618. Creating a user-space system call stub

$ cat testsyscall.h // for system call numbers #include <sys/syscall.h> // taken from <architecture/ppc/mode_independent_asm.h> #define MI_ENTRY_POINT(name)     \     .globl  name                @\     .text                       @\     .align  2                   @\ name: #if defined(__DYNAMIC__) #define MI_BRANCH_EXTERNAL(var)  \     MI_GET_ADDRESS(r12,var)     @\     mtctr   r12                 @\     bctr #else /* ! __DYNAMIC__ */ #define MI_BRANCH_EXTERNAL(var)  \     b       var #endif // taken from Libc/ppc/sys/SYS.h #define kernel_trap_args_0 #define kernel_trap_args_1 #define kernel_trap_args_2 #define kernel_trap_args_3 #define kernel_trap_args_4 #define kernel_trap_args_5 #define kernel_trap_args_6 #define kernel_trap_args_7 #define SYSCALL(name, nargs)        \         .globl  cerror             @\     MI_ENTRY_POINT(_##name)        @\         kernel_trap_args_##nargs   @\         li      r0,SYS_##name      @\         sc                         @\         b       1f                 @\         blr                        @\ 1:      MI_BRANCH_EXTERNAL(cerror) // let us define the stub for SYS_exit SYSCALL(exit, 1) $ gcc -static -E testsyscall.h | tr '@' '\n' ... ; indented and annotated for clarity .globl cerror     .globl _exit     .text     .align 2 _exit:      li r0,1    ; load system call number in r0      sc         ; execute the sc instruction      b 1f       ; jump over blr, to the cerror call      blr        ; return 1:   b cerror   ; call cerror, which will also return to the user

The f in the unconditional branch instruction to 1f in Figure 618 specifies the directionforward, in this case. If you have another label named 1 before the branch instruction, you can jump to it using 1b as the operand.

Figure 618 also shows the placement of the call to cerror() in the case of an error. When the sc instruction is executed, the processor places the effective address of the instruction following the sc instruction in SRR0. Therefore, the stub is set to call the cerror() function by default after the system call returns. cerror() copies the system call's return value (contained in GPR3) to the errno variable, calls cthread_set_errno_self() to set the per-thread errno value for the current thread, and sets both GPR3 and GPR4 to -1, thereby causing the calling program to receive return values of -1 whether the expected return value is one word (in GPR3) or two words (in GPR3 and GPR4).

Let us now look at an example of directly invoking a system call using the sc instruction. Although doing so is useful for demonstration, a nonexperimental user program should not use the sc instruction directly. The only API-compliant and future-proof way to invoke system calls under Mac OS X is through user libraries. Almost all supported system calls have stubs in the system library (libSystem), of which the standard C library is a subset.

As we noted in Chapter 2, the primary reason system calls must not be invoked directly in user programsespecially shipping productsis that the interfaces between system shared libraries and the kernel are private to Apple and are subject to change. Moreover, user programs are allowed to link with system libraries (including libSystem) only dynamically. This allows Apple flexibility in modifying and extending its private interfaces without affecting user programs.

With that caveat, let us use the sc instruction to invoke a simple BSD system callsay, getpid(). Figure 619 shows a program that uses both the library stub and our custom stub to call getpid(). We need an extra instructionsay, a no-opimmediately following the sc instruction, otherwise the program will behave incorrectly.

Figure 619. Directly invoking a BSD system call

// getpid_demo.c #include <stdio.h> #include <sys/types.h> #include <unistd.h> #include <sys/syscall.h> pid_t my_getpid(void) {     int syscallnum = SYS_getpid;     __asm__ volatile(         "lwz r0,%0\n"         "sc\n"         "nop\n" // The kernel will arrange for this to be skipped       :       : "g" (syscallnum)     );     // GPR3 already has the right return value     // Compiler warning here because of the lack of a return statement } int main(void) {     printf("my pid is %d\n", getpid());     printf("my pid is %d\n", my_getpid());     return 0; } $ gcc -Wall -o getpid_demo getpid_demo.c getpid_demo.c: In function 'my_getpid': getpid_demo.c:24: warning: control reaches end of non-void function $ ./getpid_demo my pid is 2345 my pid is 2345 $

Note that since user programs on Mac OS X can only be dynamically linked with Apple-provided libraries, one would expect a user program not to have any sc instructions at allit should only have dynamically resolved symbols to system call stubs. However, dynamically linked 32-bit C and C++ programs do have a couple of embedded sc instructions that come from the language runtime startup codespecifically, the __dyld_init_check() function.

; dyld.s in the source for the C startup code /*  * At this point the dynamic linker initialization was not run so print a  * message on stderr and exit non-zero.  Since we can't use any libraries the  * raw system call interfaces must be used.  *  *      write(stderr, error_message, sizeof(error_message));  */         li      r5,78         lis     r4,hi16(error_message)         ori     r4,r4,lo16(error_message)         li      r3,2         li      r0,4    ; write() is system call number 4         sc         nop             ; return here on error /*  *      _exit(59);  */         li      r3,59         li      r0,1    ; exit() is system call number 1         sc         trap            ; this call to _exit() should not fall through         trap

6.7.2. Mach Traps

Although Mach traps are similar to traditional system calls in that they are entry points into the kernel, they are different in that Mach kernel services are typically not offered directly through these traps. Instead, certain Mach traps are IPC entry points through which user-space clientssuch as the system libraryaccess kernel services by exchanging IPC messages with the server that implements those services, just as if the server were in user space.

There are almost ten times as many BSD system calls as there are Mach traps.

Consider an example of a simple Mach trapsay, task_self_trap(), which returns send rights to the task's kernel port. The documented mach_task_self() library function is redefined in <mach/mach_init.h> to be the value of the environment variable mach_task_self_, which is populated by the system library during the initialization of a user process. Specifically, the library stub for the fork() system call^[10] sets up the child process by calling several initialization routines, including one that initializes Mach in the process. This latter step caches the return value of task_self_trap() in the mach_task_self_ variable.

^[10] We will see how fork() is implemented in Chapter 7.

// <mach/mach_init.h> extern mach_port_t mach_task_self_; #define mach_task_self() mach_task_self_ ...

The program shown in Figure 620 uses several apparently different ways of retrieving the same informationthe current task's self port.

Figure 620. Multiple ways of retrieving a Mach task's self port

// mach_task_self.c #include <stdio.h> #include <mach/mach.h> #include <mach/mach_traps.h> int main(void) {     printf("%#x\n", mach_task_self()); #undef mach_task_self     printf("%#x\n", mach_task_self());     printf("%#x\n", task_self_trap());     printf("%#x\n", mach_task_self_);     return 0; } $ gcc -Wall -o mach_task_self mach_task_self.c $ ./mach_task_self 0x807 0x807 0x807 0x807

The value returned by task_self_trap() is not a unique identifier like a Unix process ID. In fact, its value will be the same for all tasks, even on different machines, provided the machines are running identical kernels.

An example of a complex Mach trap is mach_msg_overwrite_trap() [osfmk/ipc/mach_msg.c], which is used for sending and receiving IPC messages. Its implementation contains over a thousand lines of C code. mach_msg_trap() is a simplified wrapper around mach_msg_overwrite_trap(). The C library provides the mach_msg() and mach_msg_overwrite() documented functions that use these traps but also can restart message sending or receiving in the case of interruptions. User programs access kernel services by performing IPC with the kernel using these "msg" traps. The paradigm used is essentially client server, wherein the clients (programs) request information from the server (the kernel) by sending messages, and usuallybut not alwaysreceiving replies. Consider the example of Mach's virtual memory services. As we will see in Chapter 8, a user program can allocate a region of virtual memory using the Mach vm_allocate() function. Now, although vm_allocate() is implemented in the kernel, it is not exported by the kernel as a regular system call. It is available as a remote procedure in the "Kernel Server" and is callable by user clients. The vm_allocate() function that user programs call lives in the C library, representing the client end of the remote procedure call. Various other Mach services, such as those that allow the manipulation of tasks, threads, processors, and ports, are provided similarly.

Mach Interface Generator (MIG)

Implementations of Mach services commonly use the Mach Interface Generator (MIG), which simplifies the task of creating Mach clients and servers by subsuming a considerable portion of frequently used IPC code. MIG accepts a definition file that describes IPC-related interfaces using a predefined syntax. Running the MIG program/usr/bin/migon a definition file generates a C header, a client (user) interface module, and a server interface module. We will see an example of using MIG in Chapter 9. MIG definition files for various kernel services are located in the /usr/include/mach/ directory. A MIG definition file conventionally has a .def extension.

Mach traps are maintained in an array of structures called mach_trap_table, which is similar to BSD's sysent table. Each element of this array is a structure of type mach_trap_t, which is declared in osfmk/kern/syscall_sw.h. Figure 621 shows the MACH_TRAP() macro.

Figure 621. Mach trap table data structures and definitions

// osfmk/kern/syscall_sw.h typedef void mach_munge_t(const void *, void *); typedef struct {     int mach_trap_arg_count;     int (* mach_trap_function)(void); #if defined(__i386__)     boolean_t  mach_trap_stack; #else     mach_munge_t *mach_trap_arg_munge32;     mach_munge_t *mach_trap_arg_munge64; #endif #if !MACH_ASSERT     int mach_trap_unused; #else     const char * mach_trap_name; #endif } mach_trap_t; #define MACH_TRAP_TABLE_COUNT   128 extern mach_trap_t mach_trap_table[]; extern int         mach_trap_count; ... #if !MACH_ASSERT #define MACH_TRAP(name, arg_count, munge32, munge64) \     { (arg_count), (int (*)(void)) (name), (munge32), (munge64), 0 } #else #define MACH_TRAP(name, arg_count, munge32, munge64) \     { (arg_count), (int (*)(void)) (name), (munge32), (munge64), #name } #endif ...

The MACH_ASSERT compile-time configuration option controls the ASSERT() and assert() macros and is used while compiling debug versions of the kernel.

The MACH_TRAP() macro shown in Figure 621 is used to populate the Mach trap table in osfmk/kern/syscall_sw.cFigure 622 shows how this is done. Mach traps on Mac OS X have numbers that start from -10, decrease monotonically, and go as high in absolute value as the highest numbered Mach trap. Numbers 0 tHRough -9 are reserved for Unix system calls and are unused. Note also that the argument munger functions are the same as those used in BSD system call processing.

Figure 622. Mach trap table initialization

// osfmk/kern/syscall_sw.c mach_trap_t mach_trap_table[MACH_TRAP_TABLE_COUNT] = {     MACH_TRAP(kern_invalid, 0, NULL, NULL), /* Unix */         /* 0 */     MACH_TRAP(kern_invalid, 0, NULL, NULL), /* Unix */         /* -1 */     ...                                     ...                ...     MACH_TRAP(kern_invalid, 0, NULL, NULL), /* Unix */         /* -9 */     MACH_TRAP(kern_invalid, 0, NULL, NULL),                    /* -10 */     ...                                                        ...     MACH_TRAP(kern_invalid, 0, NULL, NULL),                    /* -25 */     MACH_TRAP(mach_reply_port, 0, NULL, NULL),                 /* -26 */     MACH_TRAP(thread_self_trap, 0, NULL, NULL),                /* -27 */     ...                                                        ...     MACH_TRAP(mach_msg_trap, 7, munge_wwwwwww, munge_ddddddd), /* -31 */     ...                                                        ...     MACH_TRAP(task_for_pid, 3, munge_www, munge_ddd),          /* -46 */     MACH_TRAP(pid_for_task, 2, munge_ww, munge_dd),            /* -47 */     ...                                                        ...     MACH_TRAP(kern_invalid, 0, NULL, NULL),                    /* -127 */ }; int mach_trap_count = (sizeof(mach_trap_table) / \                         sizeof(mach_trap_table[0])); ... kern_return_t kern_invalid(void) {     if (kern_invalid_debug)         Debugger("kern_invalid mach_trap");     return KERN_INVALID_ARGUMENT; } ...

The assembly stubs for Mach traps are defined in osfmk/mach/syscall_sw.h using the machine-dependent kernel_trap() macro defined in osfmk/mach/ppc/syscall_sw.h. Table 611 enumerates the key files used in the implementation of these traps.

Table 611. Implementing Mach Traps in xnu
File	Contents
`osfmk/kern/syscall_sw.h`	Declaration of the trap table structure
`osfmk/kern/syscall_sw.c`	Population of the trap table; definitions of default error functions
`osfmk/mach/mach_interface.h`	Master header file that includes headers for the various Mach APIsspecifically the kernel RPC functions corresponding to these APIs (the headers are generated from MIG definition files)
`osfmk/mach/mach_traps.h`	Prototypes for traps as seen from user space, including declaration of each trap's argument structure
`osfmk/mach/syscall_sw.h`	Instantiation of traps by defining assembly stubs, using the machine-dependent `kernel_trap()` macro (note that some traps may have different versions for the 32-bit and 64-bit system libraries, whereas some traps may not be available in one of the two libraries)
`osfmk/mach/ppc/syscall_sw.h`	PowerPC definitions of the `kernel_trap()` macro and associated macros; definitions of other PowerPC-only system calls

The kernel_trap() macro takes three arguments for a trap: its name, its index in the trap table, and its argument count.

// osfmk/mach/syscall_sw.h kernel_trap(mach_reply_port, -26, 0); kernel_trap(thread_self_trap, -27, 0); ... kernel_trap(task_for_pid, -45, 3); kernel_trap(pid_for_task, -46, 2); ...

Let us look at a specific example, say, pid_for_task(), and see how its stub is instantiated. pid_for_task() attempts to find the BSD process ID for the given Mach task. It takes two arguments: the port for a task and a pointer to an integer for holding the returned process ID. Figure 623 shows the implementation of this trap.

Figure 623. Setting up the `pid_for_task()` Mach trap

// osfmk/mach/syscall_sw.h kernel_trap(pid_for_task, -46, 2); ... // osfmk/mach/ppc_syscall_sw.h #include <mach/machine/asm.h> #define kernel_trap(trap_name, trap_number, trap_args) \ ENTRY(trap_name, TAG_NO_FRAME_USED) @\         li      r0,     trap_number @\         sc      @\         blr ... // osfmk/ppc/asm.h // included from <mach/machine/asm.h> #define TAG_NO_FRAME_USED 0x00000000 #define EXT(x) _ ## x #define LEXT(x) _ ## x ## : #define FALIGN 4 #define MCOUNT #define Entry(x,tag)    .text@.align FALIGN@ .globl EXT(x)@ LEXT(x) #define ENTRY(x,tag)    Entry(x,tag)@MCOUNT ... // osfmk/mach/mach_traps.h #ifndef KERNEL extern kern_return_t pid_for_task(mach_port_name_t t, int *x); ... #else /* KERNEL */ ... struct pid_for_task_args {     PAD_ARG_(mach_port_name_t, t);     PAD_ARG_(user_addr_t, pid); }; extern kern_return_t pid_for_task(struct pid_for_task_args *args); ... // bsd/vm/vm_unix.c kern_return_t pid_for_task(struct pid_for_task_args *args) {     mach_port_name_t t = args->t;     user_addr_t pid_addr = args->pid;     ... }

Using the information shown in Figure 623, the trap definition for pid_for_task() will have the following assembly stub:

        .text         .align 4         .globl _pid_for_task _pid_for_task:         li r0,-46         sc         blr

Let us test the assembly stub by changing the stub's function name from _pid_for_task to _my_pid_for_task, placing it in a file called my_pid_for_task.S, and using it in a C program. Moreover, we can call the regular pid_for_task() to verify the operation of our stub, as shown in Figure 624.

Figure 624. Testing the `pid_for_task()` Mach trap

// traptest.c #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <mach/mach.h> #include <mach/mach_error.h> extern kern_return_t my_pid_for_task(mach_port_t, int *); int main(void) {     pid_t         pid;     kern_return_t kr;     mach_port_t   myTask;     myTask = mach_task_self();     // call the regular trap     kr = pid_for_task(myTask, (int *)&pid);     if (kr != KERN_SUCCESS)         mach_error("pid_for_task:", kr);     else         printf("pid_for_task says %d\n", pid);     // call our version of the trap     kr = my_pid_for_task(myTask, (int *)&pid);     if (kr != KERN_SUCCESS)         mach_error("my_pid_for_task:", kr);     else         printf("my_pid_for_task says %d\n", pid);     exit(0); } $ gcc -Wall -o traptest traptest.c my_pid_for_task.S $ ./traptest pid_for_task says 20040 my_pid_for_task says 20040

In general, handling of Mach traps follows a similar path in the kernel as BSD system calls. shandler() identifies Mach traps by virtue of their call numbers being negative. It looks up the trap handler in mach_trap_table and performs the call.

Mach traps in Mac OS X support up to eight parameters that are passed in GPRs 3 through 10. Nevertheless, mach_msg_overwrite_TRap() takes nine parameters, but the ninth parameter is not used in practice. In the trap's processing, a zero is passed as the ninth parameter.

6.7.3. I/O Kit Traps

Trap numbers 100 through 107 in the Mach trap table are reserved for I/O Kit traps. In Mac OS X 10.4, only one I/O Kit trap is implemented (but not used): iokit_user_client_trap() [iokit/Kernel/IOUserClient.cpp]. The I/O Kit framework (IOKit.framework) implements the user-space stub for this trap.

6.7.4. PowerPC-Only System Calls

The Mac OS X kernel maintains yet another system call table called PPCcalls, which contains a few special PowerPC-only system calls. PPCcalls is defined in osfmk/ppc/PPCcalls.h. Each of its entries is a pointer to a function that takes one argument (a pointer to a save area) and returns an integer.

// osfmk/ppc/PPCcalls.h typedef int (*PPCcallEnt)(struct savearea *save); #define PPCcall(rout) rout #define dis (PPCcallEnt)0 PPCcallEnt PPCcalls[] = {     PPCcall(diagCall),         // 0x6000     PPCcall(vmm_get_version),  // 0x6001     PPCcall(vmm_get_features), // 0x6002     ...                        // ...     PPCcall(dis),     ... }; ...

Call numbers for the PowerPC system calls begin at 0x6000 and can go up to 0x6FFFthat is, there can be at most 4096 such calls. The assembly stubs for these calls are instantiated in osfmk/mach/ppc/syscall_sw.h.

// osfmk/mach/ppc/syscall_sw.h #define ppc_trap(trap_name,trap_number) \ ENTRY(trap_name, TAG_NO_FRAME_USED) @\         li      r0,     trap_number @\         sc      @\         blr ... ppc_trap(diagCall, 0x6000); ppc_trap(vmm_get_version, 0x6001); ppc_trap(vmm_get_features, 0x6002); ...

Note that the ppc_trap() macro is similar to the kernel_trap() macro used for defining assembly stubs for Mach traps. shandler() passes most of these calls to ppscall() [osfmk/hw_exception.s], which looks up the appropriate handler in the PPCcalls table.

Depending on their purpose, these calls can be categorized as follows:

Calls that are used for low-level performance monitoring, diagnostics, and power management (Table 612)

Table 612. PowerPC-Only Calls for Performance Monitoring, Diagnostics, and Power Management
Call Number	Call Name	Purpose
`0x6000`	`diagCall`	Calls the routines implemented in the kernel's built-in diagnostics facility (see Section 6.8.8.2)
`0x6009`	`CHUDCall`	Acts as a hook for the Computer Hardware Understanding Development (CHUD) interfacedisabled to begin with, but is set to a private system call callback function when such a callback is registered by CHUD
`0x600A`	`ppcNull`	Does nothing and simply returns (a null system call); used for performance testing
`0x600B`	`perfmon_control`	Allows manipulation of the PowerPC performance-monitoring facility
`0x600C`	`ppcNullinst`	Does nothing but forces various timestamps to be returned (an instrumented null system call); used for performance testing
`0x600D`	`pmsCntrl`	Controls the Power Management Stepper

Calls that allow a user program to instantiate and control a virtual machine using the kernel's virtual machine monitor (VMM) facility (Table 613)

Table 613. PowerPC-Only Calls for the Virtual Machine Monitor
Call Number	Call Name	Purpose
`0x6001`	`vmm_get_version`	Retrieves the VMM facility's version
`0x6002`	`vmm_get_features`	Retrieves the VMM facility's supported features
`0x6003`	`vmm_init_context`	Initializes a new VMM context
`0x6004`	`vmm_dispatch`	Used as an indirect system call for dispatching various VMM system callsis also an ultra-fast trap (see Section 6.7.5)
`0x6008`	`vmm_stop_vm`	Stops a running virtual machine

Calls that provide kernel assistance to the Blue Box (Classic) environment (Table 614)

Table 614. PowerPC-Only Calls for the Blue Box
Call Number	Call Name	Purpose
`0x6005`	`bb_enable_bluebox`	Enables a thread for use in the Blue Box virtual machine
`0x6006`	`bb_disable_bluebox`	Disables a thread for use in the Blue Box virtual machine
`0x6007`	`bb_settaskenv`	Sets the Blue Box per-thread task environment data

6.7.5. Ultra-Fast Traps

Certain traps are handled entirely by the low-level exception handlers in osfmk/ppc/lowmem_vectors.s, without saving or restoring much (or any) state. Such traps also return from the system call interrupt very rapidly. These are the ultra-fast traps (UFTs). As shown in Figure 613, these calls have dedicated handlers in the scTable, from where the exception vector at 0xC00 loads them. Table 615 lists the ultra-fast traps.

Table 615. Ultra-Fast Traps
Call Number	Association	Purpose
`0xFFFF_FFFE`	Blue Box only	Determines whether the given Blue Box task is preemptive, and also loads GPR0 with the shadowed task environment (`MkIsPreemptiveTaskEnv`)
`0xFFFF_FFFF`	Blue Box only	Determines whether the given Blue Box task is preemptive (`MkIsPreemptiveTask`)
`0x8000_0000`	`CutTrace` firmware call	Used for low-level tracing (see Section 6.8.9.2)
`0x6004`	`vmm_dispatch`	Treats certain calls (those belonging to a specific range of selectors supported by this dispatcher call) as ultra-fast trapseventually handled by `vmm_ufp()` [`osfmk/ppc/vmachmon_asm.s`]
`0x7FF2`	User only	Returns the `pthread_self` valuei.e., the thread-specific pointer (Thread Info UFT)
`0x7FF3`	User only	Returns floating-point and AltiVec facility statusi.e., if they are being used by the current thread (Facility Status UFT)
`0x7FF4`	Kernel only	Loads the Machine State Registernot used on 64-bit hardware (Load MSR UFT)

A comm area (see Section 6.7.6) routine uses the Thread Info UFT for retrieving the thread-specific (self) pointer, which is also called the per-thread cookie. The pthread_self(3) library function retrieves this value. The following assembly stub, which directly uses the UFT, retrieves the same value as the pthread_self() function in a user program.

; my_pthread_self.S         .text         .globl _my_pthread_self _my_pthread_self:         li r0,0x7FF2         sc         blr

Note that on certain PowerPC processorsfor example, the 970 and the 970FXthe special-purpose register SPRG3, which Mac OS X uses to hold the per-thread cookie, can be read from user space.

; my_pthread_self_970.S         .text         .globl _my_pthread_self_970 _my_pthread_self_970:         mfspr r3,259 ; 259 is user SPRG3         blr

Let us test our versions of pthread_self() by using them in a 32-bit program on both a G4 and a G5, as shown in Figure 625.

Figure 625. Testing the Thread Info UFT

$ cat main.c #include <stdio.h> #include <pthread.h> extern pthread_t my_pthread_self(); extern pthread_t my_pthread_self_970(); int main(void) {     printf("library: %p\n", pthread_self());        // call library function     printf("UFT    : %p\n", my_pthread_self());     // use 0x7FF2 UFT     printf("SPRG3  : %p\n", my_pthread_self_970()); // read from SPRG3     return 0; } $ machine ppc970 $ gcc -Wall -o my_pthread_self main.c my_pthread_self.S my_pthread_self_970.S $ ./my_pthread_self library: 0xa000ef98 UFT    : 0xa000ef98 SPRG3  : 0xa000ef98 $ machine ppc7450 $ ./my_pthread_self library: 0xa000ef98 UFT    : 0xa000ef98 zsh: illegal hardware instruction  ./f

The Facility Status UFT can be used to determine which processor facilitiessuch as floating-point and AltiVecare being used by the current thread. The following function, which directly uses the UFT, will return with a word whose bits specify the processor facilities in use.

; my_facstat.S         .text         .globl _my_facstat _my_facstat:         li r0,0x7FF3         sc         blr

The program in Figure 626 initializes a vector variable only if you run it with one or more arguments on the command line. Therefore, it should report that AltiVec is being used only if you run it with an argument.

Figure 626. Testing the Facility Status UFT

// isvector.c #include <stdio.h> // defined in osfmk/ppc/thread_act.h #define vectorUsed 0x20000000 #define floatUsed  0x40000000 #define runningVM  0x80000000 extern int my_facstat(void); int main(int argc, char **argv) {     int facstat;     vector signed int c;     if (argc > 1)         c  = (vector signed int){ 1, 2, 3, 4 };     facstat = my_facstat();     printf("%s\n", (facstat & vectorUsed) ? \            "vector used" : "vector not used");     return 0; } $ gcc -Wall -o isvector isvector.c my_facstat.S $ ./isvector vector not used $ ./isvector usevector vector used

6.7.5.1. Fast Traps

A few other traps that need somewhat more processing than ultra-fast traps, or are not as beneficial to handle so urgently, are handled by shandler() in osfmk/ppc/hw_exception.s. These are called fast traps, or fastpath calls. Table 616 lists the fastpath calls. Figure 612 shows the handling of both ultra-fast and fast traps.

Table 616. Fastpath System Calls
Call Number	Call Name	Purpose
`0x7FF1`	`CthreadSetSelf`	Sets a thread's identifier. This call is used by the Pthread library to implement `pthread_set_self()`, which is used during thread creation.
`0x7FF5`	Null fastpath	Does nothing. It branches straight to `exception_exit()` in `lowmem_vectors.s`.
`0x7FFA`	Blue Box interrupt notification	Results in the invocation of `syscall_notify_interrupt()` [`osfmk/ppc/PseudoKernel.c`], which queues an interrupt for the Blue Box and sets an asynchronous procedure call (APC) AST. The Blue Box interrupt handler`bbsetRupt()` [`osfmk/ppc/PseudoKernel.c`]runs asynchronously to handle the interrupt.

6.7.5.2. Blue Box Calls

The Mac OS X kernel includes support code for the Blue Box virtualizer that provides the Classic runtime environment. The support is implemented as a small layer of software called the PseudoKernel, whose functionality is exported via a set of fast/ultra-fast system calls. We came across these calls in Tables 614, 615, and 616.

The truBlueEnvironment program, which resides within the Resources subdirectory of the Classic application package (Classic Startup.app), directly uses the 0x6005 (bb_enable_bluebox), 0x6006 (bb_disable_bluebox), 0x6007 (bb_settaskenv), and 0x7FFA (interrupt notification) system calls.

A specially designated threadthe Blue threadruns Mac OS while handling Blue Box interrupts, traps, and system calls. Other threads can only issue system calls. The bb_enable_bluebox() [osfmk/ppc/PseudoKernel.c] PowerPC-only system call is used to enable the support code in the kernel. It receives three arguments from the user-space caller: a task identifier, a pointer to the trap table (TWI_TableStart), and a pointer to a descriptor table (Desc_TableStart). bb_enable_bluebox() passes these arguments in a call to enable_bluebox() [osfmk/ppc/PseudoKernel.c], which aligns the passed-in descriptor address to a page, wires the page, and maps it into the kernel. The page holds a BlueThreadTrapDescriptor structure (BTTD_t), which is declared in osfmk/ppc/PseudoKernel.h. Thereafter, enable_bluebox() initializes several Blue Boxrelated fields of the thread's machine-specific state (the machine_thread structure). Figure 627 shows pseudocode depicting the operation of enable_bluebox().

Figure 627. Enabling the kernel's Blue Box support

// osfmk/ppc/thread.h struct machine_thread {     ...     // Points to Blue Box Trap descriptor area in kernel (page aligned)     unsigned int bbDescAddr;     // Points to Blue Box Trap descriptor area in user (page aligned)     unsigned int bbUserDA;     unsigned int bbTableStart;// Points to Blue Box Trap dispatch area in user     unsigned int emPendRupts; // Number of pending emulated interruptions     unsigned int bbTaskID;    // Opaque task ID for Blue Box threads     unsigned int bbTaskEnv;   // Opaque task data reference for Blue Box threads     unsigned int specFlags;   // Special flags     ...     unsigned int bbTrap;      // Blue Box trap vector     unsigned int bbSysCall;   // Blue Box syscall vector     unsigned int bbInterrupt; // Blue Box interrupt vector     unsigned int bbPending;   // Blue Box pending interrupt vector     ... }; // osfmk/ppc/PseudoKernel.c kern_return_t enable_bluebox(host_t host, void *taskID, void *TWI_TableStart,                char *Desc_TableStart) {     thread_t       th;     vm_offset_t    kerndescaddr, origdescoffset;     kern_return_t  ret;     ppnum_t        physdescpage;     BTTD_t        *bttd;     th = current_thread(); // Get our thread.     // Ensure descriptor is non-NULL.     // Get page offset of the descriptor in 'origdescoffset'.     // Now align descriptor to a page.     // Kernel wire the descriptor in the user's map.     // Map the descriptor's physical page into the kernel's virtual address     // space, calling the resultant address 'kerndescaddr'. Set the 'bttd'     // pointer to 'kerndescaddr'.     // Set the thread's Blue Box machine state.     // Kernel address of the table     th->machine.bbDescAddr = (unsigned int)kerndescaddr + origdescoffset;     // User address of the table     th->machine.bbUserDA = (unsigned int)Desc_TableStart;     // Address of the trap table     th->machine.bbTableStart = (unsigned int)TWI_TableStart;     ...     // Remember trap vector.     th->machine.bbTrap = bttd->TrapVector;     // Remember syscall vector.     th->machine.bbSysCall = bttd->SysCallVector;     // Remember interrupt vector.     th->machine.bbPending = bttd->PendingIntVector;     // Ensure Mach system calls are enabled and we are not marked preemptive.     th->machine.specFlags &= ~(bbNoMachSC | bbPreemptive);     // Set that we are the Classic thread.     th->machine.specFlags |= bbThread;     ... }

Once the Blue Box trap and system call tables are established, the PseudoKernel can be invoked^[11] while changing Blue Box interruption state atomically. Both thandler() and shandler() check for the Blue Box during trap and system call processing, respectively.

^[11] The PseudoKernel can be invoked both from PowerPC (native) and 68K (system) contexts.

thandler() checks the specFlags field of the current activation's machine_thread structure to see if the bbThread bit is set. If the bit is set, thandler() calls checkassist() [osfmk/ppc/hw_exception.s], which checks whether all the following conditions hold true.

The SRR1_PRG_TRAP_BIT bit^[12] of SRR1 specifies that this is a trap.
^[12] The kernel uses bit 24 of SRR1 for this purpose. This reserved bit can be implementation-defined.
The trapped address is in user space.
This is not an ASTthat is, the trap type is not a T_AST.
The trap number is not out of rangethat is, it is not more than a predefined maximum.

If all of these conditions are satisfied, checkassist() branches to atomic_switch_trap() [osfmk/ppc/atomic_switch.s], which loads the trap table (the bbTrap field of the machine_thread structure) in GPR5 and jumps to .L_CallPseudoKernel() [osfmk/ppc/atomic_switch.s].

shandler() checks whether system calls are being redirected to the Blue Box by examining the value of the bbNoMachSC bit of the specFlags field. If this bit is set, shandler() calls atomic_switch_syscall() [osfmk/ppc/atomic_switch.s], which loads the system call table (the bbSysCall field of the machine_thread structure) in GPR5 and falls through to .L_CallPseudoKernel().

In both cases, .L_CallPseudoKernel()among other thingsstores the vector contained in GPR5 in the saved SRR0 as the instruction at which execution will resume. Thereafter, it jumps to fastexit() [osfmk/ppc/hw_exception.s], which jumps to exception_exit() [osfmk/ppc/lowmem_vectors.s], thus causing a return to the caller.

A particular Blue Box trap value (bbMaxTrap) is used to simulate a return-from-interrupt from the PseudoKernel to user context. Returning Blue Box traps and system calls use this trap, which results in the invocation of .L_ExitPseudoKernel() [osfmk/ppc/atomic_switch.s].

6.7.6. The Commpage

The kernel reserves the last eight pages of every address space for the kernel-user comm areaalso referred to as the commpage. Besides being wired in kernel memory, these pages are mapped (shared and read-only) into the address space of every process. Their contents include code and data that are frequently accessed systemwide. The following are examples of commpage contents:

Specifications of processor features available on the machine, such as whether the processor is 64-bit, what the cache-line size is, and whether AltiVec is present
Frequently used routines, such as functions for copying, moving, and zeroing memory; for using spinlocks; for flushing the data cache and invalidating the instruction cache; and for retrieving the per-thread cookie
Various time-related values maintained by the kernel, allowing the current seconds and microseconds to be retrieved by user programs without making system calls

There are separate comm areas for 32-bit and 64-bit address spaces, although they are conceptually similar. We will discuss only the 32-bit comm area in this section.

Using the end of the address space for the comm area has an important benefit: It is possible to access both code and data in the comm area from anywhere in the address space, without involving the dynamic link editor or requiring complex address calculations. Absolute unconditional branch instructions, such as ba, bca, and bla, can branch to a location in the comm area from anywhere because they have enough bits in their target address encoding fields to allow them to reach the comm area pages using a sign-extended target address specification. Similarly, absolute loads and stores can comfortably access the comm area. Consequently, accessing the comm area is both efficient and convenient.

The comm area is populated during kernel initialization in a processor-specific and platform-specific manner. commpage_populate() [osfmk/ppc/commpage/commpage.c] performs this initialization. In fact, functionality contained in the comm area can be considered as processor capabilitiesa software extension to the native instruction set. Various comm-area-related constants are defined in osfmk/ppc/cpu_capabilities.h.

// osfmk/ppc/cpu_capabilities.h // Start at page -8, ie 0xFFFF8000 #define _COMM_PAGE_BASE_ADDRESS (-8*4096) // Reserved length of entire comm area #define _COMM_PAGE_AREA_LENGTH  (7*4096) // Mac OS X uses two pages so far #define _COMM_PAGE_AREA_USED    (2*4096) // The Objective-C runtime fixed address page to optimize message dispatch #define OBJC_PAGE_BASE_ADDRESS  (-20*4096) // Data in the comm page ... // Code in the comm page (routines) ... // Used by gettimeofday() #define _COMM_PAGE_GETTIMEOFDAY \                                (_COMM_PAGE_BASE_ADDRESS+0x2e0) ...

The comm area's actual maximum length is seven pages (not eight) since Mach's virtual memory subsystem does not map the last page of an address space.

Each routine in the commpage is described by a commpage_descriptor structure, which is declared in osfmk/ppc/commpage/commpage.h.

// osfmk/ppc/cpu_capabilities.h typedef struct commpage_descriptor {     short code_offset;      // offset to code from this descriptor     short code_length;      // length in bytes     short commpage_address; // put at this address     short special;          // special handling bits for DCBA, SYNC, etc.     long  musthave;         // _cpu_capability bits we must have     long  canthave;         // _cpu_capability bits we cannot have } commpage_descriptor;

Implementations of the comm area routines are in the osfmk/ppc/commpage/ directory. Let us look at the example of gettimeofday(), which is both a system call and a comm area routine. It is substantially more expensive to retrieve the current time using the system call. Besides a regular system call stub for gettimeofday(), the C library contains the following entry point for calling the comm area version of gettimeofday().

        .globl __commpage_gettimeofday         .text         .align 2 __commpage_gettimeofday:         ba __COMM_PAGE_GETTIMEOFDAY

Note that _COMM_PAGE_GETTIMEOFDAY is a leaf procedure that must be jumped to, instead of being called as a returning function.

Note that comm area contents are not guaranteed to be available on all machines. Moreover, in the particular case of gettimeofday(), the time values are updated asynchronously by the kernel and read atomically from user space, leading to occasional failures in reading. The C library falls back to the system call version in the case of failure.

// <darwin>/<Libc>/sys/gettimeofday.c int gettimeofday(struct timeval *tp, struct timezone *tzp) {     ... #if defined(__ppc__) || defined(__ppc64__)     {         ...         // first try commpage         if (__commpage_gettimeofday(tp)) {             // if it fails, try the system call             if (__ppc_gettimeofday(tp,tzp)) {                 return (-1);             }         }     } #else     if (syscall(SYS_gettimeofday, tp, tzp) < 0) {         return -1;     } #endif     ... }

Since the comm area is readable from within every process, let us write a program to display the information contained in it. Since the comm area API is private, you must include the required headers from the kernel source tree rather than a standard header directory. The program shown in Figure 628 displays the data and routine descriptors contained in the 32-bit comm area.

Figure 628. Displaying the contents of the comm area

// commpage32.c #include <stdio.h> #include <stdlib.h> #include <inttypes.h> #define PRIVATE #define KERNEL_PRIVATE #include <machine/cpu_capabilities.h> #include <machine/commpage.h> #define WSPACE_FMT_SZ "24" #define WSPACE_FMT "%-" WSPACE_FMT_SZ "s = " #define CP_CAST_TO_U_INT32(x)  (u_int32_t)(*(u_int32_t *)(x)) #define ADDR2DESC(x)           (commpage_descriptor *)&(CP_CAST_TO_U_INT32(x)) #define CP_PRINT_U_INT8_BOOL(label, item) \     printf(WSPACE_FMT "%s\n", label, \         ((u_int8_t)(*(u_int8_t *)(item))) ? "yes" : "no") #define CP_PRINT_U_INT16(label, item) \     printf(WSPACE_FMT "%hd\n", label, (u_int16_t)(*(u_int16_t *)(item))) #define CP_PRINT_U_INT32(label, item) \     printf(WSPACE_FMT "%u\n", label, (u_int32_t)(*(u_int32_t *)(item))) #define CP_PRINT_U_INT64(label, item) \     printf(WSPACE_FMT "%#llx\n", label, (u_int64_t)(*(u_int64_t *)(item))) #define CP_PRINT_D_FLOAT(label, item) \     printf(WSPACE_FMT "%lf\n", label, (double)(*(double *)(item))) const char * cpuCapStrings[] = { #if defined (__ppc__)     "kHasAltivec",             // << 0     "k64Bit",                  // << 1     "kCache32",                // << 2     "kCache64",                // << 3     "kCache128",               // << 4     "kDcbaRecommended",        // << 5     "kDcbaAvailable",          // << 6     "kDataStreamsRecommended", // << 7     "kDataStreamsAvailable",   // << 8     "kDcbtStreamsRecommended", // << 9     "kDcbtStreamsAvailable",   // << 10     "kFastThreadLocalStorage", // << 11 #else /* __i386__ */     "kHasMMX",                 // << 0     "kHasSSE",                 // << 1     "kHasSSE2",                // << 2     "kHasSSE3",                // << 3     "kCache32",                // << 4     "kCache64",                // << 5     "kCache128",               // << 6     "kFastThreadLocalStorage", // << 7     "NULL",                    // << 8     "NULL",                    // << 9     "NULL",                    // << 10     "NULL",                    // << 11 #endif     NULL,                      // << 12     NULL,                      // << 13     NULL,                      // << 14     "kUP",                     // << 15     NULL,                      // << 16     NULL,                      // << 17     NULL,                      // << 18     NULL,                      // << 19     NULL,                      // << 20     NULL,                      // << 21     NULL,                      // << 22     NULL,                      // << 23     NULL,                      // << 24     NULL,                      // << 25     NULL,                      // << 26     "kHasGraphicsOps",         // << 27     "kHasStfiwx",              // << 28     "kHasFsqrt",               // << 29     NULL,                      // << 30     NULL,                      // << 31 }; void print_bits32(u_int32_t); void print_cpu_capabilities(u_int32_t); void print_commpage_descriptor(const char *, u_int32_t); void print_bits32(u_int32_t u) {     u_int32_t i;     for (i = 32; i--; putchar(u & 1 << i ? '1' : '0')); } void print_cpu_capabilities(u_int32_t cap) {     int i;     printf(WSPACE_FMT, "cpu capabilities (bits)");     print_bits32(cap);     printf("\n");     for (i = 0; i < 31; i++)         if (cpuCapStrings[i] && (cap & (1 << i)))             printf("%-" WSPACE_FMT_SZ "s  + %s\n", " ", cpuCapStrings[i]); } void print_commpage_descriptor(const char *label, u_int32_t addr) {     commpage_descriptor *d = ADDR2DESC(addr);     printf("%s @ %08x\n", label, addr); #if defined (__ppc__)     printf("  code_offset      = %hd\n", d->code_offset);     printf("  code_length      = %hd\n", d->code_length);     printf("  commpage_address = %hx\n", d->commpage_address);     printf("  special          = %#hx\n", d->special); #else /* __i386__ */     printf("  code_address     = %p\n", d->code_address);     printf("  code_length      = %ld\n", d->code_length);     printf("  commpage_address = %#lx\n", d->commpage_address); #endif     printf("  musthave         = %#lx\n", d->musthave);     printf("  canthave         = %#lx\n", d->canthave); } int main(void) {     u_int32_t u;     printf(WSPACE_FMT "%#08x\n", "base address", _COMM_PAGE_BASE_ADDRESS);     printf(WSPACE_FMT "%s\n", "signature", (char *)_COMM_PAGE_BASE_ADDRESS);     CP_PRINT_U_INT16("version", _COMM_PAGE_VERSION);     u = CP_CAST_TO_U_INT32(_COMM_PAGE_CPU_CAPABILITIES);     printf(WSPACE_FMT "%u\n", "number of processors",           (u & kNumCPUs) >> kNumCPUsShift);     print_cpu_capabilities(u);     CP_PRINT_U_INT16("cache line size", _COMM_PAGE_CACHE_LINESIZE); #if defined (__ppc__)     CP_PRINT_U_INT8_BOOL("AltiVec available?", _COMM_PAGE_ALTIVEC);     CP_PRINT_U_INT8_BOOL("64-bit processor?", _COMM_PAGE_64_BIT); #endif     CP_PRINT_D_FLOAT("two52 (2^52)", _COMM_PAGE_2_TO_52);     CP_PRINT_D_FLOAT("ten6 (10^6)", _COMM_PAGE_10_TO_6);     CP_PRINT_U_INT64("timebase", _COMM_PAGE_TIMEBASE);     CP_PRINT_U_INT32("timestamp (s)", _COMM_PAGE_TIMESTAMP);     CP_PRINT_U_INT32("timestamp (us)", _COMM_PAGE_TIMESTAMP + 0x04);     CP_PRINT_U_INT64("seconds per tick", _COMM_PAGE_SEC_PER_TICK);     printf("\n");     printf(WSPACE_FMT "%s", "descriptors", "\n");     // example descriptor     print_commpage_descriptor("  mach_absolute_time()",                               _COMM_PAGE_ABSOLUTE_TIME);     exit(0); } $ gcc -Wall -I /path/to/xnu/osfmk/ -o commpage32 commpage32.c $ ./commpage32 base address             = 0xffff8000 signature                = commpage 32-bit version                  = 2 number of processors     = 2 cpu capabilities (bits)  = 00111000000000100000011100010011                            + kHasAltivec                            + k64Bit                            + kCache128                            + kDataStreamsAvailable                            + kDcbtStreamsRecommended                            + kDcbtStreamsAvailable                            + kFastThreadLocalStorage                            + kHasGraphicsOps                            + kHasStfiwx                            + kHasFsqrt cache line size          = 128 AltiVec available?       = yes 64-bit processor?        = yes two52 (2^52)             = 4503599627370496.000000 ten6 (10^6)              = 1000000.000000 timebase                 = 0x18f0d27c48c timestamp (s)            = 1104103731 timestamp (us)           = 876851 seconds per tick         = 0x3e601b8f3f3f8d9b descriptors              =   mach_absolute_time() @ ffff8200   code_offset      = 31884   code_length      = 17126   commpage_address = 7883   special          = 0x22   musthave         = 0x4e800020   canthave         = 0