Let us now look at details and examples of the various system call categories, beginning with the most staple variety from a developer's standpoint: the BSD system calls.
BSD system calls on Mac OS X have numbers that start from zero and go as high as the highest numbered BSD system call. These
numbers
are defined in
<sys/syscall.h>
.
Several system call numbers are reserved or simply unused. In some cases, they may represent calls that have been obsoleted and removed, creating holes in the sequence of implemented system calls.
Note that the zeroth system call
syscall()
is the
indirect
system call: It allows another system call to be invoked given the latter's number, which is provided as the first argument to
syscall()
, followed by the actual arguments required by the target system call. The indirect system call has traditionally been used to allow testingsay, from a high-level language like Cof new system calls that do not have stubs in the C library.
// Normal invocation of system call number SYS_foo
ret = foo(arg1, arg2, ..., argN);
// Indirect invocation of foo using the indirect system call
ret = syscall(SYS_foo, arg1, arg2, ..., argN);
The
syscall.h
file is generated during kernel compilation by the
bsd/kern/makesyscalls.sh
shell script,
which processes the system call master file
bsd/kern/syscalls.master
. The master file contains a line for each system call number, with the following entities in each column within the line (in this order):
-
The system call number
-
The type of cancellation supported by the system call in the case of a thread cancellation: one of
PRE
(can be
canceled
on entry itself),
POST
(can be canceled only after the call is run), or
NONE
(not a cancellation point)
-
The type of funnel
to be taken before executing the system call: one of
KERN
(the kernel funnel) or
NONE
-
The files to which an entry for the system call will be added: either
ALL
or a combination of
T
(
bsd/kern/init_sysent.c
the system call table),
N
(
bsd/kern/syscalls.c
the table of system call
names
),
H
(
bsd/sys/syscall.h
system call numbers), and
P
(
bsd/sys/sysproto.h
system call
prototypes
)
-
The system call function's prototype
-
Comments that will be
copied
to output files
; bsd/kern/syscalls.master
;
; Call# Cancel Funnel Files { Name and Args } { Comments }
;
...
0 NONE NONE ALL { int nosys(void); } { indirect syscall }
1 NONE KERN ALL { void exit(int rval); }
2 NONE KERN ALL { int fork(void); }
...
368 NONE NONE ALL { int nosys(void); }
369 NONE NONE ALL { int nosys(void); }
The file
bsd/kern/syscalls.c
contains an array of strings
syscallnames[]
that contains each system call's textual name.
// bsd/kern/syscalls.c
const char *syscallnames[] = {
"syscall",
/* 0 = syscall indirect syscall */
"exit",
/* 1 = exit */
"fork",
/* 2 = fork */
...
"#368",
/* 368 = */
"#369",
/* 369 = */
};
We can examine the contents of the
syscallnames[]
arrayand for that matter, other kernel data structuresfrom
user
space by reading from the kernel memory device
/dev/kmem
.
Running
nm
on the kernel binary gives us the address of the symbol
syscallnames
, which we can dereference to access the array.
$
nm /mach_kernel grep syscallnames
0037f3ac D _syscallnames
$
sudo dd if=/dev/kmem of=/dev/stdout bs=1 count=4 iseek=0x37f3ac od -x
...
0000000 0032 a8b4
0000004
$
sudo dd if=/dev/kmem of=/dev/stdout bs=1 count=1024 iseek=0x32a8b4 strings
syscall
exit
fork
...
The file
bsd/kern/init_sysent.c
contains the system call switch table,
sysent[]
, which is an array of
sysent
structures, containing one structure for each system call number. This file is generated from the master file during kernel compilation.
// bsd/kern/init_sysent.c
#ifdef __ppc__
#define AC(name) (sizeof(struct name) / sizeof(uint64_t))
#else
#define AC(name) (sizeof(struct name) / sizeof(register_t))
#endif
__private_extern__ struct sysent sysent[] = {
{
0,
_SYSCALL_CANCEL_NONE,
NO_FUNNEL,
(sy_call_t *)nosys,
NULL,
NULL,
_SYSCALL_RET_INT_T
},
/* 0 = nosys indirect syscall */
{
AC(exit_args),
_SYSCALL_CANCEL_NONE,
KERNEL_FUNNEL,
(sy_call_t *)exit,
munge_w,
munge_d,
_SYSCALL_RET_NONE
},
/* 1 = exit */
...
{
0,
_SYSCALL_CANCEL_NONE,
NO_FUNNEL,
(sy_call_t *)nosys,
NULL,
NULL,
_SYSCALL_RET_INT_T
},
/* 369 = nosys */
};
int nsysent = sizeof(sysent) / sizeof(sysent[0]);
The
sysent
structure is declared in
bsd/sys/sysent.h
.
// bsd/sys/sysent.h
typedef int32_t sy_call_t(struct proc *, void *, int *);
typedef void sy_munge_t(const void *, void *);
extern struct sysent {
int16_t sy_narg;
// number of arguments
int8_t sy_cancel;
// how to cancel, if at all
int8_t sy_funnel;
// funnel type, if any, to take upon entry
sy_call_t *sy_call;
// implementing function
sy_munge_t *sy_arg_munge32;
// arguments munger for 32-bit process
sy_munge_t *sy_arg_munge64;
// arguments munger for 64-bit process
int32_t sy_return_type;
// return type
} sysent[];
The
sysent
structure's fields have the following meanings.
-
sy_narg
is the number of argumentsat most eighttaken by the system call. In the case of the indirect system call, the number of arguments is limited to seven since the first argument is dedicated for the target system call's number.
-
As we saw earlier, a system call specifies whether it can be canceled before execution, after execution, or not at all. The
sy_cancel
field holds the cancellation type, which is one of
_SYSCALL_CANCEL_PRE
,
_SYSCALL_CANCEL_POST
, or
_SYSCALL_CANCEL_NONE
(corresponding to the
PRE
,
POST
, and
NONE
cancellation specifiers, respectively, in the master file). This feature is used in the implementation of the
pthread_cancel(3)
library call, which in
turn
invokes the
__pthread_markcancel()
[
bsd/kern/kern_sig.c
] system call to cancel a thread's execution. Most system calls cannot be canceled. Examples of those that can be canceled include
read()
,
write()
,
open
()
,
close()
,
recvmsg()
,
sendmsg
()
, and
select()
.
-
The
sy_funnel
field may contain a funnel type that causes the system call's processing to take (lock) the corresponding funnel before the system call is executed, and drop (unlock) the funnel after it has executed. The possible values for this argument in Mac OS X 10.4 are
NO_FUNNEL
and
KERNEL_FUNNEL
(corresponding to the
KERN
and
NONE
funnel specifiers, respectively, in the master file).
-
The
sy_call
field points to the kernel function that implements the system call.
-
The
sy_arg_munge32
and
sy_arg_munge64
fields point to functions that are used for
munging
system call arguments for 32-bit and 64-bit processes, respectively. We will discuss munging in Section 6.7.1.2.
-
The
sy_return_type
field contains one of the following to represent the system call's return type:
_SYSCALL_RET_NONE
,
_SYSCALL_RET_INT_T
,
_SYSCALL_RET_UINT_T
,
_SYSCALL_RET_OFF_T
,
_SYSCALL_RET_ADDR_T
,
_SYSCALL_RET_SIZE_T
, and
_SYSCALL_RET_SSIZE_T
.
Recall that
unix_syscall()
receives a pointer to the process control block, which is a
savearea
structure. The system call's arguments are received as saved registers GPR3 through GPR10 in the save area. In the case of an indirect system call, the actual system call arguments start with GPR4, since GPR3 is used for the system call number.
unix_syscall()
copies these arguments to the
uu_arg
field within the
uthread
structure before passing them to the call handler.
// bsd/sys/user.h
struct uthread {
int *uu_ar0;
// address of user's saved GPR0
u_int64_t uu_arg[8];
// arguments to current system call
int *uu_ap;
// pointer to argument list
int uu_rval[2];
// system call return values
...
};
As we will see in Chapter 7, an xnu thread structure contains a pointer to the thread's
user structure
,
roughly
analogous to the user area in BSD. Execution within the xnu kernel refers to several structures, such as the Mach task structure, the Mach thread structure, the BSD process structure, and the BSD uthread structure. The latter contains several fields used during system call processing.
|
Historically, the UNIX kernel
maintained
an entry for every process in a process table, which always remained in memory. Each process was also allocated a user structureor a
u-area
that was an extension of the process structure. The u-area contained process-
related
information that needed to be accessible to the kernel only when the process was executing. Even though the kernel would not swap out a process structure, it could swap out the associated u-area. Over time, the criticality of memory as a resource has gradually lessened, but operating systems have become more complex. Correspondingly, the process structure has grown in
size
and the u-area has become less important, with much of its information being moved into the process structure.
|
6.7.1.2. Argument Munging
Note that
uu_arg
is an array of 64-bit unsigned integerseach element represents a 64-bit register. This is
problematic
since a parameter passed in a register from 32-bit user space will not map as is to the
uu_arg
array. For example, a long long parameter will be passed in a single GPR in a 64-bit program, but in two GPRs in a 32-bit program.
unix_syscall()
addresses the issue arising from the difference depicted in Figure 614 by calling the system call's specified argument munger, which copies arguments from the save area to the
uu_arg
array while adjusting for the differences.
Figure 614. Passing a long long parameter in 32-bit and 64-bit ABIs
$
cat foo.c
extern void bar(long long arg);
void
foo(void)
{
bar((long long)1);
}
$
gcc -static -S foo.c
$
cat foo.s
...
li r3,0
li r4,1
bl _bar
...
$
gcc -arch ppc64 -static -S foo.c
$
cat foo.s
...
li r3,1
bl _bar
...
|
The munger functions are implemented in
bsd/dev/ppc/munge.s
. Each function takes two arguments: a pointer to the beginning of the system call parameters within the save area and a pointer to the
uu_arg
array. A munger function is named
munge_<encoding>
, where
<encoding>
is a string that encodes the number and types of system call parameters.
<encoding>
is a combination of one or more of the
d
,
l
,
s
, and
w
characters. The
characters
mean the following:
-
d
represents a 32-bit integer, a 64-bit pointer, or a 64-bit long when the calling process is 64-bitthat is, in each case, the parameter was passed in a 64-bit GPR. Such an argument is munged by copying two words from input to output.
-
l
represents a 64-bit long long passed in two GPRs. Such an argument is munged by skipping a word of input (the upper 32 bits of the first GPR), copying a word of input to output (the lower 32 bits of the first GPR), skipping another word of input, and copying another word from input to output.
-
s
represents a 32-bit signed value. Such an argument is munged by skipping a word of input, loading and sign-extending the
next
word of input to yield two words, and copying the two words to output.
-
w
represents a 32-bit unsigned value. Such an argument is munged by skipping a word of input, copying a zero word to output, and copying a word from input to output.
Moreover, multiple munger functions are aliased to a common implementation if each function, except one, is a prefix of another. For example,
munger_w
,
munger_ww
,
munger_www
, and
munger_wwww
are aliased to the same implementationconsequently, four arguments are munged in each case, regardless of the actual number of arguments. Similarly,
munger_wwwww
,
munger_wwwwww
,
munger_wwwwwww
, and
munger_wwwwwwww
are aliased to the same implementation, whose operation is shown in Figure 615.
Consider the example of the
read()
system call. It takes three arguments: a file descriptor, a pointer to a buffer, and the number of bytes to read.
ssize_t
read
(int d, void *buf, size_t nbytes);
The 32-bit and 64-bit mungers for the
read()
system call are
munge_www()
and
munge_ddd()
, respectively.
6.7.1.3. Kernel Processing of BSD System Calls
Figure 616 shows pseudocode depicting the working of
unix_syscall()
, which, as we saw earlier, is called by
shandler()
to process BSD system calls.
Figure 616. Details of the final dispatching of BSD system calls
// bsd/dev/ppc/systemcalls.c
void
unix_syscall(struct savearea *regs)
{
thread_t thread_act;
struct uthread *uthread;
struct proc *proc;
struct sysent *callp;
int error;
unsigned short code;
...
// Determine if this is a direct or indirect system call (the "flavor").
// Set the 'code' variable to either GPR3 or GPR0, depending on flavor.
...
// If kdebug tracing is enabled, log an entry indicating that a BSD
// system call is starting, unless this system call is kdebug_trace().
...
// Retrieve the current thread and the corresponding uthread structure.
thread_act = current_thread();
uthread = get_bsdthread_info(thread_act);
...
// Ensure that the current task has a non-NULL proc structure associated
// with it; if not, terminate the current task.
...
// uu_ar0 is the address of user's saved GPR0.
uthread->uu_ar0 = (int *)regs;
// Use the system call number to retrieve the corresponding sysent
// structure. If system call number is too large, use the number 63, which
// is an internal reserved number for a nosys().
//
// In early UNIX, the sysent array had space for 64 system calls. The last
// entry (that is, sysent[63]) was a special system call.
callp = (code >= nsysent) ? &sysent[63] : &sysent[code];
if (callp->sy_narg != 0) {
// if the call takes one or more arguments
void *regsp;
sy_munge_t *mungerp;
if (/* this is a 64-bit process */) {
if (/* this is a 64-bit unsafe call */) {
// Turn it into a nosys() -- use system call #63 and bail out.
...
}
// 64-bit argument munger
mungerp = callp->sy_arg_munge64;
} else { /* 32-bit process */
// 32-bit argument munger
mungerp = callp->sy_arg_munge32;
}
// Set regsp to point to either the saved GPR3 in the save area (for a
// direct system call), or to the saved GPR4 (for an indirect system
// call). An indirect system call can take at most 7 arguments.
...
// Call the argument munger.
(*mungerp)(regsp, (void *)&uthread->uu_arg[0]);
}
// Evaluate call for cancellation, and cancel, if required and possible.
...
// Take the kernel funnel if the call requires so.
...
// Assume there will be no error.
error = 0;
// Increment saved SRR0 by one instruction.
regs->save_srr0 += 4;
// Test if this is a kernel trace point -- that is, if system call tracing
// through ktrace(2) is enabled for this process. If so, write a trace
// record for this system call.
...
// If auditing is enabled, set up an audit record for the system call.
...
// Call the system call's specific handler.
error = (*(callp->sy_call))(proc, (void *)uthread->uu_arg,
&(uthread->uu_rval[0]));
// If auditing is enabled, commit the audit record.
...
// Handle return value(s)
...
// If this is a ktrace(2) trace point, write a trace record for the
// return of this system call.
...
// Drop the funnel if one was taken.
...
// If kdebug tracing is enabled, log an entry indicating that a BSD
// system call is ending, unless this system call is kdebug_trace().
...
thread_exception_return();
/* NOTREACHED */
}
|
unix_syscall()
potentially
performs
several types of tracing or logging: kdebug tracing,
ktrace(2)
tracing, and audit logging. We will discuss kdebug and
kTRace(2)
later in this chapter.
Arguments are passed packaged into a structure to the call-specific handler. Let us consider the example of the
socketpair(2)
system call, which takes four arguments: three integers and a pointer to a buffer for holding two integers.
int socketpair(int domain, int type, int protocol, int *rsv);
The
bsd/sys/sysproto.h
file, which, as noted earlier, is generated by
bsd/kern/makesyscalls.sh
, contains argument structure declarations for all BSD system calls. Note also the use of left and right padding in the declaration of the
socketpair_args
structure.
// bsd/sys/sysproto.h
#ifdef __ppc__
#define PAD_(t) (sizeof(uint64_t) <= sizeof(t) \
? 0 : sizeof(uint64_t) - sizeof(t))
#else
...
#endif
#if BYTE_ORDER == LITTLE_ENDIAN
...
#else
#define PADL_(t) PAD_(t)
#define PADR_(t) 0
#endif
...
struct socketpair_args {
char domain_l_[PADL_(int)];
int domain;
char domain_r_[PADR_(int)];
char type_l_[PADL_(int)];
int type;
char type_r_[PADR_(int)];
char protocol_l_[PADL_(int)];
int protocol;
char protocol_r_[PADR_(int)];
char rsv_l_[PADL_(user_addr_t)];
user_addr_t rsv;
\
char rsv_r_[PADR_(user_addr_t)];
};
...
The system call handler function for
socketpair(2)
retrieves its arguments as fields of the incoming
socket_args
structure.
// bsd/kern/uipc_syscalls.c
// Create a pair of connected sockets
int
socketpair(struct proc *p,
struct socketpair_args *uap,
__unused register_t *retval)
{
struct fileproc *fp1, *fp2;
struct socket *so1, *so2;
int fd, error, sv[2];
...
error = socreate(uap->domain, &so1, uap->type, &uap->protocol);
...
error = socreate(uap->domain, &so2, uap->type, &uap->protocol);
...
error = falloc(p, &fp1, &fd);
...
sv[0] = fd;
error = falloc(p, &fp2, &fd);
...
sv[1] = fd;
...
error = copyout((caddr_t)sv, uap->rsv, 2 * sizeof(int));
...
return (error);
}
Note that before calling the system call handler,
unix_syscall()
sets the error status to zero,
assuming
that there will be no error. Recall that the saved SRR0 register contains the address of the instruction immediately following the system call instruction. This is where execution would resume after returning to user space from the system call. As we will shortly see, a standard user-space library stub for a BSD system call invokes the
cerror()
library function to set the
errno
variablethis should be done
only if
there is an error.
unix_syscall()
increments
the saved SRR0 by one instruction, so that the call to
cerror()
will be
skipped
if there is no error. If the system call handler indeed does return an error, the SRR0 value is decremented by one instruction.
After returning from the handler,
unix_syscall()
examines the
error
variable to take the appropriate action.
-
If
error
is
ERESTART
, this is a
restartable
system call that needs to be restarted.
unix_syscall()
decrements SRR0 by 8 bytes (two instructions) to cause execution to resume at the original system call instruction.
-
If
error
is
EJUSTRETURN
, this system call wants to be returned to user space without any further processing of return values.
-
If
error
is nonzero, the system call returned an error, which
unix_syscall()
copies to the saved GPR3 in the process control block. It also decrements SRR0 by one instruction to cause the
cerror()
routine to be executed upon return to user space.
-
If
error
is
, the system call returned success.
unix_syscall()
copies the return values from the
uthread
structure to the saved GPR3 and GPR4 in the process control block. Table 610 shows how the return values are handled.
Table 610. Handling of BSD System Call Return Values
|
Call Return Type
|
Source for GPR3
|
Source for GPR4
|
|
Erroneous return
|
The
error
variable
|
Nothing
|
|
_SYSCALL_RET_INT_T
|
uu_rval[0]
|
uu_rval[1]
|
|
_SYSCALL_RET_UINT_T
|
uu_rval[0]
|
uu_rval[1]
|
|
_SYSCALL_RET_OFF_T
(32-bit process)
|
uu_rval[0]
|
uu_rval[1]
|
|
_SYSCALL_RET_OFF_T
(64-bit process)
|
uu_rval[0]
and
uu_rval[1]
as a single
u_int64_t
value
|
The value
|
|
_SYSCALL_RET_ADDR_T
|
uu_rval[0]
and
uu_rval[1]
as a single
user_addr_t
value
|
The value
|
|
_SYSCALL_RET_SIZE_T
|
uu_rval[0]
and
uu_rval[1]
as a single
user_addr_t
value
|
The value
|
|
_SYSCALL_RET_SSIZE_T
|
uu_rval[0]
and
uu_rval[1]
as a single
user_addr_t
value
|
The value
|
|
_SYSCALL_RET_NONE
|
Nothing
|
Nothing
|
Finally, to return to user mode,
unix_syscall()
calls
thread_exception_return()
[
osfmk/ppc/hw_exception.s
], which checks for outstanding ASTs. If any ASTs are found,
ast_taken()
is called. After
ast_taken()
returns,
thread_exception_return()
checks for outstanding ASTs one more time (and so on). It then
jumps
to
.L_thread_syscall_return()
[
osfmk/ppc/hw_exception.s
], which branches to
chkfac()
[
osfmk/ppc/hw_
exception.s
], which branches to
exception_exit()
[
osfmk/ppc/lowmem_vectors.s
]. Some of the context is restored during these calls.
exception_exit()
eventually branches to
EatRupt
[
ofsmk/ppc/lowmem_vectors.s
], which releases the save area, performs the remaining context restoration and state cleanup, and finally executes the
rfid
(
rfi
for 32-bit) instruction to return from the interrupt.
|
The system call mechanism in early UNIX operated similarly in concept to the one we have discussed here: It allowed a user program to call on the kernel by executing the trap instruction in user mode. The low-order byte of the instruction word encoded the system call number. Therefore, in theory, there could be up to 256 system calls. Their handler functions in the kernel were contained in a
sysent
table whose first entry was the indirect system call. First Edition UNIX (circa November 1971) had fewer than 35 documented system calls. Figure 617 shows a code excerpt from Third Edition UNIX (circa February 1973)note that the system call numbers for various system calls are identical to those in Mac OS X.
|
Figure 617. System call data structures in Third Edition UNIX
/* Third Edition UNIX */
/* ken/trap.c */
...
struct {
int count;
int (*call)();
} sysent[64];
...
/* ken/sysent.c */
int sysent[]
{
0, &nullsys,
/* 0 = indir */
0, &rexit,
/* 1 = exit */
0, &fork,
/* 2 = fork */
2, &read,
/* 3 = read */
2, &write,
/* 4 = write */
2, &open,
/* 5 = open */
...
0, &nosys,
/* 62 = x */
0, &prproc
/* 63 = special */
...
|
6.7.1.4. User Processing of BSD System Calls
A typical BSD system call stub in the C library is
constructed
using a set of macros, some of which are shown in Figure 618. The figure also shows an assembly-language fragment for the the
exit()
system call. Note that the assembly code is shown with a static call to
cerror()
for simplicity, as the invocation is somewhat more complicated in the case of dynamic linking.
Figure 618. Creating a user-space system call stub
$
cat testsyscall.h
// for system call numbers
#include <sys/syscall.h>
// taken from <architecture/ppc/mode_independent_asm.h>
#define MI_ENTRY_POINT(name) \
.globl name @\
.text @\
.align 2 @\
name:
#if defined(__DYNAMIC__)
#define MI_BRANCH_EXTERNAL(var) \
MI_GET_ADDRESS(r12,var) @\
mtctr r12 @\
bctr
#else /* ! __DYNAMIC__ */
#define MI_BRANCH_EXTERNAL(var) \
b var
#endif
// taken from Libc/ppc/sys/SYS.h
#define kernel_trap_args_0
#define kernel_trap_args_1
#define kernel_trap_args_2
#define kernel_trap_args_3
#define kernel_trap_args_4
#define kernel_trap_args_5
#define kernel_trap_args_6
#define kernel_trap_args_7
#define SYSCALL(name, nargs) \
.globl cerror @\
MI_ENTRY_POINT(_##name) @\
kernel_trap_args_##nargs @\
li r0,SYS_##name @\
sc @\
b 1f @\
blr @\
1: MI_BRANCH_EXTERNAL(cerror)
// let us define the stub for SYS_exit
SYSCALL(exit, 1)
$
gcc -static -E testsyscall.h tr '@' '\n'
...
; indented and annotated for clarity
.globl cerror
.globl _exit
.text
.align 2
_exit:
li r0,1
; load system call number in r0
sc
; execute the sc instruction
b 1f
; jump over blr, to the cerror call
blr
; return
1: b cerror
; call cerror, which will also return to the user
|
The
f
in the unconditional branch instruction to
1f
in Figure 618 specifies the directionforward, in this case. If you have another label named
1
before the branch instruction, you can jump to it using
1b
as the operand.
Figure 618 also shows the placement of the call to
cerror()
in the case of an error. When the
sc
instruction is executed, the processor places the effective address of the instruction following the
sc
instruction in SRR0. Therefore, the stub is set to call the
cerror()
function by default after the system call returns.
cerror()
copies the system call's return value (contained in GPR3) to the
errno
variable, calls
cthread_set_errno_self()
to set the per-thread
errno
value for the current thread, and sets both GPR3 and GPR4 to
-1
, thereby
causing
the calling program to receive return values of
-1
whether the expected return value is one word (in GPR3) or two words (in GPR3 and GPR4).
Let us now look at an example of directly invoking a system call using the
sc
instruction. Although doing so is useful for demonstration, a nonexperimental user program should not use the
sc
instruction directly. The only API-compliant and future-proof way to invoke system calls under Mac OS X is through user libraries. Almost all supported system calls have stubs in the system library (libSystem), of which the standard C library is a subset.
As we noted in Chapter 2, the primary reason system calls must not be invoked directly in user programsespecially shipping productsis that the interfaces between system shared libraries and the kernel are private to Apple and are subject to change. Moreover, user programs are allowed to link with system libraries (including libSystem) only dynamically. This allows Apple flexibility in modifying and extending its private interfaces without
affecting
user programs.
With that caveat, let us use the
sc
instruction to invoke a simple BSD system callsay,
getpid()
. Figure 619 shows a program that uses both the library stub and our custom stub to call
getpid()
. We need an extra instructionsay, a no-opimmediately following the
sc
instruction,
otherwise
the program will behave incorrectly.
Figure 619. Directly invoking a BSD system call
// getpid_demo.c
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/syscall.h>
pid_t
my_getpid(void)
{
int syscallnum = SYS_getpid;
__asm__ volatile(
"lwz r0,%0\n"
"sc\n"
"nop\n"
// The kernel will arrange for this to be skipped
:
: "g" (syscallnum)
);
// GPR3 already has the right return value
// Compiler warning here because of the lack of a return statement
}
int
main(void)
{
printf("my pid is %d\n", getpid());
printf("my pid is %d\n", my_getpid());
return 0;
}
$
gcc -Wall -o getpid_demo getpid_demo.c
getpid_demo.c: In function 'my_getpid':
getpid_demo.c:24: warning: control reaches end of non-void function
$
./getpid_demo
my pid is 2345
my pid is 2345
$
|
Note that since user programs on Mac OS X can only be dynamically linked with Apple-provided libraries, one would expect a user program not to have
any
sc
instructions at allit should only have dynamically resolved symbols to system call stubs. However, dynamically linked 32-bit C and C++ programs do have a couple of embedded
sc
instructions that come from the language runtime startup codespecifically, the
__dyld_init_check()
function.
; dyld.s in the source for the C startup code
/*
* At this point the dynamic linker initialization was not run so print a
* message on stderr and exit non-zero. Since we can't use any libraries the
* raw system call interfaces must be used.
*
* write(stderr, error_message, sizeof(error_message));
*/
li r5,78
lis r4,hi16(error_message)
ori r4,r4,lo16(error_message)
li r3,2
li r0,4
; write() is system call number 4
sc
nop
; return here on error
/*
* _exit(59);
*/
li r3,59
li r0,1
; exit() is system call number 1
sc
trap
; this call to _exit() should not fall through
trap
6.7.2. Mach Traps
Although Mach traps are similar to traditional system calls in that they are entry points into the kernel, they are different in that Mach kernel services are typically not
offered
directly through these traps. Instead, certain Mach traps are IPC entry points through which user-space
clients
such as the system libraryaccess kernel services by exchanging IPC messages with the
server
that implements those services, just as if the server were in user space.
There are almost ten times as many BSD system calls as there are Mach traps.
Consider an example of a simple Mach trapsay,
task_self_trap()
, which returns send rights to the task's kernel port. The documented
mach_task_self()
library function is redefined in
<mach/mach_init.h>
to be the value of the environment variable
mach_task_self_
, which is
populated
by the system library during the initialization of a user process.
Specifically
, the library stub for the
fork()
system call
sets up the child process by calling several initialization routines, including one that initializes Mach in the process. This latter step caches the return value of
task_self_trap()
in the
mach_task_self_
variable.
// <mach/mach_init.h>
extern mach_port_t mach_task_self_;
#define mach_task_self() mach_task_self_
...
The program shown in Figure 620 uses several apparently different ways of retrieving the same informationthe current task's self port.
Figure 620. Multiple ways of retrieving a Mach task's self port
// mach_task_self.c
#include <stdio.h>
#include <mach/mach.h>
#include <mach/mach_traps.h>
int
main(void)
{
printf("%#x\n", mach_task_self());
#undef mach_task_self
printf("%#x\n", mach_task_self());
printf("%#x\n", task_self_trap());
printf("%#x\n", mach_task_self_);
return 0;
}
$
gcc -Wall -o mach_task_self mach_task_self.c
$
./mach_task_self
0x807
0x807
0x807
0x807
|
The value returned by
task_self_trap()
is not a unique identifier like a Unix process ID. In fact, its value will be the same for all
tasks
, even on different machines, provided the machines are running identical
kernels
.
An example of a complex Mach trap is
mach_msg_overwrite_trap()
[
osfmk/ipc/mach_msg.c
], which is used for sending and receiving IPC messages. Its implementation contains over a thousand lines of C code.
mach_msg_trap()
is a simplified wrapper around
mach_msg_overwrite_trap()
. The C library provides the
mach_msg()
and
mach_msg_overwrite()
documented functions that use these traps but also can restart message sending or receiving in the case of interruptions. User programs access kernel services by performing IPC with the kernel using these "msg" traps. The paradigm used is
essentially
client server, wherein the clients (programs) request information from the server (the kernel) by sending messages, and usuallybut not alwaysreceiving replies. Consider the example of Mach's virtual memory services. As we will see in Chapter 8, a user program can allocate a region of virtual memory using the Mach
vm_allocate()
function. Now, although
vm_allocate()
is implemented in the kernel, it is not exported by the kernel as a regular system call. It is available as a remote procedure in the "Kernel Server" and is callable by user clients. The
vm_allocate()
function that user programs call lives in the C library, representing the client end of the remote procedure call. Various other Mach services, such as those that allow the manipulation of tasks, threads, processors, and ports, are provided similarly.
|
Implementations of Mach services commonly use the Mach Interface Generator (MIG), which
simplifies
the task of creating Mach clients and servers by subsuming a considerable portion of frequently used IPC code. MIG accepts a definition file that describes IPC-related interfaces using a predefined syntax. Running the MIG program
/usr/bin/mig
on a definition file generates a C header, a client (user) interface module, and a server interface module. We will see an example of using MIG in Chapter 9. MIG definition files for various kernel services are located in the
/usr/include/mach/
directory. A MIG definition file conventionally has a
.def
extension.
|
Mach traps are maintained in an array of structures called
mach_trap_table
, which is similar to BSD's
sysent
table. Each element of this array is a structure of type
mach_trap_t
, which is declared in
osfmk/kern/syscall_sw.h
. Figure 621 shows the
MACH_TRAP()
macro.
Figure 621. Mach trap table data structures and definitions
// osfmk/kern/syscall_sw.h
typedef void mach_munge_t(const void *, void *);
typedef struct {
int mach_trap_arg_count;
int (* mach_trap_function)(void);
#if defined(__i386__)
boolean_t mach_trap_stack;
#else
mach_munge_t *mach_trap_arg_munge32;
mach_munge_t *mach_trap_arg_munge64;
#endif
#if !MACH_ASSERT
int mach_trap_unused;
#else
const char * mach_trap_name;
#endif
} mach_trap_t;
#define MACH_TRAP_TABLE_COUNT 128
extern mach_trap_t mach_trap_table[];
extern int mach_trap_count;
...
#if !MACH_ASSERT
#define MACH_TRAP(name, arg_count, munge32, munge64) \
{ (arg_count), (int (*)(void)) (name), (munge32), (munge64), 0 }
#else
#define MACH_TRAP(name, arg_count, munge32, munge64) \
{ (arg_count), (int (*)(void)) (name), (munge32), (munge64), #name }
#endif
...
|
The
MACH_ASSERT
compile-time configuration option controls the
ASSERT()
and
assert()
macros and is used while compiling debug versions of the kernel.
The
MACH_TRAP()
macro shown in Figure 621 is used to populate the Mach trap table in
osfmk/kern/syscall_sw.c
Figure 622 shows how this is done. Mach traps on Mac OS X have numbers that start from
-10
, decrease monotonically, and go as high in absolute value as the highest numbered Mach trap. Numbers
tHRough
-9
are reserved for Unix system calls and are unused. Note also that the argument munger functions are the same as those used in BSD system call processing.
Figure 622. Mach trap table initialization
// osfmk/kern/syscall_sw.c
mach_trap_t mach_trap_table[MACH_TRAP_TABLE_COUNT] = {
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* Unix */
/* 0 */
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* Unix */
/* -1 */
... ... ...
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* Unix */
/* -9 */
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* -10 */
... ...
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* -25 */
MACH_TRAP(mach_reply_port, 0, NULL, NULL),
/* -26 */
MACH_TRAP(thread_self_trap, 0, NULL, NULL),
/* -27 */
... ...
MACH_TRAP(mach_msg_trap, 7, munge_wwwwwww, munge_ddddddd),
/* -31 */
... ...
MACH_TRAP(task_for_pid, 3, munge_www, munge_ddd),
/* -46 */
MACH_TRAP(pid_for_task, 2, munge_ww, munge_dd),
/* -47 */
... ...
MACH_TRAP(kern_invalid, 0, NULL, NULL),
/* -127 */
};
int mach_trap_count = (sizeof(mach_trap_table) / \
sizeof(mach_trap_table[0]));
...
kern_return_t
kern_invalid(void)
{
if (kern_invalid_debug)
Debugger("kern_invalid mach_trap");
return KERN_INVALID_ARGUMENT;
}
...
|
The assembly stubs for Mach traps are defined in
osfmk/mach/syscall_sw.h
using the
machine-dependent
kernel_trap()
macro defined in
osfmk/mach/ppc/syscall_sw.h
. Table 611 enumerates the key files used in the implementation of these traps.
Table 611. Implementing Mach Traps in xnu
|
File
|
Contents
|
|
osfmk/kern/syscall_sw.h
|
Declaration of the trap table structure
|
|
osfmk/kern/syscall_sw.c
|
Population of the trap table; definitions of default error functions
|
|
osfmk/mach/mach_interface.h
|
Master header file that includes headers for the various Mach APIsspecifically the kernel RPC functions corresponding to these APIs (the headers are generated from MIG definition files)
|
|
osfmk/mach/mach_traps.h
|
Prototypes for traps as seen from user space, including declaration of each trap's argument structure
|
|
osfmk/mach/syscall_sw.h
|
Instantiation of traps by defining assembly stubs, using the machine-dependent
kernel_trap()
macro (note that some traps may have different versions for the 32-bit and 64-bit system libraries, whereas some traps may not be available in one of the two libraries)
|
|
osfmk/mach/ppc/syscall_sw.h
|
PowerPC definitions of the
kernel_trap()
macro and associated macros; definitions of other PowerPC-only system calls
|
{% if main.adsdop %}{% include 'adsenceinline.tpl' %}{% endif %}
The
kernel_trap()
macro takes three arguments for a trap: its name, its index in the trap table, and its argument count.
// osfmk/mach/syscall_sw.h
kernel_trap(mach_reply_port, -26, 0);
kernel_trap(thread_self_trap, -27, 0);
...
kernel_trap(task_for_pid, -45, 3);
kernel_trap(pid_for_task, -46, 2);
...
Let us look at a specific example, say,
pid_for_task()
, and see how its stub is
instantiated
.
pid_for_task()
attempts to find the BSD process ID for the given Mach task. It takes two arguments: the port for a task and a pointer to an integer for holding the returned process ID. Figure 623 shows the implementation of this trap.
Figure 623. Setting up the
pid_for_task()
Mach trap
// osfmk/mach/syscall_sw.h
kernel_trap(pid_for_task, -46, 2);
...
// osfmk/mach/ppc_syscall_sw.h
#include <mach/machine/asm.h>
#define kernel_trap(trap_name, trap_number, trap_args) \
ENTRY(trap_name, TAG_NO_FRAME_USED) @\
li r0, trap_number @\
sc @\
blr
...
// osfmk/ppc/asm.h
// included from <mach/machine/asm.h>
#define TAG_NO_FRAME_USED 0x00000000
#define EXT(x) _ ## x
#define LEXT(x) _ ## x ## :
#define FALIGN 4
#define MCOUNT
#define Entry(x,tag) .text@.align FALIGN@ .globl EXT(x)@ LEXT(x)
#define ENTRY(x,tag) Entry(x,tag)@MCOUNT
...
// osfmk/mach/mach_traps.h
#ifndef KERNEL
extern kern_return_t pid_for_task(mach_port_name_t t, int *x);
...
#else /* KERNEL */
...
struct pid_for_task_args {
PAD_ARG_(mach_port_name_t, t);
PAD_ARG_(user_addr_t, pid);
};
extern kern_return_t pid_for_task(struct pid_for_task_args *args);
...
// bsd/vm/vm_unix.c
kern_return_t
pid_for_task(struct pid_for_task_args *args)
{
mach_port_name_t t = args->t;
user_addr_t pid_addr = args->pid;
...
}
|
Using the information shown in Figure 623, the trap definition for
pid_for_task()
will have the following assembly stub:
.text
.align 4
.globl _pid_for_task
_pid_for_task:
li r0,-46
sc
blr
Let us test the assembly stub by changing the stub's function name from
_pid_for_task
to
_my_pid_for_task
, placing it in a file called
my_pid_for_task.S
, and using it in a C program. Moreover, we can call the regular
pid_for_task()
to verify the operation of our stub, as shown in Figure 624.
Figure 624. Testing the
pid_for_task()
Mach trap
// traptest.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <mach/mach.h>
#include <mach/mach_error.h>
extern kern_return_t my_pid_for_task(mach_port_t, int *);
int
main(void)
{
pid_t pid;
kern_return_t kr;
mach_port_t myTask;
myTask = mach_task_self();
// call the regular trap
kr = pid_for_task(myTask, (int *)&pid);
if (kr != KERN_SUCCESS)
mach_error("pid_for_task:", kr);
else
printf("pid_for_task says %d\n", pid);
// call our version of the trap
kr = my_pid_for_task(myTask, (int *)&pid);
if (kr != KERN_SUCCESS)
mach_error("my_pid_for_task:", kr);
else
printf("my_pid_for_task says %d\n", pid);
exit(0);
}
$
gcc -Wall -o traptest traptest.c my_pid_for_task.S
$
./traptest
pid_for_task says 20040
my_pid_for_task says 20040
|
In general, handling of Mach traps
follows
a similar
path
in the kernel as BSD system calls.
shandler()
identifies Mach traps by virtue of their call numbers being negative. It looks up the trap handler in
mach_trap_table
and performs the call.
Mach traps in Mac OS X support up to eight parameters that are passed in GPRs 3 through 10. Nevertheless,
mach_msg_overwrite_
TRap()
takes nine parameters, but the ninth parameter is not used in practice. In the trap's processing, a zero is passed as the ninth parameter.
6.7.3. I/O Kit Traps
Trap numbers 100 through 107 in the Mach trap table are reserved for I/O Kit traps. In Mac OS X 10.4, only one I/O Kit trap is implemented (but not used):
iokit_user_client_trap()
[
iokit/Kernel/IOUserClient.cpp
]. The I/O Kit framework (
IOKit.framework
) implements the user-space stub for this trap.
6.7.4. PowerPC-Only System Calls
The Mac OS X kernel maintains yet another system call table called
PPCcalls
, which contains a few special PowerPC-only system calls.
PPCcalls
is defined in
osfmk/ppc/PPCcalls.h
. Each of its entries is a pointer to a function that takes one argument (a pointer to a save area) and returns an integer.
// osfmk/ppc/PPCcalls.h
typedef int (*PPCcallEnt)(struct savearea *save);
#define PPCcall(rout) rout
#define dis (PPCcallEnt)0
PPCcallEnt PPCcalls[] = {
PPCcall(diagCall),
// 0x6000
PPCcall(vmm_get_version),
// 0x6001
PPCcall(vmm_get_features),
// 0x6002
...
// ...
PPCcall(dis),
...
};
...
Call numbers for the PowerPC system calls begin at
0x6000
and can go up to
0x6FFF
that is, there can be at most 4096 such calls. The assembly stubs for these calls are instantiated in
osfmk/mach/ppc/syscall_sw.h
.
// osfmk/mach/ppc/syscall_sw.h
#define ppc_trap(trap_name,trap_number) \
ENTRY(trap_name, TAG_NO_FRAME_USED) @\
li r0, trap_number @\
sc @\
blr
...
ppc_trap(diagCall, 0x6000);
ppc_trap(vmm_get_version, 0x6001);
ppc_trap(vmm_get_features, 0x6002);
...
Note that the
ppc_trap()
macro is similar to the
kernel_trap()
macro used for defining assembly stubs for Mach traps.
shandler()
passes
most of these calls to
ppscall()
[
osfmk/hw_exception.s
], which looks up the appropriate handler in the
PPCcalls
table.
Depending on their purpose, these calls can be categorized as follows:
-
Calls that are used for low-level performance monitoring, diagnostics, and power management (Table 612)
Table 612. PowerPC-Only Calls for Performance Monitoring, Diagnostics, and Power Management
|
Call Number
|
Call Name
|
Purpose
|
|
0x6000
|
diagCall
|
Calls the routines implemented in the kernel's built-in diagnostics facility (see Section 6.8.8.2)
|
|
0x6009
|
CHUDCall
|
Acts as a hook for the Computer Hardware Understanding Development (CHUD) interfacedisabled to begin with, but is set to a private system call callback function when such a callback is registered by CHUD
|
|
0x600A
|
ppcNull
|
Does nothing and simply returns (a null system call); used for performance testing
|
|
0x600B
|
perfmon_control
|
Allows manipulation of the PowerPC performance-monitoring facility
|
|
0x600C
|
ppcNullinst
|
Does nothing but forces various timestamps to be returned (an instrumented null system call); used for performance testing
|
|
0x600D
|
pmsCntrl
|
Controls the Power Management Stepper
|
-
Calls that allow a user program to instantiate and control a virtual machine using the kernel's virtual machine monitor (VMM) facility (Table 613)
Table 613. PowerPC-Only Calls for the Virtual Machine Monitor
|
Call Number
|
Call Name
|
Purpose
|
|
0x6001
|
vmm_get_version
|
Retrieves the VMM facility's version
|
|
0x6002
|
vmm_get_features
|
Retrieves the VMM facility's supported features
|
|
0x6003
|
vmm_init_context
|
Initializes a new VMM context
|
|
0x6004
|
vmm_dispatch
|
Used as an indirect system call for dispatching various VMM system callsis also an
ultra
-fast trap (see Section 6.7.5)
|
|
0x6008
|
vmm_stop_vm
|
Stops a running virtual machine
|
-
Calls that provide kernel assistance to the Blue Box (Classic) environment (Table 614)
Table 614. PowerPC-Only Calls for the Blue Box
|
Call Number
|
Call Name
|
Purpose
|
|
0x6005
|
bb_enable_bluebox
|
Enables a thread for use in the Blue Box virtual machine
|
|
0x6006
|
bb_disable_bluebox
|
Disables a thread for use in the Blue Box virtual machine
|
|
0x6007
|
bb_settaskenv
|
Sets the Blue Box per-thread task environment data
|
6.7.5. Ultra-Fast Traps
Certain traps are handled entirely by the low-level exception handlers in
osfmk/ppc/lowmem_vectors.s
, without saving or restoring much (or any) state. Such traps also return from the system call interrupt very
rapidly
. These are the
ultra-fast traps
(UFTs). As shown in Figure 613, these calls have dedicated handlers in the
scTable
, from where the exception vector at
0xC00
loads them. Table 615 lists the ultra-fast traps.
Table 615. Ultra-Fast Traps
|
Call Number
|
Association
|
Purpose
|
|
0xFFFF_FFFE
|
Blue Box only
|
Determines whether the given Blue Box task is preemptive, and also loads GPR0 with the
shadowed
task environment (
MkIsPreemptiveTaskEnv
)
|
|
0xFFFF_FFFF
|
Blue Box only
|
Determines whether the given Blue Box task is preemptive (
MkIsPreemptiveTask
)
|
|
0x8000_0000
|
CutTrace
firmware call
|
Used for low-level tracing (see Section 6.8.9.2)
|
|
0x6004
|
vmm_dispatch
|
Treats certain calls (those
belonging
to a specific range of selectors supported by this dispatcher call) as ultra-fast trapseventually handled by
vmm_ufp()
[
osfmk/ppc/vmachmon_asm.s
]
|
|
0x7FF2
|
User only
|
Returns the
pthread_self
valuei.e., the thread-specific pointer (Thread
Info
UFT)
|
|
0x7FF3
|
User only
|
Returns floating-point and AltiVec facility statusi.e., if they are being used by the current thread (Facility Status UFT)
|
|
0x7FF4
|
Kernel only
|
Loads the Machine State Registernot used on 64-bit hardware (Load MSR UFT)
|
A comm area (see Section 6.7.6) routine uses the Thread Info UFT for retrieving the thread-specific (self) pointer, which is also called the
per-thread cookie
. The
pthread_self(3)
library function retrieves this value. The following assembly stub, which directly uses the UFT, retrieves the same value as the
pthread_self()
function in a user program.
; my_pthread_self.S
.text
.globl _my_pthread_self
_my_pthread_self:
li r0,0x7FF2
sc
blr
Note that on certain PowerPC processorsfor example, the 970 and the 970FXthe special-purpose register SPRG3, which Mac OS X uses to hold the per-thread cookie, can be read from user space.
; my_pthread_self_970.S
.text
.globl _my_pthread_self_970
_my_pthread_self_970:
mfspr r3,259 ; 259 is user SPRG3
blr
Let us test our versions of
pthread_self()
by using them in a 32-bit program on both a G4 and a G5, as shown in Figure 625.
Figure 625. Testing the Thread Info UFT
$
cat main.c
#include <stdio.h>
#include <pthread.h>
extern pthread_t my_pthread_self();
extern pthread_t my_pthread_self_970();
int
main(void)
{
printf("library: %p\n", pthread_self());
// call library function
printf("UFT : %p\n", my_pthread_self());
// use 0x7FF2 UFT
printf("SPRG3 : %p\n", my_pthread_self_970());
// read from SPRG3
return 0;
}
$
machine
ppc970
$
gcc -Wall -o my_pthread_self main.c my_pthread_self.S my_pthread_self_970.S
$
./my_pthread_self
library: 0xa000ef98
UFT : 0xa000ef98
SPRG3 : 0xa000ef98
$
machine
ppc7450
$
./my_pthread_self
library: 0xa000ef98
UFT : 0xa000ef98
zsh: illegal hardware instruction ./f
|
The Facility Status UFT can be used to determine which processor facilitiessuch as floating-point and AltiVecare being used by the current thread. The following function, which directly uses the UFT, will return with a word whose bits specify the processor facilities in use.
; my_facstat.S
.text
.globl _my_facstat
_my_facstat:
li r0,0x7FF3
sc
blr
The program in Figure 626 initializes a vector variable only if you run it with one or more arguments on the command line. Therefore, it should report that AltiVec is being used only if you run it with an argument.
Figure 626. Testing the Facility Status UFT
// isvector.c
#include <stdio.h>
// defined in osfmk/ppc/thread_act.h
#define vectorUsed 0x20000000
#define floatUsed 0x40000000
#define runningVM 0x80000000
extern int my_facstat(void);
int
main(int argc, char **argv)
{
int facstat;
vector signed int c;
if (argc > 1)
c = (vector signed int){ 1, 2, 3, 4 };
facstat = my_facstat();
printf("%s\n", (facstat & vectorUsed) ? \
"vector used" : "vector not used");
return 0;
}
$
gcc -Wall -o isvector isvector.c my_facstat.S
$
./isvector
vector not used
$
./isvector usevector
vector used
|
6.7.5.1. Fast Traps
A few other traps that need somewhat more processing than ultra-fast traps, or are not as beneficial to handle so urgently, are handled by
shandler()
in
osfmk/ppc/hw_exception.s
. These are called
fast traps
, or
fastpath calls
. Table 616 lists the fastpath calls. Figure 612 shows the handling of both ultra-fast and fast traps.
Table 616. Fastpath System Calls
|
Call Number
|
Call Name
|
Purpose
|
|
0x7FF1
|
CthreadSetSelf
|
Sets a thread's identifier. This call is used by the Pthread library to implement
pthread_set_self()
, which is used during thread creation.
|
|
0x7FF5
|
Null fastpath
|
Does nothing. It branches straight to
exception_exit()
in
lowmem_vectors.s
.
|
|
0x7FFA
|
Blue Box interrupt notification
|
Results in the invocation of
syscall_notify_interrupt()
[
osfmk/ppc/PseudoKernel.c
], which queues an interrupt for the Blue Box and sets an asynchronous procedure call (APC) AST. The Blue Box interrupt handler
bbsetRupt()
[
osfmk/ppc/PseudoKernel.c
]runs asynchronously to handle the interrupt.
|
6.7.5.2. Blue Box Calls
The Mac OS X kernel includes support code for the Blue Box virtualizer that provides the Classic runtime environment. The support is implemented as a small layer of software called the
PseudoKernel
, whose functionality is exported via a set of fast/ultra-fast system calls. We came across these calls in Tables 614, 615, and 616.
The
truBlueEnvironment
program, which resides within the Resources subdirectory of the Classic application package (
Classic Startup.app
), directly uses the
0x6005
(
bb_enable_bluebox
),
0x6006
(
bb_disable_
bluebox
),
0x6007
(
bb_settaskenv
), and
0x7FFA
(interrupt notification) system calls.
A specially designated threadthe
Blue thread
runs Mac OS while handling Blue Box interrupts, traps, and system calls. Other threads can only issue system calls. The
bb_enable_bluebox()
[
osfmk/ppc/PseudoKernel.c
] PowerPC-only system call is used to enable the support code in the kernel. It receives three arguments from the user-space caller: a task identifier, a pointer to the trap table (
TWI_TableStart
), and a pointer to a descriptor table (
Desc_TableStart
).
bb_enable_bluebox()
passes these arguments in a call to
enable_bluebox()
[
osfmk/ppc/PseudoKernel.c
], which aligns the passed-in descriptor address to a page, wires the page, and maps it into the kernel. The page holds a
BlueThreadTrapDescriptor
structure (
BTTD_t
), which is declared in
osfmk/ppc/PseudoKernel.h
. Thereafter,
enable_bluebox()
initializes several Blue Boxrelated fields of the thread's machine-specific state (the
machine_thread
structure). Figure 627 shows pseudocode depicting the operation of
enable_bluebox()
.
Figure 627. Enabling the kernel's Blue Box support
// osfmk/ppc/thread.h
struct machine_thread {
...
// Points to Blue Box Trap descriptor area in kernel (page aligned)
unsigned int bbDescAddr;
// Points to Blue Box Trap descriptor area in user (page aligned)
unsigned int bbUserDA;
unsigned int bbTableStart
;// Points to Blue Box Trap dispatch area in user
unsigned int emPendRupts;
// Number of pending emulated interruptions
unsigned int bbTaskID;
// Opaque task ID for Blue Box threads
unsigned int bbTaskEnv;
// Opaque task data reference for Blue Box threads
unsigned int specFlags;
// Special flags
...
unsigned int bbTrap;
// Blue Box trap vector
unsigned int bbSysCall;
// Blue Box syscall vector
unsigned int bbInterrupt;
// Blue Box interrupt vector
unsigned int bbPending;
// Blue Box pending interrupt vector
...
};
// osfmk/ppc/PseudoKernel.c
kern_return_t
enable_bluebox(host_t host, void *taskID, void *TWI_TableStart,
char *Desc_TableStart)
{
thread_t th;
vm_offset_t kerndescaddr, origdescoffset;
kern_return_t ret;
ppnum_t physdescpage;
BTTD_t *bttd;
th = current_thread();
// Get our thread.
// Ensure descriptor is non-NULL.
// Get page offset of the descriptor in 'origdescoffset'.
// Now align descriptor to a page.
// Kernel wire the descriptor in the user's map.
// Map the descriptor's physical page into the kernel's virtual address
// space, calling the resultant address 'kerndescaddr'. Set the 'bttd'
// pointer to 'kerndescaddr'.
// Set the thread's Blue Box machine state.
// Kernel address of the table
th->machine.bbDescAddr = (unsigned int)kerndescaddr + origdescoffset;
// User address of the table
th->machine.bbUserDA = (unsigned int)Desc_TableStart;
// Address of the trap table
th->machine.bbTableStart = (unsigned int)TWI_TableStart;
...
// Remember trap vector.
th->machine.bbTrap = bttd->TrapVector;
// Remember syscall vector.
th->machine.bbSysCall = bttd->SysCallVector;
// Remember interrupt vector.
th->machine.bbPending = bttd->PendingIntVector;
// Ensure Mach system calls are enabled and we are not marked preemptive.
th->machine.specFlags &= ~(bbNoMachSC bbPreemptive);
// Set that we are the Classic thread.
th->machine.specFlags = bbThread;
...
}
|
Once the Blue Box trap and system call tables are established, the PseudoKernel can be invoked
while changing Blue Box interruption state atomically. Both
thandler()
and
shandler()
check for the Blue Box during trap and system call processing, respectively.
thandler()
checks the
specFlags
field of the current activation's
machine_
thread
structure to see if the
bbThread
bit is set. If the bit is set,
thandler()
calls
checkassist()
[
osfmk/ppc/hw_exception.s
], which checks whether all the following conditions hold true.
-
The
SRR1_PRG_TRAP_BIT
bit
of SRR1 specifies that this is a trap.
-
The trapped address is in user space.
-
This is not an ASTthat is, the trap type is not a
T_AST
.
-
The trap number is not out of rangethat is, it is not more than a predefined maximum.
If all of these conditions are satisfied,
checkassist()
branches to
atomic_switch_trap()
[
osfmk/ppc/atomic_switch.s
], which loads the trap table (the
bbTrap
field of the
machine_thread
structure) in GPR5 and jumps to
.L_CallPseudoKernel()
[
osfmk/ppc/atomic_switch.s
].
shandler()
checks whether system calls are being redirected to the Blue Box by examining the value of the
bbNoMachSC
bit of the
specFlags
field. If this bit is set,
shandler()
calls
atomic_switch_syscall()
[
osfmk/ppc/atomic_switch.s
], which loads the system call table (the
bbSysCall
field of the
machine_thread
structure) in GPR5 and
falls
through to
.L_CallPseudoKernel()
.
In both cases,
.L_CallPseudoKernel()
among other thingsstores the vector contained in GPR5 in the saved SRR0 as the instruction at which execution will resume. Thereafter, it jumps to
fastexit()
[
osfmk/ppc/hw_exception.s
], which jumps to
exception_exit()
[
osfmk/ppc/lowmem_vectors.s
], thus causing a return to the caller.
A particular Blue Box trap value (
bbMaxTrap
) is used to simulate a return-from-interrupt from the PseudoKernel to user context. Returning Blue Box traps and system calls use this trap, which results in the invocation of
.L_ExitPseudoKernel()
[
osfmk/ppc/atomic_switch.s
].
6.7.6. The Commpage
The kernel
reserves
the last eight pages of every address space for the kernel-user
comm area
also referred to as the
commpage
. Besides being wired in kernel memory, these pages are mapped (shared and read-only) into the address space of every process. Their contents include code and data that are frequently accessed
systemwide
. The following are examples of commpage contents:
-
Specifications of processor features available on the machine, such as whether the processor is 64-bit, what the cache-line size is, and whether AltiVec is present
-
Frequently used routines, such as functions for copying, moving, and zeroing memory; for using
spinlocks
; for flushing the data cache and
invalidating
the instruction cache; and for retrieving the per-thread cookie
-
Various time-related values maintained by the kernel, allowing the current seconds and microseconds to be retrieved by user programs without making system calls
There are separate comm areas for 32-bit and 64-bit address spaces, although they are conceptually similar. We will discuss only the 32-bit comm area in this section.
Using the end of the address space for the comm area has an important benefit: It is possible to access both code and data in the comm area from
anywhere
in the address space, without involving the dynamic link editor or requiring complex address calculations. Absolute unconditional branch instructions, such as
ba
,
bca
, and
bla
, can branch to a location in the comm area from anywhere because they have enough bits in their target address encoding fields to allow them to reach the comm area pages using a sign-extended target address specification. Similarly, absolute loads and stores can comfortably access the comm area. Consequently, accessing the comm area is both efficient and
convenient
.
The comm area is populated during kernel initialization in a processor-specific and platform-specific manner.
commpage_populate()
[
osfmk/ppc/commpage/
commpage.c
] performs this initialization. In fact, functionality contained in the comm area can be
considered
as processor capabilitiesa software extension to the native instruction set. Various comm-area-related constants are defined in
osfmk/ppc/cpu_capabilities.h
.
// osfmk/ppc/cpu_capabilities.h
// Start at page -8, ie 0xFFFF8000
#define _COMM_PAGE_BASE_ADDRESS (-8*4096)
// Reserved length of entire comm area
#define _COMM_PAGE_AREA_LENGTH (7*4096)
// Mac OS X uses two pages so far
#define _COMM_PAGE_AREA_USED (2*4096)
// The Objective-C runtime fixed address page to optimize message dispatch
#define OBJC_PAGE_BASE_ADDRESS (-20*4096)
// Data in the comm page
...
// Code in the comm page (routines)
...
// Used by gettimeofday()
#define _COMM_PAGE_GETTIMEOFDAY \
(_COMM_PAGE_BASE_ADDRESS+0x2e0)
...
The comm area's actual maximum length is seven pages (not eight) since Mach's virtual memory subsystem does not map the last page of an address space.
Each routine in the commpage is described by a
commpage_descriptor
structure, which is declared in
osfmk/ppc/commpage/commpage.h
.
// osfmk/ppc/cpu_capabilities.h
typedef struct commpage_descriptor {
short code_offset;
// offset to code from this descriptor
short code_length;
// length in bytes
short commpage_address;
// put at this address
short special;
// special handling bits for DCBA, SYNC, etc.
long musthave;
// _cpu_capability bits we must have
long canthave;
// _cpu_capability bits we cannot have
} commpage_descriptor;
Implementations of the comm area routines are in the
osfmk/ppc/commpage/
directory. Let us look at the example of
gettimeofday()
, which is both a system call and a comm area routine. It is substantially more expensive to retrieve the current time using the system call. Besides a regular system call stub for
gettimeofday()
, the C library contains the following entry point for calling the comm area version of
gettimeofday()
.
.globl __commpage_gettimeofday
.text
.align 2
__commpage_gettimeofday:
ba __COMM_PAGE_GETTIMEOFDAY
Note that
_COMM_PAGE_GETTIMEOFDAY
is a leaf procedure that must be jumped to, instead of being called as a returning function.
Note that comm area contents are not
guaranteed
to be available on all machines. Moreover, in the particular case of
gettimeofday()
, the time values are updated asynchronously by the kernel and read atomically from user space, leading to
occasional
failures in reading. The C library falls back to the system call version in the case of failure.
// <darwin>/<Libc>/sys/gettimeofday.c
int
gettimeofday(struct timeval *tp, struct timezone *tzp)
{
...
#if defined(__ppc__) defined(__ppc64__)
{
...
// first try commpage
if (__commpage_gettimeofday(tp)) {
// if it fails, try the system call
if (__ppc_gettimeofday(tp,tzp)) {
return (-1);
}
}
}
#else
if (syscall(SYS_gettimeofday, tp, tzp) < 0) {
return -1;
}
#endif
...
}
Since the comm area is readable from within every process, let us write a program to display the information contained in it. Since the comm area API is private, you must include the required headers from the kernel source tree rather than a standard header directory. The program shown in Figure 628 displays the data and routine descriptors contained in the 32-bit comm area.
Figure 628. Displaying the contents of the comm area
// commpage32.c
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#define PRIVATE
#define KERNEL_PRIVATE
#include <machine/cpu_capabilities.h>
#include <machine/commpage.h>
#define WSPACE_FMT_SZ "24"
#define WSPACE_FMT "%-" WSPACE_FMT_SZ "s = "
#define CP_CAST_TO_U_INT32(x) (u_int32_t)(*(u_int32_t *)(x))
#define ADDR2DESC(x) (commpage_descriptor *)&(CP_CAST_TO_U_INT32(x))
#define CP_PRINT_U_INT8_BOOL(label, item) \
printf(WSPACE_FMT "%s\n", label, \
((u_int8_t)(*(u_int8_t *)(item))) ? "yes" : "no")
#define CP_PRINT_U_INT16(label, item) \
printf(WSPACE_FMT "%hd\n", label, (u_int16_t)(*(u_int16_t *)(item)))
#define CP_PRINT_U_INT32(label, item) \
printf(WSPACE_FMT "%u\n", label, (u_int32_t)(*(u_int32_t *)(item)))
#define CP_PRINT_U_INT64(label, item) \
printf(WSPACE_FMT "%#llx\n", label, (u_int64_t)(*(u_int64_t *)(item)))
#define CP_PRINT_D_FLOAT(label, item) \
printf(WSPACE_FMT "%lf\n", label, (double)(*(double *)(item)))
const char *
cpuCapStrings[] = {
#if defined (__ppc__)
"kHasAltivec", // << 0
"k64Bit", // << 1
"kCache32", // << 2
"kCache64", // << 3
"kCache128", // << 4
"kDcbaRecommended", // << 5
"kDcbaAvailable", // << 6
"kDataStreamsRecommended", // << 7
"kDataStreamsAvailable", // << 8
"kDcbtStreamsRecommended", // << 9
"kDcbtStreamsAvailable", // << 10
"kFastThreadLocalStorage", // << 11
#else /* __i386__ */
"kHasMMX", // << 0
"kHasSSE", // << 1
"kHasSSE2", // << 2
"kHasSSE3", // << 3
"kCache32", // << 4
"kCache64", // << 5
"kCache128", // << 6
"kFastThreadLocalStorage", // << 7
"NULL", // << 8
"NULL", // << 9
"NULL", // << 10
"NULL", // << 11
#endif
NULL, // << 12
NULL, // << 13
NULL, // << 14
"kUP", // << 15
NULL, // << 16
NULL, // << 17
NULL, // << 18
NULL, // << 19
NULL, // << 20
NULL, // << 21
NULL, // << 22
NULL, // << 23
NULL, // << 24
NULL, // << 25
NULL, // << 26
"kHasGraphicsOps", // << 27
"kHasStfiwx", // << 28
"kHasFsqrt", // << 29
NULL, // << 30
NULL, // << 31
};
void print_bits32(u_int32_t);
void print_cpu_capabilities(u_int32_t);
void print_commpage_descriptor(const char *, u_int32_t);
void
print_bits32(u_int32_t u)
{
u_int32_t i;
for (i = 32; i--; putchar(u & 1 << i ? '1' : '0'));
}
void
print_cpu_capabilities(u_int32_t cap)
{
int i;
printf(WSPACE_FMT, "cpu capabilities (bits)");
print_bits32(cap);
printf("\n");
for (i = 0; i < 31; i++)
if (cpuCapStrings[i] && (cap & (1 << i)))
printf("%-" WSPACE_FMT_SZ "s + %s\n", " ", cpuCapStrings[i]);
}
void
print_commpage_descriptor(const char *label, u_int32_t addr)
{
commpage_descriptor *d = ADDR2DESC(addr);
printf("%s @ %08x\n", label, addr);
#if defined (__ppc__)
printf(" code_offset = %hd\n", d->code_offset);
printf(" code_length = %hd\n", d->code_length);
printf(" commpage_address = %hx\n", d->commpage_address);
printf(" special = %#hx\n", d->special);
#else /* __i386__ */
printf(" code_address = %p\n", d->code_address);
printf(" code_length = %ld\n", d->code_length);
printf(" commpage_address = %#lx\n", d->commpage_address);
#endif
printf(" musthave = %#lx\n", d->musthave);
printf(" canthave = %#lx\n", d->canthave);
}
int
main(void)
{
u_int32_t u;
printf(WSPACE_FMT "%#08x\n", "base address", _COMM_PAGE_BASE_ADDRESS);
printf(WSPACE_FMT "%s\n", "signature", (char *)_COMM_PAGE_BASE_ADDRESS);
CP_PRINT_U_INT16("version", _COMM_PAGE_VERSION);
u = CP_CAST_TO_U_INT32(_COMM_PAGE_CPU_CAPABILITIES);
printf(WSPACE_FMT "%u\n", "number of processors",
(u & kNumCPUs) >> kNumCPUsShift);
print_cpu_capabilities(u);
CP_PRINT_U_INT16("cache line size", _COMM_PAGE_CACHE_LINESIZE);
#if defined (__ppc__)
CP_PRINT_U_INT8_BOOL("AltiVec available?", _COMM_PAGE_ALTIVEC);
CP_PRINT_U_INT8_BOOL("64-bit processor?", _COMM_PAGE_64_BIT);
#endif
CP_PRINT_D_FLOAT("two52 (2^52)", _COMM_PAGE_2_TO_52);
CP_PRINT_D_FLOAT("ten6 (10^6)", _COMM_PAGE_10_TO_6);
CP_PRINT_U_INT64("timebase", _COMM_PAGE_TIMEBASE);
CP_PRINT_U_INT32("timestamp (s)", _COMM_PAGE_TIMESTAMP);
CP_PRINT_U_INT32("timestamp (us)", _COMM_PAGE_TIMESTAMP + 0x04);
CP_PRINT_U_INT64("seconds per tick", _COMM_PAGE_SEC_PER_TICK);
printf("\n");
printf(WSPACE_FMT "%s", "descriptors", "\n");
// example descriptor
print_commpage_descriptor(" mach_absolute_time()",
_COMM_PAGE_ABSOLUTE_TIME);
exit(0);
}
$
gcc -Wall -I /path/to/xnu/osfmk/ -o commpage32 commpage32.c
$
./commpage32
base address = 0xffff8000
signature = commpage 32-bit
version = 2
number of processors = 2
cpu capabilities (bits) = 00111000000000100000011100010011
+ kHasAltivec
+ k64Bit
+ kCache128
+ kDataStreamsAvailable
+ kDcbtStreamsRecommended
+ kDcbtStreamsAvailable
+ kFastThreadLocalStorage
+ kHasGraphicsOps
+ kHasStfiwx
+ kHasFsqrt
cache line size = 128
AltiVec available? = yes
64-bit processor? = yes
two52 (2^52) = 4503599627370496.000000
ten6 (10^6) = 1000000.000000
timebase = 0x18f0d27c48c
timestamp (s) = 1104103731
timestamp (us) = 876851
seconds per tick = 0x3e601b8f3f3f8d9b
descriptors =
mach_absolute_time() @ ffff8200
code_offset = 31884
code_length = 17126
commpage_address = 7883
special = 0x22
musthave = 0x4e800020
canthave = 0
|