Section 6.4. Entering the Kernel | Mac OS X Internals: A Systems Approach

6.4. Entering the Kernel

On a typical operating system, user processes are logically insulated from the kernel's memory by using different processor execution modes. The Mac OS X kernel executes in a higher-privileged mode (PowerPC OEA) than any user program (PowerPC UISA and VEA). Each user processthat is, each Mach taskhas its own virtual address space. Similarly, the kernel has its own, distinct virtual address space that does not occupy a subrange of the maximum possible address space of a user process. Specifically, the Mac OS X kernel has a private 32-bit (4GB) virtual address space, and so does each 32-bit user process. Similarly, a 64-bit user process also gets a private virtual address space that is not subdivided into kernel and user parts.

Although the Mac OS X user and kernel virtual address spaces are not subdivisions of a single virtual address space, the amounts of virtual memory usable within both are restricted due to conventional mappings. For example, kernel addresses in the 32-bit kernel virtual address space lie between 0x1000 and 0xDFFFFFFF (3.5GB). Similarly, the amount of virtual memory a 32-bit user process can use is significantly less than 4GB, since various system libraries are mapped by default into each user address space. We will see specific examples of such mappings in Chapter 8.

We will refer to the kernel virtual address space simply as the kernel space. Moreover, even though each user process has its own address space, we will often use the phrase the user space when the specific process is not relevant. In this sense, we can think of all user processes as residing in the user space. The following are some important characteristics of the kernel and user spaces.

The kernel space is inaccessible to user tasks. The kernel enforces this protection by using memory management hardware to create a boundary between kernel-level and user-level code.
The user space is fully accessible to the kernel.
The kernel normally prevents one user task from modifying, or even accessing, another task's memory. However, such protection is usually subject to task and system ownership. For example, there exist kernel-provided mechanisms through which a task T1 can access the address space of another task T2 if T1 is running with root privileges, or if T1 and T2 are both owned by the same user. Tasks can also explicitly share memory with other tasks.
The user space cannot directly access hardware. However, it is possible to have user-space device drivers that access hardware after mediation by the kernel.

Since the kernel mediates access to physical resources, a user program must exchange information with the kernel to avail the kernel's services. Typical user-space execution requires exchange of both control information and data. In such an exchange between a Mach task and the kernel, a thread within the task transitions to kernel space from user space, transferring control to the kernel. After handling the user thread's request, the kernel returns control back to the thread, allowing it to continue normal execution. At other times, the kernel can acquire control even though the current thread was not involved in the reason for the transferin fact, the transfer is often not explicitly requested by the programmer. We refer to execution within the kernel and user spaces as being in the kernel mode and the user mode, respectively.

A Modal Dialog

Technically, even the Mac OS X kernel mode can be thought of as consisting of two submodes. The first mode refers to the environment in which the kernel's own threads runthat is, the kernel task and its resources. The kernel task is a proper Mach task (it is the first Mach task to be created) that runs several dozen kernel threads on a typical system.

The second mode refers to threads running in the kernel after they enter the kernel from user space through a system callthat is, threads trap from user space into the kernel. Kernel subsystems that need to be aware of the two modes may handle them differently.

6.4.1. Types of Control Transfer

Although such transfers of control are traditionally divided into categories based on the events that caused them, at the PowerPC processor level, all categories are handled by the same exception mechanism. Examples of events that can cause the processor to change execution mode include the following:

External signals, such as from the interrupt controller hardware
Abnormal conditions encountered while executing an instruction
Expected system events such as rescheduling and page faults
Trace exceptions caused by deliberate enabling of single-stepping (setting the SE bit of the MSR) or branch-tracing (setting the BE bit of the MSR)
Conditions internal to the processor, such as the detection of a parity error in the L1 D-cache
Execution of the system call instruction

Nevertheless, it is still useful to categorize control transfers in Mac OS X based on the events causing them. Let us look at some broad categories.

6.4.1.1. External Hardware Interrupts

An external hardware interrupt is a transfer of control into the kernel that is typically initiated by a hardware device to indicate an event. Such interrupts are signaled to the processor by the assertion of the processor's external interrupt input signal, which causes an external interrupt exception in the processor. External interrupts are asynchronous, and their occurrence is typically unrelated to the currently executing thread. Note that external interrupts can be masked.

An example of an external interrupt is a storage device controller causing an interrupt to signal the completion of an I/O request. On certain processors, such as the 970FX, a thermal exceptionused to notify the processor of an abnormal conditionis signaled by the assertion of the thermal interrupt input signal. In this case, even though the abnormal condition is internal to the processor, the source of the interrupt is external.

6.4.1.2. Processor Traps

A processor trap is a transfer of control into the kernel that is initiated by the processor itself because of some event that needs attention. Processor traps may be synchronous or asynchronous. Although the conditions that cause traps could all be termed abnormal in that they are all exceptional (hence the exception), it is helpful to subclassify them as expected (such as page faults) or unexpected (such as a hardware failure). Other examples of reasons for traps include divide-by-zero errors, completion of a traced instruction, illegal access to memory, and the execution of an illegal instruction.

6.4.1.3. Software Traps

The Mac OS X kernel implements a mechanism called asynchronous system traps (ASTs), wherein one or more reason bits can be set by software for a processor or a thread. Each bit represents a particular software trap. When a processor is about to return from an interrupt context, including returns from system calls, it checks for these bits, and takes a trap if it finds one. The latter operation involves executing the corresponding interrupt-handling code. A thread checks for such traps in many cases when it is about to change its execution state, such as from being suspended to running. The kernel's clock interrupt handler also periodically checks for ASTs. We categorize ASTs on Mac OS X as software traps because they are both initiated and handled by software. Some AST implementations may use hardware support.

6.4.1.4. System Calls

The PowerPC system call instruction is used by programs to generate a system call exception, which causes the processor to prepare and execute the system call handler in the kernel. The system call exception is synchronous. Hundreds of system calls constitute a well-defined set of interfaces serving as entry points into the kernel for user programs.

POSIX

A standard set of system calls, along with their behavior, error handling, return values, and so on, is defined by a Portable Operating System Interface (POSIX) standard, which defines an interface and not its implementation. Mac OS X provides a large subset of the POSIX API.

The name POSIX was suggested by Richard Stallman. POSIX documentation suggests that the word should be pronounced pahz-icks, as in positive, and not poh-six or other variations.

To sum up, a hardware interrupt from an external device generates an external interrupt exception, a system call generates a system call exception, and other situations result in a variety of exceptions.

6.4.2. Implementing System Entry Mechanisms

PowerPC exceptions are the fundamental vehicles for propagating any kind of interrupts (other than ASTs), whether hardware- or software-generated. Before we discuss how some of these exceptions are processed, let us look at the key components of the overall PowerPC exception-processing mechanism on Mac OS X. These include the following, some of which we have come across in earlier chapters:

The kernel's exception vectors that reside in a designated memory area starting at physical memory address 0x0
PowerPC exception-handling registers
The rfid (64-bit) and rfi (32-bit) system linkage instructions, which are used for returning from interrupts
The sc system linkage instruction, which is used to cause a system call exception
Machine-dependent thread state, including memory areas called exception save areas, which are used for saving various types of context during exception processing

A system linkage instruction connects user-mode and supervisor-mode software. For example, by using a system linkage instruction (such as sc), a program can call on the operating system to perform a service. Conversely, after performing the service, the operating system can return to user-mode software by using another system linkage instruction (such as rfid).

6.4.2.1. Exceptions and Exception Vectors

The __VECTORS segment of the kernel executable (Figure 65) contains the kernel's exception vectors. As we saw in Chapter 4, BootX copies these to their designated location (starting at 0x0) before transferring control to the kernel. These vectors are implemented in osfmk/ppc/lowmem_vectors.s.

Figure 65. The Mach-O segment containing the exception vectors in the kernel executable

$ otool -l /mach_kernel ... Load command 2       cmd LC_SEGMENT   cmdsize 124   segname __VECTORS    vmaddr 0x00000000    vmsize 0x00007000   fileoff 3624960  filesize 28672   maxprot 0x00000007  initprot 0x00000003    nsects 1     flags 0x0 Section   sectname __interrupts    segname __VECTORS       addr 0x00000000       size 0x00007000     offset 3624960      align 2^12 (4096)     reloff 0     nreloc 0      flags 0x00000000 reserved1 0  reserved2 0 ...

Table 51 lists various PowerPC processor exceptions and some of their details. Recall that most exceptions are subject to one or more conditions; for example, most exceptions can occur only when no higher-priority exception exists. Similarly, exceptions caused by failed effective-to-virtual address translations can occur only if address translation is enabled. Moreover, depending on a system's specific hardware, or whether the kernel is being debugged, some exceptions listed in Table 51 may be inconsequential. Figure 66 shows an excerpt from lowmem_vectors.s. For example, when there is a system call exception, the processor executes the code starting at the label .L_handlerC00 (vector offset 0xC00).

Figure 66. The kernel's exception vectors

; osfmk/ppc/lowmem_vectors.s ... #define VECTOR_SEGMENT .section __VECTORS, __interrupts             VECTOR_SEGMENT             .globl EXT(lowGlo) EXT(lowGlo):             .globl EXT(ExceptionVectorsStart) EXT(ExceptionVectorsStart): baseR:             ...             . = 0x100 ; T_RESET             .globl EXT(ResetHandler) .L_handler100:             ...             . = 0x200 ; T_MACHINE_CHECK .L_handler200:             ...             . = 0x300 ; T_DATA_ACCESS .L_handler300:             ...             . = 0xC00 ; T_SYSTEM_CALL .L_handlerC00:             ...

The exception vectors for the x86 version of Darwin are implemented in osfmk/i386/locore.s.

Exception Vectors in Early UNIX

The concept of exception vectors in early UNIX was very similar to the one being discussed here, although there were far fewer vectors. The UNIX trap vectors were defined in an assembly file called low.s or l.s, representing that the vectors resided in low memory. Figure 67 shows an excerpt from the low.s file in the Third Edition UNIX source.

Figure 67. Trap vectors in Third Edition UNIX

/ PDP-11 Research UNIX V3 (Third Edition), circa 1973 / ken/low.s / low core ... .globl start . = 0^.         4         br      1f / trap vectors         trap; br7+0             / bus error         trap; br7+1             / illegal instruction         trap; br7+2             / bpt-trace trap         trap; br7+3             / iot trap         trap; br7+4             / power fail         trap; br7+5             / emulator trap         trap; br7+6             / system entry . = 040^. 1:      jmp     start . = 060^.         klin; br4         klou; br4 ...

6.4.2.2. Exception-Handling Registers

The Machine Status Save/Restore Register 0 (SRR0) is a special branch-processing register in the PowerPC architecture. It is used to save machine status on interrupts and to restore machine status on return from interrupts. When an interrupt occurs, SRR0 is set to the current or next instruction address, depending on the nature of the interrupt. For example, if an interrupt is being caused due to an illegal instruction exception, then SRR0 will contain the address of the current instruction (the one that failed to execute).

SRR1 is used for a related purpose: It is loaded with interrupt-specific information when an interrupt occurs. It also mirrors certain bits of the Machine State Register (MSR) in the case of an interrupt.

The special-purpose registers SPRG0, SPRG1, SPRG2, and SPRG3 are used as support registers (in an implementation-dependent manner) in various stages of exception processing. For example, the Mac OS X kernel uses SPRG2 and SPRG3 to save interrupt-time general-purpose registers GPR13 and GPR11, respectively, in the implementation of the low-level exception vectors. Furthermore, it uses SPRG0 to hold a pointer to the per_proc structure.

6.4.2.3. System Linkage Instructions

System Call

When a system call is invoked from user space, GPR0 is loaded with the system call number, and the sc instruction is executed. The effective address of the instruction following the system call instruction is placed in SRR0, certain bit ranges of the MSR are placed into the corresponding bits of SRR1, certain bits of the SRR1 are cleared, and a system call exception is generated. The processor fetches the next instruction from the well-defined effective address of the system call exception handler.

Return from Interrupt

rfid (return-from-interrupt-double-word) is a privileged, context-altering, and context-synchronizing instruction used to continue execution after an interrupt. Upon its execution, among other things, the next instruction is fetched from the address specified by SRR0. rfid's 32-bit counterpart is the rfi instruction.

A context-altering instruction is one that alters the context in which instructions are executed, data is accessed, or data and instruction addresses are interpreted in general. A context-synchronizing instruction is one that ensures that any address translations associated with instructions following it will be discarded if the translations were performed using the old contents of the page table entry (PTE).

6.4.2.4. Machine-Dependent Thread State

We will examine the in-kernel thread data structure [osfmk/kern/thread.h] and related structures in Chapter 7. Each thread contains a machine-dependent state, represented by a machine_thread structure [osfmk/ppc/thread.h].

Figure 68 shows a portion of the machine_thread structure. Its fields include the following.

The kernel and user save area pointers (pcb and upcb, respectively) refer to saved kernel-state and user-state contexts. The contents of a save area in xnu are analogous to those of the process control block (PCB) in a traditional BSD kernel.
The current, deferred, and normal facility context structures (curctx, deferctx, and facctx, respectively) encapsulate the contexts for the floating-point and AltiVec facilities. Note that the save area holds only a normal context that does not include floating-point or vector contexts.
The vmmCEntry and vmmControl pointers point to data structures related to the kernel's virtual machine monitor (VMM) facility, which allows a user program to create, manipulate, and run virtual machine (VM) instances. A VMM instance includes a processor state and an address space. The VMM facility and its use are discussed in Section 6.9.
The kernel stack pointer (ksp) either points to the top of the thread's kernel stack or is zero.
The machine_thread structure also contains several data structures related to the kernel's support for the Blue Boxthat is, the Classic environment.

Figure 68. Structure for a thread's machine-dependent state

// osfmk/kern/thread.h struct thread {     ...     struct machine_thread machine;     ... }; // osfmk/ppc/thread.h struct facility_context {     savearea_fpu             *FPUsave;  // FP save area     savearea                 *FPUlevel; // FP context level     unsigned int              FPUcpu;   // last processor to enable FP     unsigned int              FPUsync;  // synchronization lock     savearea_vec             *VMXsave;  // VMX save area     savearea                 *VMXlevel; // VMX context level     unsigned int              VMXcpu;   // last processor to enable VMX     unsigned int              VMXsync;  // synchronization lock     struct thread_activation *facAct;   // context's activation }; typedef struct facility_context facility_context; ... struct machine_thread {     savearea             *pcb;        // the "normal" save area     savearea             *upcb;       // the "normal" user save area     facility_context     *curctx;     // current facility context pointer     facility_context     *deferctx;   // deferred facility context pointer     facility_context      facctx;     // "normal" facility context structure     struct vmmCntrlEntry *vmmCEntry;  // pointer to current emulation context     struct vmmCntrlTable *vmmControl; // pointer to VMM control table     ...     unsigned int          ksp;        // top of stack or zero     unsigned int          preemption_count;     struct per_proc_info *PerProc;    // current per-processor data     ... };

6.4.2.5. Exception Save Areas

Save areas are fundamental to xnu's exception processing. Important characteristics of the kernel's save area management include the following.

Save areas are stored in pages, with each page logically divided into an integral number of save area slots. Consequently, a save area never spans a page boundary.
The kernel accesses save areas using both virtual and physical addressing. A low-level interrupt vector refers to a save area using its physical address, as exceptions (including PTE misses) must not occur at that level. Certain queuing operations during save area management are also performed using physical addresses.
Save areas can be permanent, or they can be dynamically allocated. Permanent save areas are allocated at boot time and are necessary so that interrupts can be taken. The initial save areas are allocated from physical memory. The number of initial save areas is defined in osfmk/ppc/savearea.h, as are other save area management parameters. Eight "back-pocket" save areas are also allocated at boot time for use in emergencies.
Save areas are managed using two global free lists: the save area free list, and the save area free pool. Each processor additionally has a local list. The pool contains entire pages, with each slot within a page being marked free or otherwise. The free list gets its save areas from pool pages. The free list can be grown or shrunk as necessary. An unused save area from the free list is returned to its pool page. If all slots in a pool page are marked free, it is taken off the free pool list and entered into a pending release queue.

We can write a simple program as follows to display some save-area-related sizes used by the kernel.

$ cat savearea_sizes.c // savearea_sizes.c #include <stdio.h> #include <stdlib.h> #define XNU_KERNEL_PRIVATE #define __APPLE_API_PRIVATE #define MACH_KERNEL_PRIVATE #include <osfmk/ppc/savearea.h> int main(void) {      printf("size of a save area structure in bytes = %ld\n", sizeof(savearea));      printf("# of save areas per page               = %ld\n", sac_cnt);      printf("# of save areas to make at boot time   = %ld\n", InitialSaveAreas);      printf("# of save areas for an initial target  = %ld\n", InitialSaveTarget);      exit(0); } $ gcc -I /work/xnu -Wall -o savearea_sizes savearea_sizes.c $ ./savearea_sizes size of a save area structure in bytes = 640 # of save areas per page               = 6 # of save areas to make at boot time   = 48 # of save areas for an initial target  = 24

Structure declarations for the various save area types are also contained in osfmk/ppc/savearea.h.

// osfmk/ppc/savearea.h #ifdef MACH_KERNEL_PRIVATE typedef struct savearea_comm {     // ... fields common to all save areas     // ... fields used to manage individual contexts } savearea_comm; #endif #ifdef BSD_KERNEL_PRIVATE typedef struct savearea_comm {     unsigned int save_000[24]; } savearea_comm; #endif typedef struct savearea {     savearea_comm save_hdr;     // general context: exception data, all GPRs, SRR0, SRR1, XER, LR, CTR,     // DAR, CR, DSISR, VRSAVE, VSCR, FPSCR, Performance Monitoring Counters,     // MMCR0, MMCR1, MMCR2, and so on     ... } savearea; typedef struct savearea_fpu {     savearea_comm save_hdr;     ...     // floating-point context  that is, all FPRs } savearea_fpu; typedef struct savearea_vec {     savearea_comm save_hdr;     ...     save_vrvalid; // valid VRs in saved context     // vector context  that is, all VRs } savearea_vec; ...

When a new thread is created, a save area is allocated for it by machine_thread_create() [osfmk/ppc/pcb.c]. The save area is populated with the thread's initial context. Thereafter, a user thread begins life with a taken interruptthat is, it looks from an observer's standpoint that the thread is in the kernel because of an interrupt. It returns to user space through thread_return() [osfmk/ppc/hw_exception.s], retrieving its context from the save area. In the case of kernel threads, machine_stack_attach() [osfmk/ppc/pcb.c] is called to attach a kernel stack to a thread and initialize its state, including the address where the thread will continue execution.

// osfmk/ppc/pcb.c kern_return_t machine_thread_create(thread_t thread, task_t task) {     savearea *sv;                   // pointer to newly allocated save area     ...     sv = save_alloc();              // allocate a save area     bzero((char *)((unsigned int)sv // clear the save area           + sizeof(savearea_comm)),           (sizeof(savearea) - sizeof(savearea_comm)));     sv->save_hdr.save_prev = 0;     // clear the back pointer     ...     sv->save_hdr.save_act = thread; // set who owns it     thread->machine.pcb = sv;       // point to the save area     // initialize facility context     thread->machine.curctx = &thread->machine.facctx;     // initialize facility context pointer to activation     thread->machine.facctx.facAct = thread;     ...     thread->machine.upcb = sv;      // set user pcb     ...     sv->save_fpscr = 0;             // clear all floating-point exceptions     sv->save_vrsave = 0;            // set the vector save state     ...     return KERN_SUCCESS; }

What's in a Context?

When a thread executes, its execution environment is described by a context, which in turn relates to the thread's memory state and its execution state. The memory state refers to the thread's address space, as defined by the virtual-to-real address mappings that have been set up for it. The execution state's contents depend on whether the thread is running as part of a user task, running as part of the kernel task to perform some kernel operation, or running as part of the kernel task to service an interrupt.^[4]

^[4] All threads in the kernel are created within the kernel task.