Section 6.8. Kernel Support for Debugging, Diagnostics, and Tracing

6.8. Kernel Support for Debugging, Diagnostics, and Tracing

In this section, we will look at various facilities in the Mac OS X kernel for both kernel-level and application-level debugging, diagnostics, and tracing. Note that we will not discuss how to actually use the kernel debuggers herewe visit that topic in Chapter 10, in the context of creating kernel extensions.

6.8.1. GDB (Network-Based or FireWire-Based Debugging)

The most convenient way to perform kernel-level debugging on Mac OS X is through the GNU debugger, GDB, which supports either a network-based or FireWire-based kernel debugging configuration, both of which require two machines.

The standard Mac OS X kernel includes support for two-machine network-based debugging using GDB.

In the network-based configuration, GDB running on the debugging machine communicates over Ethernet with a stub in the target machine's kernel. This remote debugger protocol is called the Kernel Debugging Protocol (KDP). It uses a variant of TFTP over UDP as the core transfer protocol. The default debugger-side UDP port number is 41139. The following are examples of requests in the KDP protocol:

Connection-oriented requests (KDP_CONNECT, KDP_DISCONNECT)
Requests for obtaining client information (KDP_HOSTINFO, KDP_VERSION, KDP_MAXBYTES)
Requests for obtaining executable image information (KDP_LOAD, KDP_IMAGEPATH)
Requests for accessing memory (KDP_READMEM, KDP_WRITEMEM)
Requests for accessing registers (KDP_READREGS, KDP_WRITEREGS)
Requests for manipulating breakpoints (KDP_BREAKPOINT_SET, KDP_BREAKPOINT_REMOVE)

Each KDP requestand the corresponding replyhas its own packet format. Note that the target-side in-kernel KDP implementation does not use the kernel's networking stack but has its own minimal UDP/IP implementation.

Two fundamental functions used by the KDP implementation are for sending and receiving protocol packets. A network driver that supports kernel debugging must provide polled-mode implementations of these two functions: sendPacket(), the transmit handler, and receivePacket(), the receive handler. These functions are used only when the kernel debugger is active.

Since a network driver must explicitly support KDP, remote debugging is possible only when using network interfaces driven by such drivers. In particular, the AirPort driver does not support KDP. Hence, remote debugging cannot be done over a wireless network.

In the FireWire debugging configuration, KDP is used over a FireWire cable courtesy of a kernel extension (AppleFireWireKDP.kext) on the target machine and a translator program (FireWireKDPProxy) on the debugger machine. The translator routes data between the FireWire connection and UDP port 41139 on the debugger systemthat is, it acts as a local proxy for the target machine. GDB still performs network-based debugging, except that it communicates with localhost instead of directly communicating with the shim on the target machine.

6.8.2. KDB (Serial-Line-Based Debugging)

Although GDB is typically sufficient as a kernel debugger, the Mac OS X kernel also supports a built-in kernel debugger called KDB, which is more suitableand in some cases may be the only optionfor debugging low-level kernel components. Since remote debugging with GDB uses network or FireWire hardware, it cannot be used for kernel debugging before the required hardware is operational. For example, debugging the built-in Ethernet hardware used by GDB or low-level hardware interrupt handlers requires the use of the built-in debugger.

KDB's nature and functionality are roughly similar to that of the kdb debugger in BSD variants.

KDB also requires two machines for debugging, although the entire debugger is built into the kernel. You interact with KDB over a serial line, which means that both the target machine and the debugging machine must have serial ports. Whereas the debugging machine can have any type of serial port, including those provided by USB-based or PCI-based serial port adapters, the target machine must have a built-in hardware serial porttypically on the main logic board. Xserve is an example of a system model that has such a real serial port.

Recall the discussion of kprintf() initialization from Chapter 5. Serial output from kprintf() is disabled unless DB_KPRT is set in the debug boot argument. When debug output to a serial port is enabled, kprintf() requires a serial device that can be directly addressed, since scc_putc() performs polled I/Oit directly reads and writes serial chip registers. This allows debugging to work even with interrupts disabled. Figure 629 shows an excerpt from the initialization of the kprintf() function.

Figure 629. Initialization of the `kprintf()` function

// pexpert/ppc/pe_kprintf.c void PE_init_kprintf(boolean_t vm_initialized) {     ...     if ((scc = PE_find_scc())) {        // See if we can find the serial port         scc = io_map_spec(scc, 0x1000); // Map it in         initialize_serial((void *)scc); // Start the serial driver         PE_kputc = serial_putc;         simple_lock_init(&kprintf_lock, 0);     } else         PE_kputc = cnputc;     ... } void serial_putc(char c) {     (void)scc_putc(0, 1, c);     if (c == '\n')         (void)scc_putc(0, 1, '\r'); } void kprintf(const char *fmt, ...) {     ...     if (!disableSerialOutput) {         va_start(listp, fmt);         _doprnt(fmt, &listp, PE_kputc, 16);         va_end(listp);     }     ... }

Moreover, unlike GDB-based remote debugging, which works with the default Mac OS X kernel, using KDB requires building a custom kernel with the DEBUG configuration (see Section 6.10 for a discussion of kernel compilation).

6.8.3. CHUD Support

The Computer Hardware Understanding Development (CHUD) Tools software is a suite of graphical and command-line programs for measuring and optimizing software performance on Mac OS X.^[13] It is also used for benchmarking and analyzing various aspects of system hardware. Besides user-space programs, the CHUD Tools suite employs kernel extensions (CHUDProf.kext and CHUDUtils.kext), which export a variety of functions to the user space through I/O Kit user clients.^[14] Finally, the kernel implements several functions and callback hooks for use by CHUD software. The bsd/dev/ppc/chud/ and osfmk/ppc/chud/ directories contain the implementations of these functions and hooks. Many of the functions are analogous to functions in the Mach API. The CHUD-related functions and hooks in the kernel can be categorized as described in the following subsections.

^[13] We enumerated the individual programs in the CHUD Tools suite in Chapter 2.

^[14] An I/O Kit user client is an in-kernel object that allows a user-space application to communicate with a device that the user client represents. We will look at user clients in Chapter 10.

6.8.3.1. Task-Related Functions

Examples include chudxnu_current_task(), chudxnu_task_read(), chudxnu_task_write(), chudxnu_pid_for_task(), chudxnu_task_for_pid(), chudxnu_current_pid(), and chudxnu_is_64bit_task().

6.8.3.2. Thread-Related Functions

Examples include chudxnu_bind_thread(), chudxnu_unbind_thread(), chudxnu_thread_get_state(), chudxnu_thread_set_state(), chudxnu_thread_user_state_available(), chudxnu_thread_get_callstack(), chudxnu_thread_get_callstack64(), chudxnu_current_thread(), chudxnu_task_for_thread(), chudxnu_all_threads(), chudxnu_thread_info(), and chudxnu_thread_last_context_switch().

6.8.3.3. Memory-Related Functions

Examples include chudxnu_avail_memory_size(), chudxnu_phys_memory_size(), chudxnu_io_map(), and chudxnu_phys_addr_wimg().

6.8.3.4. CPU-Related Functions

These include functions for the following operations:

Retrieving the numbers of available and physical CPUs
Retrieving the current CPU's index
Starting and stopping CPUs
Enabling, disabling, and querying napping on a CPU
Enabling and disabling interrupts
Checking whether the current CPU is running in an interrupt context (determined by checking whether the pointer to the interrupt stack is NULL or not)
Generating a fake I/O interrupt
Reading and writing special-purpose registers
Flushing and enabling CPU caches
Acquiring and releasing the performance-monitoring facility
Performing SCOM facility reads and writes
Retrieving a pointer to and the size of the branch-tracing buffer
Retrieving and clearing interrupt counters

6.8.3.5. Callback-Related Functions

The kernel supports registration of CHUD callbacks in several kernel subsystems, such as those listed here.

Per-CPU timer callback The CPU's per_proc_info structure's pp_chud field, which is a pointer to a chudcpu_data_t structure, is used to hold timer-related data structures for this callback.
System-wide trap callback The perfTrapHook function pointer points to the callback. It is invoked from TRap() [osfmk/ppc/trap.c] during trap processing.
System-wide interrupt callback The perfIntHook function pointer points to the callback. It is invoked from interrupt() [osfmk/ppc/interrupt.c] during interrupt processing.
System-wide AST callback The perfASTHook function pointer points to the callback. It is invoked from trap() [osfmk/ppc/trap.c] during trap processing.
System-wide CPU signal callback The perfCpuSigHook function pointer points to the callback. It is invoked from cpu_signal_handler() [osfmk/ppc/cpu.c] during the processing of inter-CPU signals.
System-wide kdebug callback The kdebug_chudhook function pointer points to the callback. It is invoked from kernel_debug() [bsd/kern/kdebug.c] and its variants, which are used (as part of KERNEL_DEBUG macros) throughout the kernel for fine-grained tracing of kernel events.
System-wide system call callback The PowerPC-only system call 0x6009 becomes alivecalling it from user space invokes the callback in the kernel.
Timer callbacks Multiple timers can be allocated, and thread-based callouts can be established to run the CHUD timer callbacks.

Figure 630 shows how the kernel invokes CHUD system-wide hooks for traps and ASTs.

Figure 630. Invocation of CHUD system-wide hooks for traps and ASTs

/ osfmk/ppc/trap.c struct savearea * trap(int trapno, struct savearea *ssp, unsigned int dsisr, addr64_t dar) {     ...     ast_t *myast;     ...     myast = ast_pending();     if (perfASTHook) {         if (*myast & AST_PPC_CHUD_ALL) {             perfASTHook(trapno, ssp, dsisr, (unsigned int)dar);         }     } else {         *myast &= ~AST_PPC_CHUD_ALL;     }     if (perfTrapHook) {         if (perfTrapHook(trapno, ssp, dsisr, (unsigned int)dar) ==             KERN_SUCCESS)             return ssp; // if it succeeds, we are done...     }     ... }

Let us look at the operation of the CHUD system call hook. We noted in Table 612 that the PowerPC-only system call 0x6009 (CHUDCall) is disabled by default. Figure 631 shows that invoking a disabled PowerPC-only system call results in an erroneous (nonzero) return.

Figure 631. Invoking a disabled PowerPC-only system call

// CHUDCall.c #include <stdio.h> int CHUDCall(void) {     int ret;     __asm__ volatile(         "li r0,0x6009\n"         "sc\n"         "mr %0,r3\n"         : "=r" (ret) // output         :            // no input     );     return ret; } int main(void) {     int ret = CHUDCall();     printf("%d\n", ret);     return ret; } $ gcc -Wall -o CHUDCall CHUDCall.c $ ./CHUDCall 1

Let us now see what is involved in registering a CHUD system call callback with the kernel. If successfully registered, the callback will be invoked in the kernel when the 0x6009 system call is executed. We can carry out such an experiment by using the following steps.

Create a trivial loadable kernel extension that has only the start and stop entry points. We will discuss kernel extensions in Chapter 10. You can create a trivial kernel extension using the Generic Kernel Extension template in Xcode.
Implement a wrapper function that first checks whether the callback function pointer is non-NULL, and if so, it copies the save area that the 0x6009 system call receives as an argument to a thread state structure. It then invokes the callback with a pointer to the thread state structure as an argument.
Implement a function that copies information from a save area to a thread state structure.
Implement a function that sets PPCcalls[9]the table entry for system call 0x6009to point to the wrapper. Call this function from the kernel extension's start routine.
Implement a function that disables the 0x6009 system call by setting PPCcalls[9] to NULL. Call this function from the kernel extension's stop routine.

Figure 632 shows most of the code for implementing these steps. Note that the code, including the portion that is not shown, is largely identical to the code in osfmk/ppc/chud/chud_osfmk_callback.c. To create a working kernel extension for this experiment, you need to provide the missing code.

Figure 632. Implementing a kernel extension to register a PowerPC-only system call

// CHUDSyscallExtension.c #include <sys/systm.h> #include <mach/mach_types.h> #define XNU_KERNEL_PRIVATE #define __APPLE_API_PRIVATE #define MACH_KERNEL_PRIVATE // Either include the appropriate headers or provide structure declarations // for the following: // // struct savearea // struct ppc_thread_state // struct ppc_thread_state64 // PowerPC-only system call table (from osfmk/ppc/PPCcalls.h) typedef int (* PPCcallEnt)(struct savearea *save); extern PPCcallEnt PPCcalls[]; // The callback function's prototype typedef kern_return_t (* ppc_syscall_callback_func_t) \                       (thread_flavor_t flavor, thread_state_t tstate, \                        mach_msg_type_number_t count); // Pointer for referring to the incoming callback function static ppc_syscall_callback_func_t callback_func = NULL; // Identical to chudxnu_copy_savearea_to_threadstate(), which is implemented // in osfmk/ppc/chud/chud_osfmk_callbacks.c kern_return_t ppc_copy_savearea_to_threadstate(thread_flavor_t         flavor,                                  thread_state_t          tstate,                                  mach_msg_type_number_t *count,                                  struct savearea        *sv) {     ... } // PPCcalls[9] will point to this when a callback is registered kern_return_t callback_wrapper(struct savearea *ssp) {     if (ssp) {         if (callback_func) {             struct my_ppc_thread_state64 state;             mach_msg_type_number_t       count = PPC_THREAD_STATE64_COUNT;             ppc_copy_savearea_to_threadstate(PPC_THREAD_STATE64,                                              (thread_state_t)&state,                                              &count, ssp);             ssp->save_r3 = (callback_func)(PPC_THREAD_STATE64,                                            (thread_state_t)&state, count);         } else {             ssp->save_r3 = KERN_FAILURE;         }     }     return 1; // Check for ASTs } // Example callback function kern_return_t callback_func_example(thread_flavor_t        flavor,                       thread_state_t         tstate,                       mach_msg_type_number_t count) {     printf("Hello, CHUD!\n");     return KERN_SUCCESS; } / Callback registration kern_return_t ppc_syscall_callback_enter(ppc_syscall_callback_func_t func) {     callback_func = func;     PPCcalls[9] = callback_wrapper;     __asm__ volatile("eieio");     __asm__ volatile("sync");     return KERN_SUCCESS; } // Callback cancellation kern_return_t ppc_syscall_callback_cancel(void) {     callback_func = NULL;     PPCcalls[9] = NULL;     __asm__ volatile("eieio");     __asm__ volatile("sync");     return KERN_SUCCESS; } kern_return_t PPCSysCallKEXT_start(kmod_info_t *ki, void *d) {     ppc_syscall_callback_enter(callback_func_example);     printf("PPCSysCallKEXT_start\n");     return KERN_SUCCESS; } kern_return_t PPCSysCallKEXT_stop(kmod_info_t *ki, void *d) {     ppc_syscall_callback_cancel();     printf("PPCSysCallKEXT_stop\n");     return KERN_SUCCESS; }

If you run the program from Figure 631 after the kernel extension shown in Figure 632 is loaded, you should get a zero return from the system call, and the "Hello, CHUD!" message should appear in the system log.

6.8.4. Kernel Profiling (`kgmon` and `gprof`)

The Mac OS X kernel can be compiled with support for profiling its own code. Such a compilation is achieved by selecting the PROFILE configuration before initiating a kernel build. Doing so enables several aspects of the kernel profiling mechanism, such as the following.

The kernel is compiled with the -pg GCC option, thereby generating extra code to write profiling information for subsequent analysis.
BSD initialization during system startup calls kmstartup() [bsd/kern/subr_prof.c] to initialize profiling data structures that reside in a global gmonparam structure [bsd/sys/gmon.h] called _gmonparam. Moreover, kmstartup() allocates kernel memory for holding profile data.
When running on the master processor, the kernel invokes bsd_hardclock() [bsd/kern/kern_clock.c] every time hertz_tick() [osfmk/kern/mach_clock.c] is called from the real-time clock interrupt handler. When kernel profiling is enabled, bsd_hardclock() updates the information in _gmonparam.
When kernel profiling is enabled, the code for servicing the KERN_PROF sysctl is included in kern_sysctl() [bsd/kern/kern_sysctl.c]the dispatcher for kernel-related sysctl calls.

The kgmon command-line program is used to enable or disable profiling, reset in-kernel profile buffers, and dump the contents of the profile buffers to a gmon.out file. kgmon communicates with the kernel primarily using management information base (MIB) names of the format CTL_KERNKERN_PROF<terminal name>, where <terminal name> can be one of the following:

GPROF_STATE enable or disable profiling
GPROF_COUNT retrieve buffer containing profile tick counts
GPROF_FROMS retrieve buffer containing "from" hash buckets
GPROF_TOS retrieve buffer containing "to" (destination) structures
GPROF_GMONPARAM retrieve the state of kernel profiling

Once profiling data is dumped to the gmon.out file, the standard gprof command-line program can be used to display the execution profile. Figure 633 shows an example of using kgmon and gprof.

Figure 633. Using `kgmon` and `gprof` for kernel profiling

$ uname -v      # This kernel was compiled with profiling support Darwin Kernel Version.../BUILD/obj/PROFILE_PPC $ kgmon         # Profiling should be disabled to begin with kgmon: kernel profiling is off. $ sudo kgmon -b # Resume the collection of profile data kgmon: kernel profiling is running. ...             # Wait for the data of interest $ sudo kgmon -h # Stop the collection of profile data kgmon: kernel profiling is off. $ ls            # No output files yet $ kgmon -p      # Dump the contents of the profile buffers kgmon: kernel profiling is off. $ ls            # We should have a gmon.out file now gmon.out $ gprof /mach_kernel.profile gmon.out ... granularity: each sample hit covers 4 byte(s) for 0.03% of 34.23 seconds                                   called/total       parents index  %time    self descendents  called+self    name           index                                   called/total       children                                                      <spontaneous> [1]     98.8   33.81        0.00                 _machine_idle_ret [1] -----------------------------------------------                                                      <spontaneous> [2]      0.6    0.22        0.00                 _ml_set_interrupts_enabled [2] ...       0.00        0.00       6/117         _thread_setstatus [818]       0.00        0.00       6/6           _thread_userstack [1392]       0.00        0.00       6/6           _thread_entrypoint [1388]       0.00        0.00       3/203         _current_map [725]       0.00        0.00       3/3           _swap_task_map [1516]       0.00        0.00       3/3037        _pmap_switch [436] ... Index by function name [1149] _BTFlushPath        [257] _fdesc_readdir     [1029] _psignal_lock ...  [782] __ZN18IOMemoryDescr [697] _ipc_kobject_destro[1516] _swap_task_map ...   [27] _devfs_make_link    [436] _pmap_switch       [1213] _wait_queue_member ... $

6.8.4.1. Per-Process Profiling (`profil(2)`)

The xnu kernel implements the profil() system call that allows a user process to gather its own CPU-usage statistics by profiling the program counter.

int profil(char *samples, size_t size, u_long offset, u_int scale)

samples is a buffer whose length is size bytes. It is divided into sequential bins of 16 bits each. offset specifies the lowest program counter (PC) value at which the kernel will sample the PC upon every clock tickit is the starting address of the program region to be sampled. For each sampled PC, the kernel increments the value in the bin whose number is computed based on the scale argument. To compute the bin number, that is, the index in the samples array, the kernel subtracts offset from the sampled PC and multiplies the result by scale. If the resultant index is within the bounds of the samples array, the corresponding bin's value is incremented; otherwise, profiling is turned off.

scale is passed as an unsigned integer but represents 16 bits' worth of fraction, with the value 1 being in the middle. Therefore, a scale value of 0x10000 results in a one-to-one mapping from PC value to bins, whereas a higher value results in a many-to-one mapping. A scale value of 0 or 1 turns off profiling.

// bsd/kern/subr_prof.c #define PC_TO_INDEX(pc, prof) \         ((int)(((u_quad_t)((pc) - (prof)->pr_off) * \                         (u_quad_t)((prof)->pr_scale)) >> 16) & ~1)

Enabling profiling sets the P_PROF bit in the p_flag field of the BSD process structure. Moreover, the pr_scale field of the profiling substructure within the process structure is set to the scale value. Thereafter, as shown in Figure 634, every time there is a BSD-level clock tick (100 times a second), bsd_hardclock() checks whether the process is running in user mode and has a nonzero scale value. If so, it sets a flag (P_OWEUPC) in the process, indicating that a call to addupc_task() [bsd/kern/subr_prof.c] is owed to the task the next time there is an AST. It then calls astbsd_on() to generate an AST_BSD, which is serviced by bsd_ast(), which in turn calls addupc_task(). The latter updates the profiling buffer, turning off profiling in the case of an error.

Figure 634. Implementation of the `profil()` system call

// bsd/kern/kern_clock.c void bsd_hardclock(boolean_t usermode, caddr_t pc, int numticks) {     register struct proc *p;     ...     p = (struct proc *)current_proc();     ...         if (usermode) {             if (p->p_stats && p->p_stats->p_prof.pr_scale) {                 // Owe process an addupc_task() call at next AST                 p->p_flag |= P_OWEUPC;                 astbsd_on();             }             ...         }     ... } // bsd/kern/kern_sig.c // called when there is an AST_BSD void bsd_ast(thread_t thr_act) {     ...     if ((p->p_flag & P_OWEUPC) && (p->p_flag & P_PROFIL) {         pc = get_useraddr();         addupc_task(p, pc, 1);         p->p_flag &= ~P_OWEUPC;     }     ... } // bsd/kern/subr_prof.c void addupc_task(register struct proc *p, user_addr_t pc, u_int ticks) {     ...     // 64-bit or 32-bit profiling statistics collection     if (/* 64-bit process */) {         // calculate offset in profile buffer using PC_TO_INDEX()         // if target location lies within the buffer, copyin() existing         // count value from that location into the kernel         // increment the count by ticks         // copyout() the updated information to user buffer         // if there is any error, turn off profiling     } else {         // do 32-bit counterpart     } }

Another system calladd_profil()can be used to profile multiple, noncontiguous areas of program memory. Multiple calls to add_profil() can follow a single call to profil(). Note that calling profil() removes any buffers that were allocated by one or more previous calls to add_profil().

6.8.4.2. Mach Task and Thread Sampling

Mach 3 provides calls for sampling tasks and threads. The mach_sample_thread() call periodically samples the program counter of the specified thread, saves the sampled values in buffers, and sends the buffers to the specified reply port when the buffers become full. The mach_sample_task() call performs a similar function, but for all threads of a given task. The sampled values are not tagged by thread, which means that samples for various threads will typically be intermingled.

The Mac OS X kernel includes code for Mach-based task and thread sampling. The code is conditionally compiledif MACH_PROF is defined during kernel compilation. However, note that the code is not functional.

kern_return_t task_sample(task_t sample_task, mach_port_make_send_t reply_port); kern_return_t thread_sample(thread_act_t sample_thread, mach_port_make_send_t reply_port); kern_return_t receive_samples(mach_port_t sample_port, sample_array_t samples,                 mach_msg_type_number_t sample_count);

The receive_samples() call is used to receive a message containing sampled values. It is invoked by prof_server()a MIG-generated library function that simplifies the work involved in processing the incoming IPC message.^[15] We will discuss this style of IPC in Chapter 9 when we look at Mach exception handling.

^[15] The prof_server() function does not exist in the default Mac OS X system library.

The osfmk/mach/prof.defs file contains the MIG definitions for this profiling interface.

6.8.5. Per-Process Kernel Tracing (`kTRace(2)` and `kdump`)

The ktrace() system call can be used on Mac OS X to enable or disable tracing of selected operations in one or more processes.

int ktrace(const char *tracefile, // pathname of file in which to save trace records        int         ops,       // ktrace operation        int         trpoints,  // trace points of interest (what to trace)        int         pid);      // primary process of interest                               // a negative pid specifies a process group

The ops argument to ktrace() can be one of the following:

KtrOP_SET enable trace points specified in the trpoints argument
KtrOP_CLEAR disable trace points specified in trpoints
KtrOP_CLEARFILE stop all tracing
KTRFLAG_DESCEND also apply tracing change to all children of the process specified by the pid argument

The ktrace command uses the ktrace() system call, allowing trace data to be logged to a specified file (ktrace.out by default). The kdump command displays the data in human-readable format. Only the superuser may trace setuid and setgid processes, or another user's processes.

Categories of operations to tracetrace pointsare specified by logically OR'ing respective bits and passing the resultant value as the TRpoints argument. Each selection causes the corresponding types of events to be generated at one or more locations in the kernel. The following bit values can be used to specify operation types when calling ktrace().

KTRFAC_SYSCALL TRaces BSD system calls. When this bit is set, ktrsyscall() is called by unix_syscall() before the system call handler is invoked. kTRsyscall() writes a "system call" trace record (struct ktr_syscall).
Ktr_SYSRET traces return from BSD system calls. When this bit is set, kTRsysret() is called by unix_syscall() after the system call handler has returned and return values have been processed. ktrsysret() writes a "return from system call" trace record (struct ktr_sysret).
KtrFAC_NAMEI traces name lookup operations. When this bit is set, kTRnamei() is called by namei() [bsd/vfs/vfs_lookup], sem_open() [bsd/kern/posix_sem.c], and shm_open() [bsd/kern/posix_shm.c]. kTRnamei() writes a stringthe relevant pathnameas the trace data.
KtrFAC_GENIO traces a variety of I/O operations. When this bit is set, ktrgenio() is called by recvit() and sendit() [both in bsd/kern/uipc_syscalls.c], and dofileread(), dofilewrite(), rd_uio(), and wr_uio() [all in bsd/kern/sys_generic.c]. ktrgenio() writes a "generic process I/O" trace record (struct ktr_genio).
KtrFAC_PSIG traces posted signals. When this bit is set, kTRpsig() is called by postsig() [bsd/kern/kern_sig.c]. ktrpsig() writes a "processed signal" trace record (struct ktr_psig).
KtrFAC_CSW TRaces context switches. When this bit is set, ktrcsw() is called by _sleep_continue() and _sleep() [both in bsd/kern/kern_synch.c]. ktrcsw() writes a "context switch" trace record (struct ktr_csw).

All ktrace event-logging functions are implemented in bsd/kern/kern_ktrace.c.

The p_traceflag field of the process structure holds the operation-related bits. This field also contains other relevant flags, such as the following.

KTRFAC_ACTIVE specifies that ktrace logging is in progress.
KtrFAC_ROOT specifies that the tracing status for that process was previously established by the superuser, and only the superuser can now change it further.
KTRFAC_INHERIT specifies that the child process arising out of a fork() system call will inherit the parent's p_traceflag field.

Note that a record resulting from some type of tracing event consists of a generic header (struct ktr_header) followed by an event-specific structure.

// bsd/sys/ktrace.h struct ktr_header {     int     ktr_len;               // length of buffer that follows this header     short   ktr_type;              // trace record type     pid_t   ktr_pid;               // process ID generating the record     char    ktr_comm[MAXCOMLEN+1]; // command name generating the record     struct  timeval ktr_time;      // record generation timestamp (microsecond)     caddr_t ktr_buf;               // buffer }

Event-specific structures can vary in length even for a given event type. For example, the ktr_syscall structure contains a system call number, the number of arguments passed to that system call, and an array of 64-bit unsigned integers containing the arguments.

struct ktr_syscall {     short     ktr_code;    // system call number     short     ktr_narg;    // number of arguments     u_int64_t ktr_args[1]; // a 64-bit "GPR" for each argument };

6.8.6. Auditing Support

We briefly looked at the user-space aspects of the auditing system in Chapter 2. The Mac OS X kernel supports auditing of system events using the Basic Security Module (BSM) format for audit records. Figure 635 shows the key interactions between the user and kernel components of the auditing system.

Figure 635. The Mac OS X auditing system

During the initialization of the kernel's BSD portion, audit_init() [bsd/kern/kern_audit.c] is called to initialize the auditing system. Besides allocating the relevant data structures and initializing various parameters, audit_init() also calls kau_init() [bsd/kern/kern_bsm_audit.c] to initialize the BSM audit subsystem. In particular, kau_init() sets up the initial event-to-class mappings for BSD system calls, Mach system calls, and several types of open events.

Initially, BSD and Mach system call events map to the null audit class (AU_NULL). Note that at this point, auditing is initialized but not started in the kernel. When the user-space audit daemon (auditd) starts up, it establishes a log file to which the kernel will write audit recordsauditing in the kernel begins as a side effect of this operation. Let us first look at auditd's operation while referring to Figure 635.

auditd is a simple daemon whose responsibilities are limited to managing audit log files and enabling or disabling auditing. It is not involved in the actual writing of the audit records to disk.^[16] The following are auditd's primary initialization steps.

^[16] In this sense, the audit daemon can be likened to the dynamic_pager program, which manages swap files. We will see the operation of dynamic_pager in Chapter 8.

It installs a signal handler for the SIGTERM and SIGCHLD signals. The handler sends a Mach IPC message to auditd's main server loop. The reason these signals are not handled in the signal handler itself is that auditd may not be at a clean point while it is in the signal handler.
It allocates a Mach port for the aforementioned signal "reflection," and anotherthe audit control portfor communication with user programs (such as the audit utility) and the kernel.
It registers the control port with the kernel as a special host-level port by calling host_set_audit_control_port(). Thereafter, clients of auditd can communicate with it by retrieving the control port via host_get_audit_control_port(). Note that it places both ports in a single port set,^[17] allowing it to later wait for incoming messages on either port by using the port set alone.
^[17] We will discuss the port set concept in Chapter 9.
It reads the audit control file (/etc/security/audit_control), which contains system-wide audit parameters. Once it has determined the pathname of the directory for storing audit log files (/var/audit/ by default), auditd generates the pathname for the audit log file itself and invokes the auditctl() system call to convey the pathname to the kernel. auditctl()'s kernel implementation creates a kernel thread running audit_worker() [bsd/kern/kern_audit.c]unless the thread already exists,^[18] in which case it is woken up.
^[18] If you enable, disable, and reenable auditing on a system, the audit_worker kernel thread will exist during the reenabling and, therefore, will be woken up.

Depending on the flags passed to the open() system call, it is treated as different types of open events. For example, if there are multiple invocations of the open() system call with the flags argument being O_RDONLY, O_WRONLY, and O_RDWR, the corresponding open events are AUE_OPEN_R, AUE_OPEN_W, and AUE_OPEN_RW, respectively.
It generates an "audit startup" audit record indicating its own startup. Note that this event is "artificial" in that it is programmatically generated by auditd. It uses functions in the BSM library (libbsm.dylib) for this purpose, which in turn use the audit() system call, a function that allows user-space applications to explicitly submit audit records to the kernel for inclusion in the audit log.
It parses the audit event file (/etc/security/audit_event), which contains descriptions of auditable events on the system, mapping an audit event number to a name (such as AUE_OPEN), a description (such as "open(2)attr only", and a class name (such as "fa"). The classes are described in /etc/security/audit_classinformation in this file must conform with the audit class information known to the kernel. For each event line in audit_event, auditd registers the class mapping by invoking the A_SETCLASS command through the auditon() system call.
It finally enters its server loop, waiting for messages to arrive on either of the control or signal ports.

The in-kernel audit record structure (struct kaudit_record) for a thread resides in the thread's corresponding utHRead structure (struct uthread).

// bsd/sys/user.h struct uthread {     ...     struct kaudit_record *uu_ar;     ... }; // bsd/bsm/audit_kernel.h struct kaudit_record {     struct audit_record         k_ar;        // standard audit record     u_int32_t                   k_ar_commit; // AR_COMMIT_KERNEL, AR_COMMIT_USER     void                       *k_udata;     // opaque user data, if any     u_int                       k_ulen;      // user data length     struct uthread             *k_uthread;   // thread that we are auditing     TAILQ_ENTRY(kaudit_record)  k_q;         // queue metadata };

The kernel allocates and initializes a new audit record for a thread by calling audit_new() [bsd/kern/kern_audit.c] when the first auditable event occurs. This is typically when either a BSD or Mach system call is invoked. However, this can also occur during the audit() system call, which, as we saw earlier, is used by a user program to submit an audit record. Since the audit() system call itself is not audited, it is possible for a uthread structure's uu_ar field to be NULL if no auditable event has occurred so far for that thread. Various portions of the kernel code use audit macros to add information to a thread's existing audit record. These macros resolve to conditional code that operates only if auditing is currently enabled. Figure 636 shows an example of audit macros.

Figure 636. Audit macros in the kernel and how they are used

// bsd/bsm/audit_kernel.h #define AUDIT_ARG(op, args...)  do { \     if (audit_enabled)               \         audit_arg_ ## op (args);     \ } while (0) #define AUDIT_SYSCALL_ENTER(args...) do { \     if (audit_enabled) {                  \         audit_syscall_enter(args);        \     }                                     \ } while (0) // Additional check for uu_ar since it is possible that an audit record // was begun before auditing was disabled #define AUDIT_SYSCALL_EXIT(error, proc, uthread) do { \     if (audit_enabled || (uthread->uu_ar != NULL)) {  \         audit_syscall_exit(error, proc, uthread);     \     }                                                 \ } while (0) // bsd/dev/ppc/systemcalls.c void unix_syscall(struct savearea *regs) {     ...     AUDIT_SYSCALL_ENTER(code, proc, uthread);     // call the system call handler     error = (*(callp->sy_call))(proc, (void *uthread->uu_arg,              &(uthread->uu_rval[0]));     AUDIT_SYSCALL_EXIT(error, proc, uthread);     ... } // bsd/vfs/vfs_syscalls.c static int open1(...) {     ...     AUDIT_ARG(fflags, oflags);     AUDIT_ARG(mode, vap->va_mode);     ... }

When audit_syscall_enter() [bsd/kern/kern_audit.c] is called, the current uthread structure's uu_ar field will be NULL. If the event corresponding to the current system call and its arguments is auditable, audit_syscall_enter() allocates an audit record and sets uu_ar to point to it. As long as the system call remains in the kernel, any audit_arg_xxx() functions that may be called by kernel code will append information to the thread's audit record. When the system call finishes, audit_syscall_exit() [bsd/kern/kern_audit.c] commits the record by calling audit_commit() [bsd/kern/kern_audit.c] and sets the uu_ar field to NULL. audit_commit() inserts the record in a queue of audit records that are ready to be written to disk. The queue is serviced by the audit worker thread, which, as we saw earlier, is created when auditd first specifies a log file pathname to the kernel. The audit worker thread writes audit records to the log file by calling audit_write() [bsd/kern/kern_audit.c], which directly writes to the log file's vnode (the audit_vp variable) through the vn_rdwr() kernel function. Auditing is disabled by passing a NULL log file pathname, which results in the audit_vp variable being set to NULL. The worker thread checks for a valid audit_vp in each iteration of its loopif audit_vp is NULL, it sets audit_enabled to 0, causing the various audit logger functions to be ineffective.

Other responsibilities of the worker thread include sending trigger messages to auditd if the log file becomes full or if the amount of free disk space on the volume containing the log file falls below a configured threshold.

6.8.7. Fine-Grained Kernel Event Tracing (kdebug)

The Mac OS X kernel provides a fine-grained kernel-tracing facility called kdebug, which can be enabled or disabled on a per-process basis. The sc_usage, fs_usage, and latency command-line tools use the kdebug facility. sc_usage displays ongoing system calls and various types of page faults. The output of fs_usage is restricted to system calls and page faults related to file system activity. latency monitors and displays scheduling and interrupt latency statistics. The kdebug facility also allows the CHUD toolkit to register a callback functionkdebug_chudhook()that will be invoked each time kdebug logging is attempted, regardless of whether kdebug tracing is enabled or not. Finally, the kdebug facility can be enabled to collect entropy and therefore can be used as a source of entropy for random number generation. The Mac OS X Security Server uses the kdebug facility to sample entropy.

Specifically, the kdebug facility can be enabled to operate in modes corresponding to one or more of the following mode bits at any given time: KDEBUG_ENABLE_TRACE, KDEBUG_ENABLE_ENTROPY, and KDEBUG_ENABLE_CHUD. The kdebug_enable global variable in the kernel holds these bits.

6.8.7.1. kdebug Tracing

The kdebug facility categorizes traced operations into classes, subclasses within classes, and codes within subclasses. Moreover, if a traced operation marks the beginning or end of a kernel function, its trace is tagged with DBG_FUNC_START and DBG_FUNC_END function qualifiers, respectively. Nonfunction traces are tagged with DBG_FUNC_NONE. Figure 637 shows an excerpt from kdebug's hierarchy of traced operation classification. Each trace record has a 32-bit debug code whose bits represent the operation's class, subclass, code, and function qualifier. The entire hierarchy is defined in bsd/sys/kdebug.h.

Figure 637. The composition of a debug code in the kdebug facility

Code throughout the kernel submits operations for kdebug tracing using macros that resolve to either kernel_debug() or kernel_debug1(), both of which are implemented in bsd/kern/kdebug.c.

// bsd/sys/kdebug.h #define KERNEL_DEBUG_CONSTANT(x,a,b,c,d,e)  \ do {                                        \     if (kdebug_enable)                      \         kernel_debug(x,a,b,c,d,e);          \ } while(0) #define KERNEL_DEBUG_CONSTANT1(x,a,b,c,d,e) \ do {                                        \     if (kdebug_enable)                      \         kernel_debug1(x,a,b,c,d,e);         \ } while(0) ...

kernel_debug1() is used during an execve() operation that follows a vfork() operationkernel_debug1() is a special version of kernel_debug() that receives the thread's identity as an argument instead of calling current_thread(), which kernel_debug() uses. current_thread() cannot be used in this particular case since it will return the parent thread.

Let us see how the BSD system call handler uses these macros. As shown in Figure 638, the handler submits trace records soon after starting and just before finishing. Note that it skips record generation if the system call number is 180 (SYS_kdebug_trace), which corresponds to the kdebug_trace() system call. Similar conceptually to the audit() system call, this call can be used by user programs to explicitly submit a trace record to the kdebug facility. As we will shortly see, the BSDDBG_CODE() macro computes a debug code given a subclass and a code within that subclass.

Figure 638. kdebug tracing in the BSD system call handler

// bsd/dev/ppc/systemcalls.c void unix_syscall(struct savearea *regs) {     ...     unsigned int cancel_enable;     flavor = (((unsigned int)regs->save_r0) == 0) ? 1 : 0;     if (flavor)         code = regs->save_r3;     else         code = regs->save_r0;     if (kdebug_enable && (code != 180)) {         if (flavor) // indirect system call             KERNEL_DEBUG_CONSTANT(                 BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_START,                             regs->save_r4, regs->save_r5,                             regs->save_r6, regs->save_r7, 0);         else        // direct system call             KERNEL_DEBUG_CONSTANT(                 BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_START,                             regs->save_r3, regs->save_r4, regs->save_r5,                             regs->save_r6, 0);     }     ...     // call the system call handler     ...     if (kdebug_enable && (code != 180)) {         if (callp->sy_return_type == _SYSCALL_REG_SSIZE_T)             KERNEL_DEBUG_CONSTANT(                 BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_END,                             error, uthread->uu_rval[1], 0, 0, 0);         else             KERNEL_DEBUG_CONSTANT(                 BSDDBG_CODE(DBG_BSD_EXCP_SC, code) | DBG_FUNC_END,                             error, uthread->uu_rval[0], uthread->uu_rval[1],                             0, 0);     }     thread_exception_return();     /* NOTREACHED */ }

The kdebug facility is accessed from user space through KERN_KDEBUG sysctl operations with CTL_KERN as the top-level sysctl identifier. Examples of supported operations include the following:

Enabling or disabling tracing (KERN_KDENABLE)
Cleaning up the relevant trace buffers (KERN_KDREMOVE)
Reinitializing the tracing facility (KERN_KDSETUP)
Specifying the trace buffer size to the kernel (KERN_KDSETBUF)
Specifying which process IDs to trace (KERN_KDPIDTR)
Specifying which process IDs to exclude (KERN_KDPIDEX)
Specifying trace points of interest to the kernel by class, by subclass, by debug code values, or by a range of debug code values (KERN_KDSETREG)
Retrieving trace buffer metainformation from the kernel (KERN_KDGETBUF)
Retrieving the trace buffer from the kernel (KERN_KDREADTR)

Note that process IDs can be either selected for or excluded from tracing. In the former case, the KDBG_PIDCHECK bit is set in the global kdebug-flags kernel variable (kdebug_flags), and the P_KDEBUG bit is set in the p_flag process structure field of each selected process. Thereafter, any process for which P_KDEBUG is not set is not traced. In the case of process ID exclusion, the KDBG_PIDEXCLUDE bit is set instead in kdebug_flags, and the P_KDEBUG bit is set for each excluded process.

As seen in the examples of supported kdebug operations, trace points of interest can be specified to the kernel in several ways: through a kdebug class (such as DBG_BSD), a subclass (such as DBG_BSD_EXCP_SC, which represents BSD system calls), up to four specific debug code values, or a range of such values. Figure 637 shows the structure of a debug code. The bsd/sys/kdebug.h header provides macros for constructing a debug code from its constituents.

Let us consider a specific example. Suppose we wish to use the kdebug facility to trace the use of the chdir() system call. The debug code for chdir() will have DBG_BSD as its class, DBG_BSD_EXP_SC as its subclass, and the system call's number (SYS_chdir) as its code. We can use the BSDDBG_CODE() macro to compute the code.

// bsd/sys/kdebug.h #define KDBG_CODE(Class, SubClass, code) (((Class & 0xff) << 24) | \                   ((SubClass & 0xff << 16) | ((code & 0x3fff) << 2)) ... #define MACHDBG_CODE(SubClass, code) KDBG_CODE(DBG_MACH, SubClass, code) #define NETDBG_CODE(SubClass, code) KDBG_CODE(DBG_NETWORK, SubClass, code) #define FSDBG_CODE(SubClass, code) KDBG_CODE(DBG_FSYSTEM, SubClass, code) #define BSDDBG_CODE(SubClass, code) KDBG_CODE(DBG_BSD, SubClass, code) ...

Before we look at a programming example, let us briefly discuss the operation of kernel_debug(), which is the center of kdebug activity in the kernel. It performs the following primary operations.

If the CHUD kdebug hook is registered, it calls the hook.
If entropy is being sampled, it adds an entry to the entropy buffer unless the buffer is full. We will look at entropy sampling in Section 6.8.7.2.
If KDBG_PIDCHECK is set, it returns without adding a trace record if the current process does not have P_KDEBUG set.
If KDBG_PIDEXCLUDE, which is mutually exclusive to KDBG_PIDCHECK, is set, kernel_debug() returns if the current process does not have P_KDEBUG set.
If KDBG_RANGECHECK is set, it checks whether the current debug code falls within the configured range of interesting trace points. If not, kernel_debug() returns.
If KDBG_VALCHECK is set, it compares the debug code (minus the function qualifier bits) with the four specific values, at least one of which must have been configured. If there is no match, kernel_debug() returns.
At this point, kernel_debug() records the trace entry, updates its bookkeeping data structures, and returns.

Let us now look at an example of using the kdebug facility in a user program. We will use kdebug to trace the chdir() system call. If a process ID is passed as an argument to the program, we will configure kdebug to trace only that process; otherwise, kdebug will trace on a system-wide basis. We will use kdebug's value-checking feature to configure it to trace only one specific debug codethat corresponding to the chdir() system call. Figure 639 shows the program and a sample of its usage. Note that only one program can use the kdebug tracing facility at a time.

Figure 639. Using the kdebug facility in a program

// kdebug.c #define PROGNAME "kdebug" #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/sysctl.h> #include <sys/ptrace.h> #include <sys/syscall.h> struct proc; // Kernel Debug definitions #define PRIVATE #define KERNEL_PRIVATE #include <sys/kdebug.h> #undef KERNEL_PRIVATE #undef PRIVATE // Configurable parameters enum {     KDBG_BSD_SYSTEM_CALL_OF_INTEREST = SYS_chdir,     KDBG_SAMPLE_SIZE                 = 16384,     KDBG_SAMPLE_INTERVAL             = 100000, // in microseconds }; // Useful constants enum {     KDBG_FUNC_MASK   = 0xfffffffc, // for extracting function type     KDBG_CLASS_MASK  = 0xff000000, // for extracting class type     KDBG_CLASS_SHIFT = 24          // for extracting class type }; // Global variables int    exiting = 0; // avoid recursion in exit handlers size_t oldlen;      // used while calling sysctl() int    mib[8];      // used while calling sysctl() pid_t  pid = -1;    // process ID of the traced process // Global flags int trace_enabled   = 0; int set_remove_flag = 1; // Mapping of kdebug class IDs to class names const char *KDBG_CLASS_NAMES[256] = {     NULL,           // 0     "DBG_MACH",     // 1     "DBG_NETWORK",  // 2     "DBG_FSYSTEM",  // 3     "DBG_BSD",      // 4     "DBG_IOKIT",    // 5     "DBG_DRIVERS",  // 6     "DBG_TRACE",    // 7     "DBG_DLIL",     // 8     "DBG_SECURITY", // 9     NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,     "DBG_MISC",     // 20     NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,     "DBG_DYLD",     // 31     "DBG_QT",       // 32     "DBG_APPS",     // 33     NULL, }; // Functions that we implement (the 'u' in ukdbg represents user space) void ukdbg_exit_handler(int); void ukdbg_exit(const char *); void ukdbg_setenable(int); void ukdbg_clear(); void ukdbg_reinit(); void ukdbg_setbuf(int); void ukdbg_getbuf(kbufinfo_t *); void ukdbg_setpidcheck(pid_t, int); void ukdbg_read(char *, size_t *); void ukdbg_setreg_valcheck(int val1, int val2, int val3, int val4); void ukdbg_exit_handler(int s) {     exiting = 1;     if (trace_enabled)         ukdbg_setenable(0);     if (pid > 0)         ukdbg_setpidcheck(pid, 0);     if (set_remove_flag)         ukdbg_clear();     fprintf(stderr, "cleaning up...\n");     exit(s); } void ukdbg_exit(const char *msg) {     if (msg)         perror(msg);     ukdbg_exit_handler(0); } // Enable or disable trace // enable=1 enables (trace buffer must already be initialized) // enable=0 disables void ukdbg_setenable(int enable) {     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDENABLE;     mib[3] = enable;     if ((sysctl(mib, 4, NULL, &oldlen, NULL, 0) < 0) && !exiting)         ukdbg_exit("ukdbg_setenable::sysctl");     trace_enabled = enable; } // Clean up relevant buffers void ukdbg_clear(void) {     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDREMOVE;     if ((sysctl(mib, 3, NULL, &oldlen, NULL, 0) < 0) && !exiting) {         set_remove_flag = 0;         ukdbg_exit("ukdbg_clear::sysctl");     } } // Disable any ongoing trace collection and reinitialize the facility void ukdbg_reinit(void) {     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDSETUP;     if (sysctl(mib, 3, NULL, &oldlen, NULL, 0) < 0)         ukdbg_exit("ukdbg_reinit::sysctl"); } // Set buffer for the desired number of trace entries // Buffer size is limited to either 25% of physical memory (sane_size), // or to the maximum mapped address, whichever is smaller void ukdbg_setbuf(int nbufs) {     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDSETBUF;     mib[3] = nbufs;     if (sysctl(mib, 4, NULL, &oldlen, NULL, 0) < 0)         ukdbg_exit("ukdbg_setbuf::sysctl"); } // Turn pid check on or off in the trace buffer // check=1 turns on pid check for this and all pids // check=0 turns off pid check for this pid (but not all pids) void ukdbg_setpidcheck(pid_t pid, int check) {     kd_regtype kr;     kr.type = KDBG_TYPENONE;     kr.value1 = pid;     kr.value2 = check;     oldlen = sizeof(kd_regtype);     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDPIDTR;     if ((sysctl(mib, 3, &kr, &oldlen, NULL, 0) < 0) && !exiting)         ukdbg_exit("ukdbg_setpidcheck::sysctl"); } // Set specific value checking void ukdbg_setreg_valcheck(int val1, int val2, int val3, int val4) {     kd_regtype kr;     kr.type = KDBG_VALCHECK;     kr.value1 = val1;     kr.value2 = val2;     kr.value3 = val3;     kr.value4 = val4;     oldlen = sizeof(kd_regtype);     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDSETREG;     if (sysctl(mib, 3, &kr, &oldlen, NULL, 0) < 0)         ukdbg_exit("ukdbg_setreg_valcheck::sysctl"); } // Retrieve trace buffer information from the kernel void ukdbg_getbuf(kbufinfo_t *bufinfop) {     oldlen = sizeof(bufinfop);     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDGETBUF;     if (sysctl(mib, 3, bufinfop, &oldlen, 0, 0) < 0)         ukdbg_exit("ukdbg_getbuf::sysctl"); } // Retrieve some of the trace buffer from the kernel void ukdbg_read(char *buf, size_t *len) {     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDREADTR;     if (sysctl(mib, 3, buf, len, NULL, 0) < 0)         ukdbg_exit("ukdbg_read::sysctl"); } int main(int argc, char **argv) {     int            i, count;     kd_buf        *kd;     char          *kd_buf_memory;     kbufinfo_t     bufinfo = { 0, 0, 0, 0 };     unsigned short code;     KDBG_CLASS_NAMES[255] = "DBG_MIG";     if (argc > 2) {         fprintf(stderr, "usage: %s [<pid>]\n", PROGNAME);         exit(1);     }     if (argc == 2)         pid = atoi(argv[1]);     code = KDBG_BSD_SYSTEM_CALL_OF_INTEREST;     // Arrange for cleanup     signal(SIGHUP, ukdbg_exit_handler);     signal(SIGINT, ukdbg_exit_handler);     signal(SIGQUIT, ukdbg_exit_handler);     signal(SIGTERM, ukdbg_exit_handler);     kd_buf_memory = malloc(KDBG_SAMPLE_SIZE * sizeof(kd_buf));     if (!kd_buf_memory) {         perror("malloc");         exit(1);     }     ukdbg_clear();                  // Clean up related buffers     ukdbg_setbuf(KDBG_SAMPLE_SIZE); // Set buffer for the desired # of entries     ukdbg_reinit();                 // Reinitialize the facility     if (pid > 0)         ukdbg_setpidcheck(pid, 1);  // We want this pid     // We want this particular BSD system call     ukdbg_setreg_valcheck(BSDDBG_CODE(DBG_BSD_EXCP_SC, code), 0, 0, 0);     ukdbg_setenable(1);             // Enable tracing     while (1) {         ukdbg_getbuf(&bufinfo);                    // Query information         oldlen = bufinfo.nkdbufs * sizeof(kd_buf); // How much to read?         ukdbg_read(kd_buf_memory, &oldlen);        // Read that much        count = oldlen;         kd = (kd_buf *)kd_buf_memory;         for (i = 0; i < count; i++) {             char     *qual = "";             uint64_t  cpu, now;             int       debugid, thread, type, class;             thread = kd[i].arg5;             debugid = kd[i].debugid;             type = debugid & KDBG_FUNC_MASK;             class = (debugid & KDBG_CLASS_MASK) >> KDBG_CLASS_SHIFT;             now = kd[i].timestamp & KDBG_TIMESTAMP_MASK;             cpu = (kd[i].timestamp & KDBG_CPU_MASK) >> KDBG_CPU_SHIFT;             if (debugid & DBG_FUNC_START)                 qual = "DBG_FUNC_START";             else if (debugid & DBG_FUNC_END)                 qual = "DBG_FUNC_END";             // Note that 'type' should be the system call we were looking for             // (type == BSDDBG_CODE(DBG_BSD_EXCP_SC, code) is true             printf("%lld: cpu %lld %s code %#x thread %p %s\n",                    now,                    cpu,                    (KDBG_CLASS_NAMES[class]) ? KDBG_CLASS_NAMES[class] : "",                    type,                    (void *)thread,                    qual);         }         usleep(KDBG_SAMPLE_INTERVAL);     } } $ gcc -Wall -I /path/to/xnu/bsd/ -o kdebug kdebug.c $ ./kdebug # now use the 'cd' command from another shell 9009708884894: cpu 1 DBG_BSD code 0x40c0030 thread 0x47f9948 DBG_FUNC_START 9009708885712: cpu 1 DBG_BSD code 0x40c0030 thread 0x47f9948 DBG_FUNC_END ^Ccleaning up...

The file /usr/share/misc/trace.codes maps kdebug codes to operation names. It is used by programs such as sc_usage and latency to display debug codes in human-readable forms.

6.8.7.2. kdebug Entropy Collection

As we noted earlier, the kdebug facility can be enabled to sample system entropy. The samples are collected in a kernel buffer, which is then copied to user space. These steps occur within a single sysctl call. Input to this call includes a user buffer for receiving entropy entries, the buffer's size, and a timeout value. The Mac OS X Security Server (/usr/sbin/securityd) contains a system entropy management module that uses kdebug entropy collection. It performs the following operations.

It reads the saved entropy file (/var/db/SystemEntropyCache) on startup and seeds the random number generator (RNG) for initial use.
It periodically calls the kdebug facility to collect and retrieve system entropy, which it uses to seed the RNG.
It periodically saves the RNG's entropy to the entropy file for use across reboots.

// <darwin>/<securityd>/src/entropy.cpp void EntropyManager::collectEntropy() {     int mib[4];     mib[0] = CTL_KERN;     mib[1] = KERN_KDEBUG;     mib[2] = KERN_KDGETENTROPY;     mib[3] = 1; // milliseconds of maximum delay     mach_timespec_t timings[timingsToCollect];     size_t size = sizeof(timings);     int ret = sysctl(mib, 4, timings, &size, NULL, 0);     ... }

The in-kernel handler for the KERN_KDGETENTROPY sysctl calls kdbg_getentropy() [bsd/kern/kdebug.c], which performs the following operations.

If entropy is already being sampled, it returns EBUSY. This is because there can be at most one instance of entropy collection in the kdebug facility.
It uses kmem_alloc() to allocate a kernel buffer large enough to hold the number of entropy entries requested. If this allocation fails, it returns EINVAL.
If the caller-specified timeout value is less than 10 ms, it alters the timeout to be 10 ms.
It enables entropy sampling by setting the KDEBUG_ENABLE_ENTROPY bit in the kdebug_enable variable.
It calls tsleep() [bsd/kern/kern_synch.c] to sleep for the timeout duration.
Upon being woken up, it disables entropy sampling by clearing the KDEBUG_ENABLE_ENTROPY bit in kdebug_enable.
It copies the kernel entropy buffer to the user buffer that was passed in the sysctl call.
It calls kmem_free() to free the in-kernel entropy buffer.

Enabling entropy collection causes kernel_debug() to collect timestamp entries in the entropy buffer. Note that this is independent of whether kdebug tracing is enabled or the CHUD hook is installed.

// bsd/kern/kdebug.c void kernel_debug(debugid, arg1, arg2, arg3, arg4, arg5) {     ...     if (kdebug_enable & KDEBUG_ENABLE_ENTROPY) {         // collect some more entropy         if (kd_entropy_index < kd_entropy_count) {             kd_entropy_buffer[kd_entropy_index] = mach_absolute_time();             kd_entropy_index++;         }         // do we have enough timestamp entries to fill the entropy buffer?         if (kd_entropy_index == kd_entropy_count) {             // disable entropy collection             kdebug_enable &= ~KDEBUG_ENABLE_ENTROPY;             kdebug_slowcheck &= ~SLOW_ENTROPY;         }     }     ... }

6.8.8. Low-Level Diagnostics and Debugging Interfaces

The Mac OS X kernel provides a low-level diagnostics and debugging interface that can be enabled at boot time by passing the diag argument to the kernel. Depending on the specific flags passed through this argument, the kernel enables specific features and behavior. The various flag bits that can be set in the diag value^[19] are defined in osfmk/ppc/Diagnostics.h. Table 617 lists these flags and their purpose.

^[19] The diag value is passed as a logical OR of these flag bits.

Table 617. Flags for Enabling Diagnostic Features at Boot Time
Name	Value	Description
`enaExpTrace`	`0x0000_0001`	This bit enables ultra-low-level diagnostics tracing in the kernel. The built-in kernel debugger, KDB, can be used to view the trace records.
`enaUsrFCall`	`0x0000_0002`	This bit enables the firmware call interface to be used from user space. The interface provides hardware-dependent low-level functionality.
`enaUsrPhyMp`	`0x0000_0004`	This bit is unused.
`enaDiagSCs`	`0x0000_0008`	This bit enables the diagnostics system call interface.
`enaDiagDM`	`0x0000_0010`	If this bit is set, the driver for `/dev/mem` allows access to the entire physical memory (`mem_actual`), even if available physical memory has been limited through the `maxmem` boot-time argument.
`enaDiagEM`	`0x0000_0020`	This bit causes the special opcode `0`, along with an extended opcode that is one of the X-form instructions capable of taking an alignment interrupt, to simulate an alignment exception. This is to facilitate debugging of the alignment handler.
`enaDiagTrap`	`0x0000_0040`	This bit enables the special diagnostics trap, which is of the form `twi 31,r31,0xFFFX`, with `X` being a hexadecimal digit. When enabled, the trap returns from the kernel with a return value of `1`.
`enaNotifyEM`	`0x0000_0080`	Instructions that fail because of operand alignment can be emulated by low-level kernel code. This bit enables notification of such emulationsay, for logging of unaligned accesses. `chudxnu_passup_alignment_exceptions()` [`osfmk/ppc/chud/chud_cpu.c`]a CHUD toolkit internal functioncan set or clear this bit.

The flags listed in Table 617 are likely to be useful only to Mac OS X core kernel developers. Enabling any of the diagnostics or low-level debugging features on production systems is not recommended.

6.8.8.1. Firmware Call Interface

The firmware call interface is so called because it provides functionality that is considered an extension of the hardwareit is not an interface to Open Firmware. Its functionality includes the following:

Writing debugging information (such as register contents) to a printer or modem port (dbgDispCall)
Storing words to physical memory and clearing physical pages (StoreRealCall and ClearRealCall, respectively)
Loading BAT registers (LoadDBATsCall and LoadIBATsCall)
Creating fake I/O and decrementer interrupts (CreateFakeIOCall and CreateFakeDECCall, respectively)
Crashing the system immediately (Choke)
Shutting down the system immediately (CreateShutdownCTXCall)
Switching context (SwitchContextCall)
Preempting (DoPreemptCall)

Note that the kernel uses certain firmware calls during its regular operation. For example, _ml_set_interrupts_enabled() [osfmk/ppc/machine_routines_asm.s] can conditionally invoke the DoPreemptCall() system call from within the kernel. Similarly, context-switching assembly code in osfmk/ppc/cswtch.s uses the SwitchContextCall() firmware call.

Let us look at an example of using the firmware call interface from user space. A system call is treated as a firmware call if its number has its high-order bit set to 1. As we saw in Figure 612, the xcpSyscall() handler tests for firmware calls early. Whereas firmware calls are always allowed from supervisor state, they must be explicitly enabled for user space, as noted in Table 617. If xcpSyscall() finds the high-order bit set in the system call number, and the call is allowed, it forwards the call to FirmwareCall() [xnu/osfmk/ppc/Firmware.s]. The low-order bits in the system call number represent the index of the desired call in the firmware call table (FWtable), which is declared in osfmk/ppc/Firmware.s and populated in osfmk/ppc/FirmwareCalls.h. The lowest numbered firmware callCutTraceCall()has the number 0x80000000. Looking at osfmk/ppc/FirmwareCalls.h, we find that the Choke() call is at index 0xa, whereas the CreateShutdownCTXCall() call is at index 0xd.

Firmware call indices may not be the same across all kernel versions.

For example, if we make a system call with the value 0x8000000a in GPR0, the system will crash. Similarly, a call with the value 0x8000000d in GPR0 will cause the system to shut down instantly, as if all power to the system has been cut.

6.8.8.2. Diagnostics System Call Interface

The diagnostics system call interface is somewhat more generally interesting for experimentation than the firmware call interface. Its functionality includes routines for the following operations:

Adjusting the Timebase Registerused for testing drift recovery
Returning the physical address of a page
Accessing physical memory (including copying physical memory pages, and reading or writing individual bytes)
Soft-resetting the processor
Forcing all caches, including the TLB, to be reinitialized
Retrieving boot-screen information

A complete list of implemented diagnostics system calls can be seen in osfmk/ppc/Diagnostics.c. As shown in Table 612, the PowerPC-specific diagCall() system call has the number 0x6000. This system call invokes diagCall() [osfmk/ppc/Diagnostics.c]the dispatcher for these system calls. diagCall() uses the value in GPR3 to determine the specific diagnostics operation to perform. The available operations are defined in osfmk/ppc/Diagnostics.h.

// osfmk/ppc/Diagnostics.h #define diagSCnum 0x00006000 #define dgAdjTB      0 #define dgLRA        1 #define dgpcpy       2 #define dgreset      3 #define dgtest       4 #define dgBMphys     5 #define dgUnMap      6 #define dgBootScreen 7 ... #define dgKfree      22 #define dgWar        23 ...

Let us look at examples of using the diagnostics system calls. First, we create a common header filediagCommon.hcontaining code we will use in all examples in this section. Figure 640 shows diagCommon.h.

Figure 640. Common header file for using the diagnostics system call interface

// diagCommon.h #ifndef _DIAG_COMMON_H_ #define _DIAG_COMMON_H_ #include <stdio.h> #include <stdint.h> #include <string.h> #include <ppc/types.h> #define _POSIX_C_SOURCE #include <stdlib.h> #include <unistd.h> struct savearea; // These headers are not available outside of the kernel source tree #define KERNEL_PRIVATE #include <ppc/Diagnostics.h> #include <console/video_console.h> #undef KERNEL_PRIVATE // The diagCall() prototype in Diagnostics.h is from the kernel's standpoint // -- having only one argument: a pointer to the caller's save area. Our user- // space call takes a variable number of arguments. // // Note that diagCall() does have a stub in libSystem. // // Here we declare a prototype with a variable number of arguments, define // a function pointer per that prototype, and point it to the library stub. typedef int (*diagCall_t)(int op, ...); diagCall_t diagCall_ = (diagCall_t)diagCall; // Defined in osfmk/vm/pmap.h, which may not be included from user space #define cppvPsrc        2 #define cppvNoRefSrc    32 // Arbitrary upper limit on the number of bytes of memory we will handle #define MAXBYTES        (8 * 1024 * 1024) #endif // _DIAG_COMMON_H_

Before the diagnostics system call interface can be used, it must be enabled at boot time by passing the diag=<number> boot argument, where <number> contains set bits corresponding to the enaDiagSCs constant (0x8), as noted in Table 617. Similarly, the firmware interface is enabled by passing set bits corresponding to the enaUsrFCall constant (0x2). For example, to enable both interfaces, pass diag=0xa, since 0xa is the logical OR of 0x8 and 0x2.

Retrieving Boot-Screen Information

In this example, we will write a program that uses the dgBootScreen() call to retrieve the boot-screen "video" information from the kernel. The information is maintained in a structure of type vc_info. The system console code in osfmk/console/video_console.c manages this structure. Figure 641 shows the output of our program on a system with a 1280x854 display attached to it. Note that the quantity labeled "physical address" shows the location of the raw framebuffer in physical memory.

Figure 641. Retrieving boot-screen information using a diagnostics system call

// diagBootScreen.c #include "diagCommon.h" int main(int argc, char **argv) {     struct vc_info vc_info;     if (diagCall_(dgBootScreen, &vc_info) < 0)         exit(1);     printf("%ldx%ld pixels, %ldx%ld characters, %ld-bit\n",            vc_info.v_width, vc_info.v_height,            vc_info.v_columns, vc_info.v_rows,            vc_info.v_depth);     printf("base address %#08lx, physical address %#08lx\n",            vc_info.v_baseaddr, vc_info.v_physaddr);     printf("%ld bytes used for display per row\n",            vc_info.v_rowscanbytes);     exit(0); } $ gcc -Wall -I /path/to/xnu/osfmk/ -o diagBootScreen diagBootScreen.c $ ./diagBootScreen 1280x854 pixels, 160x53 characters, 32-bit base address 0x2f72c000, physical address 0xb8010000 5120 bytes used for display per row

Retrieving the Physical Address for a Virtual Address

In this example, we will use the dgLRA() (where LRA stands for logical-to-real address) call to retrieve the physical page, and therefore the physical address, for a given virtual address in the address space of the calling process. If the virtual address is not mapped in the caller's address space, the dgLRA() system call returns a nonzero value. We can verify this program by retrieving the physical address of the page beginning at virtual address 0xFFFF8000as we saw earlier, this is the base virtual address of the comm area and should map to the same physical page in all user address spaces. Figure 642 shows the program.

Figure 642. Retrieving the physical address (if any) for a virtual address in the caller's address space

// diagLRA.c #include "diagCommon.h" #define PROGNAME "diagLRA" int main(int argc, char **argv) {     u_int32_t phys, virt;     u_int64_t physaddr;     if (argc != 2) {         printf("usage: %s <virtual address in hex>\n", PROGNAME);         exit(1);     }     // Must be in hexadecimal     virt = strtoul(argv[1], NULL, 16);     phys = diagCall_(dgLRA, virt);     if (!phys) {         printf("virtual address %08x :: physical page none\n", virt);         exit(1);     }     physaddr = (u_int64_t)phys * 0x1000ULL + (u_int64_t)(virt & 0xFFF);     printf("virtual address %#08x :: physical page %#x (address %#llx)\n",             virt, phys, physaddr);     exit(0); } $ gcc -Wall -I /path/to/xnu/osfmk/ -o diagLRA diagLRA.c $ ./diagLRA 0x0 virtual address 00000000 :: physical page none $ ./diagLRA 0xFFFF8000 virtual address 0xFFFF8000 :: physical page 0x1669 (address 0x1669000) ...

Examining Physical Memory

The dgpcpy() diagnostics system call copies physical memory into a supplied buffer. In this example, we will write a program that uses this call to fetch physical memory and dumps it on the standard output. Thereafter, we can either redirect the program's output to a file or pipe it through a utility such as hexdump to view the memory's contents in different formats. Figure 643 shows the program.

Figure 643. Retrieving physical memory using a diagnostics system call

// diagpcpy.c #include "diagCommon.h" #define PROGNAME "diagpcpy" void usage(void); int main(int argc, char **argv) {     int        ret;     u_int32_t  phys;     u_int32_t  nbytes;     char      *buffer;     if (argc != 3)         usage();     phys = strtoul(argv[1], NULL, 16);     nbytes = strtoul(argv[2], NULL, 10);     if ((nbytes < 0) || (phys < 0))         usage();     nbytes = (nbytes > MAXBYTES) ? MAXBYTES : nbytes;     buffer = (char *)malloc(nbytes);     if (buffer == NULL) {         perror("malloc");         exit(1);     }     // copy physical to virtual     ret = diagCall_(dgpcpy, 0, phys, 0, buffer, nbytes, cppvPsrc|cppvNoRefSrc);     (void)write(1, buffer, nbytes);     free(buffer);     exit(0); } void usage(void) {     printf("usage: %s <physical addr> <bytes>\n", PROGNAME);     printf("\tphysical address must be specified in hexadecimal\n");     printf("\tnumber of bytes to copy must be specified in decimal\n");     exit(1); } $ gcc -Wall -I /path/to/xnu/osfmk/ -o diagpcpy diagpcpy.c ...

We can test the operation of our diagpcpy program by examining physical memory that is known to contain specific information. Recall from Figure 628 that the beginning of the comm area contains a string signature. Moreover, we know that the comm area should begin at virtual address 0xFFFF8000 in every user virtual address space, and we determined the corresponding physical address using our diagLRA program in a previous example.

$ ./diagLRA 0xFFFF8000 virtual address 0xFFFF8000 :: physical page 0x1669 (address 0x1669000) $ ./diagpcpy 0x1669000 16 | strings commpage 32-bit $

Let us look at another example. We know that the exception vectors reside in physical memory starting at address 0x0. We can retrieve the contents of that page and compare them with the contents of the __interrupts section in the __VECTORS segment within the kernel executable.

$ ./diagpcpy 0x0 4096 > /tmp/phys0.out $ hexdump -v /tmp/phys0.out | less ... 0000100 7db2 43a6 7d73 43a6 81a0 00f0 7d60 0026 0000110 2c0d 0001 4082 001c 3960 0000 9160 00f0 0000120 8080 00f4 8060 00f8 7c88 03a6 4e80 0020 ... $ otool -s __VECTORS __interrupts /mach_kernel | less /mach_kernel: Contents of (__VECTORS,__interrupts) section ... 00000100 7db243a6 7d7343a6 81a000f0 7d600026 00000110 2c0d0001 4082001c 39600000 916000f0 00000120 808000f4 806000f8 7c8803a6 4e800020 ...

Note that the hexdump output is formatted somewhat differently from otool's output. The modern-day hexdump program's output format can be configured through format strings. In this example, you can make the hexdump output identical to otool's as follows:

$ echo '"%07.7_Ax\\n"\n"%07.7_ax  " 4/4 "%08x " "\\n"' | \     hexdump -v -f /dev/stdin /tmp/phys0.out ... 00000100 7db243a6 7d7343a6 81a000f0 7d600026 00000110 2c0d0001 4082001c 39600000 916000f0 00000120 808000f4 806000f8 7c8803a6 4e800020 ...

Finally, let us retrieve a few bytes from the physical address 0x5000 and attempt to interpret them as a character string. We will see the reason for this string in Section 6.8.9.1.

$ ./diagpcpy 0x5000 8 | strings Hagfish

Capturing a Screenshot of the Text Console

We earlier determined the physical base address of the boot display's framebuffer using the diagBootScreen program. Since diagpcpy allows us to dump physical memory, we could capture a raw screenshot of the display using these two programs. In particular, this gives us a way of capturing a screenshot of the textual console. Let us consider the example of the display shown in Figure 641; the relevant information is repeated here.

$ ./diagBootScreen 1280x854 pixels, 160x53 characters, 32-bit base address 0x2f72c000, physical address 0xb8010000 5120 bytes used for display per row

Given the information dumped by diagBootScreen, we can see that our screenshot capturing on this particular system would involve copying a certain amount of physical memory starting at physical address 0xb8010000. Since this is a 32-bit framebuffer and there are 1280 x 854 pixels, the number of bytes we need to retrieve is 4 x 1280 x 854, which is 4,372,480.

$ ./diagpcpy 0xb8010000 4372480 > display.dump $ file display.dump display.dump: data

Note that diagpcpy is not the only way to read physical memory from user space. Given the appropriate privileges, a user program can read physical memory and kernel virtual memory through the /dev/mem and /dev/kmem devices, respectively. The kernel functions that serve these devices are implemented in bsd/dev/ppc/mem.c.

At this point, the display.dump file contains raw pixel dataa linear sequence of 32-bit pixel values in row-major order. When treated as big-endian, each pixel value contains a leading pad byte followed by 8-bit red, green, and blue components, in that order. We can convert this raw data to an image formatsay, TIFF or JPEGby using a variety of image-processing tools. For example, we can write a trivial Perl script to remove the padding from each pixel value to create a new raw pixel data file, which can then be converted to easily viewable image formats. The following example uses the freely available rawtoppm and ppmtojpeg command-line programs.

$ cat unpad.pl #! /usr/bin/perl -w my $PROGNAME = "unpad"; if ($#ARGV != 1) {     die "usage: $PROGNAME <infile> <outfile>\n"; } open(I, "<$ARGV[0]") or die "$!\n"; open(O, ">$ARGV[1]") or die "$!\n"; my $ibuf; while (sysread(I, $buf, 4) == 4) {     my ($pad, $r, $g, $b) = unpack('C4', $buf);     $buf = pack('C3', $r, $g, $b);     syswrite(O, $buf, 3); } close(I); close(O); exit(0); $ ./unpad.pl display.dump display-rgb.raw $ rawtoppm -rgb -interpixel 1280 854 display-rgb.raw > display.ppm $ ppmtojpeg display.ppm > display.jpg

A user program can retrieve the base address of a framebuffer using the CGDisplayBaseAddress() Quartz Services API call. Thereafter, the program can access and modify the framebuffer memorysay, using the read() and write() system calls. We will see an API-compliant version of the screenshot-capturing example in Chapter 10.

6.8.9. Low-Level Kernel Tracing

In addition to the various tracing facilities we have seen so far, the Mac OS X kernel contains yet another tracing facility used for low-level tracing. We have alluded to this facility earlier in this chapter in the context of the CutTrace() system call, which is the means by which kernel code logs low-level traces into a kernel buffer. The buffer can be examined from within KDB. Before we discuss this tracing mechanism, let us look at the low-memory global (lowglo) data structures, which also include the work area for this mechanism.

6.8.9.1. Low-Memory Global Data Structures

We have seen earlier that the first 32KB of physical memorythe low memorycontains critical kernel data and code. For example, the PowerPC exception vectors begin at physical address 0x0. The low-level exception filter table (xcpTable) and the first-level system call dispatch table (scTable), both of which are implemented in osfmk/ppc/lowmem_vectors.s, also reside in low memory. Another low-memory area instantiated in lowmem_vectors.s is lowGloa structure of type lowglo that contains global (as opposed to per-processor) constants, data areas, and pointers. These entities are accessed by kernel code directly using absolute addresses. Hence, they must reside in low physical memory. The logGlo area starts at physical address 0x5000 and is a page long. The next physical pagestarting at address 0x6000is a shared page mapped into the kernel's address space; it can be used for low-level kernel debugging. Figure 644 shows the structure of the lowGlo area.

Figure 644. The low-memory global data area

In Section 6.8.8.2, we found that memory at physical address 0x5000 contains the word Hagfish. To be precise, it is the string "Hagfish" (with a trailing space). It is an "eyecatcher" string used as a system verification code at the beginning of the lowGlo area. Let us use our diagpcpy program from Section 6.8.8.2 to glean some more information from this area.

Kernel Version String

As shown in Figure 644, physical address 0x501C contains a pointer to the kernel version string.

$ ./diagpcpy 0x501C 4 | hexdump 0000000 0033 1da0 0000004 $ ./diagpcpy 0x00331da0 128 | strings Darwin Kernel Version 8.6.0: ... root:xnu-792.6.70.obj~1/RELEASE_PPC 8.6.0 Darwin

Per-Processor Information Areas

As shown in Figure 644, physical address 0x5010 points to the start of the array containing per-processor information entries. Each entry is a structure of type per_proc_entry, which contains a pointer to a structure of type per_proc_info. The latter contains a variety of static and dynamic information about a processorfor example, the hardware exception counters (struct hwCtrs). Figure 645 shows excerpts from these structures, some of which we came across in Chapter 5.

Figure 645. Data structures for holding per-processor information

// osfmk/ppc/exception.h #pragma pack(4) struct hwCtrs {     unsigned int hwInVains;     unsigned int hwResets;     unsigned int hwMachineChecks;     unsigned int hwDSIs;     unsigned int hwISIs;     unsigned int hwExternals;     unsigned int hwAlignments;     unsigned int hwPrograms;     ... }; #pragma pack() typedef struct hwCtrs hwCtrs; ... #pragma pack(4) struct per_proc_info {     unsigned int cpu_number;     ...     hwCtrs hwCtr; // begins at offset 0x800 within the structure     ... } #pragma pack() ... #define MAX_CPUS 256 struct per_proc_entry {     addr64_t              ppe_paddr;     unsigned int          ppe_pad4[1];     struct per_proc_info *ppe_vaddr; }; extern struct per_proc_entry PerProcTable[MAX_CPUS-1];

Let us retrieve the value of a particular hardware countersay, hwProgramsusing diagpcpy. We will do so for the first processor, in case there are multiple processors. Given the information in Figures 644 and 645, we can compute the physical address of interest using the following information.

The address 0x5010 contains a pointer to PerProcTablewe are interested in the first entry of PerProcTable.
The ppe_addr field is at an offset of 12 bytesthe sum of the sizes of an addr64_t and an unsigned int.
Resolving the ppe_addr pointer gives us the per_proc_info structure of the processor.
The hwCtr structure is at an offset 0x800 within the per_proc_info structure.
The hwPrograms counter is at an offset of 28 bytesseven times the size of an unsigned intwithin the hwCtr structure, and therefore, at an offset of 0x81c (0x800 + 28) bytes within per_proc_info.

Let us now retrieve the value of hwPrograms.

$ ./diagpcpy 0x5010 4 | hexdump # this will give us the address of PerProcTable 00000000 0035 d000 00000004 $ ./diagpcpy 0x35d000 16 | hexdump # fourth 32-bit word is the first                                    # processor's ppe_vaddr 00000000 0000 0000 0035 e000 0000 0000 0035 e000 00000010 $ ./diagpcpy 0x35e81c 4 | hexdump # add 0x81c to get the address of hwPrograms 00000000 0000 0000 00000004

In this example, the counter's value is 0. Let us execute a trivial program that we know will increment the value of this counter. For example, executing an illegal hardware instructionsay, a supervisor-only instruction in user modewill result in a T_PROGRAM exception, which is counted by hwPrograms. Figure 646 shows a program that causes a T_PROGRAM to be generated.

Figure 646. Causing an exception and retrieving the corresponding counter from the kernel

// gentprogram.c #if defined(__GNUC__) #include <ppc_intrinsics.h> #endif int main(void) {     return __mfspr(1023); } $ gcc -Wall -o gentprogram gentprogram.c $ ./gentprogram zsh: illegal hardware instruction  ./gentprogram $ ./diagpcpy 0x35e81c 4 | hexdump 00000000 0000 0001 00000004 $ ./gentprogram; ./gentprogram; ./gentprogram zsh: illegal hardware instruction  ./gentprogram zsh: illegal hardware instruction  ./gentprogram zsh: illegal hardware instruction  ./gentprogram $ ./diagpcpy 0x35e81c 4 | hexdump 00000000 0000 0004 00000004

Figure 647 shows a portion of exception-handling code that increments various hardware exception counters. Note that the order of counter fields within the hwCtr structure is matched with the definition of exception numbersan exception number is the offset of the corresponding counter within hwCtr. For example, T_PROGRAM is defined to be (0x07 * T_VECTOR_SIZE) in osfmk/ppc/exception.h, with T_VECTOR_SIZE being 4.

Figure 647. Maintenance of hardware exception counters in the kernel

// osfmk/ppc/genassym.c ... DECLARE("hwCounts", offsetof(struct per_proc_info *, hwCtr); ... ; osfmk/ppc/lowmem_vectors.s .L_exception_entry:             ... xcpCommon:             ... Redrive:             ...             mfsprg r2,0         ; restore per_proc (SPRG0 contains per_proc ptr)             ...             la r12,hwCounts(r2) ; point to the exception count area             ...             add r12,r12,r11     ; point to the count (r11 contains T_XXX)             lwz r25,0(r12)      ; get the old value             ...             add r25,r25,r24     ; count this one (r24 will be 1 or 0)             ...             stw r25,0(r12)      ; store it back             ...

6.8.9.2. Low Tracing

Let us now continue our discussion of "CutTrace" low-level tracing, or simply low tracing. Figure 644 shows an area called lgTrcWork: This is the control block for low tracing. It is a structure of type TRaceWork.

// osfmk/ppc/low_trace.h typedef struct traceWork {     unsigned int traceCurr;   // Address of next slot     unsigned int traceMask;   // Types to be traced     unsigned int traceStart;  // Start of trace table     unsigned int traceEnd;    // End of trace table     unsigned int traceMsnd;   // Saved trace mask     unsigned int traceSize;   // Size of trace table     unsigned int traceGas[2]; } traceWork;

Similarly, lgdgWork is the diagnostic work areait is a structure of type diagWork, which is declared in osfmk/ppc/Diagnostics.h. One of this structure's fields, dgFlags, stores the diagnostic flags. The flags passed through the diag boot-time argument are stored here.

Low tracing is enabled by setting the enaExpTrace bit in the value of the diag boot-time argument. It can be limited to a specific processor by providing the processor number through the ctrc boot-time argument. Moreover, the size of the kernel buffer used for low tracing can be adjusted through the tb boot-time argument. ppc_init() [osfmk/ppc/ppc_init.c] processes these arguments during early system startup. Figure 648 shows this processing.

Figure 648. Processing of low-tracing-related boot-time arguments during system startup

// osfmk/ppc/genassym.c ... DECLARE("trcWork", offsetof(struct lowglo *, lgTrcWork)); ... // osfmk/ppc/ppc_init.c void ppc_init(boot_args *args) {     ...     // Set diagnostic flags     if (!PE_parse_boot_arg("diag", &dgWork.dgFlags))         dgWork.dgFlags = 0;     ...     // Enable low tracing if it is requested     if (dgWork.dgFlags & enaExpTrace)         trcWork.traceMask = 0xFFFFFFFF;     // See if tracing is limited to a specific processor     if (PE_parse_boot_arg("ctrc", &cputrace)) {         trcWork.traceMask = (trcWork.traceMask & 0xFFFFFFF0) | (cputrace & 0xF);     }     // See if we have a nondefault trace-buffer size     if (!PE_parse_boot_arg("tb", &trcWork.traceSize)) { #if DEBUG         trcWork.traceSize = 32;  // Default 32-page trace table for DEBUG #else         trcWork.traceSize = 8;   // Default 8-page trace table for RELEASE     }     // Adjust trace table size, if not within minimum/maximum limits     if (trcWork.traceSize < 1)         trcWork.traceSize = 1;   // Must be at least 1 page     if (trcWork.traceSize > 256)         trcWork.traceSize = 256; // Can be at most 256 pages     // Convert from number of pages to number of bytes     trcWork.traceSize = trcWork.traceSize * 4096;     ... }

Later during the system startup, pmap_bootstrap() [osfmk/ppc/pmap.c] reserves physical memory for the trace table.

// osfmk/ppc/pmap.c void pmap_bootstrap(uint64_t memsize, vm_offset_t *first_avail, unsigned int kmapsize) {     ...     trcWork.traceCurr = (unsigned int)addr;  // set first trace slot to use     trcWork.traceStart = (unsigned int)addr; // set start of trace table     trcWork.traceEnd = (unsigned int)addr + trcWork.traceSize; // set end ... }

As we noted earlier, low tracing is performed through the CutTrace() system call, a firmware call that is also an ultra-fast trapit is handled in osfmk/ppc/lowmem_vectors.s (in fact, before any other ultra-fast trap). The firmware call interface provides a stub, dbgTrace() [osfmk/ppc/Firmware.s], to call CutTrace().

; osfmk/ppc/Firmware.s ; ; dbgTrace(traceID, item1, item2, item3, item4) ;             .align 5             .globl EXT(dbgTrace) LEXT(dbgTrace)             mr    r2,r3                    ; trace ID             mr    r3,r4                    ; item1             lis   r0,HIGH_ADDR(CutTrace)   ; top half of firmware call number             mr    r4,r5                    ; item2             mr    r5,r6                    ; item3             ori   r0,r0,LOW_ADDR(CutTrace) ; bottom half             mr    r6,r7                    ; item4             sc                             ; invoke the system call             blr                            ; done

Various portions of the kernel add low-trace records either by calling dbgTrace() or by directly invoking the CutTrace() system call. In the latter case, the trace identifier is passed in GPR2. Figure 649 shows an example of kernel code creating a low-trace record.

Figure 649. An example of low-trace record generation by kernel code

// osfmk/ipc/ipc_kmsg.c mach_msg_return_t ipc_kmsg_get(mach_vm_address_t msg_addr,              mach_msg_size_t size,              ipc_kmsg_t *kmsgp) {     ... #ifdef ppc     if (trcWork.traceMask)         dbgTrace(0x1100,                  (unsigned int)kmsg->ikm_header->msgh_id,                  (unsigned int)kmsg->ikm_header->msgh_remote_port,                  (unsigned int)kmsg->ikm_header->msgh_local_port,                  0); #endif     ... }

A low-trace record is a structure of type LowTraceRecord, which is declared in osfmk/ppc/low_trace.h. Information contained in a LowTraceRecord structure includes the following:

Processor number
Exception code
Upper and lower parts of the Timebase Register
Contents of the following registers: CR, DSISR, SRR0, SRR1, DAR, LR, CTR, and GPRs 0 through 6
Save area

A convenient way to view low-trace records is through the built-in kernel debugger KDB, whose lt command formats and displays these records.

db{0}> lt 0 ...  00ADEA80  0  00000002 FD6D0959 - 0C00               DAR/DSR/CR: 00000000A000201C 40000000 84024A92                 SRR0/SRR1 00000000000D6D00 1000000000001030                 LR/CTR    00000000000D61F4 00000000000344A8                 R0/R1/R2  FFFFFFFF80000000 000000001759BD00 0000000000004400                 R3/R4/R5  0000000002626E60 000000000002CD38 0000000001E5791C               R6/sv/rsv   000000002FD78780 0000000000000000 00000000 ...