Section 8.2. An Overview of Mac OS X Memory Management

8.2. An Overview of Mac OS X Memory Management

Besides the Mach-based core VM subsystem, memory management in Mac OS X encompasses several other mechanisms, some of which are not strictly parts of the VM subsystem but are closely related nonetheless.

Figure 81 shows an overview of key VM and VM-related components in Mac OS X. Let us briefly look at each of them in this section. The rest of the chapter discusses these components in detail.

The Mach VM subsystem consists of the machine-dependent physical map (pmap) module and other, machine-independent modules for managing data structures corresponding to abstractions such as virtual address space maps (VM maps), VM objects, named entries, and resident pages. The kernel exports several routines to user space as part of the Mach VM API.
The kernel uses the universal page list (UPL) data structure to describe a bounded set of physical pages. A UPL is created based on the pages associated with a VM object. It can also be created for an object underlying an address range in a VM map. UPLs include various attributes of the pages they describe. Kernel subsystemsparticularly file systemsuse UPLs while communicating with the VM subsystem.
The unified buffer cache (UBC) is a pool of pages for caching the contents of files and the anonymous portions of task address spaces. Anonymous memory is not backed by regular files, devices, or some other named source of memorythe most common example is that of dynamically allocated memory. The "unification" in the UBC comes from a single pool being used for file-backed and anonymous memory.

The kernel includes three kernel-internal pagers, namely, the default (anonymous) pager, the device pager, and the vnode pager. These handle page-in and page-out operations over memory regions. The pagers communicate with the Mach VM subsystem using UPL interfaces and derivatives of the Mach pager interfaces.

Vnode

As we will see in Chapter 11, a vnode (virtual node) is a file-system-independent abstraction of a file system object, very much analogous to an abstract base class from which file-system-specific instances are derived. Each active file or directory (where "active" has context-dependent connotations) has an in-memory vnode.

The device pager, which handles device memory, is implemented in the I/O Kit portion of the kernel. On 64-bit hardware, the device pager uses a part of the memory controllerthe Device Address Resolution Table (DART)that is enabled by default on such hardware. The DART maps addresses from 64-bit memory into the 32-bit address space of PCI devices.
The page-out daemon is a set of kernel threads that write portions of task address spaces to disk as part of the paging operation in virtual memory. It examines the usage of resident pages and employs an LRU^[2]-like scheme to page out those pages that have not been used for over a certain time.
^[2] Least recently used.
The dynamic_pager user-space program creates and deletes swap files for the kernel's use. The "pager" in its name notwithstanding, dynamic_pager does not perform any paging operations.
The update user-space daemon periodically invokes the sync() system call to flush file system caches to disk.
The task working set (TWS) detection subsystem maintains profiles of page-fault behaviors of tasks on a per-application basis. When an application causes a page fault, the kernel's page-fault-handling mechanism consults this subsystem to determine which additional pages, if any, should be paged in. Usually the additional pages are adjacent to those being faulted in. The goal is to improve performance by making residentspeculativelythe pages that may be needed soon.
The kernel provides several memory allocation mechanisms, some of which are subsystem-specific wrappers around others. All such mechanisms eventually use the kernel's page-level allocator. User-space memory allocation schemes are built atop the Mach VM API.
The Shared Memory Server subsystem is a kernel service that provides two globally shared memory regions: one for text (starting at user virtual address 0x9000_0000) and the other for data (starting at user virtual address 0xA000_0000). Both regions are 256MB in size. The text region is read-only and is completely shared between tasks. The data region is shared copy-on-write. The dynamic link editor (dyld) uses this mechanism to load shared libraries into task address spaces.

Figure 81. An overview of the Mac OS X memory subsystem

8.2.1. Reading Kernel Memory from User Space

Let us look at a couple of ways of reading kernel memory; these are useful in examining kernel data structures from user space.

8.2.1.1. `dd` and `/dev/kmem`

The Mac OS X kernel provides the /dev/kmem character device, which can be used to read kernel virtual memory from user space. The device driver for this pseudo-device disallows memory at addresses less than VM_MIN_KERNEL_ADDRESS (4096) to be readthat is, the page at address 0 cannot be read.

Recall that we used the dd command in Chapter 7 to sample the sched_tick kernel variable by reading from /dev/kmem. In this chapter, we will again read from this device to retrieve the contents of kernel data structures. Let us generalize our dd-based technique so we can read kernel memory at a given address or at the address of a given kernel symbol. Figure 82 shows a shell script that accepts a symbol name or an address in hexadecimal, attempts to read the corresponding kernel memory, and, if successful, displays the memory on the standard output. By default, the program pipes raw memory bytes through the hexdump program using hexdump's -x (hexadecimal output) option. If the -raw option is specified, the program prints raw memory on the standard output, which is desirable if you wish to pipe it through another program yourself.

Figure 82. A shell script for reading kernel virtual memory

#!/bin/sh # #readksym.sh PROGNAME=readksym if [ $# -lt 2 ] then     echo "usage: $PROGNAME <symbol>  <bytes to read> [hexdump option|-raw]"     echo "       $PROGNAME <address> <bytes to read> [hexdump option|-raw]"     exit 1 fi SYMBOL=$1                 # first argument is a kernel symbol SYMBOL_ADDR=$1            # or a kernel address in hexadecimal IS_HEX=${SYMBOL_ADDR:0:2} # get the first two characters NBYTES=$2                 # second argument is the number of bytes to read HEXDUMP_OPTION=${3:--x}   # by default, we pass '-x' to hexdump RAW="no"                  # by default, we don't print memory as "raw" if [ ${HEXDUMP_OPTION:0:2} == "-r" ] then     RAW="yes" # raw... don't pipe through hexdump -- print as is fi KERN_SYMFILE=`sysctl -n kern.symfile | tr '\\' '/'` # typically /mach.sym if [ X"$KERN_SYMFILE" == "X" ] then     echo "failed to determine the kernel symbol file's name"     exit 1 fi if [ "$IS_HEX" != "0x" ] then     # use nm to determine the address of the kernel symbol     SYMBOL_ADDR="0x`nm $KERN_SYMFILE | grep -w $SYMBOL | awk '{print $1}'`" fi if [ "$SYMBOL_ADDR" == "0x" ] # at this point, we should have an address then     echo "address of $SYMBOL not found in $KERN_SYMFILE"     exit 1 fi if [ ${HEXDUMP_OPTION:0:2} == "-r" ] # raw... no hexdump then     dd if=/dev/kmem bs=1 count=$NBYTES iseek=$SYMBOL_ADDR of=/dev/stdout \         2>/dev/null else     dd if=/dev/kmem bs=1 count=$NBYTES iseek=$SYMBOL_ADDR of=/dev/stdout \         2>/dev/null | hexdump $HEXDUMP_OPTION fi exit 0 $ sudo ./readksym.sh 0x5000 8 -c # string seen only on the PowerPC 0000000   H   a   g   f   i   s   h 0000008

8.2.1.2. The `kvm(3)` Interface

Mac OS X also provides the kvm(3) interface for accessing kernel memory. It includes the following functions:

kvm_read() read from kernel memory
kvm_write() write to kernel memory
kvm_getprocs(), kvm_getargv(), kvm_getenvv() retrieve user process state
kvm_nlist() retrieve kernel symbol table names

Figure 83 shows an example of using the kvm(3) interface.

Figure 83. Using the `kvm(3)` interface to read kernel memory

// kvm_hagfish.c #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <kvm.h> #define TARGET_ADDRESS (u_long)0x5000 #define TARGET_NBYTES  (size_t)7 #define PROGNAME        "kvm_hagfish" int main(void) {     kvm_t *kd;     char buf[8] = { '\0' };     kd = kvm_open(NULL,      // kernel executable; use default                   NULL,      // kernel memory device; use default                   NULL,      // swap device; use default                   O_RDONLY,  // flags                   PROGNAME); // error prefix string     if (!kd)         exit(1);     if (kvm_read(kd, TARGET_ADDRESS, buf, TARGET_NBYTES) != TARGET_NBYTES)         perror("kvm_read");     else         printf("%s\n", buf);     kvm_close(kd);     exit(0); } $ gcc -Wall -o kvm_hagfish kvm_hagfish.c # string seen only on the PowerPC $ sudo ./kvm_hagfish Hagfish $

Raw Kernel Memory Access: Caveats

Exchanging information with the kernel by having raw access to its memory is unsatisfactory for several reasons. To begin with, a program must know the actual names, sizes, and formats of kernel structures. If these change across kernel versions, the program would need to be recompiled and perhaps even modified. Besides, it is cumbersome to access complicated data structures. Consider a linked list of deep structuresthat is, structures with one or more fields that are pointers. To read such a list, a program must read each element individually and then must separately read the data referenced by the pointer fields. It would also be difficult for the kernel to guarantee the consistency of such information.

Moreover, the information sought by a user program must be either kernel-resident in its final form (i.e., the kernel must compute it), or it must be computed from its components by the program. The former requires the kernel to know about the various types of information user programs might need, precompute it, and store it. The latter does not guarantee consistency and requires additional hardcoded logic in the program.

Direct user-program access to all kernel memory may also be a security and stability concern, even though such access normally requires superuser privileges. It is difficult to both specify and enforce limits on the accessibility of certain parts of kernel memory. In particular, the kernel cannot do sanity checking of the data that is written to its raw memory.

Several approaches have been used in operating systems to address these issues. The sysctl() system call was introduced in 4.4BSD as a safe, reliable, and portable (across kernel versions) way to perform user-kernel data exchange. The Plan 9 operating system extended the file metaphor to export servicessuch as I/O devices, network interfaces, and the windowing systemas files. With these services, one could perform file I/O for most things that would require access to /dev/kmem on traditional systems. The /proc file system uses the file metaphor to provide both a view of currently running processes and an interface to control them. Linux extended the concept further by providing formatted I/O to files in /proc. For example, kernel parameters can be modified by writing strings to the appropriate filesthe Linux kernel will parse, validate, and accept or reject the information. Newer versions of Linux provide sysfs, which is another in-memory file system used to export kernel data structures, their properties, and interconnections to user space.

8.2.2. Querying Physical Memory Size

The size of physical memory on a system can be programmatically determined through the sysctl() or sysctlbyname() functions. Figure 84 shows an example. Note that the retrieved size is the value of the max_mem kernel variable, which, as we saw in earlier chapters, can be artificially limited.

Figure 84. Determining the size of physical memory on a system

// hw_memsize.c #include <stdio.h> #include <sys/sysctl.h> int main(void) {     int                ret;     unsigned long long memsize;     size_t             len = sizeof(memsize);     if (!(ret = sysctlbyname("hw.memsize", &memsize, &len, NULL, 0)))         printf("%lld MB\n", (memsize >> 20ULL));     else         perror("sysctlbyname");     return ret; } $ gcc -Wall -o hw_memsize hw_memsize.c $ ./hw_memsize 4096 MB