Section 8.16. Memory Allocation in the Kernel

8.16. Memory Allocation in the Kernel

Figure 843 shows an overview of kernel-level memory allocation functions in Mac OS X. The numerical labels are rough indicators of how low-level that group of functions is. For example, page-level allocation, which is labeled with the lowest number, is the lowest-level allocation mechanism, since it allocates memory directly from the list of free pages in the Mach VM subsystem.

Figure 843. An overview of memory allocation in the Mac OS X kernel

Figure 844 shows an overview of kernel-level memory deallocation functions.

Figure 844. An overview of memory deallocation in the Mac OS X kernel

8.16.1. Page-Level Allocation

Page-level allocation is performed in the kernel by vm_page_alloc() [osfmk/vm/vm_resident.c]. This function requires a VM object and an offset as arguments. It then attempts to allocate a page associated with the VM object/offset pair. The VM object can be the kernel VM object (kernel_object), or it can be a newly allocated VM object.

vm_page_alloc() first calls vm_page_grab() [osfmk/vm/vm_resident.c] to remove a page from the free list. If the free list is too small, vm_page_grab() fails, returning a VM_PAGE_NULL. However, if the requesting thread is a VM-privileged thread, vm_page_grab() consumes a page from the reserved pool. If there are no reserved pages available, vm_page_grab() waits for a page to become available.

If vm_page_grab() returns a valid page, vm_page_alloc() calls vm_page_insert() [osfmk/vm/vm_resident.c] to insert the page into the hash table that maps VM object/offset pairs to pagesthat is, the virtual-to-physical (VP) table. The VM object's resident page count is also incremented.

kernel_memory_allocate() [osfmk/vm/vm_kern.c] is the master entry point for allocating kernel memory in that most but not all pathways to memory allocation go through this function.

kern_return_t kernel_memory_allocate(     vm_map_t     map,   // the VM map to allocate into     vm_offset_t *addrp, // pointer to start of new memory     vm_size_t    size,  // size to allocate (rounded up to a page size multiple)     vm_offset_t  mask,  // mask specifying a particular alignment     int          flags);// KMA_HERE, KMA_NOPAGEWAIT, KMA_KOBJECT

The flag bits are used as follows:

If KMA_HERE is set, the address pointer contains the base address to use; otherwise, the caller doesn't care where the memory is allocated. For example, if the caller has a newly created submap that the caller knows is empty, the caller may want to allocate memory at the beginning of the map.
If KMA_NOPAGEWAIT is set, the function does not wait for pages if memory is not available.
If KMA_KOBJECT is set, the function uses the kernel VM object (kernel_object); otherwise, a new VM object is allocated.

kernel_memory_allocate() calls vm_map_find_space() [osfmk/vm/vm_map.c] to find and allocate a virtual address range in the VM map. A new VM map entry is initialized because of this. As shown in Figure 843, kernel_memory_allocate() calls vm_page_alloc() to allocate pages. If the VM object is newly allocated, it passes a zero offset to vm_page_alloc(). If the kernel object is being used, the offset is the difference of the address returned by vm_map_find_space() and the minimum kernel address (VM_MIN_KERNEL_ADDRESS, defined to be 0x1000 in osfmk/mach/ppc/vm_param.h).

8.16.2. `kmem_alloc`

The kmem_alloc family of functions is implemented in osfmk/vm/vm_kern.c. These functions are intended for use in the Mach portion of the kernel.

kern_return_t kmem_alloc(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_wired(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_aligned(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_pageable(vm_map_t map, vm_offset_t *addrp, vm_size_t size); kern_return_t kmem_alloc_contig(vm_map_t map, vm_offset_t *addrp,                   vm_size_t size, vm_offset_t mask, int flags); kern_return_t kmem_realloc(vm_map_t map, vm_offset_t oldaddr, vm_size_t oldsize,              vm_offset_t *newaddrp, vm_size_t newsize); void kmem_free(vm_map_t map, vm_offset_t addr, vm_size_t size);

kmem_alloc() simply forwards its arguments to kernel_memory_allocate() and also sets the latter's mask and flags parameters to 0 each.
kmem_alloc_wired() simply forwards its arguments to kernel_memory_allocate() and also sets the latter's mask and flags parameters to 0 and KMA_KOBJECT, respectively. Consequently, memory is allocated in the kernel objectin either the kernel's map or a submap. The memory is not zero-filled.
kmem_alloc_aligned() simply forwards its arguments to kernel_memory_allocate() after ensuring that the requested allocation size is a power of 2. Additionally, it sets the latter's flags parameter to KMA_KOBJECT and the mask parameter to (size -1), where size is the requested allocation size.
kmem_alloc_pageable() allocates pageable kernel memory in the given address map. It only calls vm_map_enter() to allocate a range in the given VM map. In particular, it does not back the range with physical memory. The execve() system call implementation uses this function to allocate memory in the BSD pageable map (bsd_pageable_map) for execve() arguments.
kmem_alloc_contig() allocates physically contiguous, wired kernel memory. The I/O Kit uses this function.
kmem_realloc() reallocates wired kernel memory given a region that is already allocated using kmem_alloc().
kmem_free() releases allocated kernel memory.

Except kmem_alloc_pageable(), all kmem_alloc functions allocate wired memory.

8.16.3. The Mach Zone Allocator

The Mach zone allocator is a fast memory allocation mechanism with garbage collection. As shown in Figure 843, several allocation functions in the kernel directly or indirectly use the zone allocator.

A zone is a collection of fixed-size memory blocks that are accessible through an efficient interface for allocation and deallocation. The kernel typically creates a zone for each class of data structure to be managed. Examples of data structures for which the Mac OS X kernel creates individual zones include the following:

Asynchronous I/O work queue entries (struct aio_workq_entry)
Alarms (struct alarm) and timer data (mk_timer_data_t)
Kernel audit records (struct kaudit_record)
Kernel notifications (struct knote)
Tasks (struct task), threads (struct thread), and uthreads (struct uthread)
Pipes (struct pipe)
Semaphores (struct semaphores)
Buffer headers (struct buf) and metadata buffers
Various protocol control blocks in the network stack
Unified buffer cache "info" structures (struct ubc_info)
Vnode pagers (struct vnode_pager) and device pagers (struct device_pager)
Mach VM data structures, such as VM maps (struct vm_map), VM map entries (struct vm_map_entry), VM map copy objects (struct vm_map_copy), VM objects (struct vm_object), VM object hash entries (struct vm_object_hash_entry), and pages (struct vm_page)
Mach IPC data structures, such as IPC spaces (struct ipc_space), IPC tree entries (struct ipc_tree_entry), ports (struct ipc_port), port sets (struct ipc_pset), and IPC messages (ipc_kmsg_t)

The host_zone_info() Mach routine retrieves information about Mach zones from the kernel. It returns an array of zone names and another array of zone_info structures [<mach_debug/zone_info.h>]. The zprint command-line program uses host_zone_info() to retrieve and display information about all zones in the kernel.

$ zprint                           elem    cur    max    cur    max   cur alloc alloc zone name                 size   size   size  #elts  #elts inuse  size count ------------------------------------------------------------------------------- zones                       80    11K    12K    152    153    89     4K   51 vm.objects                 136  6562K  8748K  49410  65867 39804     4K   30 C vm.object.hash.entries      20   693K   768K  35496  39321 24754     4K  204 C ... pmap_mappings               64 25861K 52479K 413789  839665272627    4K   64 C kalloc.large             59229  2949K 4360K     51     75      51   57K    1

Note that zprint's output includes the size of an object in each zone (the elem size column). You can pipe zprint's output through the sort command to see that several zones have the same element sizes. A single physical page is never shared between two or more zones. In other words, all zone-allocated objects on a physical page will be of the same type.

$ zprint | sort +1 -n ... alarms                      44     3K     4K     93     93     1    4K    93 C kernel.map.entries          44  4151K  4152K  96628  96628  9582    4K    93 non-kernel.map.entries      44  1194K  1536K  27807  35746 18963    4K    93 C semaphores                  44    35K  1092K    837  25413   680    4K    93 C vm.pages                    44 32834K     0K 764153      0763069    4K    93 C ...

A zone is described in the kernel by a zone structure (struct zone).

// osfmk/kern/zalloc.h struct zone {     int         count;          // number of elements used now     vm_offset_t free_elements;     decl_mutex_data(,lock);     // generic lock     vm_size_t   cur_size;       // current memory utilization     vm_size_t   max_size;       // how large this zone can grow     vm_size_t   elem_size;      // size of an element     vm_size_t   alloc_size;     // chunk size for more memory     char        *zone_name;     // string describing the zone     ...     struct zone *next_zone;     // link for all-zones list     ... };

A new zone is initialized by calling zinit(), which returns a pointer to a newly created zone structure (zone_t). Various subsystems use zinit() to initialize the zones they need.

zone_t zinit(vm_size_t   size,  // size of each object       vm_size_t   max,   // maximum size in bytes the zone may reach       vm_size_t   alloc, // allocation size       const char *name); // a string that describes the objects in the zone

The allocation size specified in the zinit() call is the amount of memory to add to the zone each time the zone becomes emptythat is, when there are no free elements on the zone's free list. The allocation size is automatically rounded up to an integral number of pages. Note that zone structures are themselves allocated from a zone of zones (zone_zone). When the zone allocator is initialized during kernel bootstrap, it calls zinit() to initialize the zone of zones. zinit() TReats this initialization specially: It calls zget_space() [osfmk/kern/zalloc.c] to allocate contiguous, nonpaged space through the master kernel memory allocator (kernel_memory_allocate() [osfmk/vm/vm_kern.c]). Other calls to zinit() allocate zone structures from the zone of zones through zalloc() [osfmk/kern/zalloc.c].

// osfmk/kern/zalloc.c // zone data structures are themselves stored in a zone zone_t zone_zone = ZONE_NULL; zone_t zinit(vm_size_t size, vm_size_t max, vm_size_t alloc, const char *name) {     zone_t z;     if (zone_zone == ZONE_NULL) {         if (zget_space(sizeof(struct_zone), (vm_offset_t *)&z)             != KERN_SUCCESS)                 return(ZONE_NULL);     } else         z = (zone_t)zalloc(zone_zone);     // initialize various fields of the newly allocated zone structure     thread_call_setup(&z->call_async_alloc, zalloc_async, z);     // add the zone structure to the end of the list of all zones     return(z); } void zone_bootstrap(void) {     ...     // this is the first call to zinit()     zone_zone = zinit(sizeof(struct zone), 128 * sizeof(struct zone),                       sizeof(struct zone), "zones");     // this zone's empty pages will not be garbage collected     zone_change(zone_zone, Z_COLLECT, FALSE);     ... }

zinit() populates the various fields of a newly allocated zone structure. In particular, it sets the zone's current size to 0 and the zone's empty list to NULL. Therefore, at this point, the zone's memory pool is empty. Before returning, zinit() arranges for zalloc_async() [osfmk/kern/zalloc.c] to run by setting up a callout. zalloc_async() attempts to allocate a single element from the empty zone, because of which memory is allocated for the zone. zalloc_async() immediately frees the dummy allocation.

// osfmk/kern/zalloc.c void zalloc_async(thread_call_param_t p0, __unused thread_call_param_t p1) {     void *elt;     elt = zalloc_canblock((zone_t)p0, TRUE);     zfree((zone_t)p0, elt);     lock_zone((zone_t)p0);     ((zone_t)p0)->async_pending = FALSE;     unlock_zone((zone_t)p0); }

The zone allocator exports several functions for memory allocation, deallocation, and zone configuration. Figure 845 shows the important functions.

Figure 845. Zone allocator functions

// Allocate an element from the specified zone void *zalloc(zone_t zone); // Allocate an element from the specified zone without blocking void *zalloc_noblock(zone_t zone); // A special version of a nonblocking zalloc() that does not block // even for locking the zone's mutex: It will return an element only // if it can get it from the zone's free list void *zget(zone_t zone); // Free a zone element void zfree(zone_t zone, void *elem); // Add ("cram") the given memory to the given zone void zcram(zone_t zone, void *newmem, vm_size_t size); // Fill the zone with enough memory for at least the given number of elements int zfill(zone_t zone, int nelem); // Change zone parameters (must be called immediately after zinit()) void zone_change(zone_t zone, unsigned int item, boolean_t value); // Preallocate wired memory for the given zone from zone_map, expanding the // zone to the given size void zprealloc(zone_t zone, vm_size_t size); // Return a hint for the current number of free elements in the zone integer_t zone_free_count(zone_t zone)

The zone_change() function allows the following Boolean flags to be modified for a zone.

Z_EXHAUST If this flag is true, the zone is exhaustible, and an allocation attempt simply returns if the zone is empty. This flag is false by default.
Z_COLLECT If this flag is true, the zone is collectable: Its empty pages are garbage collected. This flag is true by default.
Z_EXPAND If this flag is true, the zone is expandable: It can be grown by sending an IPC message. This flag is true by default.
Z_FOREIGN If this flag is true, the zone can contain foreign objectsthat is, those objects that are not allocated through zalloc(). This flag is false by default.

The typical kernel usage of zalloc() is blockingthat is, the caller is willing to wait if memory is not available immediately. The zalloc_noblock() and zget() functions attempt to allocate memory with no allowance for blocking and therefore can return NULL if no memory is available.

As shown in Figure 843, the zone allocator eventually allocates memory through kernel_memory_allocate() [osfmk/vm/vm_kern.c]. If the system is low on available memory, this function returns KERN_RESOURCE_SHORTAGE, which causes the zone allocator to wait for a page to become available. However, if kernel_memory_allocate() fails because there is no more kernel virtual address space left, the zone allocator causes a kernel panic.

Freeing a zone element through zfree() [osfmk/kern/zalloc.c] causes the element to be added to the zone's free list and the zone's count of in-use elements to be decremented. A collectable zone's unused pages are periodically garbage collected.

During VM subsystem initialization, the kernel calls zone_init() [osfmk/kern/zalloc.c] to create a map for the zone allocator (zone_map) as a submap of the kernel map. zone_init() also sets up garbage collection information: It allocates wired memory for the zone page tablea linked list that contains one element, a zone_page_table_entry structure, for each page assigned to a zone.

// osfmk/kern/zalloc.c struct zone_page_table_entry {     struct zone_page_table_entry *link;     short                         alloc_count;     short                         collect_count; };

The alloc_count field of the zone_page_table_entry structure is the total number of elements from that page assigned to the zone, whereas the collect_count field is the number of elements from that page on the zone's free list. Consider the following sequence of steps as an example of new memory being added to a zone.

A caller invokes zalloc() to request memory. zalloc() is a wrapper around zalloc_canblock(), which it calls with the "can block" Boolean parameter (canblock) set to true.
zalloc_canblock() attempts to remove an element from the zone's free list. If it succeeds, it returns; otherwise, the zone's free list is empty.
zalloc_canblock() checks whether the zone is currently undergoing garbage collection. If so, it sets the zone structure's waiting bit field and goes to sleep. The garbage collector will wake it up, after which it can retry removing an element from the zone's free list.
If allocation still doesn't succeed, zalloc_canblock() checks the zone structure's doing_alloc bit field to check whether someone else is allocating memory for the zone. If so, it goes to sleep again while setting the waiting bit field.
If nobody else is allocating memory for the zone, zalloc_canblock() attempts to allocate memory for the zone by calling kernel_memory_allocate(). The size of this allocation is normally the zone's allocation size (the size structure's alloc_size field), but it can be just the size of a single element (rounded up to an integral number of pages) if the system is low on memory.
On a successful return from kernel_memory_allocate(), zalloc_canblock() calls zone_page_init() on the new memory. For each page in the memory, zone_page_init() sets both the alloc_count and collect_count fields of the corresponding zone_page_table_entry structure to 0.
zalloc_canblock() then calls zcram() on the new memory, which in turn calls zone_page_alloc() for each newly available element. zone_page_alloc() increments the appropriate alloc_count value by one for each element.

The zone garbage collector, zone_gc() [osfmk/kern/zalloc.c], is invoked by consider_zone_gc() [osfmk/kern/zalloc.c]. The latter ensures that garbage collection is performed at most once per minute, unless someone else has explicitly requested a garbage collection. The page-out daemon calls consider_zone_gc().

zfree() can request explicit garbage collection if the system is low on memory and the zone from which the element is being freed has an element size of a page size or more.

zone_gc() makes two passes on each collectable zone.^[24] In the first pass, it calls zone_page_collect() [osfmk/kern/zalloc.c] on each free element. zone_page_collect() increments the appropriate collect_count value by one. In the second pass, it calls zone_page_collectable() on each element, which compares the collect_count and alloc_count values for that page. If the values are equal, the page can be reclaimed since all elements on that page are free. zone_gc() tracks such pages in a list of pages to be freed and eventually frees them by calling kmem_free().

^[24] zone_gc() can skip a collectable zone if the zone has less than 10% of its elements free or if the amount of free memory in the zone is less than twice its allocation size.

8.16.4. The Kalloc Family

The kalloc family of functions, implemented in osfmk/kern/kalloc.c, provides access to a fast general-purpose memory allocator built atop the zone allocator. kalloc() uses a 16MB submap (kalloc_map) of the kernel map from which it allocates its memory. The limited submap size avoids virtual memory fragmentation. kalloc() supports a set of allocation sizes, ranging from as little as KALLOC_MINSIZE bytes (16 bytes by default) to several kilobytes. Note that each size is a power of 2. When the allocator is initialized, it calls zinit() to create a zone for each allocation size that it handles. Each zone's name is set to reflect the zone's associated size, as shown in Figure 846. These are the so-called power-of-2 zones.

Figure 846. Printing sizes of kalloc zones supported in the kernel

$ zprint | grep kalloc kalloc.16                   16   484K   615K  30976  39366 26998    4K   256 C kalloc.32                   32  1452K  1458K  46464  46656 38240    4K   128 C kalloc.64                   64  2404K  2916K  38464  46656 24429    4K    64 C kalloc.128                 128  1172K  1728K   9376  13824  2987    4K    32 C kalloc.256                 256   692K  1024K   2768   4096  2449    4K    16 C kalloc.512                 512   916K  1152K   1832   2304  1437    4K     8 C kalloc.1024               1024   804K  1024K    804   1024   702    4K     4 C kalloc.2048               2048  1504K  2048K    752   1024   663    4K     2 C kalloc.4096               4096   488K  4096K    122   1024    70    4K     1 C kalloc.8192               8192  2824K 32768K    353   4096   307    8K     1 C kalloc.large             60648  2842K  4360K     48     73    48   59K     1

Note that the zone named kalloc.large in the zprint output in Figure 846 is not realit is a fake zone used for reporting on too-large-for-a-zone objects that were allocated through kmem_alloc().

The kalloc family provides malloc-style functions, along with a version that attempts memory allocation without blocking.

void * kalloc(vm_size_t size); void * kalloc_noblock(vm_size_t size); void * kalloc_canblock(vm_size_t size, boolean_t canblock); void krealloc(void **addrp, vm_size_t old_size, vm_size_t new_size,          simple_lock_t lock); void kfree(void *data, vm_size_t size);

Both kalloc() and kalloc_noblock() are simple wrappers around kalloc_canblock(), which prefers to get memory through zalloc_canblock(), unless the allocation size is too largekalloc_max_prerounded (8193 bytes by default or more). krealloc() uses kmem_realloc() if the existing allocation is already too large for a kalloc zone. If the new size is also too large, krealloc() uses kmem_alloc() to allocate new memory, copies existing data into it using bcopy(), and frees the old memory. If the new memory fits in a kalloc zone, krealloc() uses zalloc() to allocate new memory. It still must copy existing data and free the old memory, since there is no "zrealloc" function.

8.16.5. The OSMalloc Family

The file osfmk/kern/kalloc.c implements another family of memory allocation functions: the OSMalloc family.

OSMallocTag OSMalloc_Tagalloc(const char *str, uint32_t flags); void OSMalloc_Tagfree(OSMallocTag tag); void * OSMalloc(uint32_t size, OSMallocTag tag); void * OSMalloc_nowait(uint32_t size, OSMallocTag tag); void * OSMalloc_noblock(uint32_t size, OSMallocTag tag); void OSFree(void *addr, uint32_t size, OSMallocTag tag);

The key aspect of these functions is their use of a tag structure, which encapsulates certain properties of allocations made with that tag.

#define OSMT_MAX_NAME 64 typedef struct _OSMallocTag_ {     queue_chain_t OSMT_link;     uint32_t      OSMT_refcnt;     uint32_t      OSMT_state;     uint32_t      OSMT_attr;     char          OSMT_name[OSMT_MAX_NAME]; } *OSMallocTag;

Here is an example use of the OSMalloc functions:

#include <libkern/OSMalloc.h> OSMallocTag my_tag; void my_init(void) {     my_tag = OSMalloc_Tagalloc("My Tag Name", OSMT_ATTR_PAGEABLE);     ... } void my_uninit(void) {     OSMalloc_Tagfree(my_tag); } void some_function(...) {     void *p = OSMalloc(some_size, my_tag); }

OSMalloc_Tagalloc() calls kalloc() to allocate a tag structure. The tag's name and attributes are set based on the arguments passed to OSMalloc_Tagalloc(). The tag's reference count is initialized to one, and the tag is placed on a global list of tags. Thereafter, memory is allocated using one of the OSMalloc allocation functions, which in turn uses one of kalloc(), kalloc_noblock(), or kmem_alloc_pageable() for the actual allocation. Each allocation increments the tag's reference count by one.

8.16.6. Memory Allocation in the I/O Kit

The I/O Kit provides its own interface for memory allocation in the kernel.

void * IOMalloc(vm_size_t size); void * IOMallocPageable(vm_size_t size, vm_size_t alignment); void * IOMallocAligned(vm_size_t size, vm_size_t alignment); void * IOMallocContiguous(vm_size_t size, vm_size_t alignment,                    IOPhysicalAddress *physicalAddress); void IOFree(void *address, vm_size_t size); void IOFreePageable(void *address, vm_size_t size); void IOFreeAligned(void *address, vm_size_t size); void IOFreeContiguous(void *address, vm_size_t size);

IOMalloc() allocates general-purpose, wired memory in the kernel map by simply calling kalloc(). Since kalloc() can block, IOMalloc() must not be called while holding a simple lock or from an interrupt context. Moreover, since kalloc() offers no alignment guarantees, IOMalloc() should not be called when a specific alignment is desired. Memory allocated through IOMalloc() is freed through IOFree(), which simply calls kfree(). The latter too can block.

Pageable memory with alignment restriction is allocated through IOMallocPageable(), whose alignment argument specifies the desired alignment in bytes. The I/O Kit maintains a bookkeeping data structure (gIOKitPageableSpace) for pageable memory.

// iokit/Kernel/IOLib.c enum { kIOMaxPageableMaps    = 16 }; enum { kIOPageableMapSize    = 96 * 1024 * 1024 }; enum { kIOPageableMaxMapSize = 96 * 1024 * 1024 }; static struct {     UInt32     count;     UInt32     hint;     IOMapData  maps[kIOMaxPageableMaps];     lck_mtx_t *lock; } gIOKitPageableSpace;

The maps array of gIOKitPageableSpace contains submaps allocated from the kernel map. During bootstrap, the I/O Kit initializes the first entry of this array by allocating a 96MB (kIOPageableMapSize) pageable map. IOMallocPageable() calls IOIteratePageableMaps(), which first attempts to allocate memory from an existing pageable map, failing which it fills the next slotup to a maximum of kIOPageableMaps slotsof the maps array with a newly allocated map. The eventual memory allocation is done through kmem_alloc_pageable(). When such memory is freed through IOFreePageable(), the maps array is consulted to determine which map the address being freed belongs to, after which kmem_free() is called to actually free the memory.

Wired memory with alignment restriction is allocated through IOMallocAligned(), whose alignment argument specifies the desired alignment in bytes. If the adjusted allocation size (after accounting for the alignment) is equal to or more than the page size, IOMallocAligned() uses kernel_memory_allocate(); otherwise, it uses kalloc(). Correspondingly, the memory is freed through kmem_free() or kfree().

IOMallocContiguous() allocates physically contiguous, wired, alignment-restricted memory in the kernel map. Optionally, this function returns the physical address of the allocated memory if a non-NULL pointer for holding the physical address is passed as an argument. When the adjusted allocation size is less than or equal to a page, physical contiguity is trivially present. In these two cases, IOMallocContiguous() uses kalloc() and kernel_memory_allocate(), respectively, for the underlying allocation. When multiple physical contiguous pages are requested, the allocation is handled by kmem_alloc_contig(). Like vm_page_alloc(), this function also causes memory allocation directly from the free list. It calls kmem_alloc_contig(), which in turn calls vm_page_find_contiguous() [osfmk/vm/vm_resident.c]. The latter traverses the free list, inserting free pages into a private sublist sorted on the physical address. As soon as a contiguous range large enough to fulfill the contiguous allocation request is detected in the sublist, the function allocates the corresponding pages and returns the remaining pages collected on the sublist to the free list. Because of the free list sorting, this function can take a substantial time to run when the free list is very largefor example, soon after bootstrapping on a system with a large amount of physical memory.

When the caller requests the newly allocated memory's physical address to be returned, IOMallocContiguous() first retrieves the corresponding physical page from the pmap layer by calling pmap_find_phys() [osfmk/ppc/pmap.c]. If the DART IOMMU^[25] is present and active on the system, the address of this page is not returned as is. As we noted earlier, the DART translates I/O Kit-visible 32-bit "physical for I/O" addresses to 64-bit "true" physical addresses. Code running in the I/O Kit environment cannot even see the true physical address. In fact, even if such code attempted to use a 64-bit physical address, the DART would not be able to translate it, and an error would occur.

^[25] We will discuss the DART in Section 10.3.

If the DART is active, IOMallocContiguous() calls it to allocate an appropriately sized I/O memory rangethe address of this allocation is the "physical" address that is returned. Moreover, IOMallocContiguous() has to insert each "true" physical page into the I/O memory range by calling the DART's "insert" function. Since IOFreeContiguous() must call the DART to undo this work, IOMallocContiguous() saves the virtual address and the I/O address in an _IOMallocContiguousEntry structure. The I/O Kit maintains these structures in a linked list. When the memory is freed, the caller provides the virtual address, using which the I/O Kit can search for the I/O address on this linked list. Once the I/O address is found, the structure is removed from the list and the DART allocation is freed.

// iokit/Kernel/IOLib.c struct _IOMallocContiguousEntry {     void          *virtual; // caller-visible virtual address     ppnum_t        ioBase;  // caller-visible "physical" address     queue_chain_t  link;    // chained to other contiguous entries }; typedef struct _IOMallocContiguousEntry _IOMallocContiguousEntry;

8.16.7. Memory Allocation in the Kernel's BSD Portion

The BSD portion of the kernel provides _MALLOC() [bsd/kern/kern_malloc.c] and _MALLOC_ZONE() [bsd/kern/kern_malloc.c] for memory allocation. The header file bsd/sys/malloc.h defines the MALLOC() and MALLOC_ZONE() macros, which are trivial wrappers around _MALLOC() and _MALLOC_ZONE(), respectively.

void * _MALLOC(size_t size, int type, int flags); void _FREE(void *addr, int type); void * _MALLOC_ZONE(size_t size, int type, int flags); void _FREE_ZONE(void *elem, size_t size, int type);

The BSD-specific allocator designates different types of memory with different numerical values, where the "memory type" (the type argument), which is specified by the caller, represents the purpose of the memory. For example, M_FILEPROC memory is used for open file structures, and M_SOCKET memory is used for socket structures. The various known types are defined in bsd/sys/malloc.h. The value M_LAST is one more than the last known type's value. This allocator is initialized during kernel bootstrap by a call to kmeminit() [bsd/kern/kern_malloc.c], which goes through a predefined array of kmzones structures (struct kmzones [bsd/kern/kern_malloc.c]). As shown in Figure 847, there is one kmzones structure for each type of memory supported by the BSD allocator.

Figure 847. Array of memory types supported by the BSD memory allocator

// bsd/kern/kern_malloc.c char *memname[] = INITKMEMNAMES; struct kmzones {     size_t  kz_elemsize;     void   *kz_zalloczone; #define KMZ_CREATEZONE ((void *)-2) #define KMZ_LOOKUPZONE ((void *)-1) #define KMZ_MALLOC     ((void *)0) #define KMZ_SHAREZONE  ((void *)1) } kmzones[M_LAST] = { #define SOS(sname)     sizeof (struct sname) #define SOX(sname)     -1     -1,          0,                /* 0 M_FREE       */     MSIZE,       KMZ_CREATEZONE,   /* 1 M_MBUF       */     0,           KMZ_MALLOC,       /* 2 M_DEVBUF     */     SOS(socket), KMZ_CREATEZONE,   /* 3 M_SOCKET     */     SOS(inpcb),  KMZ_LOOKUPZONE,   /* 4 M_PCB        */     M_MBUF,      KMZ_SHAREZONE,    /* 5 M_RTABLE     */     ... SOS(unsafe_fsnode),KMZ_CREATEZONE, /* 102 M_UNSAFEFS */ #undef  SOS #undef  SOX }; ...

Moreover, each type has a string name. These names are defined in bsd/sys/malloc.h in another array.

// bsd/sys/malloc.h #define INITKMEMNAMES { \         "free",         /* 0 M_FREE       */ \         "mbuf",         /* 1 M_MBUF       */ \         "devbuf",       /* 2 M_DEVBUF     */ \         "socket",       /* 3 M_SOCKET     */ \         "pcb",          /* 4 M_PCB        */ \         "routetbl",     /* 5 M_RTABLE     */ \         ...         "kauth",        /* 100 M_KAUTH    */ \         "dummynet",     /* 101 M_DUMMYNET */ \         "unsafe_fsnode" /* 102 M_UNSAFEFS */ \ } ...

As kmeminit() iterates over the array of kmzones, it analyses each entry's kz_elemsize and kz_zalloczone fields. Entries with kz_elemsize values of -1 are skipped. For the other entries, if kz_zalloczone is KMZ_CREATEZONE, kmeminit() calls zinit() to initialize a zone using kz_elemsize as the size of an element of the zone, 1MB as the maximum memory to use, PAGE_SIZE as the allocation size, and the corresponding string in the memname array as the zone's name. The kz_zalloczone field is set to this newly initialized zone.

If kz_zalloczone is KMZ_LOOKUPZONE, kmeminit() calls kalloc_zone() to simply look up the kernel memory allocator (kalloc) zone with the appropriate allocation size. The kz_zalloczone field is set to the found zone or to ZONE_NULL if none is found.

If kz_zalloczone is KMZ_SHAREZONE, the entry shares the zone with the entry at index kz_elemsize in the kmzones array. For example, the kmzones entry for M_RTABLE shares the zone with the entry for M_MBUF. kmeminit() sets the kz_zalloczone and kz_elemsize fields of a KMZ_SHAREZONE entry to those of the "shared with" zone.

Thereafter, _MALLOC_ZONE() uses its type argument as an index into the kmzones array. If the specified type is greater than the last known type, there is a kernel panic. If the allocation request's size matches the kz_elemsize field of kmzones[type], _MALLOC_ZONE() calls the Mach zone allocator to allocate from the zone pointed to by the kz_zalloczone field of kmzones[type]. If their sizes do not match, _MALLOC_ZONE() uses kalloc() or kalloc_noblock(), depending on whether the M_NOWAIT bit is clear or set, respectively, in the flags argument.

Similarly, _MALLOC() calls kalloc() or kalloc_noblock() to allocate memory. The type argument is not used, but if its value exceeds the last known BSD malloc type, _MALLOC() still causes a kernel panic. _MALLOC() uses a bookkeeping data structure of its own to track allocated memory. It adds the size of this data structure (struct _mhead) to the size of the incoming allocation request.

struct _mhead {         size_t  mlen;   // used to record the length of allocated memory         char    dat[0]; // this is returned by _MALLOC() };

Moreover, if the M_ZERO bit is set in the flags argument, _MALLOC calls bzero() to zero-fill the memory.

8.16.8. Memory Allocation in libkern's C++ Environment

As we noted in Section 2.4.4, libkern defines OSObject as the root base class for the Mac OS X kernel. The new and delete operators for OSObject call kalloc() and kfree(), respectively.

// libkern/c++/OSObject.cpp void * OSObject::operator new(size_t size) {     void *mem = (void *)kalloc(size);     ...     return mem; } void OSObject::operator delete(void *mem, size_t size) {     kfree((vm_offset_t)mem, size);     ... }