8.15. Memory Allocation in User SpaceThere are several user-space and kernel-space memory allocation APIs in Mac OS X, although all such APIs are built atop a single low-level mechanism. In the kernel, memory is fundamentally allocated through a page-level allocator in the Mach VM subsystem. In user space, memory is fundamentally allocated through the Mach vm_allocate() API,[20] although user programs typically use application-environment-specific APIs for memory allocation. The system library provides malloc(), which is the preferred user-space memory allocation function. malloc() is implemented using the Mach API. Memory allocation APIs in Carbon and Core Foundation are implemented on top of malloc(). Besides malloc-based memory allocation, user programs can also use the stack-based alloca(3) memory allocator, which allocates temporary space in the runtime stack. The space is automatically reclaimed when the function returns.[21] The alloca() function is built into the C compiler on Mac OS X.
Table 86 shows a sampling of ways in which memory can be allocated by user-space programs. Note that the list shown is not exhaustiveits purpose is to illustrate that a variety of APIs exist. Moreover, in many cases, you can use a function listed under a particular environment in other environments as well.
8.15.1. A Historical BreakHistorically, in UNIX the sbrk() and brk() system calls are used to dynamically change the amount of space allocated for the calling program's data segment. sbrk() is implemented but not supported by the Mac OS X kerneldirectly invoking the system call will result in an ENOTSUP error being returned. brk() is not even implemented by the kernel. However, the system library implements both sbrk() and brk(). Whereas brk() always returns a value of -1, sbrk() simulates a program break region by using a 4MB memory area that it allocates through vm_allocate() on the first invocation of sbrk(). Subsequent invocations of sbrk() adjust the current break value within the simulated region. If the adjusted size falls outside the region, a value of -1 is returned. // libSystem: mach/sbrk.c static int sbrk_needs_init = TRUE; static vm_size_t sbrk_region_size = 4*1024*1024; static vm_address_t sbrk_curbrk; caddr_t sbrk(int size) { kern_return_t ret; if (sbrk_needs_init) { sbrk_needs_init = FALSE; ret = vm_allocate(mach_task_self(), &sbrk_curbrk, sbrk_region_size, VM_MAKE_TAG(VM_MEMORY_SBRK)|TRUE); ... } if (size <= 0) return((caddr_t)sbrk_curbrk); else if (size > sbrk_region_size) return((caddr_t)-1); sbrk_curbrk += size; sbrk_region_size -= size; return((caddr_t)(sbrk_curbrk size)); } Note the use of the VM_MAKE_TAG macro by sbrk(). The macro can be used to tag any vm_allocate() allocation, thereby indicating the purpose of that allocation. The available tag values are defined in <mach/vm_statistics.h>. Besides system-reserved tags, there are tags available that user programs can use for tagging application-specific allocations. The vmmap tool displays memory regions in a program along with the region types. For example, running vmmap with the process ID of the WindowServer program produces an output like the following. $ sudo vmmap 71 ... REGION TYPE [ VIRTUAL] =========== [ =======] Carbon [ 4080K] CoreGraphics [ 271960K] IOKit [ 139880K] MALLOC [ 46776K] STACK GUARD [ 8K] Stack [ 9216K] VM_ALLOCATE ? [ 7724K] ... In a program that has called sbrk(), the 4MB region can be seen in the output of the vmmap command as a writable region labeled SBRK. // sbrk.c #include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(void) { char cmdbuf[32]; sbrk(0); snprintf(cmdbuf, 32, "vmmap %d", getpid()); return system(cmdbuf); } $ gcc -Wall -o sbrk sbrk.c $ ./sbrk ... SBRK [ 4096K] ... 8.15.2. Memory Allocator InternalsThe Mac OS X system library's malloc implementation uses an abstraction called a malloc zone (unrelated to the Mach zone allocator). Malloc zones are variable-size blocks of virtual memory from which malloc draws memory for its allocations. The system library creates a default malloc zone when the malloc package is initialized in a program, which occurs when the zone is accessed for the first time (e.g., during the program's first call to malloc() or calloc()). A zone is analogous to a UNIX program's heap from the standpoint of memory allocation. The malloc implementation supports the creation of multiple zones. Although creating your own malloc zones is typically unnecessary, it may be useful in certain circumstances. Destroying a zone[22] frees all objects allocated from that zone; therefore, using a custom zone may improve performance if a large number of temporary objects need to be deallocated. Figure 829 shows the API exported by the malloc zone layer.
Figure 829. The malloc zones API
The actual zone allocator is implemented in an independent moduleseparate from the malloc layer that consists of the programmer-visible malloc family of functions. The malloc layer uses well-defined functions exported by the zone layer, and in fact, the malloc front-end can support an alternate underlying allocator or even multiple allocators. Figure 830 shows an overview of the communication between the malloc layer and the zone allocator. The latter is called the scalable zone allocator because it uses allocation strategies that scale from very small to very large allocation sizes. Figure 830. An overview of the Mac OS X malloc implementationIn the malloc layer, a malloc zone is represented by the malloc_zone_t structure, which is a substructure of the szone_t structure. The latter is visible only to the scalable zone layer. The malloc layer provides the malloc_create_zone function for creating zones. This function calls create_scalable_zone(), which creates a new scalable zone by allocating and initializing an szone_t structure. As part of this initialization, the scalable malloc layer populates the malloc_zone_t substructure, a pointer to which is returned to the malloc layer. Before returning, malloc_create_zone() calls malloc_zone_register(), which saves the malloc_zone_t pointer in a global array of malloc zones. Note that although the malloc layer directly calls create_scalable_zone(), it calls the other scalable zone functions only through the function pointers set up in the malloc_zone_t structure. As shown in Figure 830, the scalable zone layer provides standard malloc family functions, batch allocation and deallocation functions, and functions for introspection. The scalable zone allocator categorizes memory allocation requests based on size into tiny, small, large, and huge requests. As shown in Figure 831, it maintains internal bookkeeping data structures for each of these categories. Tiny and small allocations are made from tiny and small regions, respectively. We earlier said that a zone is analogous to a program's heap. A specific tiny or small region can be regarded as a bounded subheap from which allocations falling within a certain size range are made. Figure 831. Scalable zone data structures
Table 87 shows the size ranges and the corresponding allocation quanta for the categories in the case of the 32-bit system library. Table 88 shows the numbers for the 64-bit version.
The following are some noteworthy points about scalable malloc's allocation strategies.
Let us now look at the details of the individual strategies. 8.15.2.1. Tiny AllocationsTiny allocations are made from tiny regions. The allocator divides a tiny region into equal-size pieces called tiny quanta. Each tiny quantum is TINY_QUANTUM bytes in size.
TINY_QUANTUM is the minimum allocation size, regardless of the actual amount requested by the caller. For example, the amounts of memory actually allocated to satisfy allocation requests of one byte, TINY_QUANTUM bytes, and (TINY_QUANTUM+1) bytes are TINY_QUANTUM bytes, TINY_QUANTUM bytes, and 2xTINY_QUANTUM bytes, respectively. The tiny region is laid out as a contiguous chunk of memory containing NUM_TINY_BLOCKS quanta. Immediately following the last quantum is a metadata area that has the following structure.
Even though a tiny quantum's size is fixed, a tiny allocation can range from as small as 1 byte to as large as 31xTINY_QUANTUM. Allocations larger than a quantum will consist of multiple contiguous quanta. A set of such contiguous tiny quanta, whether allocated or free, is called a tiny block. The header and in-use bitmaps are used to maintain the notions of free and allocated blocks as follows.
When a pointer is freed, the corresponding block's header and in-use bits are appropriately modified. Moreover, information is written to the first few bytes of the pointed-to memory to convert that block into a free block. Free blocks are chained together in free lists. The allocator maintains 32 free lists for tiny region memory. Each list contains free blocks containing a particular number of quantathe first list contains free blocks that are all one quantum in size, the second list contains free blocks that are all two quanta in size, and so on. Although the maximum allocation size contains 31 quanta, it is possible to have free blocks that are larger, since adjacent blocks can be coalesced. The last list holds these extra-large free blocks. The structure of a free block is as follows:
Consider the free and allocated blocks shown in Figure 832. The free block starting at quantum qi contains m quanta, whereas the allocated block starting at quantum qk contains n quanta. Bits i and k are both set in the header bitmap. However, only bit k is set in the in-use block. Figure 832. Internal bookkeeping of tiny allocations (32-bit)Let us examine the working of the allocator by using the debugger on a simple program. Figure 833 shows a program that performs four tiny allocations. We will examine the allocator's state immediately after performing these allocations. Next, we will free the pointers and examine the state again. We will use a debugger watchpoint to stop execution of the program at the desired locations. Figure 833. A program that performs tiny allocations
To examine the tiny region's metadata area, we first need to determine the area's base address. Given a pointer known to be allocated from a tiny region, we can determine the base address of that tiny region since the region is always aligned on a boundary defined by the product of NUM_TINY_BLOCKS and TINY_QUANTUMthat is, the total allocatable size of the tiny region. The allocator ensures this alignment when it allocates memory for the tiny region itself. Therefore, given a tiny region pointer p, the following formula will give us the tiny region's base address: TINY_REGION_FOR_PTR(p) = p & ~((NUM_TINY_BLOCKS * TINY_QUANTUM) - 1) Since we are dealing with a 32-bit program, we must use the 32-bit-specific values of NUM_TINY_BLOCKS and TINY_QUANTUM. As shown in Figure 832, these values are 65,536 and 16, respectively. Using these values, the product is calculated to be 0x100000. Our formula then reduces to the following: TINY_REGION_FOR_PTR(p) = p & 0xFFF00000 Given the tiny region's base address, we can readily compute the metadata area's base address, since the metadata area immediately follows the last tiny quantum. This means that the metadata area starts (NUM_TINY_BLOCKS * TINY_QUANTUM) bytes from the tiny region's base address, which gives us the following formula for the metadata area's base address: TINY_REGION_END(p) = (p & 0xFFF00000) + 0x100000 The metadata area's base address is where the header bitmap is located. The location of the in-use bitmap can be calculated by adding NUM_TINY_BLOCKS/8 bytes (the size of the header bitmap) and a further 4 bytes (the size of the pad word). We finally have the following expressions: HEADER_BITMAP_32(p) = (p & 0xFFF00000) + 0x100000 INUSE_BITMAP_32(p) = HEADER_BITMAP_32(p) + 0x2004 Now that we know how to calculate the addresses we wish to look at, let us compile and run the program shown in Figure 833. $ gcc -Wall -g -o test_malloc test_malloc.c $ gdb ./test_malloc ... (gdb) watch watch Hardware watchpoint 1: watch (gdb) run Starting program: /private/tmp/test_malloc Reading symbols for shared libraries . done Hardware watchpoint 1: watch Old value = -1 New value = 1 main () at test_malloc.c:18 18 free(ptr1); (gdb) where full #0 main () at test_malloc.c:18 ptr1 = (void *) 0x300120 ptr2 = (void *) 0x300310 ptr3 = (void *) 0x300500 ptr4 = (void *) 0x3006f0 (gdb) When the first watchpoint is hit, our program has performed the four tiny allocations. Given the pointer values and the formulas devised earlier, we see that for all four pointers, the tiny region's base address is 0x300000, the header bitmap is at address 0x400000, and the in-use bitmap is at address 0x400000+0x2004. Thus, the pointers are 0x120, 0x310, 0x500, and 0x6f0 bytes away, respectively, from the base address. We can divide these distances by TINY_QUANTUMthat is, 16to get the starting quantum number for each pointer. The starting quanta are 18, 49, 80, and 111, with the first quantum being numbered 0. If we call the first bit in a bitmap as bit 0, we should examine bits 18, 49, 80, and 111 in both the header and in-use bitmaps. (gdb) x/4x 0x400000 0x400000: 0x25920400 0x00000200 0x00000100 0x00800000 (gdb) x/4x 0x400000+0x2004 0x402004: 0x25920400 0x00000200 0x00000100 0x00800000 Indeed, each of the four bits is set in both the bitmaps. Note that while accessing a bit, the allocator first locates the byte that contains the bit. The leftmost byte is byte 0, the next byte to the right is byte 1, and so on. Within a byte, the lowest numbered bit is the rightmost one. Let us continue the program, which will cause ptr1 and ptr3 to be freed before the next watchpoint is hit. (gdb) cont Continuing. Hardware watchpoint 1: watch Old value = 1 New value = 2 main () at test_malloc.c:22 22 free(ptr2); At this point, we expect the header bitmap to remain unchanged, but bits 18 and 80corresponding to ptr1 and ptr3should have been cleared in the in-use bitmap. (gdb) x/4x 0x400000 0x400000: 0x25920400 0x00000200 0x00000100 0x00800000 (gdb) x/4x 0x400000+0x2004 0x402004: 0x25920000 0x00000200 0x00000000 0x00800000 Moreover, the memory contents of ptr1 and ptr3 should have been populated to convert the corresponding blocks to free blocks. (gdb) x/4x ptr1 0x300120: 0x0030307b 0x00300500 0x00000000 0x001f0000 (gdb) x/4x ptr3 0x300500: 0x0030345b 0x00000000 0x00300120 0x001f0000 Recall that the free block begins with a checksum, which is followed by pointers to previous and next free blocks, with a trailing unsigned short value specifying the size of the block in units of quanta. We see that the freed blocks are chained on the same free listptr1's previous pointer points to ptr3, whereas ptr3's next pointer points to ptr1. This is expected, as both blocks have the same number of quanta (0x1f, or 31). Moreover, we can verify that the two checksums are correct by XOR'ing the relevant pointers with the magic number we saw earlier (0x357B). Let us continue the program again, which will free ptr2 and ptr4 before the next watchpoint is hit. (gdb) cont Continuing. Hardware watchpoint 1: watch Old value = 2 New value = 3 main () at test_malloc.c:26 26 return 0; (gdb) x/4x 0x400000 0x400000: 0x25920400 0x00000200 0x00000000 0x00800000 (gdb) x/4x 0x400000+0x2004 0x402004: 0x25920000 0x00000200 0x00000000 0x00800000 (gdb) x/4x ptr2 0x300310: 0x00000000 0x00000000 0x00000000 0x00000000 (gdb) x/4x ptr4 0x3006f0: 0x00000000 0x00000000 0x00000000 0x00000000 (gdb) We see that although ptr2 and ptr4 have been freed, the corresponding bits in both the header and in-use bitmaps are unchanged. Moreover, there is no free block information in the contents of ptr2 and ptr4. This is so because the allocator has coalesced the four adjacent free blocks into a single big free block starting at ptr1. Examining ptr1's contents should confirm this. (gdb) x/4x ptr1 0x300120: 0x0000357b 0x00000000 0x00000000 0x007c0000 The free block at ptr1 is now the only block on its free list, as both its previous and next pointers contain zeros. Moreover, the number of quanta in this free block is 0x7c (124), which is as expected. 8.15.2.2. Small AllocationsSmall allocations are handled in a manner conceptually similar to tiny allocations. Figure 834 shows the key data structures involved in small allocations. Besides the larger quantum size of a small region, the metadata area following the region is structured differently from the tiny region metadata area: Instead of the header and in-use bitmaps, the small region metadata area is an array of 16-bit quantities. The ith element of this array provides two pieces of information about the ith quantum in the small region. If the most significant bit of the array element is set, the corresponding quantum is the first quantum of a free block. The remaining 15 bits, which are nonzero only in a first quantum (whether of a free block or an allocated block), represent the number of quanta contained in the block. Like the set of tiny free lists, there are 32 small free lists, each of which is used for referring to free objects of the corresponding quantum size. Figure 834. Internal bookkeeping of small malloc allocations (32-bit)8.15.2.3. Large AllocationsLarge allocations are described by large_entry_t structures. As shown in Figure 831, a large_entry_t structure is an alias for a compact_range_t, which contains a single field called address_and_num_pages that represents both an address (high bits) and the number of pages allocated at that address (low bits). Since the maximum large allocation is 4095 pages, the low 12 bits are used to specify the number of pages. Large allocations are performed through vm_allocate(), which always returns a page-aligned addressthat is, the low 12 bits of the address are always zero. Therefore, it can be OR'ed with the number of pages to yield address_and_num_pages with no loss of information. The allocator tracks large entries by hashing their addresses in a table referenced by the large_entries field of the szone_t structure. An entry's hash index is its address modulo the total number of entries. If there is a hash collision, the index is incremented by one, and so onwhen the index becomes equal to the size of the table, it wraps around to zero. If the number of large allocations in use becomes higher than approximately 25% of the number of entries in the hash table, the table's density is deemed too high, and the allocator grows the hash table. 8.15.2.4. Huge AllocationsHuge allocations have a relatively simpler bookkeeping. For each huge allocation, the allocator maintains a huge_entry_t structure. These structures are kept in the array referenced by the huge_entries field of the szone_t structure. Upon a huge entry allocation, which is performed through vm_allocate(), the allocator grows this array by internally allocating memory, copying the old entries to the grown array, and freeing the old array. As shown in Figure 831, huge_entry_t is an alias for vm_range_t, which contains an address and a size. 8.15.3. The malloc() RoutineThe various malloc API functions, such as malloc() and free(), are simple wrappers around internal functions that in turn call the scalable zone functions through the function pointers in the malloc_zone_t structure corresponding to the default zone. To use any other zone than the default, the malloc_zone_* functions from Figure 829 must be used directly. Figure 835 shows the processing of the malloc() library function, which simply calls malloc_zone_malloc(), with the return value from inline_malloc_default_zone() being the zone to allocate from. The inline function checks the value of the global variable malloc_num_zones to determine whether it is being called for the first time. If so, it first initializes the malloc package by creating a scalable zone, which will be used as the default malloc zone. It calls malloc_create_zone(), which first checks for the presence of several environment variables that affect malloc's behavior. These variables are documented in the malloc(3) manual page, and a summary of the available variables can be printed by running any program that uses malloc() with the MallocHelp environment variable set. $ MallocHelp=1 /usr/bin/true (10067) malloc: environment variables that can be set for debug: - MallocLogFile <f> to create/append messages to file <f> instead of stderr - MallocGuardEdges to add 2 guard pages for each large block - MallocDoNotProtectPrelude to disable protection (when previous flag set) - MallocDoNotProtectPostlude to disable protection (when previous flag set) ... - MallocBadFreeAbort <b> to abort on a bad free if <b> is non-zero - MallocHelp - this help! Figure 835. Processing of the malloc() function in the system libraryOnce the default zone is available, malloc_zone_malloc() calls the malloc() function exported by the scalable malloc layer. The latter functionszone_malloc()calls the szone_malloc_should_clear() internal function with the cleared_request Boolean parameter set to false. In contrast, the szone_calloc() function calls szone_malloc_should_clear() with cleared_request set to true. As shown in Figure 835, szone_malloc_should_clear() classifies the allocation request based on size and dispatches it to the appropriate handler. While changing the size of an allocation because of a realloc() call, the allocator first attempts to reallocate in place. If that fails, a new buffer is allocated, into which the allocator copies the contents of the old buffer. If the old buffer had at least VM_COPY_THRESHOLD bytes (defined to be 40KB) of memory, the allocator uses vm_copy() for copying. If the old buffer had less memory, or if vm_copy() fails, the allocator uses memcpy(). 8.15.4. The Largest Single Allocation (32-bit)Depending on a program's needs, it may be affected by the most amount of contiguous memory it can allocate. Given the size of a 64-bit virtual address space, a 64-bit program is highly unlikely to face this issue. However, a 32-bit address space is limited to 4GB of virtual memory, not all of which is available to the program. As we saw in Table 84, several virtual address ranges are unavailable for the program's use as the system uses these ranges for predefined mappings. The program shown in Figure 836 can be used to determine the maximum size of a single allocation, which can be no larger than the largest free contiguous virtual address range. Note that the exact number is likely to differ across operating system versions and even across specific installations of the same version, although the difference may not be great. Figure 836. Determining the size of the largest single malloc() allocation
As the output in Figure 836 shows, on this particular system, at the instant the program was run, a contiguous allocation is limited to 577,016 pages, or 2,363,457,536 bytes (approximately 2.2GB). Note that there still exist other, smaller free virtual address ranges, so the program could allocate a sum total that may be considerably larger than the single largest allocation. For example, on the system shown in Figure 836, allocating another 1GB (approximately) of memory through malloc() succeeds, after which several much smaller allocations can still be made before the process's virtual address space is exhausted. 8.15.5. The Largest Single Allocation (64-bit)Let us see how much virtual memory we can allocate using mach_vm_allocate() in a 64-bit task. Figure 837 shows a program that attempts to allocate the amount of memory given as a command-line argument. Figure 837. Allocating 2 petabytes of virtual memory
8.15.6. Enumerating All PointersThe scalable zone implementation provides several functions for debugging and analysis. The malloc layer exports some of this functionality to user programs through functions such as malloc_zone_from_ptr(), malloc_zone_get_all_zones(), malloc_zone_print(), malloc_zone_print_ptr_info(), malloc_zone_statistics(), and malloc_zone_log(). Moreover, those scalable zone functions that have function pointers in the malloc_zone_t structure or the szone_t structure can also be called by a user program, although doing so would be a layering violation since the szone_t structure is meant to be opaque in the malloc layer. Note that we cannot use printf(3) in our program since printf()'s implementation itself uses malloc(). We will use a custom printf()-like functionlet us call it nomalloc_printf()that does not use printf() and in turn does not use malloc(). Figure 838 shows the implementation of nomalloc_printf(). We will also use this function in other examples that follow. Figure 838. Implementing a version of the printf() function without using malloc()
The program shown in Figure 839 calls a pointer enumeration function implemented by the scalable zone layer. The function is available to the malloc layer through the enumerator field of the malloc_introspection_t substructure within the malloc_zone_t structure. One of the arguments to the function is a pointer to a recorder functiona callback invoked for each allocated range encountered. The function also accepts a type mask that limits the enumeration to specific types of memory. Figure 839. Enumerating all malloc()-allocated pointers in a program
8.15.7. Displaying Scalable Zone StatisticsThe program shown in Figure 840 retrieves and displays statistics on the various types of malloc regions. In particular, you can use this program to see how the allocator classifies an allocation request as tiny, small, large, or huge based on the request's size. Figure 840. Displaying scalable-zone statistics
8.15.8. Logging Malloc OperationsThe malloc implementation supports logging of malloc operations to help in the analysis of memory-related bugs. The MallocStackLogging environment variable can be set to cause the allocator to remember the function call stack at the time of each allocation. A variantthe MallocStackLoggingNoCompact environment variablecauses all allocations to be logged, regardless of their sizes or lifetimes. Mac OS X provides several tools for memory-related debugging, for example, heap, leaks, malloc_history, and MallocDebug.app.[23]
The malloc layer allows a custom malloc logger to be installed by setting the malloc_logger global function pointer. In fact, setting the aforementioned environment variables results in this pointer being set to an internal logger function. Figure 841 shows a program that implements its own malloc logging through this mechanism. Figure 841. Logging malloc operations
8.15.9. Intercepting the Malloc LayerSince the malloc layer calls the allocator functions through function pointers, we can easily intercept invocations of these functionssay, for debugging or for experimenting with an alternate allocator. Specifically, we can retrieve a pointer to the default zone by calling malloc_default_zone(). We can then save the original function pointers from the malloc_zone_t structure and insert pointers to our own functions in the structure. Thereafter, we can call the original functions from our functions or provide an alternate allocator implementation altogether. Figure 842 shows an example of using such interception. Figure 842. Intercepting the malloc layer
|