Section 13.2. Memory Mapping

13.2. Memory Mapping

Linux allows a process to map files into its address space. Such a mapping creates a one-to-one correspondence between data in the file and data in the mapped memory region. Memory mapping has a number of applications.

High-speed file access. Normal I/O mechanisms, such as read() and write(), force the kernel to copy the data through a kernel buffer rather than directly between the file that holds the device and the user-space process. Memory maps eliminate this middle buffer, saving a memory copy.^[12]
^[12] Although saving a memory copy may not seem that important, thanks to Linux's efficient caching mechanism, these copy latencies are the slowest part of writing to data files that do not have O_SYNC set.
Executable files can be mapped into a program's memory, allowing a program to dynamically load new executable sections. This is how dynamic loading, described in Chapter 27, is implemented.
New memory can be allocated by mapping portions of /dev/zero, a special device that is full of zeros,^[13] or through an anonymous mapping. Electric Fence, described in Chapter 7, uses this mechanism to allocate memory.
^[13] Although most character devices cannot be mapped, /dev/zero can be mapped for exactly this type of application.
New memory allocated through memory maps can be made executable, allowing it to be filled with machine instructions, which are then executed. This feature is used by just-in-time compilers.
Files can be treated just like memory and read using pointers instead of system calls. This can greatly simplify programs by eliminating the need for read(), write(), and lseek() calls.
Memory mapping allows processes to share memory regions that persist across process creation and destruction. The memory contents are stored in the mapped file, making it independent of any process.

13.2.1. Page Alignment

System memory is divided into chunks called pages. The size of a page varies with architecture, and on some processors the page size can be changed by the kernel. The getpagesize() function returns the size, in bytes, of each page on the system.

 #include <unistd.h> size_t getpagesize(void);

For each page on the system, the kernel tells the hardware how each process may access the page (such as write, execute, or not at all). When a process attempts to access a page in a manner that violates the kernel's restrictions, a segmentation fault (SIGSEGV) results, which normally terminates the process.

A memory address is said to be page aligned if it is the address of the beginning of a page. In other words, the address must be an integral multiple of the architecture's page size. On a system with 4K pages, 0, 4,096, 16,384, and 32,768 are all page-aligned addresses (of course, there are many more) because the first, second, fifth, and ninth pages in the system begin at those addresses.

13.2.2. Establishing Memory Mappings

New memory maps are created by the mmap() system call.

 #include <sys/mman.h> caddr_t mmap(caddr_t address, size_t length, int protection, int flags,              int fd, off_t offset);

The address specifies where in memory the data should be mapped. Normally, address is NULL, which means the process does not care where the new mapping is, allowing the kernel to pick any address. If an address is specified, it must be page aligned and not already in use. If the requested mapping would conflict with another mapping or would not be page aligned, mmap() may fail.

The second parameter, length, tells the kernel how much of the file to map into memory. You can successfully map more memory than the file has data, but attempting to access it may result in a SIGSEGV.^[14]

^[14] A segmentation fault will result when you try to access an unallocated page.

The process controls which types of access are allowed to the new memory region. It should be one or more of the values from Table 13.2 bitwise OR'ed together, or PROT_NONE if no access to the mapped region should be allowed. A file can be mapped only for access types that were also requested when the file was originally opened. For example, a file that was opened O_RDONLY cannot be mapped for writing with PROT_WRITE.

Table 13.2. `mmap()` Protections
Flag	Description
`PROT_READ`	The mapped region may be read.
`PROT_WRITE`	The mapped region may be written.
`PROT_EXEC`	The mapped region may be executed.

The enforcement of the specified protection is limited by the hardware platform on which the program is running. Many architectures cannot allow code to execute in a memory region while disallowing reading from that memory region. On such hardware, mapping a region with PROT_EXEC is equivalent to mapping it with PROT_EXEC | PROT_READ.The memory protections passed to mmap() should be relied on only as minimal protections for this reason.

The flags specify other attributes of the mapped region. Table 13.3 summarizes all the flags. Many of the flags that Linux supports are not standard but may be useful in special circumstances. Table 13.3 differentiates between the standard mmap() flags and Linux's extra flags. All calls to mmap() must specify one of MAP_PRIVATE or MAP_SHARED; the remainder of the flags are optional.

Table 13.3. `mmap()` Flags
Flag	POSIX?	Description
`MAP_ANONYMOUS`	Yes	Ignore `fd`,create an anonymous mapping.
`MAP_FIXED`	Yes	Fail if `address` is invalid.
`MAP_PRIVATE`	Yes	Writes are private to process.
`MAP_SHARED`	Yes	Writes are copied to the file.
`MAP_DENYWRITE`	No	Do not allow normal writes to the file.
`MAP_GROWSDOWN`	No	Grow the memory region downward.
`MAP_LOCKED`	No	Lock the pages into memory.

`MAP_ANONYMOUS`	Rather than mapping a file, an anonymous mapping is returned. It behaves like a normal mapping, but no physical file is involved. Although this memory region cannot be shared with other processes, nor is it automatically saved to a file, anonymous mappings allow processes to allocate new memory for private use. Such mapping is often used by implementations of `malloc()`, as well as by more specialized applications. The `fd` parameter is ignored if this flag is used.
`MAP_FIXED`	If the mapping cannot be placed at the requested `address, mmap()` fails. If this flag is not specified, the kernel will try to place the map at `address` but will map it at an alternate address if it cannot. If the `address` has already been used by `mmap()`, the item mapped into that region will be replaced by a new memory map. This means that it is a very good idea to pass only addresses which were returned by previous calls to `mmap()`; if arbitrary addresses are used, the memory region used by system libraries may be overwritten.
`MAP_PRIVATE`	Modifications to the memory region should be private to the process, neither shared with other processes that map the same file (other than related processes that are forked after the memory map is created) nor reflected in the file itself. Either `MAP_SHARED` or `MAP_PRIVATE` must be used. If the memory region is not writeable, it does not matter which is used.
`MAP_SHARED`	Changes that are made to the memory region are copied back to the file that was mapped and shared with other processes that are mapping the same file. (To write changes to the memory region, `PROT_WRITE` must have been specified; otherwise, the memory region is immutable.) Either `MAP_SHARED` or `MAP_PRIVATE` must be specified.
`MAP_DENYWRITE`	Usually, system calls for normal file access (like `write()`) may modify a mapped file. If the region is being executed, this may not be a good idea, however. `MAP_DENYWRITE` causes writes to the file, other than those writes done through memory maps, to return `ETXTBSY`.
`MAP_GROWSDOWN`	Trying to access the memory immediately before a mapped region normally causes a `SIGSEGV`. This flag tells the kernel to extend the region to lower memory addresses, one page at a time, if a process tries to access the memory in the lower adjacent page, and continue the process as normal. This is used to allow the kernel to automatically grow processes' stacks on platforms that have stacks that grow down (the most common case). This is a platform-specific flag that is normally used only for system code.
	The only limit on `MAP_GROWSDOWN` is the stack-size resource limit, discussed on pages 120-120. If no limit is set, the kernel will grow the mapped segment whenever doing so would be beneficial. It will not grow the segment past other mapped regions, however.
`MAP_GROWSUP`	This flag works just like `MAP_GROWSDOWN`, but it is for those (rare) platforms that have stacks that grow up, which means that the region is extended only with higher rather than lower addresses. (As of kernel 2.6.7, only the `parisc` architecture has stacks that grow up.) Like `MAP_GROWSDOWN`, this flag is normally reserved for system code, and the stack-size resource limit is applied.
`MAP_LOCKED`	The region is locked into memory, meaning it will never be swapped. This is important for real-time applications (`mlock()`, discussed on page 275, provides another method for memory locking). This normally may be specified only by the root user; normal users cannot lock pages into memory. Some Linux systems allow limited allocation of locked memory by users other than root, and it is possible that this capability will be added to the standard Linux kernel in the future.

After the flags comes the file descriptor, fd, for the file that is to be mapped into memory. If MAP_ANONYMOUS was used, this value is ignored. The final parameter specifies where in the file the mapping should begin, and it must be an integral multiple of the page size. Most applications begin the mapping from the start of the file by specifying an offset of zero.

mmap() returns an address that should be stored in a pointer. If an error occurred, it returns the address that is equivalent to -1. To test for this, the -1 constant should be typecast to a caddr_t rather than typecasting the returned address to an int. This ensures that you get the right result no matter what the sizes of pointers and integers.

Here is a program that acts like cat and expects a single name as a command-line argument. It opens that file, maps it into memory, and writes the entire file to standard output through a single write() call. It may be instructional to compare this example with the simple cat implementation on page 174. This example also illustrates that memory mappings stay in place after the mapped file is closed.

  1: /* map-cat.c */  2:  3: #include <errno.h>  4: #include <fcntl.h>  5: #include <sys/mman.h>  6: #include <sys/stat.h>  7: #include <sys/types.h>  8: #include <stdio.h>  9: #include <unistd.h> 10: 11: int main(int argc, const char ** argv) { 12:     int fd; 13:     struct stat sb; 14:     void * region; 15: 16:     if ((fd = open(argv[1], O_RDONLY)) < 0) { 17:         perror("open"); 18:         return 1; 19:     } 20: 21:     /* stat the file so we know how much of it to map into memory */ 22:     if (fstat(fd, &sb)) { 23:         perror("fstat"); 24:         return 1; 25:     } 26: 27:     /* we could just as well map it MAP_PRIVATE as we aren't writing 28:        to it anyway */ 29:     region = mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED, fd, 0); 30:     if (region == ((caddr_t) -1)) { 31:         perror("mmap"); 32:         return 1; 33:     } 34: 35:     close(fd); 36: 37:     if (write(1, region, sb.st_size) != sb.st_size) { 38:         perror("write"); 39:         return 1; 40:     } 41: 42:     return 0; 43: }

13.2.3. Unmapping Regions

After a process is finished with a memory mapping, it can unmap the memory region through munmap(). This causes future accesses to that address to generate a SIGSEGV (unless the memory is subsequently remapped) and saves some system resources. All memory regions are unmapped when a process terminates or begins a new program through an exec() system call.

 #include <sys/mman.h> int munmap(caddr_t addr, int length);

The addr is the address of the beginning of the memory region to unmap, and length specifies how much of the memory region should be unmapped. Normally, each mapped region is unmapped by a single munmap() call. Linux can fragment maps if only a portion of a mapped region is unmapped, but this is not a portable technique.

13.2.4. Syncing Memory Regions to Disk

If a memory map is being used to write to a file, the modified memory pages and the file will be different for a period of time. If a process wishes to immediately write the pages to disk, it may use msync().

 #include <sys/mman.h> int msync(caddr_t addr, size_t length, int flags);

The first two parameters, addr and length, specify the region to sync to disk. The flags parameter specifies how the memory and disk should be synchronized. It consists of one or more of the following flags bitwise OR'ed together:

`MS_ASYNC`	The modified portions of the memory region are scheduled to be synchronized "soon." Only one of `MS_ASYNC` and `MS_SYNC` may be used.
`MS_SYNC`	The modified pages in the memory region are written to disk before the `msync()` system call returns. Only one of `MS_ASYNC` and `MS_SYNC` may be used.
`MS_INVALIDATE`	This option lets the kernel decide whether the changes are ever written to disk. Although this does not ensure that they will not be written, it tells the kernel that it does not have to save the changes. This flag is used only under special circumstances.
`0`	Passing `0` to `msync()` works on Linux kernels, though it is not well documented. It is similiar to `MS_ASYNC`, but means that pages should be written out to disk whenever it is appropriate to do so. This normally means they will be flushed when the `bdflush` kernel thread next runs (it normally runs every 30 seconds), whereas `MS_ASYNC` writes the pages out more aggressively.

13.2.5. Locking Memory Regions

Under Linux and most other modern operating systems, memory regions may be paged to disk (or discarded if they can be replaced in some other manner) when memory becomes scarce. Applications that are sensitive to external timing constraints may be adversely affected by the delay that results from the kernel paging memory back into RAM when the process needs it. To make these applications more robust, Linux allows a process to lock memory in RAM to make these timings more predictable. For security reasons, only processes running with root permission may lock memory.^[15] If any process could lock regions of memory, a rogue process could lock all the system's RAM, making the system unusable. The total amount of memory locked by a process cannot exceed its RLIMIT_MEMLOCK usage limit.^[16]

^[15] This may change in the future as finer-grained system permissions are implemented in the kernel.

^[16] See pages 118-119 for information on resource limits.

The following calls are used to lock and unlock memory regions:

 #include <sys/mman.h> int mlock(caddr_t addr, size_t length); int mlockall(int flags); int munlock(caddr_t addr, size_t length); int munlockall(void);

The first of these, mlock(), locks length bytes starting at address addr. An entire page of memory must be locked at a time, so mlock() actually locks all the pages between the page containing the first address and the page containing the final address to lock, inclusively. When mlock() returns, all the affected pages will be in RAM.

If a process wants to lock its entire address space, mlockall() should be used. The flags argument is one or both of the following flags bitwise OR'ed together:

`MCL_CURRENT`	All the pages currently in the process's address space are locked into RAM. They will all be in RAM when `mlockall()` returns.
`MCL_FUTURE`	All pages added to the process's address space will be locked into RAM.

Unlocking memory is nearly the same as locking it. If a process no longer needs any of its memory locked, munlockall() unlocks all the process's pages. munlock() takes the same arguments as mlock() and unlocks the pages containing the indicated region.

Locking a page multiple times is equivalent to locking it once. In either case, a single call to munlock() unlocks the affected pages.

13.2. Memory Mapping

13.2.1. Page Alignment

13.2.2. Establishing Memory Mappings

Table 13.2. `mmap()` Protections

Table 13.3. `mmap()` Flags

13.2.3. Unmapping Regions

13.2.4. Syncing Memory Regions to Disk

13.2.5. Locking Memory Regions

Section 13.2. Memory Mapping

13.2. Memory Mapping

13.2.1. Page Alignment

13.2.2. Establishing Memory Mappings

Table 13.2. mmap() Protections

Table 13.3. mmap() Flags

13.2.3. Unmapping Regions

13.2.4. Syncing Memory Regions to Disk

13.2.5. Locking Memory Regions

Table 13.2. `mmap()` Protections

Table 13.3. `mmap()` Flags