Section 13.5. Virtual Memory Management | Linux for Programmers and Users

[Page 556 (continued)]

13.5. Virtual Memory Management

The Linux kernel also manages fair access to physical memory on behalf of all user processes. Processes have their own view of system memory that is not exactly reality, so a process is said to use a virtual memory space rather than a physical memory space. The beauty of this scheme is that each process sees system memory as an area that can be much larger than the amount of physical memory actually in the system. The addressing always begins at zero (rather than some varying and large number where the process might begin if it were in a physical memory space) and the process has no access to any other process's memory space, so all processes are protected from being overwritten by any other process.

13.5.1. The Page Table

For the sake of convenience, memory is (conceptually) divided into regions called pages, analogous to disk blocks. In fact, the size of a memory page is usually the same as the size of a disk block to facilitate moving data back and forth between the two efficiently. For example, memory pages on an Alpha platform are 8K, but 4K on an Intel platform.

The virtual memory manager in the Linux kernel, often with hardware assistance from the Memory Management Unit (MMU) if available, maps the address of a virtual page of memory from the user process's address space to the address of a physical page of memory on the system. This is accomplished through the use of a page table, as shown in Figure 13-26.

[Page 557]

Figure 13-26. The page table maps virtual to physical memory addresses.

This is an oversimplification of a Linux page table. Linux uses a three-level page table to map from a virtual to a physical memory address. The first level is a page directory, which is a list of pointers to other page directories, which in turn are lists of pointers to page tables. On platforms where the MMU only supports two-level page tables (e.g., Intel platforms), Linux makes the middle directory of length one.

A page-table entry in the memory management system is analogous to an inode in the file system, as each tracks the location of individual storage units. Each page-table entry contains information about an individual page of memory, including:

valid flag (invalid means page is in swap area)
physical page address
access control information (executable or shared data, modified flag, last time used, etc.)

When the proper page-table entry has been identified, the memory page can be accessed and the data transferred to the proper location in memory or read from memory and transferred back to the process. We'll see uses for the other information in the page-table entry a bit later.

13.5.2. Paging

When a process is running, not all of its memory pages need to be resident in physical memory. In fact, since the virtual memory space may be larger than the physical memory space, this might well be impossible. The only pages that must be in physical memory are those the process is actually using. Linux uses a scheme called demand paging to bring memory pages into memory as they are needed.

[Page 558]

When a running process makes a request of an address in a page of memory that is not currently in physical memory (i.e., its valid bit is not set), it generates a page fault. This causes the kernel to bring the required data into memory and update the page tables to reflect where the page is located in physical memory.

If there is no room in physical memory to bring in a new page, then a page currently in physical memory (either a page belonging to this process that it is no longer using or a page belonging to another process) will need to be removed to make room. We'll see the algorithm for this a bit later.

13.5.3. Memory-Mapped Files

Data from files that needs to be loaded into memory (such as executables or shared libraries) is treated a bit differently. Rather than loading it and then treating it as memory space, the data is mapped into memory directly from the file. Such a file is a memory-mapped file, since its contents are mapped into the virtual memory space of the process. Whenever a page fault occurs on a page from a memory-mapped file, the page is paged in directly from the disk file. The kernel also maintains a page cache so that recently used pages can be obtained again without having to go back to disk.

13.5.4. Swapping

The term swapping refers to the opposite of paging in memorythat is, to page it out. The name is somewhat historical and harkens back to a time when you had to swap an entire process with another one in order to run the new one. Some people use the terms "swapping in" and "swapping out," which are essentially the same as "paging in" and "paging out."

If a process needs to bring a new page into physical memory but there is no physical memory left in which to store it, then some existing page of memory must be paged out first. This is accomplished by the swapper, the kernel process kswapd, which is started by init when the system boots.

The swapper will run anytime real memory falls below a certain threshold (the kernel value free_pages_low) and will try to release memory until the number of free pages is back up to the value of free_pages_high. It looks at each process to see what pages might be swapped out. Some of the criteria the swapper uses to decide to swap out or release a memory page include:

if the page hasn't been used for some time
if the page hasn't been modified, especially if is still in the swap cache
if the page is from a memory-mapped file, especially if it is still in the page cache

If the page is still available in a cache, then it could easily be brought back in without having to be read from disk. If it is not easily available, then it must be saved to the swap area before it can be released from memory and that space in memory can be used by another process. The swap area is a disk partition created solely for use by the kernel to store copies of memory pages.

If a page is a memory-mapped page from a disk file or data that has not been modified since the last time it was brought in from the swap area, the page does not need to be saved in the swap area and may simply be discarded, since a copy of the page as it currently exists in memory is still available on the disk.

[Page 559]

13.5.5. Allocation and Deallocation

Linux maintains a list of free pages of physical memory, so that when a process requests one or more new pages, the request can be quickly fulfilled. When pages are deallocated and added back into the list of free pages, the memory manager attempts to group them into larger contiguous groups whenever possible so that memory does not become badly fragmented. It accomplishes this with the free_area data structure shown in Figure 13-27.

Figure 13-27. The free_area list of free memory pages.

The free_area is an array of pointers to groups of pages. free_area[0] points to a doubly linked list of single pages that are available. free_area[1] points to a doubly linked list of groups of two pages that are free. Each successive entry points to a doubly linked list of groups of pages whose size increases by powers of two, up to the sixth entry, free_area[5], that points to a doubly linked list of groups of 32 pages.

When a process allocates new memory and requests free pages from the kernel, the kernel looks in the free_area structure to find the largest contiguous group of pages of memory to give the process. If the largest page group is more pages than the process is requesting, the extra pages are removed from the group and put back into the proper list in the free_area, depending on how many pages are involved. If there is no single page group that will satisfy the process, then multiple smaller groups are selected until the process receives what it requested.

[Page 560]

When a process releases memory or is swapped out (so that its real memory is returned to the system), its pages are deallocated and returned to the free_area. The memory manager attempts to recombine these pages with adjacent groups of pages of equal size to form a contiguous page group of the next larger size. For example, when returning a single page of memory, it checks to see if the page on either side of the page to be returned is also in the list pointed to by free_area[0]. If one is, its next step is to create a page group with the two pages. Then go through the same process and check to see if the two pages on either side of these two make up a group of pages listed in free_area[1]. If not, then add the group into the list pointed to by free_area[1]. If one of the adjacent two-page groups is also free, combine them into a four-page group and continue. This way, the freed pages are recombined into the largest possible contiguous group of free pages and put in the appropriate list in the free_area array.

13.5.6. Loading an Executable: execl ()/execv ()

When a process performs an execl () or execv (), the kernel creates new page tables for the process. At this point, all of the code and initialized data resides on disk in the executable file, and so the page-table entries are set to contain the locations of the corresponding disk blocks of the memory-mapped file. When the process accesses one of these pages for the first time, its corresponding block is copied from disk into memory, and the page-table entry is updated with the physical memory page number.

13.5.7. Duplicating a Process: fork ()

When a process forks, the child process must be allocated a copy of its parent's memory space. Unfortunately, a process often immediately follows a fork () by an execl (), thereby deallocating its previous memory areas. To avoid any unnecessary and costly copying suggested by these two observations, the kernel puts off copying the entire context of the process until absolutely necessary.

When the fork occurs, the child inherits a page table of its own that points to the same pages of memory, but they are marked read-only (for both processes). If the process later tries to write to an individual memory location, the fact that the page is marked read-only will trigger a page fault. The memory manager will see that the page is valid and only then will it create an entirely new copy of the page for the process trying to write. This is known as copy-on-write. In many cases, the child process simply execs a new program, so none of this ever takes place and an unnecessary copy is avoided. If the child process really does continue to run the current program, it does get its very own copy, just not at the moment of the fork.