4.2 Virtual Memory Architecture | System Performance Tuning2002

A virtual memory system exists to provide a framework for the system to manage memory on the behalf of various processes. The virtual memory system provides two primary benefits. It allows software developers to write to a simple memory model, which shields the programmer from the memory subsystem's hardware architecture and allows the use of memory sizes substantially greater than physical memory through backing stores. It also permits processes to have nonfragmented address spaces, regardless of how physical memory is organized or fragmented . In order to implement such a scheme, four key functions are required.

First, each process is presented with its own virtual address space ; that is, it can "see" and potentially use a range of memory. This range of memory is equal to the maximum address size of the machine. For example, a process running on a 32-bit system will have a virtual address space of about 4 GB (232). The virtual memory system is responsible for managing the associations between the used portions of this virtual address space into physical memory.

Second, several processes might have substantial sharing between their address spaces. For example, let's say that two copies of the shell /bin/csh are running. Both copies will have separate virtual address spaces, each with their own copy of the executable itself, the libc library, and other shared resources. The virtual memory system transparently maps these shared segments to the same area of physical memory, so that multiple copies are not stored. This is analogous to making a hard link in a filesystem instead of duplicating the file.

However, sometimes physical memory will become insufficient to hold the used portions of all virtual address spaces. In this case, the virtual memory system selects less frequently used portions of memory and pushes them out to secondary storage (e.g., disk) in order to optimize the use of physical memory.

Finally, the virtual memory system plays one of the roles of an elementary school teacher: keeping the children in his care from interfering with each other's private things. Hardware facilities in the memory management unit perform this function by preventing a process from accessing memory outside its own address space.

4.2.1 Pages

Just like energy in quantum mechanics, memory is quantized; that is, it is organized into indivisible units. These units are called pages . The exact size of a page varies from system to system and is dependent on the processor's memory management unit (MMU) implementation. Large page sizes reduce MMU activity by reducing the number of page faults, and save kernel memory. However, they also waste some memory, since the system can only deal with memory in page- sized segments. A request for less than one full page of memory will be answered with the allocation of a full page. Typically, page sizes are 4 KB or 8 KB. On Solaris systems, you can find the page size via /usr/bin/pagesize or getpagesize(3C) , while on Linux it is defined in the kernel header file asm/param.h as EXEC_PAGESIZE .

Page Sizes on Various Architectures

There are only three commonly encountered microprocessor designs that implement an 8 KB page size: the DEC Alpha, the original Sun SPARC processors (e.g., the Ross RT601/Cypress CY7C601/Texas Instruments TMS390C601A, which were used in the SPARCstation 2), and the Sun UltraSPARC designs. The Intel 80x86, the MIPS processors used in SGI systems, the Motorola/IBM PowerPC, and Sun's microSPARC and SuperSPARC series processors all use a 4 KB page size.

4.2.2 Segments

The pages that compose a process are grouped into several segments. Each process has at least four of these segments:

Executable text: Consists of the actual executable instructions in the binary. This is mapped from the on-disk binary and is read-only.
E xecutable data: Contains initialized variables present in the executable. This is mapped from the on-disk binary, but has permissions set to read/write/private (the private mapping ensures that changes to these variables during runtime are not reflected in the on-disk file, or in other processes sharing the same executable).
H eap space: Consists of memory allocated by means of malloc(3) . This is called anonymous memory, because it has no mapping in the filesystem (we'll discuss it further in Section 4.3.2.1 later in this chapter).
S tack: Also allocated from anonymous memory.

4.2.3 Estimating Memory Requirements

Sometimes you have it easy, and a system only needs to run one commercial software package, for which the vendor has nicely told you precisely how much memory you need for optimal performance. Unfortunately, the real world is rarely so nice: the systems administrator in a typical computing environment has hundreds of different processes to worry about. Rather than try and solve every possible problem, we'll work out a general mechanism for thinking about how much memory is required for a given system.

The most important thing is to establish what a reasonable workload is on the system. Let's say that you have a system that's being used for serving typical low-end users: they run a shell, maybe an editor, a mailreader (probably pine or elm), or perhaps a newsreader. If you've been carefully monitoring your current environment, you'll be able to get a handle on how many of each sort of process is running at peak usage times. If you haven't, or if you're starting from scratch, you'll need to make a reasonable guess. Let's say that you've decided that there will be 50 users logged in at peak hours, and the usage pattern works out like this:

5 invocations each of ksh and csh , 40 of tcsh
25 pine processes and 10 elm processes
Other applications that defy categorization

We'll work through calculating the memory requirements of the csh processes. If we run pmap on a representative process, we get output that resembles this:

 %  pmap -x 7522  7522:   -csh Address   Kbytes Resident Shared Private Permissions       Mapped File 0803C000      48      48       -      48 read/write/exec    [ stack ] 08048000     116     116     116       - read/exec         csh 08065000      16      16       -      16 read/write/exec   csh 08069000      72      72       -      72 read/write/exec    [ heap ] DFF19000     548     444     408      36 read/exec         libc.so.1 DFFA2000      28      28       4      24 read/write/exec   libc.so.1 DFFA9000       4       4       -       4 read/write/exec    [ anon ] DFFAB000       4       4       4       - read/exec         libmapmalloc.so.1 DFFAC000       8       8       -       8 read/write/exec   libmapmalloc.so.1 DFFAF000     148     136     136       - read/exec         libcurses.so.1 DFFD4000      28      28       -      28 read/write/exec   libcurses.so.1 DFFDB000      12       -       -       - read/write/exec    [ anon ] DFFDF000       4       4       -       4 read/write/exec    [ anon ] DFFE1000       4       4       4       - read/exec         libdl.so.1 DFFE3000     100     100     100       - read/exec         ld.so.1 DFFFC000      12      12       -      12 read/write/exec   ld.so.1 --------  ------  ------  ------  ------ total Kb    1152    1024     772     252

It looks like each process is taking about 1,150 KB of memory, with 1,024 KB resident in memory. Of that 1,024 KB, 772 KB is shared with other processes, and 252 KB is private. So, for five invocations, we could roughly estimate memory usage at roughly 2 MB (about 770 KB shared, plus five times about 250 KB). These are back-of-the-envelope calculations, but they are remarkably good at giving you an idea how much memory you need.

Remember, though, that memory is consumed by three other things (the filesystem cache, intimately shared memory, and the kernel) aside from processes! Unless you're running Oracle or another database application, you probably won't need to worry about intimately shared memory. As a rule of thumb, you can safely figure on about 32 MB for the kernel and other applications, plus another 16 MB if you're running a windowing system. If your users only access a few hundred megabytes of data, but access it frequently, it may be advisable to buy sufficient memory to cache the entire data set. This will drastically improve I/O performance to that information.

4.2.4 Address Space Layout

The exact details of a process's address space vary depending on the architecture it is implemented on. For Solaris systems, there are essentially four address space layouts:

SPARC V7 32-bit combined kernel and process address space (used on sun4c , sun4m , and sun4d architectures)
SPARC V9 32-bit separated kernel and process address space (used on sun4u machines)
SPARC V9 64-bit separated kernel and process address space (used on sun4u machines)
Intel IA-32 32-bit combined kernel and process address space (used on Solaris for Intel systems)

The SPARC V7 32-bit combined model maps the kernel's address space into the top of the process's address space. This means that the virtual address space available to the process is restricted by the amount consumed by the kernel's address space (256 MB on sun4c and sun4m architectures, and 512 MB on sun4d architectures). The kernel's address space is protected from the user 's process by means of the the processor's privilege levels. The stack begins just beneath the kernel at address 0xEFFFC000 ( 0xDFFFE000 on sun4d systems) and grows downwards to address 0xEF7EA000 ( 0xDF7F9000 on sun4d ), where libraries are loaded into memory. The text and data portions of the executable are loaded into the bottom of the address space, and the heap grows from the top of those upwards towards the libraries.

The UltraSPARC microprocessor architecture allows the kernel its own private address space, which removes the size limit on the kernel's address space. ^[4] Thus, for the 32-bit sun4u memory model, the stack begins at address 0xFFBEC000 and grows downwards until 0xFF3DC000 (the small space at the absolute top of the address space is reserved for the OpenBoot PROM). Otherwise, it is the same as the SPARC V7 models.

^[4] This was a significant problem in the large pre-UltraSPARC systems, such as the SPARCcenter 2000E.

Although the UltraSPARC processor supports the SPARC V9 64-bit mode, the UltraSPARC-I and UltraSPARC-II processor implementations allow only 44-bit physical memory addresses. This creates a "hole" in the middle of the virtual address space, which spans addresses 0xFFFFF7FF.FFFFFFFF to 0x00000800.00000000 .

The Intel architecture address space is similar to the sun4c/sun4m SPARC V7 model, in that is does not separate user space from kernel space. However, it has a significant difference: the stack is mapped beneath the executable segments (starting at address 0x8048000 ) and permitted to grow downwards to the very bottom of the address space.

4.2.5 The Free List

The free list is the mechanism by which the system dynamically manages the flow of memory between processes. Memory is taken from the free list by processes, and returned to it when the process exits or by the action of the page scanner, which tries to ensure that there is a small amount of memory free for immediate use at all times. Every time memory is requested , a page fault is incurred. There are three kinds of faults, which we'll discuss in more detail later in this section:

Minor page fault: Occurs every time a process needs some memory, or when a process tries to access a page that has been stolen by the page scanner but not recycled.
Major page fault: Occurs when a process tries to access a page that has been taken from it by the page scanner, completely recycled, and is now being used by another process. A major page fault is always preceded by a minor page fault.
Copy-on-write fault: Caused by a process trying to write to a page of memory that it shares with other processes.

Let's examine how the free list is managed under a Solaris system.

When a system boots, all of its memory is formed into pages, and a kernel data structure is created to hold their states. The kernel reserves a few megabytes of memory for itself and releases the rest to the free list. At some point, a process will request memory, and a minor page fault occurs. A page is then taken from the free list, zeroed, and made available to the process. This sort of behavior, in which memory is given out on an "as-needed" basis, has traditionally been called demand paging .

Pages are always taken from the head of the free list.

When the free list shrinks to a certain size (set to lotsfree in units of pages), the kernel wakes up the page scanner (also called the pagedaemon ), which begins to search for pages that it can steal to replenish the free list. The page scanner implements a two-step algorithm order to avoid stealing pages that are being frequently accessed. The pagedaemon looks through memory in physical memory order and clears the MMU reference bit for each page. When a page is accessed, this bit is set. The page scanner then delays for a short while, waiting for pages to be accessed and their reference bits cleared. This delay is controlled by two parameters:

slowscan is the initial scan rate. Increasing this value causes the page scanner to run less unnecessary jobs, but do more work.
fastscan is the scan rate when the free list is completely empty.

The pagedaemon then goes through memory again. If a particular page still has the reference bit cleared, then the page hasn't been touched, so it is stolen for recycling. Think of memory as a circular track. A train rides on this track; as the engine in the front of the train passes a tie, that tie's reference bit is cleared. When the caboose gets around to a tie, if that tie's reference bit is still cleared, that page is recycled.

Some pages are very resistant to recycling, such as those that belong to the kernel and those that are shared between more than eight processes (this helps to keep shared libraries in memory). If the stolen page does not contain cached filesystem data, then it is moved to the page-out queue, marked as pending I/O, and eventually written out to swap space with a cluster of other pages. If the page is being used for filesystem caching, it is not written to disk. In either case, the page is placed on the end of the free list, but not cleared; the kernel remembers that this particular page still stores valid data.

If the average size of the free list over a 30-second interval is less than desfree , the kernel begins to take desperate measures to free memory: inactive processes are swapped out, and pages are written out immediately rather than being clustered. If the 5-second average of free memory is less than minfree , active processes start to be swapped out. The page scanner stops looking for memory to reclaim when the size of the free list increases above lotsfree .

The page scanner is governed by a parameter, maxpgio , which limits the rate at which I/O occurs to the paging devices. It is, by default, set very low (40 pages per second on sun4c , sun4m , and sun4u architectures, and 60 pages per second on sun4d ), in order to prevent saturating the paging device with I/Os. This rate is inadequate for a modern system with fast disks, and should be set to 100 times the number of spindles of configured swap space. If a process tries to access a recycled page, another minor page fault is incurred. There are now two possibilities:

The page that was recycled has been placed back on the free list, but hasn't been reused yet. The kernel gives the page back to the process.
The page has been recycled completely, and is now in use by another process. A major page fault is incurred and the paged-out data is read into a new page, taken from the free list.

Since pages can be shared between multiple processes, special precautions need to be taken in order to ensure that changes to a page are maintained locally to that process. If a process tries to write to a shared page, it incurs a copy-on-write fault . ^[5] A page is taken from the free list, and a copy of the original shared page is made for the process. When a process completes, all of its nonshared pages are returned to the free list.

^[5] Often called a COW fault; not to be confused with a ruminant falling into a chasm .

4.2.5.1 Virtual memory management in Linux

This mechanism is used in most modern memory management systems, but often with some minor changes. Here, I discuss the Linux 2.2 kernel, as at the time of this writing the 2.4 kernel was still undergoing significant revisions to the VM subsystem. The Linux kernel implements a different set of kernel variables that tune memory performance a little bit differently.

When the size of a Linux system's free list drops below the value of freepages.high , the system begins to gently page. When the available memory drops below the freepages.low value, the system pages heavily, and if the free list shrinks beneath freepages.min , only the kernel can allocate more memory. The freepages tunables are located in the /proc/sys/vm/freepages file, which is formatted as freepages.min freepages.low freepages.high .

The pagedaemon under Linux is called kswapd , which frees as many pages as necessary to get the system's free list back over the freepages.high mark. kswapd 's behavior is controlled by three parameters, called tries_base , tries_min , and swap_cluster , located in that order in the /proc/sys/vm/kswapd file. The most important tunable is swap_cluster , which is the number of pages that kswapd will write out in a single turn . This value should be large enough so that kswapd does its I/O in large chunks , but small enough that it won't overload the disk I/O request queue. If you find yourself getting short on memory and paging heavily, you probably will want to experiment with increasing this value, to buy yourself more bandwidth to the paging space.

When a page fault is taken in Linux, multiple pages are actually read in to avoid multiple short trips to disk; the precise number of pages that are brought into memory is given by 2 ⁿ , where n is the value of the page-cluster tunable (the only value in the /proc/sys/vm/page-cluster tunable file). It is pointless to set this value above 5.

There are two important things to take from the page lifecycle described here. The first is the manner by which the page scanner operates, and the parameters that define its behavior. The page scanner manages the ebb and flow of the free list by controlling paging, which can have a tremendous impact on performance. The second is the way pages are allocated and released; minor page faults occur when a process requests a page of memory, and major page faults occur if that requested page doesn't contain the data the process expects it to. Table 4-1 summarizes the kernel variables related to the behavior of the virtual memory system.

Table 4-1. A summary of memory-related kernel variables

Operating system	Variable	Effect
Solaris	`lotsfree`	The target size of the free list, in pages. Set by default, one sixty-fourth of physical memory, with a minimum size of 512 KB (64 or 128 pages on an 8 KB or 4 KB page size system, respectively). The Linux equivalent is `freemem/high` .
	`desfree`	The minimum pages of free memory, taken over an interval of 30 seconds, before swapping starts. Set to one-half of `lotsfree` by default. Roughly equivalent to Linux `freemem/low` .
	`minfree`	The threshhold of minimum free memory, taken over a 5-second interval, before active swapping of processes starts. Represented in pages. Set by default to one-half of `desfree` . There is no equivalent parameter under Linux.
	`slowscan`	The slowest scan rate (e.g., when scanning first starts) in pages. Set by default to one sixty-fourth of the size of physical memory, with a minimum of 512 KB; that is, 64 pages on an 8 KB page size system, or 128 pages on a 4 KB page size system. There is no Linux equivalent.
	`fastscan`	The fastest scan rate (e.g., with the free list completely empty) in pages. Set by default to half the physical memory, with a maximum of 64 MB; that is, 8,192 pages on an 8 KB page size system, or 16,384 pages on a 4 KB page size system. There is no Linux equivalent.
	`maxpgio`	The maximum number of I/Os per second that will be queued by the page scanner. Set by default to 40 on all systems except those based on the `sun4d` architecture, which has a default setting of 60. There is no Linux equivalent.
Linux (2.2.x)	`freepages.high`	The target size of the free list in pages. When memory drops below this level, the system starts to gently page. Set by default to 786. The Solaris equivalent is `lotsfree` .
	`freepages.low`	The threshhold for aggressive memory reclamation in pages. By default, this is set to 512. A rough Solaris equivalent is `minfree` .
	`freepages.min`	The minimum number of pages on the free list. When this value is reached, only the kernel can allocate more memory. Set to 256 by default. There is no equivalent parameter under Solaris.
	`page-cluster`	The number of pages that are read in on a page fault is given by 2 ^page-cluster . The default value is 4. There is no equivalent parameter under Solaris.

4.2.6 Page Coloring

The precise way that pages are organized in processor caches can have a dramatic effect on application performance. The optimal placement of pages depends on how the application utilizes memory -- some applications access memory almost randomly , whereas others access memory in a strict sequential order -- so there is no single algorithm that provides universally optimal results.

The free list is actually organized into a number of specifically colored bins. The number of bins is given by the physical second-level cache size divided by the page size (for example, a system with a 64 KB L2 cache and an 8 KB page size would have 8 bins). When a page is returned to the free list, it is assigned to a specific bin. When a page is required, one is taken from a specific colored bin. That choice is made based on the virtual address of the requested page. (Recall that pages exist in physical memory, and are physically addressable, but processes only understand virtual addresses; a process incurs a page fault on a virtual address in its own address space, and that page fault is filled by a page with a physical address.) The algorithms we are discussing here decide how pages are assigned colors. There are three available page coloring algorithms in Solaris 7 and later:

The default algorithm (numbered 0) uses a hashing algorithm against the virtual address, in order to distribute pages as evenly as possible.
The first optional algorithm (numbered 1) assigns colors to pages so that physical addresses map directly to virtual addresses.
The second optional algorithm (numbered 2) assigns pages to bins in a round robin mechanism.

Another algorithm called Kessler's Best Bin (numbered 6) was supported on Ultra Enterprise 10000 systems running Solaris 2.5.1 and 2.6 only. It used a history-based mechanism to try and assign pages to the least-used bins.

The default paging algorithm was chosen because it consistently provides good performance. You are not likely to see any performance improvement from tuning these parameters on typical commercial workloads. However, scientific applications may see a significant performance gain -- some testing will reveal the page coloring algorithm that fits best with your application's memory usage patterns. You can change the algorithm by setting the system parameter consistent_coloring to the number of the algorithm you'd like to use.

4.2.7 Transaction Lookaside Buffers (TLB)

While we're discussing the physical to virtual address relationship, it's important to mention the transaction lookaside buffer (TLB). One of the jobs of the hardware memory management unit is to translate a virtual address, as provided by the operating system, into a physical address. It does this by looking up entries in a page translation table. The most recent translations are cached in the translation lookaside buffer. Intel and older SPARC processors (pre-UltraSPARC) use hardware mechanisms to populate the TLB, whereas the UltraSPARC architecture uses software algorithms. While a detailed discussion of the impact of the TLB on performance is a bit beyond the scope of this book, you should be generally aware of what a TLB is and what it does.