Section 10.2. Pages: The Basic Unit of Solaris Memory

10.2. Pages: The Basic Unit of Solaris Memory

Pages are the fundamental unit of physical memory in the Solaris memory management subsystem. In this section, we discuss how pages are structured, how they are located, and how free lists manage pools of pages within the system. Physical memory is divided into pages. Every active (not free) page in the Solaris kernel is a mapping between a file (vnode) and memory; the page can be identified with a vnode pointer and the page size offset within that vnode. A page's identity is its vnode/offset pair. The vnode/offset pair is the backing store for the page and represents the file and offset that the page is mapping. The page structure and associated lists are shown in Figure 10.2.

Figure 10.2. The Page Structure

The hardware address translation (HAT) and address space layers manage the mapping between a physical page and its virtual address space (which is described in Chapter 12). The key property of the vnode/offset pair is reusability; that is, we can reuse each physical page for another task by simply synchronizing its contents in RAM with its backing store (the vnode and offset) before the page is reused.

For example, we can reuse a page of heap memory from a process by simply copying the contents to its vnode and offset, which in this case will copy the contents to the swap device. The same mechanism is used for caching files, and we simply use the vnode/offset pair to reference the file that the page is caching. If we were to reuse a page of memory that was caching a regular file, then we simply synchronize the page with its backing store (if the page has been modified) or just reuse the page if it is not modified and does not need resyncing with its backing store.

10.2.1. The Page Hash List

The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list so that they can be located by vnode and offset. Three page functions search the global page hash list: page_find(), page_lookup(), and page_lookup_nowait(). These functions take a vnode and offset as arguments and return a pointer to a page structure if found.

The global hash list is an array of pointers to linked lists of pages. The functions use a hash to index into the page_hash array to locate the list of pages that contains the page with the matching vnode/offset pair. Figure 10.3 shows how the page_find() function indexes into the page_hash array to locate a page matching a given vnode/offset.

Figure 10.3. Locating Pages by Their Vnode/Offset Identity

page_find()locates a page as follows:

It calculates the slot in the page_hash array containing a list of potential pages by using the PAGE_HASH_FUNC macro, shown below.

#define PAGE_HASHSZ     page_hashsz #define PAGE_HASHAVELEN         4 #define PAGE_HASH_FUNC(vp, off) \         ((((uintptr_t)(off) >> PAGESHIFT) + \                 ((uintptr_t)(off) >> (PAGESHIFT + PH_SHIFT_SIZE)) + \                 ((uintptr_t)(vp) >> 3) + \                 ((uintptr_t)(vp) >> (3 + PH_SHIFT_SIZE)) + \                 ((uintptr_t)(vp) >> (3 + 2 * PH_SHIFT_SIZE))) & \                 (PAGE_HASHSZ - 1))                                                                                 See vm/page.h

It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds such a page.

[View full width]
#define PAGE_HASH_SEARCH(index, pp, vp, off) { \         for ((pp) = page_hash[(index)]; (pp); (pp) = (pp)->p_hash) { \                 if ((pp)->p_vnode == (vp) && (pp)->p_offset == (off)) \                         break; \         } \                                                                                    See vm/ vm_page.h

10.2.2. Page Structures

The page structure is as follows:

typedef struct page {         u_offset_t      p_offset;        /* offset into vnode for this page */         struct vnode    *p_vnode;        /* vnode that this page is named by */         selock_t        p_selock;        /* shared/exclusive lock on the page */ #if defined(_LP64)         int             p_selockpad;     /* pad for growing selock */ #endif         struct page     *p_hash;         /* hash by [vnode, offset] */         struct page     *p_vpnext;       /* next page in vnode list */         struct page     *p_vpprev;       /* prev page in vnode list */         struct page     *p_next;         /* next page in free/intrans lists */         struct page     *p_prev;         /* prev page in free/intrans lists */         ushort_t        p_lckcnt;        /* number of locks on page data */         ushort_t        p_cowcnt;        /* number of copy on write lock */         kcondvar_t      p_cv;            /* page struct's condition var */         kcondvar_t      p_io_cv;         /* for iolock */         uchar_t         p_iolock_state;  /* replaces p_iolock */         uchar_t         p_szc;           /* page size code */         uchar_t         p_fsdata;        /* file system dependent byte */         uchar_t         p_state;         /* p_free, p_noreloc */         uchar_t         p_nrm;           /* non-cache, ref, mod readonly bits */ #if defined(__sparc)         uchar_t         p_vcolor;        /* virtual color */ #else         uchar_t         p_embed;         /* x86 - changes p_mapping & p_index */ #endif         uchar_t         p_index;         /* MPSS mapping info. Not used on x86 */         uchar_t         p_toxic;         /* page has an unrecoverable error */         void            *p_mapping;      /* hat specific translation info */         pfn_t           p_pagenum;       /* physical page number */         uint_t          p_share;         /* number of translations */ #if defined(_LP64)         uint_t          p_sharepad;      /* pad for growing p_share */ #endif         uint_t          p_msresv_1;      /* reserved for future use */ #if defined(__sparc)         uint_t          p_kpmref;        /* number of kpm mapping sharers */         struct kpme     *p_kpmelist;     /* kpm specific mapping info */ #else         /* index of entry in p_map when p_embed is set */         uint_t          p_mlentry; #endif         uint64_t        p_msresv_2;      /* page allocation debugging */ } page_t;                                                                           See vm/page.h

The stored information includes bits that indicate whether the page has been referenced or modified, for use in the page scanner (covered later in the chapter). The page structure also contains a pointer to the HAT-specific mapping information, p_mapping, which points to a machine-specific hat structure.

10.2.3. Free List and Cache List

The free list and the cache list hold pages that are not mapped into any address space and that have been freed by page_free(). The sum of these lists is reported in the free column in vmstat. Even though vmstat reports these pages as free, they can still contain a valid page from a vnode/offset and hence are still part of the global page cache. Memory on the cache list is not really free, it is a valid cache of a page from a file. However, pages will be moved from the cache list to the free list and their contents discarded if the free list becomes exhausted. The cache list exemplifies how the file systems use memory as a file system cache.

The free list contains pages that no longer have a vnode and offset associated with themwhich can only occur if the page has been destroyed and removed from a vnode's hash list. The free list is generally very small, since most pages that are no longer used by a process or the kernel still keep their vnode/offset information intact. Pages are put on the free list when a process exits, at which point all of the anonymous memory pages (heap, stack, and copy-on-write pages) are freed.

The cache list is a hashed list of pages that still have mappings to valid vnode and offset. Recall that pages can be obtained from the cache list by the page_lookup() routine. This function accepts a vnode and offset as the argument and returns a page structure. If the page is found on the cache list, then the page is removed from the cache list and returned to the caller. When we find and remove pages from the cache list, we are reclaiming a page. Page reclaims are reported by vmstat in the "re" column.

10.2.4. Physical Page "memseg" Lists

The Solaris kernel uses a segmented global physical page list, consisting of segments of contiguous physical memory. (Many hardware platforms now present memory in noncontiguous groups.) Contiguous physical memory segments are added during system boot. They are also added and deleted dynamically when physical memory is added and removed while the system is running. Figure 10.4 shows the arrangement of the physical page lists into contiguous segments.

Figure 10.4. Contiguous Physical Memory Segments

10.2.5. The Page-Level Interfaces

The Solaris virtual memory system implementation has grouped page management and manipulation into a central group of functions. These functions are used by the segment drivers and file systems to create, delete, and modify pages. The major page-level interfaces are shown in Table 10.1.

Table 10.1. Solaris 10 Page Level Interfaces
Method	Description
`page_create()`	Creates pages. Page coloring is based on a hash of the `vnode` offset. `page_create()` is provided for backward compatibility only. Don't use it if you don't have to. Instead, use the `page_create_va()` function so that pages are correctly colored.
`page_create_va()`	Creates pages, taking into account the virtual address they will be mapped to. The address is used to calculate page coloring.
`page_exists()`	Tests that a page for `vnode`/offset exists.
`page_find()`	Searches the hash list for a page with the specified `vnode` and offset that is known to exist and is already locked.
`page_first()`	Finds the first page on the global page hash list.
`page_free()`	Frees a page. Pages with `vnode`/offset go onto the cache list; other pages go onto the free list.
`page_isfree()`	Checks whether a page is on the free list.
`page_ismod()`	Checks whether a page is modified. This function checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call `hat_pagesync()` before calling `page_ismod()`.
`page_isref()`	Checks whether a page has been referenced; checks only the software bit in the page structure. To sync the MMU bits with the page structure, you may need to call `hat_pagesync()` before calling `page_isref()`.
`page_isshared()`	Checks whether a page is shared across more than one address space.
`page_lookup()`	Finds a page representing the specified `vnode`/offset. If the page is found on a free list, then it will be removed from the free list.
`page_lookup_nowait()`	Finds a page representing the specified `vnode`/offset that is not locked or on the free list.
`page_needfree()`	Informs the VM system we need some pages freed up. Calls to `page_needfree`( ) must be symmetric; that is, they must be followed by another `page_needfree`( ) with the same amount of memory multiplied by -1, after the task is complete.
`page_next()`	Finds the next page on the global page hash list.
`page_lock()`	Lock a page structure either exclusively or shared.
`page_unlock()`	Unlock a page structure.
`page_release()`	Unlock a page structure after unmapping it, and place it back on the cachelist if appropriate. This allows the file systems to recycle the page cache though the cachelist, rather than waiting for the page scanner to garbage collect it later.

The page_create_va() function allocates pages. It takes the number of pages to allocate as an argument and returns a page list linked with the pages that have been taken from the free list. page_create_va() also takes a virtual address as an argument so that it can implement page coloring (discussed in Section 10.2.7). The new page_create_va() function subsumes the older page_create() function and should be used by all newly developed subsystems because page_create() may not correctly color the allocated pages.

10.2.6. The Page Throttle

Solaris implements a page creation throttle so a small core of memory is available for consumption by critical parts of the kernel. The page throttle, implemented in the page_create() and page_create_va() functions, causes page creates to block when the PG_WAIT flag is specified. That is, when available memory is less than the system global, tHRottlefree. By default, the system global parameter, tHRottlefree, is set to the same value as the system global parameter minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT and is subject to the page-create throttle. (See Section 11.2 for more information on kernel memory allocation.)

10.2.7. Page Coloring

Some interesting effects result from the organization of pages within the processor caches, and as a result, the page placement policy within these caches can dramatically affect processor performance. When pages overlay other pages in the cache, they can displace cache data that we might not want overlaid, resulting in less cache utilization and "hot spots."

The optimal placement of pages in the cache often depends on the memory access patterns of the application; that is, is the application accessing memory in a random order, or is it doing some sort of strided ordered access? Several different algorithms can be selected in the Solaris kernel to implement page placement; the default attempts to provide the best overall performance.

To understand how page placement can affect performance, let's look at the cache configuration and see when page overlaying and displacement can occur. The UltraSPARC-I and -II implementations use virtually addressed L1 caches and physically addressed L2 caches. The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical memory in 64-byte units. Figure 12.2 shows the architecture of the UltraSPARC-I and -II CPU modules with their caches. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size of the caches reported to the operating system. The L1 cache sizes are recorded in the vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.

# mdb -k > vac_size/D vac_size:       16384 > ecache_size/D ecache_size:    1048576

We'll start by using the L2 cache as an example of how page placement can affect performance. The physical addressing of the L2 cache means that the cache is organized in page-sized multiples of the physical address space, which means that the cache effectively has only a limited number of page-aligned slots. The number of effective page slots in the cache is the cache size divided by the page size. To simplify our examples, let's assume we have a 32-Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8 Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte cache has 512 addressable slots. Figure 10.5 shows how our cache would look if we laid it out linearly.

Figure 10.5. Physical Page Mapping into a 32-Kbyte Physical Cache

The L2 cache is direct-mapped from physical memory. If we were to access physical addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory locations would map to the same cache line. If we were now to access these two addresses, we cause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it effectively reduces our performance to that of real-memory speed, rather than cache speed. By accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64 bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 1020 times slower than cache and so can have a dramatic effect on performance.

Our simple example was based on the assumption that we were accessing physical memory in a regular pattern, but we don't program to physical memory; rather, we program to virtual memory. Therefore, the operating system must provide a sensible mapping between virtual memory and physical memory; otherwise, effects such as our example can occur.

By default, physical pages are assigned to an address space from the order in which they appear in the free list. In general, the first time a machine boots, the free list may have physical memory in a linear order, and we may end up with the behavior described in our "ping pong" example. Once a machine has been running, the physical page free list will become randomly ordered, and subsequent reruns of an identical application could get very different physical page placement and, as a result, very different performance. On early Solaris implementations, this is exactly what customers sawdiffering performance for identical runs, as much as 30 percent difference.

To provide better and consistent performance, the Solaris kernel uses a page coloring algorithm when pages are allocated to a virtual address space. Rather than being randomly allocated, the pages are allocated with a specific predetermined relationship between the virtual address to which they are being mapped and their underlying physical address. The virtual-to-physical relationship is predetermined as follows: The free list of physical pages is organized into specifically colored bins, one color bin for each slot in the physical cache; the number of color bins is determined by the ecache size divided by the page size. (In our example, there would be exactly four colored bins.)

When a page is put on the free list, the page_free() algorithms assign it to a color bin corresponding to its physical address. When a page is consumed from the free list, the virtual-to-physical algorithm takes the page from a physical color bin, chosen as a function of the virtual address to which the page will be mapped. The algorithm requires that when allocating pages from the free list, the page create function must know the virtual address to which a page will be mapped.

New pages are allocated by calling the page_create_va() function ^[1]. The page_create_va() function accepts the virtual address of the location to which the page is going to be mapped as an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take physical pages from. The page_create_va() function is described with the page management functions in Table 10.1.

^[1] The page_create_va() function deprecates the older page_create() function. We chose to add a new function rather than adding an additional argument to the existing page_create() function so that existing third-party loadable kernel modules which call page_create()remain functional. However, because page_create() does not know about virtual addresses, it has to pick a color at randomwhich can cause significant performance degradation. The page_create_va() function should always be used for new code.

No one algorithm suits all applications because different applications have different memory access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel supports a default algorithm and two optional algorithms. The default algorithm was chosen according to the following criteria:

Fairly consistent, repeatable results
Good overall performance for the majority of applications
Acceptable performance across a wide range of applications

The default algorithm uses a hashing algorithm to distribute pages as evenly as possible throughout the cache. The default and other available page coloring algorithms are shown in Table 10.2.

Table 10.2. Solaris Page Coloring Algorithms
Algorithm
No.	Name	Description
0	Hashed VA	The physical page color bin is chosen based on a hashed algorithm to ensure even distribution of virtual addresses across the cache. A skew using a hash of the process address is included, to ensure a different address range is used for each process. This prevents pathological cache conflicts when many similar processes are running.
1	P. Addr = V. Addr	The physical page color is chosen so that physical addresses map directly to the virtual addresses (as in our example).
2	Bin Hopping	Physical pages are allocated with a round-robin method.

You can change the default algorithm by setting the system parameter consistent_coloring, either on-the-fly with mdb or permanently in /etc/system.

# mdb -kw > consistent_coloring/D consistent_coloring:            0 > consistent_coloring/W 1 consistent_coloring:            0x0             =       0x1

So, which algorithm is best? Well, your mileage will vary, depending on your application. Page coloring usually only makes a difference on memory-intensive scientific applications, and the defaults are usually fine for commercial or database systems. If you have a time-critical scientific application, then we recommend that you experiment with the different algorithms and see which is best. Remember that some algorithms will produce different results for each run, so aggregate as many runs as possible.