Section 9.10. Changes to Support Large Pages

9.10. Changes to Support Large Pages

Solaris provides support for large MMU pages, as part of the Multiple Page Sizes for Solaris (MPSS) infrastructure. In this section, we discuss enhancements to page allocation and segment drivers.

9.10.1. System View of a Large Page

Solaris implements large pages in way that localizes changes to to only a few layer of the virtual memory system, and that does not slow down operations and applications not using large pages. The file systems and I/O layers need no knowledge of large pages. The only exception to this is the swap file system (swapfs), which is explained in more detail later.

Solaris 9 added a p_szc field to the page_t structure. This field used to exist in the sun4u version of the machpage_t structure. Since support for multiple page sizes above the HAT and platform layers is generic, the representation of large page sizes is also generic. As in Solaris 2.6, a large page is represented as a physically contiguous range of the kernel's minimum pagesize (PAGESIZE) pages that are physically aligned on the large-page-size boundary. For example, a 64-Kbyte page is represented by eight page_t structures, each with its p_szc field set to 1 and the physical address of the first page_t on a 64-Kbyte boundary. The size codes for 8-Kbyte, 64-Kbyte, 512-Kbyte, and 4-Mbyte pages, respectively, are (0, 1, 2, 3).

Most large page operations loop across the large page, performing operations in PAGESIZE increments. Another advantage of representing a large page as a group of PAGESIZE page_t structures is that page_lookup() operations are not slowed down. If the implementation had rather represented a large page with a single, variable-size page_t, it would have meant repeatedly rehashing inside page_lookup() on misses across all supported page sizes. The implementation allows us to do one lookup on the page_t and check its p_szc field. At this point, it's easy to compute the index of the base page_t (first constituent page of the large page). As long as any one of the constituent pages of a large page is locked, the size of the large page cannot change. The constituent pages must be exclusively locked to change their size.

9.10.2. Free List Organization

The page free list consists of two logical free lists: the cache list and the free list. The cache list contains free pages that still have identity. That is, they contain valid data. The free list contains pages without identity. In Solaris 2.6, the free list was subdivided into free lists per page size for large-page ISM support. The cache list supports only PAGESIZE pages since we do not support large pages for mapped files.

Each free list and the cache list is subdivided into bin lists for each page color. UltraSPARC uses direct-mapped, physically indexed external caches. Page color in this context means the size of the external cache divided by the page size. There is a group of these lists for each memory node on the machine for NUMA.

9.10.3. Large-Page Faulting

In this section we describe the following aspects of large-page faulting:

How a large page is created on first access.
How a large page is paged in from the swap device.
How we handle COW (copy-on-write) faults.
How we do locking for I/O.

9.10.3.1. Page Size-Up/Size-Down Policy

A page-size code field (s_szc) in the segment structure provides information about the preferred page size to lower levels of the virtual memory system. The segment driver attempts to allocate pages in accordance with this page size. This field normally is set with SEGOP_SETPAGESIZE() (e.g., memcntl(2)) or at creation time in segvn_create(). The latter is based on two kernel variables: mpss_brkpgszsel and mpss_stkkpgszsel. These kernel variables determine the default page size for heaps and stacks at exec time. They are passed to segvn_create() through the exec code. They could be used to set default heap and stack page size for all processes in the system.

These kernel variables are also useful for MPSS stress testing. If the variable is set to -q, a random page size is selected, on the basis of the value of the tick register, and passed into segvn_create(). At fault time, if the page is not found in the system, segvn allocates a page of the size specified in the segment size code field s_szc. If the correct size for a large page cannot be allocated, segvn_create() immediately tries to allocate a smaller one. This approach is better than blocking until a large page becomes available. At this point the choice is between the next size down and just trying PAGESIZE pages. For processors that have fully associative TLBs, it makes more sense to keep trying the next size down until successful.

9.10.3.2. Page-In

MAP_PRIVATE anonymous pages are allocated in swapfs. At fault time, if the ANON layer cannot find the page in the system, it invokes the swapfs VOP_GETPAGE() interface to create the page. If the contents are on backing store, it is brought in by swapfs to fill the newly allocated page.

The page identity (vnode, offset) naming rules for anonymous pages made it impossible to handle large pages with a single call to the current VOP_GETPAGE() interface for swapfs. It was designed to deal only with PAGESIZE pages. The ANON layer allocates an anon slot for each anonymous page.

struct anon {         struct vnode *an_vp;     /* vnode of anon page */         struct vnode *an_pvp;    /* vnode of physical backing store */         anoff_t       an_off;    /* offset of anon page */         anoff_t       an_poff;   /* offset in vnode */         struct anon  *an_hash;   /* hash table of anon slots */         int           an_refcnt; /* # sharing slot */ };                                                                           See vm/anon.h

The ANON layer then calls into swapfs to assign a unique identity to the anonymous page (vnode, offset). This identity will later be used by the ANON layer and swapfs for page lookup and creations for the life of the anonymous page. To generate a unique vnode and offset, swapfs simply splits the virtual address of the anon slot into two portions for generating a vnode index and offset, swapfs maintains a pool of fake vnodes identified by their vnode index. Since the address of the anon slot in the system is unique, the vnode offset page name generated from it will also be unique. VOP_GETPAGE() takes a vnode, offset, and length as part of its arguments. To make a single call through this interface for handling a large page would mean that the vnode would have to be the same for all constituent pages of the large page.

The vnode offset for the base constituent page should be aligned on the large-page-size boundary with each successive constituent page's offset increasing in PAGESIZE increments. With the current naming scheme, a large page can actually have different vnodes for its constituent pages with improperly aligned offsets.

A simplified solution to this problem is usedfor a large page, the ANON layer first preallocates the large page. It then loops across the large page in PAGESIZE increments invoking swapfs's VOP_GETPAGE() interface to insert the page in the system and fill in the contents from backing store if necessary. Each constituent page of the preallocated large page is passed down to swapfs by a new pointer (t_vmdata) we added to the thread structure. If swapfs sees that a preallocated constituent page is passed down, it knows it's handling a large page. It first checks to see if a page already exists in the system. If a page of the same size or larger is found, that page's constituent page is returned and the preallocated large page is freed. If a smaller page is found, swapfs relocates the constituent page of the smaller page into the constituent page of the preallocated page and frees the old one. This is done in PAGESIZE steps as the ANON layer loops across the range of the large page.

9.10.3.3. Copy-on-Write Faults

New functions in the ANON layer help segvn deal with large pages. Copy-on-write (COW) faults for MAP_PRIVATE anonymous large pages are handled in a way similar to that for PAGESIZE pages. The anon structure keeps a reference count, to know when the last reference to a page has been removed before it can be freed. A large page is made up of PAGESIZE constituent pages. Therefore, each constituent page of the large page also has an anon slot associated with it.

When we handle a COW fault for a large page, we need to make sure we do not partially share a large page. That is, from the ANON layer's point of view we do not allow partial sharing of a subrange of anon slots representing a large page. This restriction greatly simplifies the freeing of large pages since we can only free them when the reference counts for all anon slots for each constituent page is zero. While handling a COW fault in the ANON layer, if we find that a large page cannot be allocated, we allocate enough smaller pages (equal to a large-page size) to satisfy the COW. This guarantees that our reference counts stay consistent across all the anon slots in the large page region, even if we fail to allocate a large page.

9.10.3.4. Page Locking for IO

In Solaris 2.6, a fast path was added for locking pages during IO and for caching the page list. The caching was an improvement over the old method, as_fault(F_SOFTLOCK), for locking the pages and their MMU translations during IO setup. The MMU translations had to be locked so that the nexus driver could call into the HAT and build a list of page frame numbers for DMA. Later, when the IO completed the translations, the pages were unlocked. For IO-intensive applications, the overhead of repetitively locking the same pages and their MMU translations, calling into the HAT for the page frame list, and then unlocking was expensive.

The new fast path locks the pages once, inserts the list into a segment page cache, and then passes the page list (shadow list) down to the nexus driver through the buf(9S) framework. With the fast path we no longer need to lock the MMU translations, since we already have the page list. When the IO completes, the shadow list can be left in the segment page cache, which means the pages remain locked. If the IOs continue to come in, are of at a short enough duration, and keep using the same buffer in the user address space, then the as_pagelock() fast path will probably find the shadow list in the segment page cache with an inexpensive lookup. It then passes the shadow list down to the nexus driver, thereby reducing relock/unlock overhead.

When we insert a shadow list into the segment page cache with segvn_pagelock(), we need to decrement availrmem by the number of the pages in the shadow list. For I/O with large pages, holding the shared/exclusive lock of one or more constituent pages in a large page has the effect of locking the entire large page. The reason is that a large page cannot be freed or demoted unless we first lock all constituent pages exclusively. Therefore, in segvn_pagelock(), for large pages we adjust the pagelock region, the address, and the length to include the entire large page(s) and decrement availrmem according to the new size.

We do not want to just decrement availrem by the large-page size without also adjusting the address and length of the request. If we did not resize the request, we could end up decrementing availrmem by large-page amounts for every constituent page locked by a new as_pagelock() request. So resizing has the advantage of coalescing multiple requests within the large page(s) into a single segment page cache entry and decrementing availrmem by the correct amount.

9.10.4. Large-Page Freeing

Large pages are handled mostly as with regular pagesthey are returned to the freelist when no longer needed. There are however the following differences:

Dirty page cleanup with page-out and msync(3C). When memory pressure becomes high, the page-out code will try to clean dirty pages by writing them out to their backing store so that they can possibly be freed. And applications themselves can synchronize the contents of their dirty pages with their backing store through msync(3C). In both cases, this synchronization involves the use of VOP_PUTPAGE() to write out the dirty page. As we explained earlier with anonymous page naming, you cannot use the VOP interface for swapfs to operate across a large page in a single call.
For both page-out and msync(3C), we simply try to demote the large page to PAGESIZE pages. To demote a large page, we try to exclusively lock each constituent page of the large page. If successful, we can unload the mappings and safely reset the p_szc to 0 for each constituent page. This allows us to handle the old large page as PAGESIZE pages are handled today.
Placement of freed large pages. Large pages are placed back on the free list, not on the cache list, as large pages only when they are being destroyed; for example, when a process exits or unmaps a segment and there is no low-level sharing of the large page because of fork(2).
User page locking mlock(3c). Using mlock(3C) for large pages is no different from using it for PAGESIZES pages. If an application wants to lock a large page in memory, it must lock the entire range of the large-page mapping. If the application locks only a subrange of a large-page mapping, the system is still free to demote the page and take away the unlocked portions if free memory is low and pressure is high.

9.10.5. Operations That Interfere with Large Pages

The implementation adds support for large pages as transparently as possible. Two operations that can destroy large pages however are unaligned requests for changing memory protections and unloading mappings.

Consider the example of a process using 512-Kbyte pages for a MAP_PRIVATE/dev/zero mapping in its address space. It then attempts to execute munmap(2) or mprotect(2) on an 8-Kbyte range in the middle of the mapping. The segment driver, segvn, will detect this and demote the range affected by the operation to PAGESIZE pages and return an internal retry return value (IE_RETRY) back up to the Address Space (AS) layer. The AS layer will then drop the address space lock, take it back, and restart the operation. Page size is a property of a segment.

So, demoting a range in the address space, as in the example above, can involve splitting existing segments to accommodate the unaligned request. Manipulating the seg list in an address space requires that the address space lock be held as a writer. Today, this requirement is already met for the case of SEGOP_UNMAP() but not for SEGOP_SETPROT(). Since large pages can now involve the manipulation of the seg lists for possible demote operations, we now also lock the as writer for SEGOP_SETPROT() operations. This locking is not a performance problem, since SETPROTs are relatively rare.

Also, the watchpoint facility in /proc works by changing protections at PAGESIZE granularity. The implications for large pages are the same as above. The process may end up losing some or all of its large pages if it had any.

9.10.6. HAT Support

The Solaris sfmmu, UltraSPARC HAT, has always supported large pages. Prior to MPSS, on the user side, only large (4-, 32- or 256-Mbyte) pages for ISM are optimally supported. Only 8-Kbyte and 4-Mbyte pages are supported in the TSB (translation storage buffer). This means the trap handler for user dTLB misses will probe the TSB for both 4-Mbyte and 8-Kbyte pages. If both miss, the trap handler searches the hardware mapping entry (HME) hash for the correct translation table entry (TTE). Since ISM is optimized to use 4-Mbyte pages, the trap handler first does a hash search for the 4-Mbyte TTE. If the search fails, the trap handler rehashes and searches again for an 8-Kbyte TTE.

For MPSS we generalized large-page support for any kind of user data TLB miss. Today, the HAT probes the TSB at most twice on user data TLB misses. That is, we use the 8-Kbyte pointer generated in hardware, and if that fails, we recalculate a 4-Mbyte pointer and reprobe. Although we could use the 64-Kbyte pointer generated by hardware and calculate a 512-Kbyte pointer in software, the expense of the extra TSB lookups would degrade performance of all applications. So, for 64-Kbyte and 512-Kbyte pages, we simply replicate from the 8-Kbyte pointer.

For example, a 64-Kbyte TTE can occupy 8 TSB entries, and a 512-Kbyte TTE can occupy 64 entries. This situation has two disadvantages. First, it means the TSB reach is the same for 64-Kbyte and 512-Kbyte pages as it is for 8-Kbyte pages. Second, it takes longer to warm up the TSB with the replicated TTEs and demap TTEs of these two page sizes. The performance of handling TSB misses is critical. For that reason we wanted to handle TSB misses for any page size in the trap handler and not at C code (tl = 0). To that end, that HAT now keeps accurate per-mapping-size TTE counts. This information helps the TSB miss handler determine whether a particular mapping size even exists for the current context and saves on rehashes for a process not using large pages or rehashing on a particular page size if not in use.

9.10.7. `procfs` Changes

The procfs(1) xmap file exported page mapping size information. The prxmap structure adds a new HAT page size mapping field.