5.11. Paging When the memory-management hardware detects an invalid virtual address, it generates a trap to the system. This page-fault trap can occur for several reasons. Most BSD programs are created in a format that permits the executable image to be paged into main memory directly from the filesystem. When a program in a demand-paged format is first run, the kernel marks as invalid the pages for the text and initialized-data regions of the executing process. The text and initialized data regions share an object that provides fill-on-demand from the filesystem. As part of mapping in the object, the kernel traverses the list of pages associated with the object and marks them as resident and copy-on-write in the newly created process. For a heavily used executable with most of its pages already resident, this prepaging reduces many of its initial page faults. As missing pages of the text or initialized-data region are first referenced, or write attempts are made on pages in the initialized-data region, page faults occur. Page faults can also occur when a process first references a page in the uninitialized-data region of a program. Here, the anonymous object managing the region automatically allocates memory to the process and initializes the newly assigned page to zero. The idle process fills part of its time by zeroing out pages on the free list so that the system does not have to do the zero fill when servicing the first reference to a page for an anonymous object. Other types of page faults arise when previously resident pages have been reclaimed by the system in response to a memory shortage. The handling of page faults is done with the vm_fault() routine; this routine services all page faults. Each time vm_fault() is invoked, it is provided the virtual address that caused the fault. The first action of vm_fault() is to traverse the vm_map_entry list of the faulting process to find the entry associated with the fault. The routine then computes the logical page within the underlying object and traverses the list of objects to find or create the needed page. Once the page has been found, vm_fault() must call the machine-dependent layer to validate the faulted page and return to restart the process. The details of calculating the address within the object are described in Section 5.4. Having computed the offset within the object and determined the object's protection and object list from the vm_map_entry, the kernel is ready to find or create the associated page. The page-fault-handling algorithm is shown in Figure 5.11. In the following overview, the lettered points are references to the tags down the left side of the code. Figure 5.11. Page-fault handling. /* * Handle a page fault occurring at the given address, * requiring the given permissions, in the map specified. * If successful, insert the page into the associated * physical map. */ int vm_fault( vm_map_t map, vm_offset_t addr, vm_prot_t type) { RetryFault: lookup address in map returning object/offset/prot; first_object = object; [A] for (;;) { page = lookup page at object/offset; [B] if (page found) { if (page busy) block and goto RetryFault; remove from paging queues; mark page as busy; break; } [C] if (object has nondefault pager or object == first_object) { page = allocate a page for object/offset; if (no pages available) block and goto RetryFault; } [D] if (object has nondefault pager) { scan for pages to cluster; call pager to fill page(s); if (IO error) return an error; if (pager has page) break; if (object != first_object) free page; } /* no pager, or pager does not have page */ [E] if (object == first_object) first_page = page; next_object = next object; [F] if (no next object) { if (object != first_object) { object = first_object; page = first_page; } first_page = NULL; zero fill page; break; } object = next_object; } [G] /* appropriate page has been found or allocated */ orig_page = page; [H] if (object != first object) { if (fault type ==WRITE) { copy page to first_page; deactivate page; page =first_page; object =first_object; } else { prot &=~WRITE; mark page copy-on-write; } } [I] if (prot &WRITE) mark page not copy-on-write; enter mapping for page; enter read-only mapping for clustered pages; [J] activate and unbusy page; if (first_page != NULL) unbusy and free first_page; } The loop traverses the list of shadow, anonymous, and file objects until it either finds an object that holds the sought-after page or reaches the final object in the list. If no page is found, the final object will be requested to produce it. An object with the desired page has been found. If the page is busy, another process may be in the middle of faulting it in, so this process is blocked until the page is no longer busy. Since many things could have happened to the affected object while the process was blocked, it must restart the entire fault-handling algorithm. If the page was not busy, the algorithm exits the loop with the page. Anonymous objects (such as those used to represent shadow objects) do not upgrade from the default pager to the swap pager until the first time that they need to push a page to backing store. Thus, if an object has a pager other than the default pager, then there is a chance that the page previously existed but was paged out. If the object has a nondefault pager, then the kernel needs to allocate a page to give to the pager to be filled (see D). The special case for the object being the first object is to avoid a race condition with two processes trying to get the same page. The first process through will create the sought-after page in the first object but keep it marked as busy. When the second process tries to fault the same page, it will find the page created by the first process and block on it (see B). When the first process completes the pagein processing, it will unlock the first page, causing the second process to awaken, retry the fault, and find the page created by the first process. Before calling the pager, check to see if any of the eight pages on either side of the faulting page are eligible to be brought in at the same time. To be eligible, a page must be part of the object and neither already in memory nor part of another I/O operation. The pager is given the range of possible pages and told which one is the required page. It must return the required page if it holds a copy of it. The other pages are produced only if they are held by the object and can be easily read at the same time. If the required page is present in the file or swap area, the pager will bring it back into the newly allocated page. If the pagein succeeds, then the sought-after page has been found. If the page never existed, then the pagein request will fail. Unless this object is the first, the page is freed and the search continues. If this object is the first, the page is not freed, so it will act as a block to further searches by other processes (as described in C). If the kernel created a page in the first object but did not use that page, it will have to remember that page so it can free the page when the pagein is done (see J). If the search has reached the end of the object list and has not found the page, then the fault is on an anonymous object chain, and the first object in the list will handle the page fault using the page allocated in C. The first_page entry is set to NULL to show that it does not need to be freed, if the idle process has not already done so the page is zero filled, and the loop is exited. The search exits the loop with page as the page that has been found or allocated and initialized, and object as the owner of that page. The page has been filled with the correct data at this point. If the object providing the page is not the first object, then this mapping must be private, with the first object being a shadow object of the object providing the page. If pagein is handling a write fault, then the contents of the page that it has found have to be copied to the page that it allocated for the first object. Having made the copy, it can release the object and page from which the copy came, since the first object and first page will be used to finish the page-fault service. If pagein is handling a read fault, it can use the page that it found, but it has to mark the page copy-on-write to avoid the page being modified in the future. If pagein is handling a write fault, then it has made any copies that were necessary, so it can safely make the page writable. As any pages around the required page that were brought into memory as part of the clustering were not copied, they are mapped read-only so that if a write is done on one of them, the full page-fault analysis will be done and a copy made at that time if it is necessary to do so. As the page and possibly the first_page are released, any processes waiting for that page of the object will get a chance to run to get their own references. Note that the page and object locking has been elided in Figure 5.11 to simplify the explanation. Hardware-Cache Design Because the speed of CPUs has increased far more rapidly than the speed of main memory, most machines today require the use of a memory cache to allow the CPU to operate near its full potential. Code that describes the operation of a hardware cache is given in Figure 5.12. An actual cache is entirely implemented in hardware, so the loop shown in Figure 5.12 would really be done by parallel comparisons rather than iteratively. Historically most machines had a direct-mapped cache. With a direct-mapped cache, an access to byte N followed by an access to byte N + (CACHELINES x LINESIZE) would cause the cached data for byte N to be lost. Most modern caches are either 2-way set associative or 4-way set associative. The 4-way set associative cache allows access of four different memory regions that overlap the same cache memory without destroying the previously cached data. But on the fifth access at that offset, an earlier cached value is lost. There are several cache-design choices that require cooperation with the virtual-memory system. The design option with the biggest effect is whether the cache uses virtual or physical addressing. A physically addressed cache takes the address from the CPU, runs it through the MMU to get the address of the physical page, then uses this physical address to find out whether the requested memory location is available in the cache. Although a translation lookaside buffer (described in Section 5.13) significantly reduces the average latency of the translation, there is still a delay in going through the MMU. A virtually addressed cache uses the virtual address as that address comes from the CPU to find out whether the requested memory location is available in the cache. The virtual-address cache is faster than the physical-address cache because it avoids the time to run the address through the MMU. However, the virtual-address cache must be flushed completely after each context switch, because virtual addresses from one process are indistinguishable from the virtual addresses of another process. By contrast, a physical-address cache needs to flush only a few individual entries when their associated physical page is reassigned. In a system with many short-running processes, a virtual-address cache gets flushed so frequently that it is seldom useful. Figure 5.12. Hardware-cache algorithm. Key: LINESIZE Number of bytes in each cache line, typically 64 or 128 bytes; CACHELINES Number of lines in the cache, 8192 is a typical size; SETSIZE 1 for a direct mapped cache, 2 for 2-way set associative, or 4 for 4-way set associative. struct cache { vm_offset_t key; /* address of data */ char data[LINESIZE]; /* cache data */ } cache[CACHELINES][SETSIZE]; /* * If present, get data for addr from cache. Otherwise fetch * data from main memory, place in cache, and return it. */ hardware_cache_fetch(vm_offset_t addr) { vm_offset_t key, line; line = (addr / LINESIZE) % CACHELINES; for (key = 0; key < SETSIZE; key++) if (cache[line][key] .key == addr) break; if (key < SETSIZE) return (cache[line][key].data); key = select_replacement_key(line); cache[line][key].key = addr; return (cache[line][key].data = fetch_from_RAM(addr)); } A further refinement to the virtual-address cache is to add a process tag to the key field for each cache line. At each context switch, the kernel loads a hardware context register with the tag assigned to the process. Each time an entry is recorded in the cache, both the virtual address and the process tag that faulted it are recorded in the key field of the cache line. The cache looks up the virtual address as before, but when it finds an entry, it compares the tag associated with that entry to the hardware context register. If they match, the cached value is returned. If they do not match, the correct value and current process tag replace the old cached value. When this technique is used, the cache does not need to be flushed completely at each context switch, since multiple processes can have entries in the cache. The drawback is that the kernel must manage the process tags. Usually, there are fewer tags (8 to 16) than there are processes. The kernel must assign the tags to the active set of processes. When an old process drops out of the active set to allow a new one to enter, the kernel must flush the cache entries associated with the tag that it is about to reuse. A final consideration is a write-through versus a write-back cache. A write-through cache writes the data back to main memory at the same time as it is writing to the cache, forcing the CPU to wait for the memory access to conclude. A write-back cache writes the data to only the cache, delaying the memory write until an explicit request or until the cache entry is reused. The write-back cache allows the CPU to resume execution more quickly and permits multiple writes to the same cache block to be consolidated into a single memory write. However, the writes must be forced any time it is necessary for the data to be visible to other CPUs on a multiprocessor. Page Coloring Page coloring is a performance optimization designed to ensure that accesses to contiguous pages in virtual memory make the best use of a physical-address cache. The Intel architecture uses a physical-address cache referred to as the L1 cache. The performance of the system is closely tied to the percentage of memory references that can be serviced by the L1 cache because the L1 cache operates at the processor frequency. A miss in the L1 cache requires a few cycles of processor delay to wait for data in the L2 cache, or tens of cycles to wait for the dynamic RAM that stores the main memory. A process on FreeBSD is running with a virtual-address space, not a physical-address space. The physical pages underlying the virtual-address space are rarely contiguous. In the worst case, two consecutive virtual pages could be stored in two physical pages backed by the same location in the L1 cache. Consider a process running on a CPU with a 1 Mbyte physical cache. If the process has its first virtual page mapped at physical page 1 Mbyte and its second virtual page mapped at physical page 2 Mbyte, despite maintaining locality of reference between its two consecutive virtual pages, the process unintentionally keeps missing the L1 cache. The effect is mitigated somewhat with multiway set-associative caches, but it is still problematic. The role of the page-coloring algorithm is to ensure that consecutive pages in virtual memory will be contiguous from the point of view of the cache. Instead of assigning random physical pages to virtual addresses, page coloring assigns cache-contiguous physical pages to virtual addresses. Two physical pages are cache-contiguous if they hit consecutive pages in the cache. These pages may be far apart in physical memory. Figure 5.13 shows a sample assignment of pages to a process. In Figure 5.13, the cache holds 4 pages and page 10 of physical memory is assigned to page 0 of the process's virtual memory. Physical-memory page 10 is color 2 of the cache (10 modulo 4 equals 2). The page coloring code has assigned page 15 of physical memory to page 1 of the process's virtual memory as it has color 3. The page-coloring code attempts to avoid assigning pages 6 or 14 because they map over the same cache memory as page 10 and would result in nonoptimal caching, whereas any of pages 7, 11, or 15 are optimal, since they all follow the cache memory used for page 10. Figure 5.13. Assignment of physical pages to a process. Page coloring is implemented by creating as many free lists as there are pages in the cache. Each of the pages in the cache is assigned a color. A machine with 1 Mbyte of cache and 4 Kbyte pages would have 256 cache pages hence 256 colors. At system startup, each vm_page records its color in the L1 cache. When a page is freed, it is put onto the free list corresponding to its color. When an object is created, it is assigned a starting color. To ensure that demand for colors is moderately balanced, the starting color to be used for the next object is updated by the smaller of the size of the object or a third of the size of the L1 cache. When a page fault occurs on an object, its preferred color for the new page is calculated by adding the faulting logical-page number of the object to the starting color of the object modulo the number of colors. A page is taken from the free list of that color if it is available. Otherwise colors furthest away from the requested color are checked until an available page is found. A page furthest away is used because the algorithm wants to minimize the possibility of having a cache collision with a neighboring page in the virtual address space. For example, if a next-higher-colored page were used, it would likely collide with the page that the algorithm would select for the next-higher virtual address. Page coloring makes virtual memory as deterministic as physical memory from the perspective of cache performance. Thus, programs can be written under the assumption that the characteristics of the underlying hardware cache are the same for their virtual address space as they would be if the program had been run directly in a physical-address space. |