Whereas working sets describe the resident pages owned by a process or the system, the page frame number (PFN) database describes the state of each page in physical memory. The page states are listed in Table 7-20.
The PFN database consists of an array of structures that represent each physical page of memory on the system. The PFN database and its relationship to page tables are shown in Figure 7-33. As this figure shows, valid PTEs point to entries in the PFN database, and the PFN database entries (for nonprototype PFNs) point back to the page table that is using them. For prototype PFNs, they point back to the prototype PTE.
Figure 7-33. Page tables and the page frame number database
Of the page states listed in Table 7-20, six are organized into linked lists so that the memory manager can quickly locate pages of a specific type. (Active/valid pages and transition pages aren't in any systemwide page list.) Figure 7-34 shows an example of how these entries are linked together.
Figure 7-34. Page lists in the PFN database
In the next section, you'll find out how these linked lists are used to satisfy page faults and how pages move to and from the various lists.
EXPERIMENT: Viewing the PFN Database
Using the kernel debugger !memusage command, you can dump the size of the various paging lists. The following is the output from this command:
lkd> !memusage loading PFN database loading (100% complete) Compiling memory usage data (99% Complete). Zeroed: 8474 ( 33896 kb) Free: 256 ( 1024 kb) Standby: 50790 (203160 kb) Modified: 496 ( 1984 kb) ModifiedNoWrite: 0 ( 0 kb) Active/Valid: 201980 (807920 kb) Transition: 1 ( 4 kb) Unknown: 0 ( 0 kb) TOTAL: 261997 (1047988 kb)
Page List Dynamics
Figure 7-35 shows a state diagram for page frame transitions. For simplicity, the modified-no-write list isn't shown.
Figure 7-35. State diagram for page frames
Page frames move between the paging lists in the following ways:
When the memory manager needs a zero-initialized page to service a demand-zero page fault (a reference to a page that is defined to be all zeros or to a user-mode committed private page that has never been accessed), it first attempts to get one from the zero page list. If the list is empty, it gets one from the free page list and zeroes the page. If the free list is empty, it goes to the standby list and zeroes that page.
One reason zero-initialized pages are required is to meet C2 security requirements. C2 specifies that user-mode processes must be given initialized page frames to prevent them from reading a previous process's memory contents. Therefore, the memory manager gives user-mode processes zeroed page frames unless the page is being read in from a mapped file. If that's the case, the memory manager prefers to use nonzeroed page frames, initializing them with the data off the disk.
The zero page list is populated from the free list by a system thread called the zero page thread (thread 0 in the System process). The zero page thread waits on an event object to signal it to go to work. When the free list has eight or more pages, this event is signaled. However, the zero page thread will run only if no other threads are running, because the zero page thread runs at priority 0 and the lowest priority that a user thread can be set to is 1.
In Windows Server 2003 and later, when memory needs to be zeroed as a result of a physical page allocation by a driver that calls MmAllocatePagesForMdl or a Windows application that calls AllocateUserPhysicalPages, or when an application allocates large pages, the memory manager zeroes the memory using a higher performing function called MiZeroInParallel, which maps larger regions than the zero page thread, which only zeroes a page at a time. In addition, on multiprocessor systems, it creates additional system threads to perform the zeroing in parallel (and in a NUMA-optimized fashion on NUMA platforms).
When the memory manager doesn't require a zero-initialized page, it goes first to the free list. If that's empty, it goes to the zeroed list. If the zeroed list is empty, it goes to the standby list. Before the memory manager can use a page frame from the standby list, it must first backtrack and remove the reference from the invalid PTE (or prototype PTE) that still points to the page frame. Because entries in the PFN database contain pointers back to the previous user's page table (or to a prototype PTE for shared pages), the memory manager can quickly find the PTE and make the appropriate change.
When a process has to give up a page out of its working set (either because it referenced a new page and its working set was full or the memory manager trimmed its working set), the page goes to the standby list if the page was clean (not modified) or to the modified list if the page was modified while it was resident. When a process exits, all the private pages go to the free list. Also, when the last reference to a page file backed section is closed, these pages also go to the free list.
EXPERIMENT: Viewing Page Fault Behavior
With the Pfmon tool (in the Windows 2000 and 2003 resource kits, as well as in the Windows XP Support Tools), you can watch page fault behavior as it occurs. A soft fault refers to a page fault satisfied from one of the transition lists. Hard faults refer to a disk-read. The following example is a portion of output you'll see if you start Notepad with Pfmon and then exit. Be sure to notice the summary of page fault activity at the end.
C:\> pfmon notepad SOFT: KiUserApcDispatcher : KiUserApcDispatcher SOFT: LdrInitializeThunk : LdrInitializeThunk SOFT: 0x77f61016 : : 0x77f61016 SOFT: 0x77f6105b : : fltused+0xe00 HARD: 0x77f6105b : : fltused+0xe00 SOFT: LdrQueryImageFileExecutionOptions : LdrQueryImageFileExecutionOptions SOFT: RtlAppendUnicodeToString : RtlAppendUnicodeToString SOFT: RtlInitUnicodeString : RtlInitUnicodeString notepad Caused 8 faults had 9 Soft 5 Hard faulted VA's ntdll Caused 94 faults had 42 Soft 8 Hard faulted VA's comdlg32 Caused 3 faults had 0 Soft 3 Hard faulted VA's shlwapi Caused 2 faults had 2 Soft 2 Hard faulted VA's gdi32 Caused 18 faults had 10 Soft 2 Hard faulted VA's kernel32 Caused 48 faults had 36 Soft 3 Hard faulted VA's user32 Caused 38 faults had 26 Soft 6 Hard faulted VA's advapi32 Caused 7 faults had 6 Soft 3 Hard faulted VA's rpcrt4 Caused 6 faults had 4 Soft 2 Hard faulted VA's comctl32 Caused 6 faults had 5 Soft 2 Hard faulted VA's shell32 Caused 6 faults had 5 Soft 2 Hard faulted VA's Caused 10 faults had 9 Soft 5 Hard faulted VA's winspool Caused 4 faults had 2 Soft 2 Hard faulted VA's PFMON: Total Faults 250 (KM 74 UM 250 Soft 204, Hard 46, Code 121, Data 129)
Modified Page Writer
When the modified list gets too big, or if the size of the zeroed and standby lists falls below a minimum threshold (as indicated by the kernel variable MmMinimumFreePages, which is computed at system boot time), one of two system threads are awakened to write pages back to disk and move the pages to the standby list. One system thread writes out modified pages (MiModifiedPageWriter) to the paging file, and a second one writes modified pages to mapped files (MiMappedPageWriter). Two threads are required to avoid creating a deadlock, which would occur if the writing of mapped file pages caused a page fault that in turn required a free page when no free pages were available (thus requiring the modified page writer to create more free pages). By having the modified page writer perform mapped file paging I/Os from a second system thread, that thread can wait without blocking regular page file I/O.
Both threads run at priority 17 and, after initialization, wait for separate event objects to trigger their operation. The modified page writer event is triggered for one of two reasons:
When the number of modified pages exceeds the maximum value computed at system initialization (MmModifiedPageMaximum) currently 800 pages for all systems
When the number of available pages (MmAvailablePages) goes below MmMinimum-FreePages
The modified page writer waits for an additional event (MiMappedPagesTooOldEvent) that is set after a predetermined number of seconds (MmModifiedPageLifeInSeconds) to indicate that mapped pages (not modified pages) should be written to disk. By default, this value is 300 seconds (5 minutes). (You can override this value by adding the DWORD registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\ ModifiedPageLife). The reason for this additional event is to reduce data loss in the case of a system crash or power failure by eventually writing out modified mapped pages even if the modified list hasn't reached its threshold of 800 pages.
When invoked, the mapped page writer attempts to write as many pages as possible to disk with a single I/O request. It accomplishes this by examining the original PTE field of the PFN database elements for pages on the modified page list to locate pages in contiguous locations on the disk. Once a list is created, the pages are removed from the modified list, an I/O request is issued, and at successful completion of the I/O request, the pages are placed at the tail of the standby list.
Pages that are in the process of being written can be referenced by another thread. When this happens, the reference count and the share count in the PFN entry that represents the physical page are incremented to indicate that another process is using the page. When the I/O operation completes, the modified page writer notices that the share count is no longer 0 and doesn't place the page on the standby list.
PFN Data Structures
Although PFN database entries are of fixed length, they can be in several different states, depending on the state of the page. Thus, individual fields have different meanings depending on the state. The states of a PFN entry are shown in Figure 7-36.
Figure 7-36. States of PFN database entries
Several fields are the same for several PFN types, but others are specific to a given type of PFN. The following fields appear in more than one PFN type:
PTE address Virtual address of the PTE that points to this page.
Reference count The number of references to this page. The reference count is incremented when a page is first added to a working set and/or when the page is locked in memory for I/O (for example, by a device driver). The reference count is decremented when the share count becomes 0 or when pages are unlocked from memory. When the share count becomes 0, the page is no longer owned by a working set. Then, if the reference count is also zero, the PFN database entry that describes the page is updated to add the page to the free, standby, or modified list.
Type The type of page represented by this PFN. (Types include active/valid, transition, standby, modified, modified-no-write, free, zeroed, bad, and transition.)
Flags The information contained in the flags field is shown in Table 7-21.
Table 7-21. Flags Within PFN Database Entries
Indicates whether the page was modified. (If the page is modified, its contents must be saved to disk before removing it from memory.)
Indicates that the PTE referenced by the PFN entry is a prototype PTE. (For example, this page is shareable.)
Indicates that the physical page contains parity or error correction control errors.
Read in progress
Indicates that an in-page operation is in progress for the page. The first DWORD contains the address of the event object that will be signaled when the I/O is complete; also used to indicate the first PFN for nonpaged pool allocations.
Write in progress
Indicates that a page write operation is in progress. The first DWORD contains the address of the event object that will be signaled when the I/O is complete; also used to indicate the last PFN for nonpaged pool allocations.
Start of nonpaged pool
For nonpaged pool pages, indicates that this is the first PFN for a given nonpaged pool allocation.
End of nonpaged pool
For nonpaged pool pages, indicates that this is the last PFN for a given nonpaged pool allocation.
Indicates that an I/O error occurred during the in-page operation on this page. (In this case, the first field in the PFN contains the error code.)
Original PTE contents All PFN database entries contain the original contents of the PTE that pointed to the page (which could be a prototype PTE). Saving the contents of the PTE allows it to be restored when the physical page is no longer resident.
PFN of PTE Physical page number of the page table page containing the PTE that points to this page.
The remaining fields are specific to the type of PFN. For example, the first PFN in Figure 7-36 represents a page that is active and part of a working set. The share count field represents the number of PTEs that refer to this page. (Pages marked read-only, copy-on-write, or shared read/write can be shared by multiple processes.) For page table pages, this field is the number of valid PTEs in the page table. As long as the share count is greater than 0, the page isn't eligible for removal from memory.
The working set index field is an index into the process working set list (or the system or session working set list, or zero if not in any working set) where the virtual address that maps this physical page resides. If the page is a private page, the working set index field refers directly to the entry in the working set list because the page is mapped only at a single virtual address. In the case of a shared page, the working set index is a hint that is guaranteed to be correct only for the first process that made the page valid. (Other processes will try to use the same index where possible.) The process that initially sets this field is guaranteed to refer to the proper index and doesn't need to add a working set list hash entry referenced by the virtual address into its working set hash tree. This guarantee reduces the size of the working set hash tree and makes searches faster for these particular direct entries.
The second PFN in Figure 7-36 is for a page on either the standby or the modified list. In this case, the forward and backward link fields link the elements of the list together within the list. This linking allows pages to be easily manipulated to satisfy page faults. When a page is on one of the lists, the share count is by definition 0 (because no working set is using the page) and therefore can be overlaid with the backward link. The reference count is also 0 if the page is on one of the lists. If it is nonzero (because an I/O could be in progress for this page for example, when the page is being written to disk), it is first removed from the list.
The third PFN in Figure 7-36 is for a page on the free or zeroed list. Besides being linked together within the two lists, these PFN database entries use an additional field to link physical pages by "color," their location in the processor CPU memory cache. Windows attempts to minimize unnecessary thrashing of CPU memory caches by using different physical pages in the CPU cache. It achieves this optimization by avoiding using the same cache entry for two different pages wherever possible. For systems with direct mapped caches, optimally using the hardware's capabilities can result in a significant performance advantage.
The fourth PFN in Figure 7-36 is for a page that has an I/O in progress (for example, a page read). While the I/O is in progress, the first field points to an event object that will be signaled when the I/O completes. If an in-page error occurs, this field contains the Windows error status code representing the I/O error. This PFN type is used to resolve collided page faults.
In addition to the PFN database, the system variables in Table 7-22 describe the overall state of physical memory.
Table 7-22. System Variables That Describe Physical Memory
Total number of physical pages available on the system
Total number of available pages on the system the sum of the pages on the zeroed, free, and standby lists
Total number of physical pages that would be available if every process were at its minimum working set size
Low and High Memory Notification
Windows XP and Windows Server 2003 provide a way for user mode processes to be notified when physical memory is low and/or plentiful. This information can be used to determine memory usage as appropriate. For example, if available memory is low, the application can reduce memory consumption. If available memory is high, the application can allocate more memory.
To be notified of low or high memory conditions, call the CreateMemoryResourceNotification function, specifying whether low or high memory notification is desired. A handle can be provided to any of the wait functions. When memory is low (or high), the wait completes, thus notifying the thread of the condition. Alternatively, the QueryMemoryResourceNotification can be used to query the system memory condition at any time.
Notification is implemented by the memory manager signaling a globally named event object LowMemoryCondition or HighMemoryCondition. These named objects are not in the normal \BaseNamedObjects object manager directory, but in a special directory called \KernelObjects. When low (or high) memory condition is detected, the appropriate event is signaled, thus waking up any waiting threads.
The default level of available memory that signals a low-memory-resource notification event is approximately 32 MB per 4 GB, to a maximum of 64 MB. The default level that signals a high-memory-resource notification event is three times the default low-memory value. These values can be overridden by adding a DWORD registry value LowMemoryThreshold or HighMemoryThreshold under HKEY_LOCAL_MACHINE\System\CurrentControlSet\Session Manager\Memory Management that specifies the number of megabytes to use as the low or high threshold.
EXPERIMENT: Viewing the Memory Resource Notification Events
To see the memory resource notification events, run Winobj from http://www.sysinternals.com and click on the KernelObjects folder. You will see both the low and high memory condition events shown on the right pane:
If you double-click either event, you can see how many handles and/or references have been made to the objects.
To see whether any processes in the system have requested memory resource notification, search the handle table for references to "LowMemoryCondition" or "HighMemoryCondition". You can do this by using Process Explorer's Find menu and choosing the Handle capability or by using the Oh.exe tool from the Resource Kits. (For a description of the handle table, see the section on the "Object Manager" in Chapter 3.)
RAM Optimizers: Fact or Fiction?
As you've surfed the Web, you've probably seen browser pop-ups such as "Defragment your memory and improve performance" and "Minimize application and system failures and free unused memory." The links lead you to utilities that promise to do all that and more. Do they really work?
Memory optimizers typically present a UI that shows a graph labeled "available memory" and a line representing a threshold below which the product will take action. Another line typically shows the amount of memory that the optimizer will try to free when it runs. You can usually configure one or both levels, as well as trigger manual memory optimization or schedule optimizations. Some tools also display the processes running on the system. When a scheduled optimization job runs, the utility's available memory counter often goes up, sometimes dramatically, which implies that the tool is actually freeing up memory for your applications to use. But what it's really doing is causing useful memory to be zeroed to artificially increase the available memory.
RAM optimizers work by allocating and then freeing large amounts of virtual memory. The figure below shows the effect a RAM optimizer has on a system:
The "before" bar depicts the working sets and available memory before optimization. The "during" bar shows that the RAM optimizer creates a high memory demand, which it does by incurring many page faults in a short time. In response, the memory manager increases the RAM optimizer's working set. This working-set expansion occurs at the expense of available memory and when available memory becomes low at the expense of other process working sets. The "after" bar illustrates how, after the RAM optimizer frees its memory, the memory manager moves all the pages that were assigned to the RAM optimizer to the free page list (which ultimately get zeroed by the zero page thread and moved to the zero page list), thus contributing to the available memory value. Most optimizers hide the rapid decline in available memory that occurs during the first step, but if you run Task Manager during an optimization, you can often see the decline as it takes place.
While gaining more available memory might seem like a good thing, it isn't. As RAM optimizers force the available memory counter up, they force other processes' data and code out of memory. If you're running Word, for example, the text of open documents and the program code that was part of Word's working set before the optimization (and was therefore present in physical memory) must be reread from disk as you continue to edit your document. The performance degradation can be severe on servers, on which file data that was cached in the standby list and in the system working set (as well as the code and data used by any running server applications) would be discarded.
Some vendors make additional claims for their RAM-optimizer products. One claim you might see is that a product frees memory that's needlessly consumed by unused processes, such as those that run in the Taskbar tray. That claim could be true only if those processes had sizable working sets at the time of optimization. However, because Windows automatically trims idle processes' working sets, all such claims are untrue. The memory manager handles all necessary memory optimization.
Developers of RAM optimizers also claim that their products defragment memory. The act of allocating and then freeing a large amount of virtual memory might, as a conceivable side-effect, lead to large blocks of contiguous available memory. However, because virtual memory masks the layout of physical memory from processes, they can't directly benefit from having virtual memory backed by contiguous physical memory. As processes execute and undergo working-set trimming and growth, their virtual-memory-to-physical-memory mappings will become fragmented despite the availability of contiguous memory. Having contiguous available memory can result in increased performance in one case: To try to maximize the behavior of the CPU memory caches, the memory manager uses a mechanism called page coloring to decide which page from the free or zeroed list to assign to a process. However, any minor benefit that might result from making available physical memory contiguous is heavily outweighed by the negative effect of discarding valuable code and data from memory.
Finally, vendors often claim that RAM optimizers regain memory lost to leaks. This is perhaps the most patently false assertion of all. The memory manager knows at all times what physical and virtual memory belongs to a process. However, if a process allocates memory and then doesn't free it because of a bug (an occurrence known as a leak), the Memory Manager can't recognize that the allocated memory won't be accessed again at some point and must wait until the process exits to reclaim the memory. Even for leaking processes that do not exit, as a result of working-set trimming, the memory manager eventually will steal from the process's working set any physical pages that are assigned to leaked virtual memory. That process sends the leaked pages to the paging file and lets physical memory be used for other purposes. Thus, a memory leak has only a limited effect on available physical memory. The real effect is on virtual memory consumption, which Task Manager calls both PF Usage and Commit Charge. No utility can do anything about virtual memory consumption other than kill processes that are consuming memory.
In summary, common sense suggests that if RAM optimization were possible (and could be implemented by so many small-time upstarts), Microsoft developers would have long since integrated the technology into the kernel.