Logical Prefetcher

< Day Day Up >

During a typical system boot or application startup, the order of faults is such that some pages are brought in from one part of a file, then perhaps from a distant part of the same file, then from a different file, perhaps from a directory, and then again from the first file. This jumping around slows down each access considerably and, thus, analysis shows that disk seek times are a dominant factor in slowing boot and application startup times. By prefetching batches of pages all at once, a more sensible ordering of access, without excessive backtracking, can be achieved, thus improving the overall time for system and application startup. The pages that are needed can be known in advance because of the high correlation in accesses across boots or application starts.

The prefetcher, introduced in Windows XP, tries to speed the boot process and application startup by monitoring the data and code accessed by boot and application startups and using that information at the beginning of a subsequent boot or application startup to read in the code and data. When the prefetcher is active, the memory manager notifies the prefetcher code in the kernel of page faults, both those that require that data be read from disk (hard faults) and those that simply require data already in memory be added to a process's working set (soft faults). The prefetcher monitors the first 10 seconds of application startup. For boot, the prefetcher by default traces from system start through the 30 seconds following the start of the user's shell (typically Explorer) or, failing that, up through 60 seconds following Windows service initialization or through 120 seconds, whichever comes first.

The trace assembled in the kernel notes faults taken on the NTFS Master File Table (MFT) metadata file (if the application accesses files or directories on NTFS volumes), on referenced files, and on referenced directories. With the trace assembled, the kernel prefetcher code notifies the prefetcher component of the Task Scheduler service (\Windows\System32\ Schedsvc.dll), running in a copy of Svchost, by signaling an event object named PrefetchTraces-Ready.

Note

You can enable or disable prefetching of the boot or application startups by editing the DWORD registry value HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters\EnablePrefetcher. Set it to 0 to disable prefetching altogether, 1 to enable prefetching of only applications, 2 for prefetching of boot only, and 3 for both boot and applications.

When the PrefetchTracesReady event is signaled, the Task Scheduler performs a call to the internal NtQuerySystemInformation system call requesting the trace data. The Task Scheduler post processes the trace data, combining it with previously collected data, and writes it out to a file in the \Windows\Prefetch folder, which is shown in Figure 7-31. The file's name is the name of the application to which the trace applies followed by a dash and the hexadecimal representation of a hash of the file's path. The file has a ".pf" extension, so an example would be NOTEPAD.EXE-AF43252301.PF.

Figure 7-31. Prefetch folder

There are two exceptions to the file name rule. The first is for images that host other components, including the Microsoft Management Console (\Windows\System32\Mmc.exe) and Dllhost (\Windows\System32\Dllhost.exe). Because add-on components are specified on the command line for these applications, the prefetcher includes the command line in the generated hash. Thus, invocations of these applications with different components on the command line will result in different traces. The prefetcher reads the list of executables that it should treat this way from the HostingAppList value in its parameters registry key, HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\PrefetchParameters.

The other exception to the file name rule is the file that stores the boot's trace, which is always named NTOSBOOT-B00DFAAD.PF. (If read as a word, "boodfaad" sounds similar to the English words boot fast.) Only after the prefetcher has finished the boot trace (the time of which was defined earlier) does it collect page fault information for specific applications.

EXPERIMENT: Looking Inside a Prefetch File

A prefetch file's contents serve as a record of files and directories accessed during the boot or an application startup, and you can use the Strings utility from http://www.sysinternals.com to see the record. The following command lists all the files and directories referenced during the last boot:

C:\Windows\Prefetch>Strings ntosboot-boodfaad.pf Strings v2.1 Copyright(C) 1999-2003 Mark Russinovich Systems Internals - www.sysinternals.com NTOSBOOT SCCA \DEVICE\HARDDISKVOLUME2\$MFT \DEVICE\HARDDISKVOLUME2\WINDOWS\PREFETCH\NTOSBOOT-B00DFAAD.PF \DEVICE\HARDDISKVOLUME2\SYSTEMVOLUMEINFORMATION \_RESTORE{987E0331-0F01-427C-A5 8A-7A2E4AABF84D}\RP24\CHANGE.LOG \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\PROCESSR.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\FGLRYM.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\VIDEOPRT.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\E1000325.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\USBUHCI.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\USBPORT.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\USBEHCI.SYS \DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\DRIVERS\NIC1394.SYS ...

When the system boots or an application starts, the prefetcher is called to give it an opportunity to perform prefetching. The prefetcher looks in the prefetch directory to see if a trace file exists for the prefetch scenario in question. If it does, the prefetcher calls NTFS to prefetch any MFT metadata file references, reads in the contents of each of the directories referenced, and finally opens each file referenced. It then calls the memory manager function MmPrefetchPages to read in any data and code specified in the trace that's not already in memory. The memory manager initiates all the reads asynchronously and then waits for them to complete before letting an application's startup continue.

EXPERIMENT: Watching Prefetch File Reads and Writes

If you capture a trace of application startup with Filemon from http://www.sysinternals.com in Windows XP, you can see the prefetcher check for and read the application's prefetch file (if it exists), and roughly ten seconds after the application started, see the prefetcher write out a new copy of the file. Below is a capture of Notepad startup with an Include filter set to "prefetch" so that Filemon shows only accesses to the \Windows\Prefetch directory:

Lines 1 through 3 show the Notepad prefetch file being read in the context of the Notepad process during its startup. Lines 4 through 10, which have time stamps 10 seconds later than the first 3 lines, show the Task Scheduler, which is running in the context of a Svchost process, write out the updated prefetch file.

To minimize seeking even further, every three days or so, during system idle periods, the Task Scheduler organizes a list of files and directories in the order that they are referenced during a boot or application start and stores the list in a file named Windows\Prefech\Layout.ini, shown in Figure 7-32.

Figure 7-32. Prefetch defragmentation layout file

Then it launches the system defragmenter with a command-line option that tells the defragmenter to defragment based on the contents of the file instead of performing a full defrag. The defragmenter finds a contiguous area on each volume large enough to hold all the listed files and directories that reside on that volume and then moves them in their entirety into the area so that they are stored one after the other. Thus, future prefetch operations will even be more efficient because all the data read in is now stored physically on the disk in the order it will be read. Because the files defragmented for prefetching usually number only in the hundreds, this defragmentation is much faster than full volume defragmentations. (See Chapter 12 for more information on defragmentation.)

Placement Policy

When a thread receives a page fault, the memory manager must also determine where in physical memory to put the virtual page. The set of rules it uses to determine the best position is called a placement policy. Windows considers the size of CPU memory caches when choosing page frames to minimize unnecessary thrashing of the cache.

If physical memory is full when a page fault occurs, a replacement policy is used to determine which virtual page must be removed from memory to make room for the new page. Common replacement policies include least recently used (LRU) and first in, first out (FIFO). The LRU algorithm (also known as the clock algorithm, as implemented in most versions of UNIX) requires the virtual memory system to track when a page in memory is used. When a new page frame is required, the page that hasn't been used for the greatest amount of time is removed from the working set. The FIFO algorithm is somewhat simpler; it removes the page that has been in physical memory for the greatest amount of time, regardless of how often it's been used.

Replacement policies can be further characterized as either global or local. A global replacement policy allows a page fault to be satisfied by any page frame, whether or not that frame is owned by another process. For example, a global replacement policy using the FIFO algorithm would locate the page that has been in memory the longest and would free it to satisfy a page fault; a local replacement policy would limit its search for the oldest page to the set of pages already owned by the process that incurred the page fault. Global replacement policies make processes vulnerable to the behavior of other processes an ill-behaved application can undermine the entire operating system by inducing excessive paging activity in all processes.

Windows implements a combination of local and global replacement policy. When a working set reaches its limit and/or needs to be trimmed because of demands for physical memory, the memory manager removes pages from working sets until it has determined there are enough free pages.

Working Set Management

Every process starts with a default working set minimum of 50 pages and a working set maximum of 345 pages. Although it has little effect, you can change the process working set limits with the Windows SetProcessWorkingSetSize function, though you must have the "increase scheduling priority" user right to do this. However, unless you have configured the process to use hard working set limits (new in Windows Server 2003), these limits are ignored, in that the memory manager will permit a process to grow beyond its maximum if it is paging heavily and there is ample memory (and conversely, the memory manager will shrink a process below its working set minimum if it is not paging and there is a high demand for physical memory on the system). Although in Windows 2000, the extent to which a process was over its working set maximum was a factor in its priority to be trimmed, as of Windows XP, this decision is based solely on how many pages have been accessed.

In Windows Server 2003, hard working set limits can be set using the SetProcessWorkingSet-SizeEx function along with the QUOTA_LIMITS_HARDWS_ENABLE flag. An sample consumer of this function is the Windows System Resource Manager (WSRM), described in Chapter 6.

The maximum working set size can't exceed the systemwide maximum calculated at system initialization time and stored in the kernel variable MmMaximumWorkingSetSize. This value is set to be the number of available pages (the size of the zero, free, and standby list) at the time the computation is made minus 512 pages. However, there are hard upper limits for working set sizes these are listed in Table 7-17.

Table 7-17. Upper Limit for Working Set Maximums
Windows Version	Working Set Maximum
x86 versions of Windows 2000, Windows XP, Windows XP SP1, Windows Server 2003	1984 MB
x86 versions of Windows XP SP2, Windows Server 2003 SP1	2047.9 MB
x86 versions of Windows booted /3GB	3008 MB
IA-64	7152 GB
x64	8192 GB

When a page fault occurs, the process's working set limits and the amount of free memory on the system are examined. If conditions permit, the memory manager allows a process to grow to its working set maximum (or beyond if the process does not have a hard working set limit and there are enough free pages available). However, if memory is tight, Windows replaces rather than adds pages in a working set when a fault occurs.

Although Windows attempts to keep memory available by writing modified pages to disk, when modified pages are being generated at a very high rate, more memory is required in order to meet memory demands. Therefore, when physical memory runs low, the working set manager, a routine that runs in the context of the balance set manager system thread (described in the next section), initiates automatic working set trimming to increase the amount of free memory available in the system. (With the Windows SetProcessWorkingSetSize function mentioned earlier, you can also initiate working set trimming of your own process for example, after process initialization.)

The working set manager examines available memory and decides which, if any, working sets need to be trimmed. If there is ample memory, the working set manager calculates how many pages could be removed from working sets if needed. If trimming is needed, it looks at working sets that are above their minimum setting. It also dynamically adjusts the rate at which it examines working sets as well as arranges the list of processes that are candidates to be trimmed into an optimal order. For example, processes with many pages that have not been accessed recently are examined first; larger processes that have been idle longer are considered before smaller processes that are running more often; the process running the foreground application is considered last; and so on.

When it finds processes using more than their minimums, the working set manager looks for pages to remove from their working sets, making the pages available for other uses. If the amount of free memory is still too low, the working set manager continues removing pages from processes' working sets until it achieves a minimum number of free pages on the system.

On a single-processor Windows 2000 system and all systems running Windows XP and Windows Server 2003, the working set manager tries to remove pages that haven't been accessed recently. It does this by checking the accessed bit in the hardware PTE to see whether the page has been accessed. If the bit is clear, the page is aged, that is, a count is incremented indicating that the page hasn't been referenced since the last working set trim scan. Later, the age of pages is used to locate candidate pages to remove from the working set.

Note

On a Windows 2000 multiprocessor system, the working set manager mistakenly didn't check the access bit, resulting in pages being removed from the working set without regard to the state of the accessed bit.

If the hardware PTE accessed bit is set, the working set manager clears it and goes on to examine the next page in the working set. In this way, if the accessed bit is clear the next time the working set manager examines the page, it knows that the page hasn't been accessed since the last time it was examined. This scan for pages to remove continues through the working set list until either the number of desired pages has been removed or the scan has returned to the starting point. (The next time the working set is trimmed, the scan picks up where it left off last.)

EXPERIMENT: Viewing Process Working Set Sizes

You can use the Performance tool to examine process working set sizes by looking at the following performance counters:

Counter	Description
Process: Working Set	Current size of the selected process's working set in bytes
Process: Working Set Peak	Peak size of the selected process's working set in bytes
Process: Page Faults/Sec	Number of page faults for the process that occur each second

Several other process viewer utilities (such as Task Manager, Pview, and Pviewer) also display the process working set size.

You can also get the total of all the process working sets by selecting the _Total process in the instance box in the Performance tool. This process isn't real it's simply a total of the process-specific counters for all processes currently running on the system. The total you see is misleading, however, because the size of each process working set includes pages being shared by other processes. Thus, if two or more processes share a page, the page is counted in each process's working set.

EXPERIMENT: Viewing the Working Set List

You can view the individual entries in the working set by using the kernel debugger !wsle command. The following example shows a partial output of the working set list of LiveKd. (This command was run on the LiveKd process.)

kd> !wsle7 Working  Set  @ c0502000     Quota:        9f  FirstFree:  40   FirstDynamic:      3     LastEntry    1fe  NextSlot:    3   LastInitialize   257     NonDirect     5c  HashTable:   0   HashTableSize:     0 Virtual Address   Age    Locked    ReferenceCount c0300203           0          1          1 c0301203           0          1          1 c0502203           0          1          1 c01df201           0          0          1 c01ff201           0          0          1 c0005201           0          0          1 c0001201           0          0          1 c0002201           0          0          1 c0000201           0          0          1 c0006201           0          0          1 77e87119           0          0          1 00402319           0          0          1 77e01201           0          0          1 7ffdf201           0          0          1 00130201           0          0          1 77e9e119           0          0          1 78033201           0          0          1 00230221           0          0          1 00131201           0          0          1 77d50119           0          0          1 00132201           0          0          1 c01e0201           0          0          1 00411309           0          0          1 0040d201           0          0          1 77edf201           0          0          1 77ee0201           0          0          1 77fcd201           0          0          1 0040e201           0          0          1 7ffc1009           0          0          1 00401319           0          0          1

Notice that some entries in the working set list are page table pages (the ones with addresses greater than 0xC0000000), some are from system DLLs (the ones in the 0x7nnnnnnn range), and some are from the code of LiveKd.exe itself (those in the 0x004nnnnn range).

Balance Set Manager and Swapper

Working set expansion and trimming take place in the context of a system thread called the balance set manager (routine KeBalanceSetManager). The balance set manager is created during system initialization. Although the balance set manager is technically part of the kernel, it calls the memory manager's working set manager to perform working set analysis and adjustment.

The balance set manager waits for two different event objects: an event that is signaled when a periodic timer set to fire once per second expires and an internal working set manager event that the memory manager signals at various points when it determines that working sets need to be adjusted. For example, if the system is experiencing a high page fault rate or the free list is too small, the memory manager wakes up the balance set manager so that it will call the working set manager to begin trimming working sets. When memory is more plentiful, the working set manager will permit faulting processes to gradually increase the size of their working sets by faulting pages back into memory, but the working sets will grow only as needed.

When the balance set manager wakes up as the result of its 1-second timer expiring, it takes the following four steps:

Every fourth time the balance set manager wakes up because its 1-second timer has expired, it signals an event that wakes up another system thread called the swapper (routine KeSwapProcessOrStack).
The balance set manager then checks the look-aside lists and adjusts their depths if necessary (to improve access time and to reduce pool usage and pool fragmentation).
It looks for threads that might warrant having their priority boosted because they are CPU starved. (See the section "Priority Boosts for CPU Starvation" in Chapter 6.)
It calls the memory manager's working set manager. (The working set manager has its own internal counters that regulate when to perform working set trimming and how aggressively to trim.)

The swapper is also awakened by the scheduling code in the kernel if a thread that needs to run has its kernel stack swapped out or if the process has been swapped out. The swapper looks for threads that have been in a wait state for 7 seconds in Windows 2000 and 15 seconds in Windows XP and Windows Server 2003. If it finds one, it puts the thread's kernel stack in transition (moving the pages to the modified or standby lists) so as to reclaim its physical memory, operating on the principle that if a thread's been waiting that long, it's going to be waiting even longer. When the last thread in a process has its kernel stack removed from memory, the process is marked to be entirely outswapped. That's why, for example, processes that have been idle for a long time (such as Winlogon is after you log on) can have a zero working set size.

System Working Set

Just as processes have working sets, the pageable code and data in the operating system are managed by a single system working set. Five different kinds of pages can reside in the system working set:

System cache pages
Paged pool
Pageable code and data in Ntoskrnl.exe
Pageable code and data in device drivers
System mapped views

You can examine the size of the system working set or the size of the five components that contribute to it with the performance counters or system variables shown in Table 7-18. Keep in mind that the performance counter values are in bytes whereas the system variables are measured in terms of pages.

Table 7-18. System Working Set Performance Counters
Performance Counter (in Bytes)	System Variable (in Pages)	Description
Memory: Cache Bytes^[*]	MmSystemCacheWs.WorkingSetSize	Total size of system working set (including the cache, paged pool, pageable Ntoskrnl and driver code, and system mapped views); this is not the size of the system cache alone, even though the name implies that it is.
Memory: Cache Bytes Peak	MmSystemCacheWs.Peak	Peak system working set size.
Memory: System Cache Resident Bytes	MmSystemCachePage	Physical memory consumed by the system cache.
Memory: System Code Resident Bytes	MmSystemCodePage	Physical memory consumed by pageable code in Ntoskrnl.exe.
Memory: System Driver Resident Bytes	MmSystemDriverPage	Physical memory consumed by pageable device driver code.
Memory: Pool Paged Resident Bytes	MmPagedPoolPage	Physical memory consumed by paged pool.

^[*] Internally, this working set is called the system cache working set, even though the system cache is just one of five different components in it. Thus, several utilities think they are displaying the size of the file cache when they are displaying the total size of the system working set.

You can also examine the paging activity in the system working set by examining the Memory: Cache Faults/Sec performance counter, which describes page faults that occur in the system working set (both hard and soft).

The system variable that contains the value for this counter is MmSystemCacheWs.PageFaultCount.

The minimum and maximum system working set size is computed at system initialization time based on the amount of physical memory on the machine and whether the system is running a client or server edition of Windows. The calculated working set minimum and maximum are stored in the kernel variables shown in Table 7-19. These variables aren't available through any performance counter, but you can examine them with the kernel debugger.

Table 7-19. System Variables That Store Working Set Minimums or Maximums
Variable	Type	Description
MmSystemCacheWs.MinimumWorkingSetSize	ULONG	Minimum working set size
MmSystemCacheWs.MaximumWorkingSetSize	ULONG	Maximum working set size

You can configure the system to give priority to the system working set (as opposed to process working sets) by changing the value of HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\LargeSystemCache. In Windows 2000 Server systems, you can adjust this value indirectly by setting the properties of the file server service; in Windows XP and Windows Server 2003, you can adjust this by going to My Computer, choosing Properties, clicking the Advanced tab, pressing the Settings button in the Performance section, and finally clicking the Advanced tab. (See Chapter 11 for details.)