Keeping Track of Free Physical Page Frames

The original incarnation of pfdat was quite simple. Each page had a pfdat structure, and the entries were arranged in a simple array. The physical page frame number served as the index into the array. As we mentioned in our discussion of the pfn_to_virt_ptr structure, modern HP-UX systems must be able to handle partitioned memory configurations. With the advent of today's cell-based systems with their physical and virtual partitioning schemes, this requirement is coming under a new focus. Figure 6-14 illustrates the current incarnation of the page free data table.

Figure 6-14. Page Free Data, `pfdat`

graphics/06fig14.gif

The pfdat_ptr partition size is set by PFN_CONTIGUOUS_PAGES and currently defaults to 4096 pages, or 16 MB, of contiguous memory. Only those physical page frames that may be dynamically paged in or out by the kernel have pfdat structures. The kernel's static pages do not require inclusion in this table, as they will never be part of the dynamic memory pool.

The two-tier nature of this structure makes it easy to skip blocks of memory pages such as kernel static memory pages or holes in the physical hardware map. In pfdat_ptr, a pp_tbl set to zero simply means that the corresponding 16-MB block of memory is not dynamically mapped. Note, this does not mean that the memory does not physically exist, just that it is not to be managed by this structure. (See Listing 6.11.)

Listing 6.11. `q4> fields struct pfdat_ptr`

 The pp_tbl points to a block of "page free data" entries and the pp_offset directs us to the first valid entry in the  block. Each block maps 4096 physical page translations  (determined by the kernel parameter PFN_CONTIGUOUS_PAGES) 0 0 4 0 *   pp_tbl 4 0 4 0 int pp_offset

Once we have used the physical page number and have followed the appropriate *pp_tbl pointer, we will find the page's pfdat data structure. Let's examine an annotated listing of this structure (Listing 6.12).

Listing 6.12. `q4> fields struct pfdat`

 Hash chain linkage pointer  0 0 4 0 *       pf_hchain Pointer to the device file vnode (devvp)from which this page  was mapped (if applicable)  4 0 4 0 *       pf_devvp Double-linked list of free page frames  8 0 4 0 *       pf_next 12 0 4 0 *       pf_prev Double-linked list connecting all page frames sharing a vnode 16 0 4 0 *       pf_vnext 20 0 4 0 *       pf_vprev Locking structures used during table maintenance 24 0 1 0 u_char  pf_lock.b_lock 26 0 2 0 u_short pf_lock.order 28 0 4 0 *       pf_lock.owner Process-dependent information 32 0 2 5 u_int   pf_pdk.pf_pfn 34 5 1 3 u_int   pf_pdk.pf_fill Number of regions using this page (0 means the page should be  on a free list) 36 0 2 0 short   pf_use Incremented prior to requesting a page frame lock 38 0 2 0 u_short pf_cache_waiting Disk block number page is mapped from (if applicable) 40 0 4 0 u_int   pf_data VPS system data for this page and its page group association 44 0 0 6 u_int   pf_sizeidx 44 6 0 5 u_int   pf_size 45 3 0 5 u_int   pf_size_fill 46 0 2 0 u_short pf_group_id Flags showing the page frame's current status: 0x001 P_FREE or P_QUEUE Page is on free list 0x002 P_BAD Page has been marked as bad (do not use!) 0x004 P_HASH Page is on a hash queue 0x008 P_CLIC Page is being used by the Cluster InterConnect 0x010 P_SYS This page belongs to the kernel 0x040 P_DMEM Page is locked by the memory deallocation  subsystem 0x080 P_LCOW Page is being remapped by copy-on-write subsystem 0x100 P_UAREA Page is mapped to a kthread's UAREA 0x200 P_KERN_DYNAMIC Page is allocated as dynamic kernel  memory (it may be returned by the kernel later) 48 0 2 0 short   pf_flags 50 0 1 0 u_char  pf_fs_priv_data The pf_hdl structure contains hardware dependent layer data The hdl flag values are: 0x01 HDLPF_TRANS A virtual address translation exists 0x02 HDLPF_NOTIFY Virtual DMA, not used by HP-UX 0x04 HDLPF_PROTECT User access no allowed 0x08 HDLPF_STEAL Remove translation when I/O completes 0x10 HDLPF_ONULLPG Used for null d-reference emulation 0x20 HDLPF_MOD A short cut to setting pde_mod flag 0x40 HDLPF_REF A short cut to setting pde_ref flag 0x80 HDLPF_READA Read ahead page in transit 52 0 2 0 u_short pf_hdl.hdlpf_flags During an alias operation we save the original page's access  rights and protection id bits here 54 0 2 0 u_short pf_hdl.hdlpf_savear 56 0 4 0 u_int   pf_hdl.hdlpf_saveprot

As we see in this structure, page frames are linked to several lists. Of course, there is a free list to facilitate page allocation schemes, and all pages related to a specific open file are linked together as well. Another linkage is the pf_hchain at the top of the structure; let's take a look at how this hash is utilized.

Is a Page I/O Really Needed? (`phash` and `phead`)

To assure that there are always free pages available when one is requested, the kernel invests a portion of its resources into the proactive identification of stale pages and the transfer of these pages to a swap location or back store. Moving pages between a disk and physical memory is a very costly operation in terms of computer instruction cycle times and I/O bandwidth. However, not being able to find a free page when one is needed can really throw a wrench into the works! As part of the page-out operation, a page's pfdat structure is added to a hashed list (see Figure 6-15). The pages are hashed according to the disk block descriptor and the device pointer of the device they have been written to. The actual hash algorithm is

((dbd_data >> 5) + dbd_data) xor (devvp >> 6) & phashmask

Figure 6-15. `phash` and `phead`

graphics/06fig15.gif

To minimize the cost of paging, the phash is searched before a page-in operation is called just in case the page frame has not been reallocated since it was paged out. If the required page is found on the phash, it is simply removed from its free list and reinstated in the region, htbl, and pfn_to_virt tables.

If you are starting to feel comfortable with the logical to virtual to physical to front store and back store relationships, great! That's the goal of this work, but we need to take a deep breath and press on. Recent developments have introduced another level of complexity to our strategy for managing this precious resource.

Views, Ponds, Pools, Pages

An operating system's physical page allocator must be able to work in harmony with the underlying physical hardware. The kernel claims physical pages for its own memory arena scheme and for dynamic assignment to meet its process's needs. As HP-UX is migrating from a purely proprietary architecture, PA-RISC to the new IA-64, the memory allocator must be modified to handle architectural differences. In preparation for this change and to facilitate partitioning schemes recently introduced, the kernel memory allocator has been modified. Figure 6-16 illustrates an abstraction of the new model.

Figure 6-16. Pools, Ponds, Views

graphics/06fig16.gif

This new model includes a combination of views, ponds, pools, and groups. Let's try to define these terms and concepts.

Physical Memory View

The highest level of organization is the physical memory view, or pmv, which allows a system's physical memory to be divided into sets depending on specific properties or aligned according to physical configurations. One use would be to divide memory according to its access latency. On an SMP (symmetrical multi-processor) or UMA (uniform memory access) machine there would be only one pmv, while on a multinode design or ccNUMA (non-uniform memory access) platform, each node (or cell) might have its own pmv. To date, this level has not been coded into the HP-UX kernel and is presented here only for future considerations.

Logical Memory Views

The logical memory view, or lmv, is the next level of division within a pmv. The lmv maintains its own free lists and lock structures. Currently, the lmvs are aligned with the system's processor sets. By having each processor set allocate memory from its own free list and use its own locking mechanism, overall system performance is increased due to the reduction in lock contentions. This is currently the highest abstraction level incorporated in the HP-UX kernel. The lmv is managed by an array of lmv structures allocated at boot time. The number of lmvs is currently calculated as the total number of processors divided by four.

Memory Pools

Each lmv is made of one or more pools of memory. A memory pool boundary is set in alignment with intrinsic hardware restrictions. For instance, on an HP-UX 32-bit kernel the first gigabyte of physical memory belongs to one pool, and the memory above 1 GB belongs to another. To allow equivalent memory mapping of kernel pages, it is required that they be in the first physical gigabyte of memory. By having the two pools, the memory allocator may be instructed to always allocate kernel pages from the first pool and to allocate user pages from the second pool (unless none are available, in which case the first pool is used).

On current PA-RISC systems, there are only two pools, one for equivalently mapped memory and one for nonequivalently mapped memory. On a wide system, the entire physical memory may be equivalently mapped, as the addressing range for the 64-bit system is quite large (2^42).

The page pool data is stored in an array of page_pool structures, which reside inside their respective lmv.

Memory Ponds

As mentioned, if a page is swapped out to increase the supply of free pages, its pfdat structure is linked to the phash. These pages are called cached free pages. As pages are freed, they are divided into one of two ponds of memory: one for cached free pages and the other for uncached. If the system uses variable page sizing, then a separate free list is maintained for each page size. The number of free list headers created per pond is determined by the kernel tunable parameter pf_offset_idx and set at boot time. The pond structure is contained as an array in its corresponding pool structure.

The Nested Structures

In Listing 6.13, we see that the pond array is nested in the pool structure, which in turn is nested in the lmv structure.

Listing 6.13. `q4> fields lmv_t`

 Number of pools in this logical memory view   0 0 4 0 int   lmv_npools Index of this logical memory view   4 0 4 0 int   lmv_id Amount of free memory in this view   8 0 4 0 int   lmv_freemem Pointer to spinlock for protection of this view's free lists  12 0 4 0 *     lmv_fl_lock Individual page pool structures (for PA-RISC there are two  pools per logical memory view). We start with a linked list of  all page groups in the pool, a free list spinlock for this  pool, a linked list of the pools in this logical view, a  pointer to the parent lmv structure, and a "used" count for  this pool  16 0 4 0 *     lmv_pools[0].pp_next  20 0 4 0 *     lmv_pools[0].pp_fl_lock  24 0 4 0 *     lmv_pools[0].pp_next_pool  28 0 4 0 *     lmv_pools[0].pp_lmv  32 0 4 0 *     lmv_pools[0].pp_n_used Within the pool structure we have two pond structures. Each  pond starts with a pointer to its free list headers, the fee count for this pond, a pointer to the parent pool structure  36 0 4 0 *     lmv_pools[0].pp_ponds[0].pp_phead  40 0 4 0 *     lmv_pools[0].pp_ponds[0].pp_n_free  44 0 4 0 *     lmv_pools[0].pp_ponds[0].pp_pool In addition there is an index count to the number of active entries in the pp_phead[] (number of page sizes allowed), the index number of this pond, and low- and high-water marks used by the lazy coalesce routines  48 0 4 0 u_int lmv_pools[0].pp_ponds[0].pp_max_idx  52 0 4 0 u_int lmv_pools[0].pp_ponds[0].pp_pond_index  56 0 4 0 u_int lmv_pools[0].pp_ponds[0].pp_lb_shift  60 0 4 0 u_int lmv_pools[0].pp_ponds[0].pp_ub_shift  64 0 4 0 *     lmv_pools[0].pp_ponds[1].pp_phead  68 0 4 0 *     lmv_pools[0].pp_ponds[1].pp_n_free  72 0 4 0 *     lmv_pools[0].pp_ponds[1].pp_pool  76 0 4 0 u_int lmv_pools[0].pp_ponds[1].pp_max_idx  80 0 4 0 u_int lmv_pools[0].pp_ponds[1].pp_pond_index  84 0 4 0 u_int lmv_pools[0].pp_ponds[1].pp_lb_shift  88 0 4 0 u_int lmv_pools[0].pp_ponds[1].pp_ub_shift Each pool maintains a total count of available page frames and their status flags  92 0 4 0 int   lmv_pools[0].pp_pfdatnumentries  96 0 4 0 u_int lmv_pools[0].pp_flags 100 0 4 0 *     lmv_pools[1].pp_next 104 0 4 0 *     lmv_pools[1].pp_fl_lock 108 0 4 0 *     lmv_pools[1].pp_next_pool 112 0 4 0 *     lmv_pools[1].pp_lmv 116 0 4 0 *     lmv_pools[1].pp_n_used 120 0 4 0 *     lmv_pools[1].pp_ponds[0].pp_phead 124 0 4 0 *     lmv_pools[1].pp_ponds[0].pp_n_free 128 0 4 0 *     lmv_pools[1].pp_ponds[0].pp_pool 132 0 4 0 u_int lmv_pools[1].pp_ponds[0].pp_max_idx 136 0 4 0 u_int lmv_pools[1].pp_ponds[0].pp_pond_index 140 0 4 0 u_int lmv_pools[1].pp_ponds[0].pp_lb_shift 144 0 4 0 u_int lmv_pools[1].pp_ponds[0].pp_ub_shift 148 0 4 0 *     lmv_pools[1].pp_ponds[1].pp_phead 152 0 4 0 *     lmv_pools[1].pp_ponds[1].pp_n_free 156 0 4 0 *     lmv_pools[1].pp_ponds[1].pp_pool 160 0 4 0 u_int lmv_pools[1].pp_ponds[1].pp_max_idx 164 0 4 0 u_int lmv_pools[1].pp_ponds[1].pp_pond_index 168 0 4 0 u_int lmv_pools[1].pp_ponds[1].pp_lb_shift 172 0 4 0 u_int lmv_pools[1].pp_ponds[1].pp_ub_shift 180 0 4 0 u_int lmv_pools[1].pp_flags 176 0 4 0 int   lmv_pools[1].pp_pfdatnumentries

The pp_head[] contains the actual free list pointers for each pond. The first element in this array has the next and previous pointer for the one-page frame free list, the second entry has the linkage pointers to the four-page frame free list, the third to the 16-page frame free list, and so on up to the maximum configured page size. To facilitate management of multiple page sizes, an additional structure called the memory page_group was created.