Section 12.3. The x64 HAT Layer


12.3. The x64 HAT Layer

12.3.1. MMU Configuration

Modern x64 processors provide both segmentation and paging mechanisms to assist an operating system in memory management. In general, Solaris uses a "flat" model that ignores the x64 segmentation mechanisms as a means of protection (isolating applications and kernel memory) in favor of a pure page table model. The base page size for memory management is always 4096 bytes (4 KB). Once paging mode is activated, all addresses that reference memory by kernel or application instructions are translated from virtual addresses to physical addresses by the processor through the interpretation of page tables. The processor control register (CR3) holds the physical memory address of a top-level page table. Various bits from a virtual address interpret the page tables to derive a physical address. Figure 12.11 shows how it works for 64-bit mode.

Figure 12.11. x64 Segment Registers


Note that bits 48..63 of every 64-bit virtual address must be either all 0s or all 1s on all current x64 processor implementations. This means that every 64-bit virtual address space has a "hole" of unusable addresses; however, we still have many terabytes of virtual address space.

The exact features and configuration used by Solaris for virtual memory management vary by processor type and operating mode.

  • 32-bit non-PAE. Two levels of page tables with 1024 entries, 4 bytes per entry. Used when there is less than 4 gigabytes of physical memory. Large pages if supported by the processor are 4 megabytes.

  • 32-bit PAE. Three levels of page tables. The top table has 4 entries, and other levels have 512 entries, 8 bytes per entry. Used when there are 4 or more gigabytes of memory or the processor supports the NX bit in page table entries. Large pages if supported by the processor are 2 megabytes.

  • 64 bit. Four levels of page tables of 512 entries, 8 bytes per entry. Used in 64-bit Solaris. Large pages if supported by the processor are 2 megabytes.

Note that Solaris source code ignores the architectural names Intel and AMD used for the page tables at different levels; instead, they are always just referred to by using their level number: 0 (Page Table), 1 (Page Directory), 2 (Page Directory Pointer), or 3 (Page Map Level-4).

12.3.2. struct mmu Variable

In Solaris, a single variable (struct mmu) is filled in during startup to hold the configuration of the MMU. Here is what it typically contains in 64-bit kernel running on an AMD Opteron processor.

/*  * HAT/MMU parameters that depend on kernel mode and/or processor type  */ struct htable; struct hat_mmu_info {         x86pte_t pt_nx;          /* either 0 or PT_NX */         x86pte_t pt_global;      /* either 0 or PT_GLOBAL */         pfn_t highest_pfn;         uint_t num_level;        /* number of page table levels in use */         uint_t max_level;        /* just num_level - 1 */         uint_t max_page_level;   /* maximum level at which we can map a page */         uint_t ptes_per_table;   /* # of entries in lower level page tables */         uint_t top_level_count;  /* # of entries in top most level page table */         uint_t  hash_cnt;        /* cnt of entries in htable_hash_cache */         uint_t  vlp_hash_cnt;    /* cnt of entries in vlp htable_hash_cache */         uint_t pae_hat;          /* either 0 or 1 */         uintptr_t hole_start;    /* start of VA hole (or -1 if none) */         uintptr_t hole_end;      /* end of VA hole (or 0 if none) */         struct htable **kmap_htables; /* htables for segmap + 32 bit heap */         x86pte_t *kmap_ptes;     /* mapping of pagetables that map kmap */         uintptr_t kmap_addr;     /* start addr of kmap */         uintptr_t kmap_eaddr;    /* end addr of kmap */         uint_t pte_size;         /* either 4 or 8 */         uint_t pte_size_shift;   /* either 2 or 3 */         x86pte_t ptp_bits[MAX_NUM_LEVEL];        /* bits set for interior PTP */         x86pte_t pte_bits[MAX_NUM_LEVEL];        /* bits set for leaf PTE */         /*          * The following tables are equivalent to PAGEXXXXX at different levels          * in the page table hierarchy.          */         uint_t level_shift[MAX_NUM_LEVEL];       /* PAGESHIFT for given level */         uintptr_t level_size[MAX_NUM_LEVEL];     /* PAGESIZE for given level */         uintptr_t level_offset[MAX_NUM_LEVEL];   /* PAGEOFFSET for given level */         uintptr_t level_mask[MAX_NUM_LEVEL];     /* PAGEMASK for given level */         uint_t tlb_entries[MAX_NUM_LEVEL];       /* tlb entries per pagesize */ };                                                                  See i86pc/vm/hat_pte.h 


The structure members are as follows:

  • pt_nx. Indicates if and where the processor supports the NX (No eXecute) bit in page table entries.

  • pt_global. Indicates if and where the processor supports the Global Page bit in page table entries.

  • highest_pfn. highest_pfn x 4K is the highest physical address the processor can access.

  • num_level. Number of page table levels in use.

  • max_level. num_level 1.

  • max_page_level. Indicates the highest-level page table that supports a large page. The 1 here indicates that the processor supports large (2 Mbyte) pages.

  • ptes_per_table. Number of entries in each page table except the top level.

  • top_level_count. Number of entries in top-level page table.

  • pae_hat. Indicates we are using PAE mode (8 bytes / PTE).

  • hole_start and hole_end. Give the boundaries (if any) of a virtual address hole.

  • pte_size. Number of bytes in each page table entry.

  • pte_size_shift. Log (base 2) of pte_size.

  • level_size[1]. Indexed by level, tells the page size that an entry at that page table level covers.

  • level_shift[1]. Log (base 2) of level_size.; that is, 1 << level_shift[l] == level_size[l].

  • level_offset[1]. level_size 1.

  • level_mask[l]. Same as ~level_offset[l].

12.3.3. Virtual Address Space Layout

On x64 platforms, the Solaris kernel shares the virtual address space with applications by initializing the top few entries in every application's top-level page tables to the same values used in the kernel's page tables. A bit in each page table entry control prevents user mode access to kernel memory. Additionally, if the processor supports the global page bit in PTEs, then all kernel mappings are marked global. This prevents kernel TLB entries from being lost when context-switching between applications.

In the 32-bit kernel, the dividing line between kernel and user virtual addresses is lower if the system has large amounts of physical memory installed to allow enough kernel memory for struct pages.

To deal with the AMD64 addressing modes, the 64-bit kernel has a "core heap" area in which to ensure that all loadable kernel modules reside within 2 Gbytes of the kernel text and data.

The 64-bit kernel also reserves a 1 Gbyte region at toxic_addr in which to map I/O device control registers. This makes it possible to use ::kgrep to search kernel memory without adverse results from memory mapped device control registers. Since the 32-bit kernel has stricter virtual address limitations, it maintains a bitmap (1 bit per 4-Kbyte page) to indicate which kernel virtual addresses are mapped to I/O devices.

(64 bit) > ::print toxic_addr > ::print toxic_size (32 bit) > ::print toxic_bit_map > ::print toxic_bit_map_len (in bits) 


On the 64-bit kernel, the segkpm segment provides immutable virtual mappings to all physical memory on the system. When the 64-bit kernel needs to briefly access memory in any given physical page, it can add the physical address to the value given in kpm_vbase and directly use that virtual address. The segkpm mappings are established with the largest possible page size to reduce the number of page tables needed. Again, because of virtual addressing limitations, seg_kpm is not available in the 32-bit kernel, so the 32-bit kernel must modify its page tables to establish a temporary mapping to the physical page.

12.3.4. 64-Bit Address Space Layout

The kernel text and data are the first two things loaded into memory. Early kernel startup code allocates the initial page_t structures, as well as other data structures required by the kernel's memory allocator, immediately below the kernel text. The Core Heap is used for loadable kernel modules (all kernel text needs to live in the same 2-Gbyte region for the AMD64 memory address model we use). The layout of address space was chosen to allow maximal usable virtual memory to 64 applications while providing Solaris enough memory to map all physical memory in segkpm and to have a large kernel heap. (See Figure 12.12.)

Figure 12.12. Solaris 64-Bit Address Space Layout (with 64-Bit Application)

                      +----------------------- +                       | Kernel Debugger (kmdb) |  (optionally loaded module)  0xFFFFFFFF.FF800000  |----------------------- |- SEGDEBUGBASE                       |   Kernel Data / BSS    |  0xFFFFFFFF.FBC00000  |----------------------- |                       |      Kernel Text       |  0xFFFFFFFF.FB800000  |----------------------- |- KERNEL_TEXT                       |      Core Heap         | (used for loadable modules)  0xFFFFFFFF.C0000000  |----------------------- |- core_base                       |   initial page_t's     |                       |   memsegs, memlists,   | (allocated during early startup)                       |   page hash, etc.      |  ---                  |----------------------- |- valloc_base / ekernelheap                       | Kernel Dynamic Heap    |  0xFFFFFXXX.XXX00000  |----------------------- |- kernelheap (floating)                       |        seg_map         |  0xFFFFFXXX.XXX00000  |----------------------- |- segkmap_start (floating)                       | Mem Mapped Devices     |  0xFFFFFXXX.XXX00000  |----------------------- |- toxic_addr (floating)                       |         segkp          |  ---                  |----------------------- |- segkp_base                       |        segkpm          |  0xFFFFFE00.00000000  |----------------------- |                        |///////Red Zone////////|  0xFFFFFD80.00000000  |----------------------- |- KERNELBASE                       |     User Stack         |- User space memory                       |                        |                       | shared objects, etc    |       (grows downward)                       :                        :  0xFFFF8000.00000000  |----------------------- |                        |///////////////////////|                       |   Virt. Addr. Hole     |                        |///////////////////////|  0x00008000.00000000  |----------------------- |                       :                        :                       |       User Heap        |       (grows upward)                       |----------------------- |                       |       User Data/BSS    |                       |----------------------- |                       |       User Text        |  0x00000000.04000000  |----------------------- |                       |       Invalid         |VA==0 not mapped to catch NULL ptr refs  0x00000000.00000000  +----------------------- + 

12.3.5. 32-Bit Address Space Layout

Unlike the case in Linux, the kernel text and data live at the top of the virtual address range. This allows the value of kernelbase on machines with large physical memory to be automatically adjusted downward to accommodate additional page_t structures without relinking the kernel image.

Although not shown, when a 32-bit application is run on the 64-bit kernel, the application uses all virtual memory below 0xFE000000; on a few very early x64 processors, though, this is reduced to 0xC0000000 to work around a processor erratum. This means that 32-bit applications on the 64-bit kernel can use almost 1 gigabyte more memory. (See Figure 12.13.)

Figure 12.13. 32-Bit Kernel's Address Space Layout

              +----------------------- +               |                       |  0xFFC00000  -|-----------------------|               |Kernel Debugger - kmdb |  (optionally loaded module)  0xFF800000  -|-----------------------|- SEGDEBUGBASE               |    Kernel Data/BSS    |  0xFEC00000  -|-----------------------|               |      Kernel Text      |  0xFE800000  -|-----------------------|- KERNEL_TEXT               |   page_t's            |               |   memsegs, memlists,  | (allocated during early startup)               |   page hash, etc.     |  ---         -|-----------------------|- valloc_base (floating)               | Kernel Dynamic heap   |  ---         -|-----------------------|- kernelheap (floating)               |        Seg_map        |  0xC3002000  -|-----------------------|- segkmap_start (floating)                |///////RED ZONE///////|  0xC3000000  -|-----------------------|- kernelbase / userlimit (floating)               |                       |               | User Shared Libraries |  grows downward               |                       |               :                       :               |       User Heap       |  grows upward...               |-----------------------|               |   User Data / BSS     |               |-----------------------|               |       User Text       |  0x08048000  -|-----------------------|               |       User Stack      | grows downward               :                       :               |       Invalid         | VA==0 not mapped to catch NULL ptr refs  0x00000000   +-----------------------+ 

12.3.6. HAT Implementation

The x64 HAT layer has several major data structures:

  • struct hat (a.k.a. hat_t). Contains information about each address space. These are arranged in a doubly linked list starting with the kernel address space's hat_t.

  • struct htable (htable_t). Manages a page table (one per page table needed in an address space). These are stored in a hash table based in the hat_t.

  • struct hment (hment_t). Tracks which physical memory pages are mapped by which page tables. These are in a linked list off the struct page.

12.3.6.1. struct hat Data Structure

/*  * The hat struct exists for each address space.  */ struct hat {         kmutex_t         hat_mutex;         kmutex_t         hat_switch_mutex;         struct as        *hat_as;         uint_t           hat_stats;         pgcnt_t          hat_pages_mapped[MAX_PAGE_LEVEL + 1];         cpuset_t         hat_cpus;         uint16_t         hat_flags;         htable_t         *hat_htable;    /* top-level htable */         struct hat       *hat_next;         struct hat       *hat_prev;         uint_t           hat_num_hash;   /* number of htable hash buckets */         htable_t         **hat_ht_hash;  /* htable hash buckets */         htable_t         *hat_ht_cached; /* cached free htables */         x86pte_t         hat_vlp_ptes[VLP_NUM_PTES]; }; typedef struct hat hat_t; 


  • hat_pages_mapped. An array that gives the number of pages currently mapped (by page size) in a process.

  • hat_htable. The htable_t for the top-level page table in this address space.

  • hat_ht_hash. The hash table of htable_t's that belong to this address space.

  • hat_vlp_ptes. Explained next.

The 64-bit kernel (and 32-bit PAE) creates page tables only for the bottom levels (0 and 1) of the page table tree for 32-bit user processes. The level-3 page table entries are instead stored in the hat_vlp_ptes[] in the hat_t and are copied to/from a per-cpu set of upper-level page tables whenever the process is running. This reduces memory consumption by two 4K pages for each 32-bit process. So in the above example, the hat_vlp_ptes[] values represent level-2 page table entries for a 32-bit process.

12.3.6.2. Page Tables and struct htable

For each page table needed, there is a struct htable to manage it. To find the htables being used to map an address, you can use the ::vatopfn and ::pte dcmds in mdb.

> ffffffff80aa8000::vatopfn         level=0 htable=ffffffff8067f5e8 pte=800000007fc78763         level=1 htable=ffffffff8067ec68 pte=7fc88027         level=2 htable=ffffffff8067e290 pte=9697027         level=3 htable=ffffffff8067f678 pte=2d2da992435c7878 Virtual ffffffff80aa8000 maps Physical 7fc78000 


The ::pte command can decode a page table entry.

> 800000007fc78763::pte PTE=800000007fc78763: noexec page=0x7fc78 noconsist nosync global write 


The example shows that the kernel heap virtual address 0xffffffff80aa8000 is mapped to physical address 0x7fc78000 and that it's a global kernel writable mapping.

To go from a physical page to all its usage in the HAT, use the ::report_maps dcmd.

> 7fc78::report_maps Pagetable for hat=ffffffff8012ff38 htable=ffffffff8067f438 hat=ffffffff8012ff38 maps addr=fffffe80bf800000 hat=ffffffff8012ff38 maps addr=ffffffff80aa8000 


The example shows that the page 7fc78 is a page table and that 0xffffffff8067f438 is where the corresponding htable_t is. The page is also mapped by a mapping starting at virtual address 0xfffffe80bf800000 (the segkpm mapping to the page) and by an additional kernel mapping at virtual address 0xffffffff80aa8000. Let's look at the actual htable.

> ffffffff8067f438::print -t htable_t {     struct htable *ht_next = 0     struct hat *ht_hat = 0xffffffff8012ff38     uintptr_t ht_vaddr = 0xfffffe80bf800000     level_t ht_level = 0     uint16_t ht_flags = 0     int16_t ht_busy = 0x1     uint16_t ht_num_ptes = 0x200     int16_t ht_valid_cnt = 0xe6     uint32_t ht_lock_cnt = 0     pfn_t ht_pfn = 0x7fc78     struct htable *ht_prev = 0xffffffff8067e9e0     struct htable *ht_parent = 0xffffffff8067f480     struct htable *ht_shares = 0 } 


  • ht_next. Links htables in the hash table rooted in the hat_t.

  • ht_hat. Identifies the hat_t that is using this htable.

  • ht_pfn. Is the physical page number of the page table that goes with this htable.

  • ht_vaddr. Is the virtual address that corresponds to the first entry of the page table.

  • ht_level. Is the level (0, 1, 2, 3) of the page table.

  • ht_valid_cnt. Number of entries in use in the page table.

  • ht_parent. The htable at one level higher (ht_level + 1) whose page table has an entry for this htable.

The HAT does not normally maintain virtual kernel mappings to all page tables. Instead, it remembers only the page frame number (PFN) of the page tables and then either remaps them when needed in a window (32-bit kernel) or uses seg_kpm (64-bit kernel) to access them. When debugging, you can use the ::ptable command to examine the contents of a page table when you know its PFN.

> 0x7fc78::ptable htable=ffffffff8067f438 [  0] va=fffffe80bf800000 PTE=800000006f3c8121: noexec page=0x6f3c8 global [  4] va=fffffe80bf804000 PTE=8000000070b5c121: noexec page=0x70b5c global [  5] va=fffffe80bf805000 PTE=800000007633d121: noexec page=0x7633d global [  6] va=fffffe80bf806000 PTE=800000007137e121: noexec page=0x7137e global [  9] va=fffffe80bf809000 PTE=800000006f871121: noexec page=0x6f871 global [ 10] va=fffffe80bf80a000 PTE=800000006f852121: noexec page=0x6f852 global [ 12] va=fffffe80bf80c000 PTE=800000006f874121: noexec page=0x6f874 global [ 14] va=fffffe80bf80e000 PTE=800000006f896121: noexec page=0x6f896 global [ 16] va=fffffe80bf810000 PTE=800000006f3f8121: noexec page=0x6f3f8 global [ 18] va=fffffe80bf812000 PTE=800000006f3da121: noexec page=0x6f3da global [ 20] va=fffffe80bf814000 PTE=800000006f33c121: noexec page=0x6f33c global [ 21] va=fffffe80bf815000 PTE=800000006f35d121: noexec page=0x6f35d global ... 


The example shows active page table entries by index, the VA mapped, the actual PTE, and its decoded meaning. These are all kernel heap addresses, so the global bit is set and they are marked with the NX (no execute) bit.

12.3.6.3. struct hment and struct Page

The HAT also maintains a reverse mapping which, given a page, can tell which and where address spaces have virtual mappings to a page. To save memory usage, the HAT tries to embed the mapping information directly into the page's page_t structure. Let's start by looking at page number 0x6f3c8.

> 6f3c8::page_num2pp 6f3c8 has page at fffffffffaee77d0 > fffffffffaee77d0::print -t page_t {     u_offset_t p_offset = 0     struct vnode *p_vnode = 0xffffffff83227440 ...     uchar_t p_nrm = 0x2     uchar_t p_embed = 0x1     uchar_t p_index = 0     uchar_t p_toxic = 0     void *p_mapping = 0xffffffff8067f438     pfn_t p_pagenum = 0x6f3c8     uint_t p_share = 0     uint_t p_sharepad = 0     uint_t p_msresv_1 = 0     uint_t p_mlentry = 0     uint64_t p_msresv_2 = 0 } 


  • p_embed. Non-zero indicates a page that is mapped only once and that the mapping information is embedded in the page_t structure.

  • p_mapping. Since p_embed is 1, the htable whose page table maps this page.

  • p_mlentry. Index for the page table entry.

If there is more than one mapping to a page, then a list is used and these fields change their meaning.

> f7a41::page_num2pp f7a41 has page at fffffffffb5d7e30 > fffffffffb5d7e30::print page_t {     p_offset = 0     p_vnode = 0xffffffff833a8840 ...     p_nrm = 0x2     p_embed = 0     p_index = 0     p_toxic = 0     p_mapping = 0xffffffff83249a28     p_pagenum = 0xf7a41     p_share = 0x1     p_sharepad = 0     p_msresv_1 = 0     p_mlentry = 0x1f0     p_msresv_2 = 0 } > 0xffffffff83249a28::print -t hment_t {     struct hment *hm_hashnext = 0     struct hment *hm_next = 0     struct hment *hm_prev = 0     htable_t *hm_htable = 0xffffffff83246dc8     uint16_t hm_entry = 0x1f0     uint16_t hm_pad = 0x8324     uint32_t hm_pad2 = 0xffffffff } 


In this case p_embed is 0, so p_mapping is interpreted as an hment_t pointer. The hment then gives the htable and the entry index in the htable's page table. If there are multiple mappings to this page, then the list pointers would be filled in. Since most memory in processes are not shared, mapping information is usually found directly in the page_t structures.

12.3.6.4. Physical Memory and DMA

The lists used to track free physical memory pages are broken up into four ranges (by address) that roughly track the sort of legacy DMA ranges needed in the past on the PC architecture:

  • 0 to 16 Mbytes

  • 0 to 2 Gbytes

  • 2 Gbytes to 4 Gbytes

  • More than 4 Gbytes

Memory allocations for devices with DMA range limitations are directed to the appropriate memory range. All other memory allocations go to the highest range with available memory. Once physical pages are allocated and mapped into the kernel for DMA, the I/O system tracks the memory, using power-of-two allocation bins (16 Mbytes, 32 Mbytes, 64 Mbytes, etc.) to speed up subsequent DMA allocate() and free() operations.




SolarisT Internals. Solaris 10 and OpenSolaris Kernel Architecture
Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)
ISBN: 0131482092
EAN: 2147483647
Year: 2004
Pages: 244

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net