Section 12.2. The UltraSPARC HAT Layer

12.2. The UltraSPARC HAT Layer

In this section, we discuss the implementation of the Solaris HAT layer as implemented on UltraSPARC processors.

12.2.1. Introduction

As shown in Figure 12.2, UltraSPARC processors use a memory management unit in the microprocessor to convert virtual addresses to physical addresses on-the-fly. The MMU uses a table known as the translation lookaside buffer (TLB) to manage these translations. The HAT layer programs the microprocessor's TLB with entries identifying the relationship of the virtual and physical addresses.

Figure 12.2. UltraSPARC-I-IV MMU Topology

Since the size of the TLB is limited by hardware, the TLB is typically supplemented by a larger (but slower) in-memory table of virtual-to-physical translations. On UltraSPARC processors, this table is known as the translation storage buffer (TSB); on most other architectures, it is known as the page table. When the microprocessor needs to convert a virtual address into a physical address, it first searches the TLB (a hardware search), and if a physical address is not found (that is, hardware encountered a TLB miss), the microprocessor searches the larger in-memory table. The relationship of these components is shown in Figure 12.3.

Figure 12.3. Virtual Address Translation Hardware and Software

UltraSPARC microprocessors use a software TLB replacement strategy: When a TLB miss occurs, software is invoked to search the in-memory table (the TSB) for the required translation entry.

Let's walk through a simple example. Suppose a process allocates some memory within its heap by calling malloc(), and further suppose that malloc() returns to the program a virtual address of the requested memory. When that memory is first referenced, the virtual memory layer requests a physical memory page from the system's free lists. This newly acquired page is an associated physical address within physical memory. The virtual memory system then constructs in software a translation entry containing the virtual address (the start of the page returned by malloc) and the physical address of the new page. This newly created translation entry is then inserted into the TSB and programmed into an available slot in the microprocessor's TLB. The entry is also kept in software, linked to the address space of the process to which it belongs. Later, the program reads from the virtual address, and if the new TLB entry still resides in the TLB (it may have been ousted by other activity), the virtual-to-physical address is translated on-the-fly. If the TLB entry had been evicted, a TLB miss occurs, a hardware exception occurs, and the translation entry is looked up in the larger TSB.

The TSB is also limited in size, and in extreme circumstances a TSB miss can occur, requiring a lengthy search of the software structures linked to the process.

12.2.2. struct hat

The UltraSPARC hat structure is responsible for anchoring all HAT layer information and structures relating to a single process address space. These include the process' context ID (also known as the context number); a pointer to its as structure and TSBs; and various flags and status bits to name a few. Lets look at an example of how we obtain the contents of the hat structure for a running process on the system:

# ps    PID TTY          TIME CMD   5152 pts/6        0:00 sh   5153 pts/6        0:00 bash   5162 pts/6        0:00 ps

To get to the hat structure associated with sh we first need to find the address of its proc structure. We can do this in mdb using the PID we obtained from above:

> 0t5152::pid2proc 30eb1c840b8

Alternatively, we could have just used the ::ps dcmd which lists the proc address as part of its output:

> ::ps ! grep 5152 R   5152    5140   5152   5140      0 0x00004000 0000030eb1c840b8 sh R   5153    5152   5153   5140      0 0x00014000 0000030077a64020 bash

Having obtained the proc address we can walk the link chain as illustrated in Figure 12.4 to get to the hat structure. Note that the proc structure is also known as proc_t:

> 30eb1c840b8::print proc_t p_as p_as = 0x32b8832da68 > 0x32b8832da68::print struct as a_hat a_hat = 0x32b8831dea8 >  0x32b8831dea8::print -t struct hat {      void *sfmmu_xhat_provider = 0      cpuset_t sfmmu_cpusran = {          ulong_t [9] cpub = [ 0x10, 0, 0, 0, 0, 0, 0, 0, 0 ]      }      struct as *sfmmu_as = 0x32b8832da68      ulong_t [4] sfmmu_ttecnt = [ 0x26, 0, 0, 0 ]      ulong_t [4] sfmmu_ismttecnt = [ 0, 0, 0, 0 ]      union _h_un h_un = {          ism_blk_t *sfmmu_iblkp = 0          ism_ment_t *sfmmu_imentp = 0      }      unsigned sfmmu_free = 0      unsigned sfmmu_ismhat = 0      unsigned sfmmu_ctxflushed = 1      uchar_t sfmmu_rmstat = 0      uchar_t sfmmu_clrstart = 0xac      ushort_t sfmmu_clrbin = 0xac      short sfmmu_cnum = 0xbf7      uchar_t sfmmu_cext = 0      uchar_t sfmmu_flags = 0      struct tsb_info *sfmmu_tsb = 0x30004c92270      uint64_t sfmmu_ismblkpa = 0xffffffffffffffff      kcondvar_t sfmmu_tsb_cv = {          ushort_t _opaque = 0      }     uint8_t [4] sfmmu_pgsz = [ 0, 0, 0, 0 ]

Figure 12.4. Linkage from the `proc` Structure to the `hat` Structure

The hat structure fields are as follows:

sfmmu_xhat_provider. This is used by XHATan extension to the Solaris HAT layer which allows a device with a Memory Management Unit (MMUs) to share virtual address space with processes and kernel. It is set to NULL for "regular" CPU hat structures.
sfmmu_cpusran. CPU bit-mask used for efficient cross-calling.
sfmmu_as. Pointer to the as this hat provides mapping for.
sfmmu_ttecnt[]. Array of per-pagesize TTE counts.
sfmmu_ismttecnt[]. Array of per-page-size ISM TTE counts (estimated).
sfmmu_iblkp. Pointer to ISM mapping block. See Section 12.2.5.
sfmmu_imentp. Used by the ISM hat to point to its mapping list. See Section 12.2.5.
sfmmu_free. A bit, if set, indicates that this hat is in the process of being freed. It is set by as_free() when an address space is being torn down.
sfmmu_ismhat. A bit, if set, indicates that this is a dummy ISM hat. See Section 12.2.5.
sfmmu_ctxflushed. A bit, if set, indicates that the ctx has been flushed.
sfmmu_rmstat. Refmod stats reference count.
sfmmu_clrstart. Start color bin for page coloring.
sfmmu_clrbin. Per as physical page coloring bin.
sfmmu_cnum. Context number (a.k.a. context ID).
sfmmu_flags. hat disposition flags.
sfmmu_tsb. List of per as TSBs.
sfmmu_ismblkpa. PA of ISM mapping block. If there are no ISM mappings this is set to 1.
sfmmu_tsb_cv. Signals TSB swap-in or relocation.
sfmmu_cext. Encoding of large page sizes used to program the TLBs.
sfmmu_mflags. MMU-specific page size exclusivity. The UltraSPARC IV+ MMU supports the use of either 32-Mbyte or 256-Mbyte TTEs but not both, on a per context basis. This field is used to flag the exclusive page size used by the process.
sfmmu_mcnt. Keeps track of the number of segments using the exclusive page size.
sfmmu_pgsz[]. Preferred page size ranking for programming the TLBs.

12.2.3. The Translation Table

There are many ways to implement translation tables or page tables. The older SPARC (sun4m and sun4d) architectures employ a three-level page table as described in the SPARC Reference MMU (SRMMU) specification. The first-level page table consists of 256 entries that point to 256 second-level tables. In turn, each second-level table points to 64 third-level tables that contain the actual page table entries.

The problem with multilevel page tables in general is that they are inefficient in terms of space when sparse address spaces are mapped. This is because space for nonmapped pages needs to be allocated in the table. A 32-bit address space mapped with 4-Kbyte pages will require a total of 2³² ÷ 4096 = 1,048,576 entries in the lowest-level page tables per context. With 32-bit page table entries this translates to 4 Mbytes of memory. So, if a process uses just 12 Kbytes (three 4-Kbyte pages) to map in its text, data, and stack segments, it would need at least 4 Mbytes for its page table.

Multilevel page tables don't scale in terms of space with larger address spaces either. If we were to map a 64-bit virtual address space using 8-Kbyte pages we would need more than 2 quadrillion entries alone in the lowest-level table per context! Of course we could reduce the number of entries needed by increasing the page size. But this wastes memory because it increases the allocation granularity.

So, as we have seen, page tables do not scale to sparse 64-bit address spaces. One possible solution, pioneered in the IBM System/38, for such an address space is the inverted page table (IPT). Inverted page tables have entries for each physical page of memory only, and hence their size does not depend on the size of the virtual address space. The sun4u architecture employes an improvement over IPTs called hashed page tables (HPT). In general, HPTs use a hash of the virtual address to index into a hash table. The resulting hash bucket points to the head of a list of data nodes containing table entries that are searched for a matching virtual address and context.

Solaris implements a translation table based on the hme_blk and its associated data structures which are used to keep track of active mappings. There are two tables: one for kernel mappings and another for user mappings. Figure 12.5 illustrates the relationship between the different structures which perform a similar function to page tables in the sun4m architecture. The hme_blk structures each define virtual to physical mappings for a particular address space and virtual address range. They are organized into a series of hash buckets based on an address space identifier, the virtual address and the page size used. In the event of a TSB miss a hash of these elements is used to obtain the correct hash bucket and then a linear search of the list is made to find the corresponding hme_blk for the mapping.

Figure 12.5. Hash Table Data Structures

In the following sections we will describe the data structures and functions associated with the hash table in detail.

12.2.3.1. The Translation Table Entry

Each entry of the TLB consists of a Translation Table Entry (TTE), which describes the mapping and provides details of its associated properties. The TTE may be thought of as corresponding to a page table entry, or PTE, in the sun4m architecture. A TTE is made up of two components, the tag and the translation data, each of length 64 bits. The TTE tag contains the encoded virtual address and context ID (Figure 12.7), and the TTE data contains the corresponding physical address together with various properties associated with the translation (Figure 12.6). The context ID is a 13-bit quantity which is used to distinguish between different address spaces, so that the same virtual addresses in different address spaces can coexist in the TLB. One of the most significant properties of the mapping is its size. Each TTE maps a contiguous area of memory, which can be 8 Kbytes, 64 Kbytes, 512 Kbytes or 4 Mbytes in size (and additionally 32 Mbytes and 256 Mbytes on UltraSPARC V+). Note that this mapping size is not directly related to the underlying virtual page size, which remains at 8 Kbytes (see Section 9.10.1). Other properties of the mapping include the write and execute permissions and cacheability in the physically-indexed and virtually-indexed caches.

Figure 12.6. TTE Data Fields

Figure 12.7. Hardware and Software Representations of the TTE Tag

A TLB hit occurs if both the virtual address and context supplied to the TLB correspond to those of a particular TTE entry, loaded in the TLB. The comparison is based on the MMU TTE tag field. Address aliasing is permitted and so multiple TLB entries with the same physical address but different virtual addresses may exist. However, the reverse situation of multiple entries, with the same virtual address but different physical addresses, produces undefined results. In the event of a TLB miss trap, the TSB, provides a software managed, directly mapped cache, which is used to reload the TLB.

In Solaris 9 and prior versions, the sun4u kernel implemented TSBs that could be shared amongst many contexts just like the TLB, so the TTE tag used in the TSB contained the context ID. However, with the introduction of the per-process dynamic TSB framework in Solaris 10, TSBs are now private to a process and hence the context ID is no longer required in the TSB TTE tag (see Figure 12.7 for a description of the TSB TTE tag fields). A TSB hit occurs if the virtual address supplied corresponds to the tag of a particular entry in the faulting process's TSB.

12.2.3.2. `sf_hment` Structure

The HAT layer uses a HAT mapping entry (HME) structure to keep track of virtual-to-physical address translations. On the sun4u kernel architecture, this is called the sf_hment structure and it contains the TTE for a particular mapping.

struct sf_hment {         tte_t hme_tte;                  /* tte for this hment */         union {                 struct page *page;      /* what page this maps */                 struct pa_hment *data;  /* pa_hment */         } sf_hment_un;         struct  sf_hment *hme_next;     /* next hment */         struct  sf_hment *hme_prev;     /* prev hment */ };                                                                 See sfmmu/vm/hat_sfmmu.h

The sf_hment structure points to the physical page it maps through the page pointer. There is a one-to-one correspondence between sf_hment structures and TTEs. The hme_next and hme_prev pointers form a chain that links all the virtual mappings for this physical page. Since a single physical page can be mapped into multiple address spaces at differing virtual addresses, one physical address can be referred to by many virtual addresses (virtual address aliasing), meaning that one page can be pointed to by many sf_hment structures. Therefore, to speed up the search for mappings of a particular page, we put related sf_hment structures on a null-terminated, doubly linked list.

Let's dig into the instances of an example bash process running on a particular system.

> ::ps ! grep bash R   4147   4145   4147    4147  75447  0x4a014000 000006000b255828 bash R   4160   4159   4160    4147      0  0x4a014000 0000060002518010 bash R   4112   4110   4112    4112  75447  0x4a014000 000006000b2587a8 bash R   4126   4125   4126    4112      0  0x4a014000 000006000ba4c418 bash R   4053   4051   4053    4053  75447  0x4a014000 000006000a43bb90 bash

The ::ps mdb dcmd lists five running instances of bash with the 8th column of the output being the address of the process's proc structure. As an illustration, let's attempt to find the mapping for the virtual address 0x10028 belonging to the first reported process. The dcmd that will help us is ::sfmmu_vtop, which prints the virtual-to-physical mapping of a given address. But before we can use it, we first need to get the address space belonging to the process from the proc.

> 000006000b255828::print proc_t p_as p_as = 0x6000a96a2b8 > 0x10028::sfmmu_vtop -v -a 0x6000a96a2b8 sfmmup=6000b3a5c40 hmebp=70001cb8820 hmeblkp=3000493f7d8 tte=800000000b4906a1 pfn=5a48 pp=70002aa2400 address space 6000a96a2b8: virtual 10028 mapped to physical b490028

We found it! Virtual address 0x10028 is mapped to physical address 0xb490028. Following are the other values reported by ::sfmmu_vtop when the -v flag is used:

sfmmup. Address of hat structure for this as
hmebp. Pointer to the HME hash table entry this virtual address maps to (see Section 12.2.3.6)
hmeblkp. Pointer to the hme_blk the contains this mapping (see Section 12.2.3.3)
tte. TTE for this mapping
pfn. Page frame number
pp. Pointer to the page structure

Let's look at the page this virtual address belongs to.

> 70002aa2400::print struct page {      p_offset = 0      p_vnode = 0x60014fe90c0 ...      p_mapping = 0x3000f0c8f10      p_pagenum = 0x5a48      p_share = 0x5 ... }

As a point of validation, notice that the p_pagenum field and the PFN reported by ::sfmmu_vtop agree with each other, so we are in fact looking at the correct page. The p_share field is 5, indicating that this physical page is being mapped by five TTEs. Remember that five bash processes were reported by ::ps. It so happens that the virtual address of 0x10028 that we picked for our example falls on a text page and so is shared among all the instances of the bash binary. The sf_hment structures containing these related TTEs are linked through the page's p_mapping list. We can use the ::list dcmd to help us traverse the list.

> 0x3000f0c8f10::list struct sf_hment hme_next 3000f0c8f10 3000493f810 3000f0d2a90 300053d3b10 300049eba20 > 0x3000f0c8f10::list struct sf_hment hme_next|::print struct sf_hment sf_hment_un.page sf_hment_un.page = 0x70002aa2400 sf_hment_un.page = 0x70002aa2400 sf_hment_un.page = 0x70002aa2400 sf_hment_un.page = 0x70002aa2400 sf_hment_un.page = 0x70002aa2400

The second command in the above example prints the page each sf_hment is pointing to. As expected, they all refer to the same page.

12.2.3.3. `hme_blk`

Solaris uses the hme_blk structures to keep track of active virtual-to-physical address mappings. Each hme_blk represents a contiguous area of mapped virtual memory for a particular address space, defined by a base page address and a span.

struct hme_blk_misc {         ushort_t locked_cnt;     /* HAT_LOAD_LOCK ref cnt */         uint_t  notused:10;         uint_t  xhat_bit:1;      /* set for an xhat hme_blk */         uint_t  shadow_bit:1;    /* set for a shadow hme_blk */         uint_t  nucleus_bit:1;   /* set for a nucleus hme_blk */         uint_t  ttesize:3;       /* contains ttesz of hmeblk */ }; struct hme_blk {         uint64_t        hblk_nextpa;     /* physical address for hash list */         hmeblk_tag      hblk_tag;        /* tag used to obtain an hmeblk match */         struct hme_blk  *hblk_next;      /* on free list or on hash list */                                          /* protected by hash lock */         struct hme_blk  *hblk_shadow;    /* pts to shadow hblk */                                          /* protected by hash lock */         uint_t          hblk_span;       /* span of memory hmeblk maps */         struct hme_blk_misc      hblk_misc;         union {                 struct {                         ushort_t hblk_hmecount; /* hment on mlists counter */                         ushort_t hblk_validcnt; /* valid tte reference count */                 } hblk_counts;                 uint_t          hblk_shadow_mask;         } hblk_un; #ifdef  HBLK_TRACE         kmutex_t        hblk_audit_lock;        /* lock to protect index */         uint_t          hblk_audit_index;       /* index into audit_cache */          struct  hblk_lockcnt_audit hblk_audit_cache[HBLK_AUDIT_CACHE_SIZE]; #endif  /* HBLK_AUDIT */         struct sf_hment hblk_hme[1];     /* hment array */ }; #define hblk_lckcnt      hblk_misc.locked_cnt #define hblk_xhat_bit    hblk_misc.xhat_bit #define hblk_shw_bit     hblk_misc.shadow_bit #define hblk_nuc_bit     hblk_misc.nucleus_bit #define hblk_ttesz       hblk_misc.ttesize #define hblk_hmecnt      hblk_un.hblk_counts.hblk_hmecount #define hblk_vcnt        hblk_un.hblk_counts.hblk_validcnt #define hblk_shw_mask    hblk_un.hblk_shadow_mask                                                                See sfmmu/vm/hat_sfmmu.h

An hme_blk can have two different sizes, depending on the number of sf_hment elements it implicitly contains. When dealing with 64-Kbyte, 512-Kbyte, or 4-Mbyte sf_hment structures, we have one sf_hment for each hme_blk. When dealing with 8-Kbyte sf_hment structures, we allocate an hme_blk plus an additional seven sf_hment structures to give us a total of eight (NHMENTS) sf_hment structures that can be referenced through an hme_blk.

In the following example, the hme_blk at address 0x3000722a750 contains four sf_hment structures in its hblk_hme[].

> 3000722a750::print struct hme_blk hblk_un.hblk_counts.hblk_hmecount hblk_un.hblk_counts.hblk_hmecount = 0x4

Using the ::array dcmd, we can then list the address of each sf_hment in hblk_hme[].

> ::offsetof struct hme_blk hblk_hme offsetof (struct hme_blk, hblk_hme) = 0x38 > 3000722a750+0x38::array struct sf_hment 4 3000722a788 3000722a7a8 3000722a7c8 3000722a7e8

The hme_blk structure contains two TTE reference counters that determine if it is all right to free the HME block. Both counters must be zero for the HME block to be freed. The counters are protected by cas. hblk_hmecnt is the number of sf_hment structures present on page mapping lists. hblk_vcnt reflects the number of sf_hment elements with valid TTEs in the hme_blk. The hme_blk also has per-TTE lock counts protected by cas. This is required because physio currently requires us to lock the page in memory since the driver will need to get to the page frame number (PFN). If we have multiple threads using the same buffer for physio, they will all lock that page, causing the lock count to be larger than the number of bits available in the TTE lckcnt field.

The hmeblk_tag structure that obtains a match on a hme_blk is shown below.

typedef union {         struct {                 uint64_t        hblk_basepg: 51, /* hme_blk base pg # */                                 hblk_rehash: 13; /* rehash number */                 sfmmu_t         *sfmmup;         } hblk_tag_un;         uint64_t                htag_tag[2]; } hmeblk_tag;                                                               See sfmmu/vm/hat_sfmmu.h

hblk_basepg. Bits 63..13 of the virtual address.
hblk_rehash. rehash number. This is actually only 3 bits encoding the span/mapping size of the hme_blk as shown in Table 12.2. When we search the hash table to find the translation for a VA we usually do not know the page size in advance so, we start by looking for a 64-Kbyte mapping block (which may contain either a matching 8-Kbyte or 64-Kbyte TTE). If we do not find a match we re-hash with the next mapping size up. The cycle continues until we find a match or have exhausted all possible mapping sizes. We require hblk_rehash because we don't want to get a false hit on a 512-Kbyte or larger page rehash with a base address corresponding to an 8-Kbyte or 64-Kbyte HME block.

Table 12.2. HME Block Rehash Values
Rehash Number	Mapping Size
1	64-Kbyte (1x64-Kbyte TTE or 8x8-Kbyte TTEs)
2	512-Kbyte
3	4-Mbyte
4	32-Mbyte
5	256-Mbyte

A number of macros provided to build fields of the hmeblk_tag are listed below.

#define HME_HASH_SHIFT(ttesz)                                           \         ((ttesz == TTE8K)? HBLK_RANGE_SHIFT : TTE_PAGE_SHIFT(ttesz))    \ #define HME_HASH_ADDR(vaddr, hmeshift)                                  \         ((caddr_t)(((uintptr_t)(vaddr) >> (hmeshift)) << (hmeshift))) #define HME_HASH_BSPAGE(vaddr, hmeshift)                                \         (((uintptr_t)(vaddr) >> (hmeshift)) << ((hmeshift) - MMU_PAGESHIFT)) #define HME_HASH_REHASH(ttesz)                                          \         (((ttesz) < TTE512K)? 1 : (ttesz))                                                                 See sfmmu/vm/hat_sfmmu.h

The advantage of the hme_blk structures is that much less memory is consumed when large address spaces that are only sparsely populated with translations are handled. A problem arises, however, when an entire user address space is unmapped. The obvious way to unmap that address space is to search the hash chains for each hme_blk associated with the process address space, iterating through the entire process virtual address range and searching for the mapping size equal to the range of one hme_blk. With a large user address space and given that with 8-Kbyte pages an hme_blk can only map 64 Kbytes, this approach would be time consuming and inefficient. To speed up this process, we introduced the shadow hme_blk.

12.2.3.4. Shadow HME Blocks

Each HME block allocated for a process has additional, larger, shadow hme_blk structures associated with it up to a span 4 Mbytes (256 Mbytes on UltraSPARC V+). These are dummy hme_blk structures which have the shadow flag set. For example, on pre-UltraSPARC IV+ platforms, when a 64-Kbyte HME block is allocated it is associated with a 512-Kbyte shadow HME block. That shadow HME block is in turn associated with its own 4-Mbyte shadow HME block. Each shadow HME block maintains a bit-mask in, hblk_shw_mask, of which of its constituent virtual address subranges are mapped by the next mapping size down. In the previous example, the 4-Mbyte shadow HME block would maintain a bit-mask of which of the 8 x 512-Kbyte VA sub-ranges within its VA range have 512-Kbyte have HME blocks associated with them; each of those may be a mapping using a 512-Kbyte TTE, or itself a 512-Kbyte shadow block. The 512-Kbyte shadow HME blocks in turn record which of the their 8 x 64-Kbyte VA sub-ranges have HME blocks allocated (those may be hblk1 structures with 64-Kbyte TTEs or hblk8 structures with 8-Kbyte TTEs). Note that since each shadow hme_blk spans 8 times as much as the size below it we only need to use bits <7:0> of hblk_shw_mask.

The strategy for unmapping an address range, is to walk through it using 4 Mbyte/256 Mbyte aligned VA strides, searching the hash chains for mappings. If no HME block (either "regular" or shadow) is found for a particular VA then we can be assured that there are no mappings for that 4/256 Mbyte range, even if the VA is not mapped with a 4/256-Mbyte TTE. However, if we do find a shadow HME block then we know that there might be a mapping, though we are not guaranteed of finding one because as we drill down on the HME blocks we may find that all the mappings in them have been invalidated without the block being freed.

In an address space sparsely populated with mappings the usual case is that no mapping exists for this range and so no shadow HME block will be found and that address range can be skipped. If, however, a shadow hme_blk is found then there must be at least one, smaller, real mapping so now we step through the address range mapped by the shadow hme_blk looking for the real mappings.

In summary, shadow HME blocks allow us to probe through a VA segment at a stride of 4/256 Mbytes and only drill down to smaller page sizes for those subranges that likely do contain mappings. But even this algorithm proves to be too slow when probing very large address spaces. In these situations, more precisely if the number of 4-Mbyte probes required is > UHMEHASH_SZ, the HAT layer resorts to a brute force search of the HME hash chains. The HAT layer loops through the entire uhme_hash table searching the hash chains for matching HME blocks.

12.2.3.5. HME Block Allocation

The sfmmu_hblk_alloc() routine allocates kernel and user HME blocks. It also allocates any required shadow HME blocks for a user address space by calling sfmmu_shadow_hcreate(). Under normal circumstances, sfmmu_hblk_alloc() dynamically allocates hblk8s and hblk1s from the sfmmu8_cache and the sfmmu1_cache kmem caches respectively. For the kernel kmem_cache_alloc() is called with KM_NOSLEEP allocations while for user allocations kmem_cache_alloc() is called with KM_SLEEP. The sfmmu8_cache kmem cache allocates its memory from the hat_memload_arena vmem arena while sfmmu1_cache draws on the kmem_default_arena vmem arena. During boot, however, hme_blk structures are used out of a static pool of pre-allocated blocks until segkmem is ready to allocate memory. The kernel allocates this static pool of nucleus hme_blk structures early on in the boot process by calling sfmmu_init_nucleus_hblks().

The HAT layer also maintains a reserve pool of free hblk8s pointed to by freehblkp. When sfmmu_hblk_alloc() successfully allocates a hblk8 for a user mapping from the sfmmu8_cache kmem cache it checks to see if the reserve pool is full. If it is not, sfmmu_hblk_alloc() adds the hblk8 to it and a new hblk8 is allocated. The free pool is rechecked and if it is still not full the cycle repeats.

If HME block allocations from the kmem caches fail due to resource constraints a hme_blk is "stolen" by sfmmu_hblk_steal(), which searches for an unused or unlocked hme_blk in the user hash table. If it finds a used HME block, it is stolen from the address space using it. In the worst case that a block could not be found in the user hash table, the kernel hash table is searched for a free HME block. If, in the most extreme case, a suitable block could still not be found, sfmmu_hblk_steal() retries the search looping indefinitely until it finds one. However we should never reach this case, since enough hme_blks were allocated at startup (nucleus hme_blks) and also since hme_blks were added dynamically.

Just before initializing and returning an allocated HME block, sfmmu_hblk_alloc() goes through a verification step that checks for a suitable HME block that already exists in the HME hash table that can be used. If it finds one, it frees the allocated HME block and if the HME block found is not a hblk_reserve (see below), it is initialized and returned for use. If the current thread is mapping into user space the allocated block is freed by first trying to put it into the free pool. If the free pool is full it is freed back to segkmem. On the other hand, if the current thread is mapping into kernel space the hblk8 is added to the free pool even if it is full so that we avoid freeing it to segkmem. This will prevent stack overflow due to possible recursion since kmem_cache_free() might require the creation of a slab which in turn needs an hme_blk to map that slab. We don't need to worry about freeing hblk1s to segkmem since they don't map any kmem slabs.

When we attempt to allocate an hblk8 from the sfmmu8_cache it is possible that the kmem cache itself needs to map in memory and so the HAT layer needs to take steps to prevent infinite recursion. If an hme_blk is being requested for a sfmmu8_cache slab sfmmu_hblk_alloc() TRies to allocate it from the free pool. If the free pool is empty a specially reserved, pre-allocated hme_blk, the hblk_reserve, is returned with the current thread set to be its owner and the hblk_reserve_lock held to prevent another thread from attempting to use the reserved HME block. With this scheme, there is a possibility that a recursive condition could arise where a thread owning hblk_reserve tries to allocate another hblk8. In anticipation of this kind of scenario, the HAT layer specifically sets aside HBLK_RESERVE_MIN number of HME blocks in the reserve pool to be used exclusively by an owner of hblk_reserve. If these reserves are exhausted the system panics.

When the thread holding hblk_reserve successfully allocates an hblk8 from the sfmmu8_cache on a successive call to sfmmu_hblk_alloc() it atomically swaps the new hme_blk with hblk_reserve and tries to allocate another new HME block to satisfy the pending request.

During the verification step if sfmmu_hblk_alloc() finds a HME block in the HME hash table that is a hblk_reserve and the current thread is not the owner, sfmmu_hblk_alloc() blocks waiting for the hblk_reserve_lock to be released before re-trying the entire allocation process. But, if the thread is the owner, hblk_reserve is released since it is no longer needed, and the new HME block is used.

12.2.3.6. `hme_blk` Hash Tables

The sun4u kernel maintains two hashed tables of hme_blk structures: one for the kernel address space and one for all user address spaces. The kernel table is pointed to by the kernel variable khme_hash, and the user table is pointed to by uhme_hash. These tables are represented as an array of hmehash_bucket structures. The number of buckets in the user hash is defined by the variable uhmehash_num; the number of buckets in the kernel hash is defined by khmehash_num.

struct hmehash_bucket {         kmutex_t        hmehash_mutex;         uint64_t        hmeh_nextpa;    /* physical address for hash list */         struct hme_blk *hmeblkp;         uint_t          hmeh_listlock; };                                                                  See sfmmu/vm/hat_sfmmu.h

There are two locks in the hmehash_bucket. The hmehash_mutex is a regular mutex that ensures that operations on a hash link are only done by one thread. Any operation that comes into the HAT with a <virtual address, as> will grab the hmehash_mutex. Normally, we would expect the TSB miss handlers to grab the hash lock to make sure the hash list is consistent while we traverse it. Unfortunately, this can lead to deadlocks or recursive mutex enters since someone holding the lock could take a TLB/TSB miss. To solve this problem, we added the hmehash_listlock. This lock is only grabbed by the TSB miss handlers and sfmmu_vatopfn() and while adding/removing an hme_blk from the hash list. The code is written to guarantee we won't take a TLB miss while holding this lock.

The number of buckets in the user hash table, uhmehash_num, is a power-of-2 based on a function of physical memory multiplied by a predefined overmapping factor (HMEHASH_FACTOR), such that the average hash chain length is HMENT_HASHAVELEN. To place an upper limit on how much kernel memory is required for the user hash table, we capped uhmehash_num at MAX_UHME_BUCKETS. Unlike the user hash table, the kernel hash table, khmehash_num, has its number of buckets set at a power-of-2 based on a function of physical memory, such that it maintains an average chain length of 1. The kernel table is capped to MAX_KHME_BUCKETS. However, a minimum size is also defined on the kernel hash table as MIN_KHME_BUCKETS. Table 12.3 shows the values of the HME hash table constants.

Table 12.3. HME Hash Table Constants
Name	Value
`HMENT_HASHAVELEN`	4
`HMEHASH_FACTOR`	16
`MAX_UHME_BUCKETS`	2M
`MAX_KHME_BUCKETS`	2M
`MIN_KHME_BUCKETS`	2K
`MAX_NUCUHME_BUCKETS`	16K
`MAX_NUCKHME_BUCKETS`	8K

The hash tables are allocated during system startup in the function startup_memlist(). startup_memlist() calls ndata_alloc_hat() to allocate the hash tables out of the nucleus data area. Depending on the amount of physical memory available on a 64-bit platform, the size of either the kernel hash table or the user hash table could exceed the maximum size permitted to be allocated off the kernel nucleus, controlled by the variables max_nucuhme_buckets and max_nuckhme_buckets, respectively. In this case ndata_alloc_hat() does not create the tables. Instead, startup_memlist() calls alloc_hme_buckets() to allocate the hash tables from the kernel's 64-bit heap (kemem64).

Indexing into the hme hash table is by means of the HME_HASH_FUNCTION macro shown below. (HMEHASH_FUNC_ASM is an assembly version of HME_HASH_FUNCTION.) The hashing function is based on the address of the HAT structure (hatid), virtual address, and size of the mapping.

#define HME_HASH_FUNCTION(hatid, vaddr, shift)                           \         ((hatid != KHATID)?                                              \         (&uhme_hash[ (((uintptr_t)(hatid) ^ ((uintptr_t)vaddr >> (shift))) \                 & UHMEHASH_SZ) ]):                                       \         (&khme_hash[ (((uintptr_t)(hatid) ^ ((uintptr_t)vaddr >> (shift))) \                 & KHMEHASH_SZ) ]))                                                                 See sfmmu/vm/hat_sfmmu.h

The algorithm to find an hme_blk is as follows.

Create a tag for the hme_blk structure being searched for.
Find the hmehash_bucket structure in the hme hash table by using the HME_HASH_FUNCTION macro.
Linearly search the hme hash chain associated with the hmehash_bucket for an element with a matching hmeblk_tag.

Three macros help perform the linear search:

HME_HASH_SEARCH, which removes empty hme_blk structures from the linked list as it traverses the hme hash chain
HME_HASH_SEARCH_PREV, which is identical to HME_HASH_SEARCH but additionally returns pointers to the previous hme_blk to the one found
HME_HASH_FAST_SEARCH, which simply searches the list

Searching for an hme_blk in mdb is implemented by the ::sfmmu_vtop -v dcmd. See the example on page 592.

12.2.4. The Translation Storage Buffer (TSB)

Since searching the HME hash chains for a translation on every TLB miss would be very expensive, Solaris caches the TTE's in a software-controlled cache (the TSB). In Solaris 10 a process can have up to two TSBs that are allocated, grown, and shrunk on demand. Each TSB in the system is represented by its own tsb_info structure, and the HAT maintains a list of tsb_info structures for TSBs used by a process.

Let's look at an actual TSB. We can get the list of tsb_info structures from the hat structure by examining the sfmmu_tsb field.

>  0x300001d5d18::print struct hat sfmmu_tsb sfmmu_tsb = 0x3000435ff88 > 0x3000435ff88::print -t struct tsb_info {     caddr_t tsb_va = 0x50000000000     uint64_t tsb_pa = 0x3fe000000     struct tsb_info *tsb_next = 0     uint16_t tsb_szc = 0     uint16_t tsb_flags = 0     uint_t tsb_ttesz_mask = 0x7     tte_t tsb_tte = { ...     }     sfmmu_t *tsb_sfmmu = 0x300001d5d18     kmem_cache_t *tsb_cache = 0x3000083a008     vmem_t *tsb_vmp = 0 }

tsb_va. Base virtual address of TSB
tsb_pa. Base physical address of TSB
tsb_next. Pointer to next TSB, if any, used by this process
tsb_szc. TSB size code; possible values range from 0 (8 Kbytes) to tsb_max_growsize
tsb_flags. Flags giving the disposition of this TSB; defined as TSB_* in hat_sfmmu.h
tsb_ttesz_mask. Bit mask of page sizes cached in TSB
tsb_tte. TTE of TSB itself that is locked in the dTLB
tsb_sfmmu. Pointer to process hat structure
tsb_cache. Pointer to the kmem cache from which TSB memory is allocated
tsb_vmp. Pointer to the vmem arena from which TSB memory is allocated

The ::tsbinfo dcmd prints information on a TSB and its contents. The following example lists every entry in the TSB associated with the tsb_info structure at address 0x3000435ff88.

> 0x3000435ff88::tsbinfo -l -a TSBINFO          TSB              SIZE     FLAGS                TTE SIZES 000003000435ff88 0000050000000000 8K       -                    8K,64K,512K TSB @ 50000000000 (512 entries)                  TAG               TTE ADDR             G I L VA 63:22    V S N I H LC PA 42:13 R W N X L P V E P W G 0000050000000000 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000010 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000020 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000030 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000040 0 0 0 000000003fc 1 0 0 0 4 0  001e5c44 1 1 0 1 0 1 1 0 0 1 0 0000050000000050 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000060 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000070 1 1 1 3ffffffffff 0 0 0 0 0 0  00000000 0 0 0 0 0 0 0 0 0 0 0 0000050000000080 0 0 0 000000003fc 1 0 0 0 0 0  001fb5cc 1 0 0 1 0 1 1 0 0 0 0 0000050000000090 0 0 0 00000000000 1 0 0 0 1 0  001fec03 1 0 0 1 0 1 1 0 0 0 0 00000500000000a0 0 0 0 00000000000 1 0 0 0 2 0  001fe00a 1 0 0 1 0 1 1 0 0 0 0 00000500000000b0 0 0 0 00000000000 1 0 0 0 3 0  001fe00b 1 0 0 1 0 1 1 0 0 0 0 00000500000000c0 0 0 0 00000000000 1 0 0 0 4 0  001fe00c 1 0 0 1 0 1 1 0 0 0 0 ...

When a process is created, it starts out with an 8-Kbyte TSB, and a second TSB can be added later. The size of the TSB here relates to the total number of entries that can be cached in the TSB and not to the page size being mapped.

When an address space is first created, hat_alloc() calls tsb_alloc() to create and initialize the tsb_info structure. At this point, memory for the TSB itself is not allocated. When the first MMU miss occurs, the miss handler enters sfmmu_tsbmiss_exception(), which then places another call to tsb_alloc() to actually allocate the TSB. TSBs by definition are always physically contiguous and size aligned in order to allow the following:

Use of hardware-generated TSB pointers to access the TSB
Physical addressing of the TSB on platforms that support it
Hardware TSB walks on platforms that support it

After performing a mapping operation, the HAT looks at the number of TTEs for each page size. Based on the page sizes that are cached in each TSB, the number of mappings is compared to the number of entries in the TSB. If the number of TTEs exceeds the capacity of the TSB (which is a multiple of tsb_rss_factor, depending on the TSB size), the TSB is grown synchronously. The default TSB RSS factor is 0.75 times the number of entries in an 8-Kbyte TSB, so the TSB is actually grown before the entire capacity of the TSB is reached since some conflicts (mapping of multiple addresses to the same TSB entry) are anticipated. If the TSB needs to be grown but the system is low on memory (that is, freemem desfree) or TSB memory usage has reached the limit set by tsb_alloc_hiwater, the resize request is denied. Should the program attempt to map more memory later, the grow procedure will be reattempted.

Figure 12.8. TSB Data Structures

In cases where system is under memory pressures or TSB memory usage is more than tsb_alloc_hiwater, TSB memory may be reclaimed when pages are unmapped. If a process unmaps part of its address space and the resulting address space resident size x 2 falls below the tsb_rss_factor, the TSB will be reduced in size. The resident set size is doubled to prevent thrashing (for example, growing the TSB very soon after shrinking it) and to avoid the overhead of throwing away all mappings in the TSB (unless the whole system can potentially benefit from the cleanup).

If the number of 4-Mbyte mappings residing in a process reaches tsb_sectsb_threshold, a second TSB is allocated to the process to cache 4-Mbyte mappings. Setting tsb_sectsb_threshold very high essentially disables the TSB for 4-Mbyte mappings and causes all 4-Mbyte mappings to be retrieved from the hash.

The maximum user TSB size is limited by tsb_max_growsize to the maximum supported by hardware (currently 1 Mbyte). The system's choice can be overridden by setting a different value for this variable in /etc/system if deemed necessary; however, the overriding value must not exceed tsb_slab_size. For kernel TSBs we may go beyond the hardware-supported sizes and implement larger TSBs in software.

To prevent TSBs using up too much physical memory, tsb_alloc_hiwater imposes a resource limit that defaults to 1/32 of physical memory. Once the high-water mark is reached or if freemem falls below desfree, the TSB memory allocation algorithms start throttling. The value of tsb_alloc_hiwater may be updated following DR events, in which case the value of physmem/tsb_alloc_hiwaterfactor is used to compute the new value. Note that swapfs_minfree and segspt_minfree must be kept considerably larger than tsb_alloc_hiwater to prevent system hangs under system stress, so exercise care when tuning this limit. You can safely decrease it, however.

12.2.4.1. TSB Memory Allocation

Where and how TSBs are allocated is based on how the size of the TSB compares to the base page size and what memory conditions dictate according to the following algorithm in sfmmu_init_tsbinfo().

If allocating a "large" TSB (> 8 Kbytes)         Allocate from the kmem_tsb_default_arena vmem arena with VM_NOSLEEP else if low on memory or TSB_FORCEALLOC flag is set         Allocate from kernel heap via sfmmu_tsb8k_cache with KM_SLEEP (never fails) else         Allocate from sfmmu_tsb_cache with KM_NOSLEEP endif

Note that we always do nonblocking allocations from the TSB arena since we don't want memory fragmentation to cause processes to block indefinitely waiting for memory while the kernel algorithms coalesce large pages.

The sfmmu_tsb_cache, which is used by default, draws its memory from the kmem_tsb_default_arena vmem arena. The sfmmu_tsb8k_cache draws its memory from the kernel heap in 8-Kbyte chunks rather than from the large TSB slabs, and it is created without magazines (see Section 11.2.3.6) so that the memory is returned to the system as quickly as possible when the process terminates or calls exec().

Since TSBs larger than 8 Kbytes in size are allocated a lot less frequently than their smaller counterparts, large TSBs are all allocated directly (with a best-fit algorithm) from the kmem_tsb_default_arena.

The kmem_tsb_default_arena vmem arena allocates large physical memory slabs and maps them to the virtual memory space it has allocated from the kmem_tsb_arena, which is its vmem source. The source for the kmem_tsb_arena is the heap_arena, which provides the virtual addresses for the TSBs. The intermediate layer of the kmem_tsb_arena at first glance seems superfluous, but it enforces slab-sized alignment on the allocated virtual memory, which vmem cannot do by default.

Regardless of whether the TSB is allocated from the kmem_tsb_default_arena or one of the kmem caches, the remainder of the allocation process is the same. The virtual and physical address of the TSB is added to a new tsbinfo structure, and relocation callbacks are registered with the HAT layer since the TSB has special relocation requirements.

12.2.4.2. Large Kernel Page Support

The quantum size for the kmem_tsb_default_arena is chosen to be a large page size in order to minimize external memory fragmentation and to reduce the number of TLB misses encountered by the kernel while accessing large user TSBs. On sun4u systems, this quantum will usually be 4 Mbytes, stored in the variable tsb_slab_size. This value must be a supported MMU mapping size, but otherwise has no restrictions. For memory conservation, small memory machines (for example, 1 Gbyte of memory or less) set the slab size at 512 Kbytes during startup since the largest TSB required to map that much of a resident set size is 512 Kbytes or smaller.

Currently, there is no generic framework for the dynamic allocation of large kernel pages, so physical memory for the kmem_tsb_default_arena must be acquired through special large-page-allocation routines, summarized in Table 12.4. This limited support provides mappings backed by large pages. All allocations must be page-sized and page-aligned for the underlying platform, so this interface is unstable and not suitable for general-purpose kernel memory allocations.

Table 12.4. Large Kernel Page Allocation Routines
Function	Description
`sfmmu_tsb_page_create()`	Counterpart of `segkmem_page_create()`. This function acquires a large, physical page of memory from the page free lists. It does some setup and calls `page_create_va_large()`.
`sfmmu_tsb_xalloc()`	This function reserves physical memory being allocated into the `kmem_tsb_arena`. It calls `sfmmu_tsb_page_create()`, takes care of handling any appropriate locking necessary, and establishes a kernel virtual mapping for the new TSB slab page before it is placed into `kmem_tsb_default_arena`.
`sfmmu_tsb_segkmem_alloc()`	This wrapper around `sfmmu_tsb_xalloc()` specifies additional parameters required for TSB allocations.
`sfmmu_tsb_segkmem_free()`	This function unmaps a TSB slab page from the kernel virtual address space, frees the physical page of memory, and returns the freed virtual memory to `kmem_tsb_arena`.
`page_create_va_large()`	Large-page counterpart of `page_create_va()`.

12.2.4.3. TSB Page Relocation

Since TSB memory can be allocated from outside the kernel cage, we must be able to relocate TSBs during dynamic reconfiguration events or cage expansionthese events are asynchronous with respect to process execution. Upon a request to relocate a page that contains a TSB, page_relocate() invokes the kernel memory relocation framework. This framework is necessary since we need to prevent processes from accessing the TSB using a cached physical address while it is being relocated. It is all right to try to access the TSB through a virtual address since the access just faults on that virtual address once the mapping has been suspended.

For proper notification of relocations, at allocation time a callback is registered with the HAT layer before the tsb_pa or tsb_tte fields of the tsbinfo structure are updated. When a relocation is initiated, hat_page_relocate() invokes the pre-relocation callback with the tsbinfo pointer prior to the memory being locked so that accesses to the TSB can be quiesced. It then relocates the page and calls the post-relocation callback to complete the move of the TSB page. Moving the TSB around in physical memory requires updating the locked TTE used to access the TSB from the trap handlers while preventing accesses to it.

The pre-relocation and post-relocation callbacks for TSB pages are sfmmu_tsb_pre_relocator() and sfmmu_tsb_post_relocator(). The sfmmu_tsb_pre_relocator() routine acquires the hat_lock and sets the TSB_RELOC_FLAG flag in the tsbinfo structure, signifying that the TSB is being relocated. This relocation state is required because another thread (such as one destroying an ISM segment) may need to unmap a TTE from the TSB while data is being copied from the original location to the new location; without the flag, TTEs might be unmapped from the old location after they have been copied to the new location, resulting in data corruption. The sfmmu_tsb_post_relocator() routine acquires the hat_lock, updates tsb_pa and tsb_tte, checks whether a flush is required, and clears the TSB_RELOC_FLAG before releasing the hat_lock.

12.2.4.4. TSB Replacement

Whether a process is first starting to run, swapping in from disk, or growing or shrinking its TSB, the Solaris 10 HAT layer treats all four as being equivalent. The specifics of the TSB replacement algorithm are covered in this section.

When a TSB is replaced, the tsbinfo structure and the TSB must be completely replaced. The reason is that TSB relocation or growth should not need to pause all CPUs to prevent a race with resume(), which may be traversing the process's TSB list since it cannot acquire locks.

The general algorithm to replace a TSB is as follows:

Allocate a new TSB info/TSB pair.
Prevent further updates to the current TSB.
Temporarily set the process's context to invalid context.
Remap entries from the old TSB to the new one, if necessary.
Atomically update the pointers to the new tsbinfo structure.
Cross-call all CPUs running the process to reload the TSB base register and locked TTE.
Restore the old context.
Resume updates to the TSB.
Discard the old TSB and tsbinfo structure.

The first step in replacing a TSB is to allocate the new tsbinfo/TSB pair of the appropriate size. This allocation needs to be done while no locks are held, in order to avoid deadlock scenarios as discussed in Section 12.2.6.

Since we will be updating the HAT's tsbinfo list, we need to grab the appropriate hat_lock. This prevents any other threads from walking the list while it is being updated, and it also prevents any other threads from inserting or removing mappings into this process's TSB from kernel context until we release the lock. To prevent a thread executing in resume() during this window and actually accessing its TSB, we temporarily set the context to INVALID_CONTEXT in the hat structure. This generates a TSB exception should the process try to access the TSB before we are finished replacing it.

Depending on the type of replacement that is occurring, we can remap the entries in the old TSB into the new TSB at this point. If the TSB is growing from a small TSB to a larger TSB and the value of the kernel tuneable tsb_remap_ttes is non-zero, we remap the old entries into the new TSB since we expect those entries to be reused. The default value of this tuneable is zero; most workloads grow the TSB only during their warm-up phase and hence would realize little benefit from remapping. If a TSB is shrinking, there will be no copying of TSB entries since a simple one-to-one mapping cannot be done.

Next, we modify the process's sfmmu_tsb linked list to contain the new tsbinfo. From this point, any new threads that start to run pick up the new tsbinfo and thus program their TSB base register(s) and locked TTEs with the new TSB pointer rather than the old one. So we issue a barrier instruction to guarantee this and then store the value of the hat structure's sfmmu_cpusran field.

To handle those threads that may be on-processor and running with the old TSB, we execute an xt_some() call that causes those CPUs to update their TSB base registers if they are currently using the old TSB. This same logic is executed on the CPU that is currently relocating the TSB, to make sure that it is using the new TSB as needed. Following the cross-call and restoration of the context, there can be no further references to the old TSB, so we can drop the hat_lock and free the old tsbinfo/TSB pair.

12.2.4.5. TLB Miss Handling

The CPU generates a trap when the MMU is unable to find a translation for a virtual memory operation. For a data load or store, the CPU generates a fast_data_access_MMU_miss, and for an instruction fetch, a fast_instruction_access_MMU_miss, which are handled by the DTLB_MISS() and ITLB_MISS() handlers, respectively.

DTLB_MISS() loads the MMU TLB Tag Access register and TSB 8-Kbyte Pointer register into temporary registers. The ID of the faulting context is extracted from the Tag Access register and checked to see if it is less than or equal to INVALID_CONTEXT. If the condition is true, it means the MMU miss was within the kernel or invalid context and the handler branches to the kernel's TLB miss handler, sfmmu_kdtlb_miss(). The kernel's miss handler handles faults within the invalid context as a special case since processes may not actually have a TSB when running in invalid context.

For a user process, DTLB_MISS() checks whether the most significant bit of the TSB 8-Kbyte Pointer register is set, and if it is (signifying that there is more than one TSB), branches to sfmmu_udtlb_slowpath(). Otherwise, the TSB entry, which contains the TSB tag and corresponding TTE, is atomically loaded from the TSB and the TSB tag is compared to bits 63..22 of the virtual address held in the Tag Access register. If they are the same, a TSB hit has occurred and the TTE is loaded into the dTLB and a retry instruction is issued. In the event of a TSB miss, the handler branches to the sfmmu_tsb_miss() routine.

ITLB_MISS() works in the same way but with a few exceptions:

Kernel and invalid context misses are handled by sfmmu_kitlb_miss().
If a TTE is found, execute permission on the page is checked. If the execute bit is not set in the TTE, exec_fault() is called and the program is ultimately terminated.
The TTE is programmed into the iTLB.
A TSB miss results in a branch to sfmmu_uitlb_slowpath().

Kernel TLB Miss Handling. The kernel dTLB miss handler sfmmu_kdtlb_miss() starts by probing the first TSB, using the 8-Kbyte virtual page as the index and looks for an 8-Kbyte/64-Kbyte/512-Kbyte mapping. The second TSB is probed only under one of two conditions:

The 64-bit kernel physical mapping segment (segkpm) is mapped with large pages.
The missing virtual address is below 0x80000000.00000000. This optimization is possible since in this case segkpm is using small pages and we know no large kernel mappings will be located above kpm_vbase, which is at least 0x80000000.00000000.

If we miss in the TSBs while searching for a segkpm address, we branch to sfmmu_kpm_dtsb_miss_small() or sfmmu_kpm_dtsb_miss(), depending on whether segkpm is mapped with small or large pages. Otherwise, the TSB miss is handled by sfmmu_tsb_miss().

Since the non-nucleus (TLB unlocked) instruction pages are mapped with 8-Kbyte pages, the kernel iTLB handler sfmmu_kitlb_miss() probes only the first TSB. If there is a TSB hit, execute permissions are checked and the TTE is programmed into the iTLB. Otherwise, the handler branches to the sfmmu_tsb_miss() routine.

Multiple TSB Probes. In Solaris 10, each process can have up to two TSBs. On sun4u architectures, the first TSB caches 8-Kbyte page-size entries replicating the 64-Kbyte and 512-Kbyte entries from the TSB 8-Kbyte pointer. The second TSB holds 4-Mbyte entries. With UltraSPARC IV+, the 32-Mbyte and 256-Mbyte entries are replicated with the 4-Mbyte pointer. Since the hardware-generated pointer is used for the first probe, the first TSB is limited to 1 Mbyte in size on sun4u systems for now, though in the future another case could be added to sfmmu_udtlb_slowpath() to support larger TSB sizes purely in software.

In the fast path of the TLB trap vectors, the most significant bit of the TSB 8-Kbyte Pointer register determines if a second TSB exists. In the usual case where it does not, the 8-Kbyte Pointer register contents can be used without modification to probe the only TSB. If the second TSB does exist, the miss handler branches to sfmmu_udtlb_slowpath() or fmmu_idtlb_slowpath() and the GET_1ST_TSBE_PTR() and GET_2ND_TSBE_PTR() macros generate pointers into the first and second TSBs. The slow-path handlers probe the first TSB for a TTE, and if no match is found, probe the second. If a matching TTE is not yet found, a TSB miss results and a branch is made to sfmmu_tsb_miss().

For 64-bit processes, Solaris attempts to map all ISM segments above the 8-Gbyte boundary in the virtual address space. If the faulting address lies beyond 8 Gbytes but does not have the upper bit set (which would indicate mapped libraries or stack), the dTLB miss is predicted to be an ISM page. In this case, Since ISM is optimized to use 4-Mbyte pages, sfmmu_udtlb_slowpath() starts by probing the second TSB looking for a 4-Mbyte mapping. Only if that probe fails is the first TSB searched before a branch to sfmmu_tsb_miss().

12.2.4.6. TSB Miss Handling

The TSB miss handler sfmmu_tsb_miss() searches the page tables for virtual-to-physical translations that are not cached in either the TLB or TSB and partially handles protection faults and page faults. It also handles TLB misses that occur in invalid context.

In resolving a TSB miss, sfmmu_tsb_miss() uses per-CPU tsbmiss areas to avoid cache misses. Each CPU's tsbmiss area contains a tsbmiss structure that duplicates some information needed for TSB miss handling and provides scratch space for temporary variable storage. The search for a TTE in the hash tables is performed by the GET_TTE() macro, whose parameters are described in Table 12.5.

Table 12.5. `GET_TTE()` Parameters
Parameter	Description
`tagacc`	Tag Access register containing the faulting virtual address and context ID. In the case of ISM, the virtual address used is offset into the ISM segment (clobbered).
`hatid`	`sfmmu` pointer (clobbered).
`tte`	TTE for TLB miss if found, otherwise clobbered (return value).
`hmeblkpa`	Physical address of the `hment` if found; otherwise, clobbered (return value).
`hmeblkva`	Virtual address of `hment` if found; otherwise, clobbered (return value).
`tsbarea`	Pointer to the CPU `tsbmiss` area.
`hmentoff`	Temporarily stores `hment` offset (clobbered).
`hmeshift`	Constant/register to shift virtual address to obtain the virtual page number for page size being searched.
`hashno`	Constant/register hash number. The coded page-size value used to form the hash tag.
`label`	Temporary label for branching within macro.
`foundlabel`	Label to jump to when TTE is found.
`exitlabel`	Label to jump to when TTE is not found. The `hmebp` lock is still held at this time.

If the virtual address that caused the miss is not in an ISM segment for the process, GET_TTE() is called with hatid set to the HAT address loaded from the tsbmiss area and hashno set to TTE64K to specify a search for 8-Kbyte or 64-Kbyte pages. If a mapping is not found in the HME hash chains, then GET_TTE() is called with hashno set to TTE512K to search for a 512-Kbyte page.

As a user TSB miss handling optimization, the sfmmu HAT flags stored in the tsbmiss area are checked to see if any 512-Kbyte pages have been mapped into the process's address space, and GET_TTE() is called only if any such pages are found. If no 512-Kbyte mapping is found, the search is continued for every other valid page size supported on the platform, as above (Note: the sun4u kernel is only mapped with page sizes up to 4 Mbytes).

If the TSB miss is for an ISM segment, GET_TTE() is called with hatid set to the ISM hatid and the virtual address of tagacc set to the offset within the segment. As an optimization in this case, sfmmu_tsb_miss() searches for the largest page size down to the smallest.

If a valid mapping is found, it is loaded into a TSB by the TSB_UPDATE_TL() routine. In the case of page sizes less than 4 Mbytes, the first TSB is used. For 4-Mbyte pages and larger, the second TSB is used if it exists. Finally, the translation is programmed into the appropriate TLB and program execution resumes.

The UltraSPARC IV+ iTLB miss handler code simulates 32-Mbyte and 256-Mbyte page sizes with 4-Mbyte pages, to provide support for programs, for example, Java programs, that may copy instructions into a 32-Mbyte or 256-Mbyte data page and then execute them. The code generates the 4-Mbyte PFN bits and saves them in the modified 32-Mbyte/256-Mbyte TTEs in the TSB by calling TSB_UPDATE_TL_PN(). If the TTE is stored in the dTLB to map a 32-Mbyte/256-Mbyte page, the 4-Mbyte PFN offset bits are ignored by hardware.

If no mapping is found in the HME hash chain search, then the behavior depends on the trap level at which that the handler was called. Both the DTLB_MISS() and ITLB_MISS() handlers are common to trap level 0 and trap level > 0 portions of the trap table.

In the case of a kernel TLB miss, if the current trap level is 1, a page fault has occurred in the kernel on a kernel address and the sfmmu_pagefault() routine is called. If CPU_DTRACE_NOFAULT is set in the cpuc_dtrace_flags, we dont actually want to call sfmmu_pagefault(). Instead, we note that a fault has occurred by setting CPU_DTRACE_BADADDR and issuing a done (instead of a retry) instruction. This steps over the faulting instruction. On the other hand, if the trap level is > 1, the kernel panics with a call to ptl1_panic(). Also, if the fault occurs on the same page as the stack pointer, then we know the stack is bad and the trap handler will fail, so we call the ptl1_panic() routine.

In the case of a user TLB miss, if the current trap level is > 1, the sfmmu_window_trap() routine is called. This deals with the case of a dTLB miss when handling a register window overflow or underflow trap. If the trap level is 1, a branch is made to sfmmu_pagefault().

12.2.5. Intimate Shared Memory (ISM)

Every process that attaches a particular shared memory segment to its address space creates its own page table structures to map its virtual pages to the shared physical pages. That is, it maintains its own private hme_blk and sf_hment structures even though the sf_hment structures contain the same mappings across the different processes sharing the memory segment. For very large shared memory segments being shared by a large number of processes, as is typical of many commercial database installations, this practice can waste kernel memory. To overcome this drawback, Solaris implements a form of shared memory known as Intimate Shared Memory (ISM), whereby the page table structures are shared among each attaching process. We discussed more about ISM primitives in Section 4.4.2.

To share hme_blk structures across different address spaces, we need to be able to construct identical tags for the HME hash chain search. But recall that the hmeblk_tag is formed from the hatid, virtual address, and page size. Since each address space can map the shared segment at any virtual address, only the page size is guaranteed to be in common. To solve this problem, Solaris represents each ISM segment on the system with a separate dummy hat structure and uses the virtual address offset with the ISM segment to create the hmeblk_tag.

Each mapping of an ISM segment into a process address space is represented by the ism_ment structure and is used to link all the process hat structures sharing an ISM hat. This is similar in function to a page's p_mapping list of sf_hment structures. If a process uses ISM, the hat structure points to an ISM mapping block, ism_blk[], an array of map entries that maintain information for each ISM segment attached to the process.

Two Solaris segment drivers support ISM: the segspt and segspt_shm drivers. Each instance of an ISM segment attached to a user address space has one segspt_shm segment. And each ISM segment on the system has one segspt segment. Therefore, in the simplest case, if two processes share an ISM segment, each process having a single mapping to it, then there will be two segspt_shm segments, one in each process's address space segment list. These will both point to a single common segspt segment that describes the memory and swap space allocated for the ISM mapping. Following is the sequence of events involved in the creation of these segments.

Figure 12.9. ISM Data Structures

-> shmat                              Entry point from shmat(2)   -> ipc_lookup                       Look up the id created by shmget(2)   <- ipc_lookup   -> sptcreate                        Create the ISM segment     -> as_alloc                       Allocate an ISM as and hat     <- as_alloc     -> as_map                         Create a segspt segment in the ISM as       -> segspt_create                Get physical pages for segment         -> anon_swap_adjust           Move reserved phys swap into memory swap         <- anon_swap_adjust         -> anon_map_createpages       Allocate physical pages         <- anon_map_createpages         -> hat_memload_array          Add hme_blks to hme hash chains         <- hat_memload_array       <- segspt_create     <- as_map   <- sptcreate   -> as_map                           Create process's segspt_shm segment     -> segspt_shmattach               Attach ISM segment       -> hat_share                    Add ism_ment to ism_blk       <- hat_share     <- segspt_shmattach   <- as_map <- shmat                              All done

In more detail, the shmat() routine checks whether the segspt segment for the ISM has been created; if it has not, then sptcreate() is called. sptcreate() calls as_alloc() to allocate the common ISM as and hat structures and then calls as_map() to create the segspt segment and attach it to the ISM address space at virtual address SEGSPTADDR, which is 0x0. The segspt_create() routine allocates a vnode and calls anon_map_createpages() to allocate physical pages for the ISM area.

The anon_map_createpages() routine allocates anonymous pages, allocating the largest page size possible for the given the virtual address and range. The segspt_create() routine then calls hat_memload_array() to create the mapping of the physical pages allocated for the ISM address space to the virtual address SEGSPTADDR. hat_memload_array() adds the appropriate hme_blks to the hme hash chains to created the mappings. Two flags are passed to hat_memload_array(): The HAT_LOAD_LOCK flag causes the routine to increment the lock field in the hme_blk structure, hblk_lckcnt; the HAT_LOAD_SHARE flag causes the routine to try to use the largest page size possible that is not disabled for ISM.

Large pages for ISM are disabled with the disable_ism_large_pages flag. A value of 1 disables all large pages. Bits 1 through 5, if set, disable 64-Kbyte, 512-Kbyte, 4-Mbyte, 32-Mbyte and 512-Mbyte pages, respectively. 512-Kbyte pages must be disabled for ISM; otherwise, a process would page fault indefinitely if it tried to access a 512-Kbyte page.

The shmat() routine then creates the segspt_shm segment and attaches it to the current process's address space by passing the user-specified attach address and the routine segspt_shmattach() to as_map(). The segspt_shmattach() routine creates and initializes the segment private sptshm_data area and then calls hat_share(), which allocates the ism_blk structures and links them to the process hat structure. If the user-specified address is 0, then shmat() picks a suitable address. In a 64-bit address space, it tries to put ISM segments between PREDISM_BASE and PREDISM_BOUND. The HAT can use these constants to predict that a virtual is contained within the ISM segment, a capability that may optimize translation. The range is treated as advisory; ISM segments may fall outside the range, and non-ISM segments may be contained within the range. To avoid collision between ISM addresses with, say, process heap addresses, shmat() tries to put ISM segments above PREDISM_1T_BASE. The HAT still expects that any virtual address larger than PREDISM_BASE may belong to ISM.

The segspt driver can allocate memory up to availrrmemsegspt_minfree. segspt_minfree is the memory left for the system after the ISM pages have been created; it is set up to 5% of availrmem in sptcreate() when ISM is created. ISM should not use more than around 90% of availrmem; if it does, then the performance of the system may decrease. Machines with large memories may be able to use more memory for ISM, so we set the default segspt_minfree to 5% (which gives ISM a maximum of 95% of availrmem. If kernel programmers want even more memory for ISM (risking hanging the system), they can patch the segspt_min-free to a smaller number.

12.2.6. Synchronization in the HAT Layer

The HAT layer needs to do a considerable amount of synchronization in order to function properly. Some of the synchronization includes waiting for the TLB to be flushed before acquiring a new context, protecting the list of available contexts, preventing a TSB from being removed from a process while handlers are accessing that TSB, or waiting for a TSB's physical memory to be relocated by another thread.

The locking scheme used in the Solaris 10 HAT layer has been significantly improved over that of previous OS versions. The new scheme decreases complexity while increasing scalability and minimizing bottlenecks. Table 12.6 describes the HAT locks.

Table 12.6. Summary of HAT Locks
Lock Variable	Description
`hat_lock`	New to Solaris 10, this is an array of adaptive mutex locks hashed on the `hatid`. This replaces the old `ctx_lock` to synchronize context acquisition as well as TLB and TSB flushing.
`ctx_rwlock`	This is a reimplementation of the obsolete, halfword reader-writer spinlock, `ctx_refcnt` lock in `struct ctx` as a true kernel reader-writer lock. It exists solely to protect the context from being stolen.
`ctx_list_lock`	New to Solaris 10, this lock protects access to the `ctxfree` and `ctxdirty` lists.
`ism_mlist_lock`	This is a global adaptive mutex, protecting access to an ISM `hat` structures mapping list.
`mml_table`	This is an array of adaptive mutex locks hashed on the address of the `page` structure protecting a page's `p_mapping` list and `p_nrm` field. These locks are also referred to as mapping list locks.
`p_selock`	This lock (a.k.a. the page lock) is a shared/exclusive lock in the `page` structure that synchronizes page manipulations.

12.2.6.1. `hat_lock`

The hat_lock is an array of kmutex_t adaptive mutex locks hashed on the hatid of the process address space. When a context needs to be acquired or entries need to be flushed from the TLB or TSB, the lock corresponding to the HAT's bucket is acquired. The hat_lock, when combined with the new sfmmu_flags, replaces the uses of the old sfmmu_mutex lock.

The hat_lock protects the tsbinfo list within each HAT from changing. This is convenient since TSB map and unmap operations need to access TSBs without fear that the TSB will be relocated during the operation, and map and unmap operations need to be synchronized with TSB flushing operations to prevent data corruption.

The hat_lock is acquired within the HAT through the utility function sfmmu_hat_enter() and is dropped by a call to sfmmu_hat_exit() with the hatlock_t that was returned from the sfmmu_hat_enter() call.

Note that in the case of a virtual address cache flush, such as for setting up of temporary noncacheable entries (TNC), it is not known which address spaces are affected ahead of time. Because of locking order constraints, the hat_lock must be acquired before the lower-level routines that have this information are entered. As with the ctx_lock, all the elements of the hat_lock array must be acquired. A set of utility functions are provided for this purpose (sfmmu_hat_[un]lock_all()).

To prevent a process from becoming bottlenecked on the hat_lock, operations that will take some time to complete (such as relocating a TSB's physical memory) are guarded by new flags in the hat and tsbinfo structures. Table 12.7 describes the flags.

Table 12.7. HAT Flags
Function	Description
`HAT_SWAPPED`	This process is swapped out, and all virtual memory translation resources have been freed.
`HAT_SWAPIN`	This process is being swapped in.
`HAT_BUSY`	The `tsbinfo` list of this `hat` structure is being changed because of TSB resizing.
`HAT_ISMBUSY`	ISM mappings are being mapped or unmapped by this process. Use of this flag is restricted to the locking primitives `sfmmu_ismhat_enter()` and `sfmmu_ismhat_exit()`.
`HAT_64K_FLAG`	This process does or might use 64-Kbyte mappings.
`HAT_512K_FLAG`	This process does or might use 512-Kbyte mappings.
`HAT_4M_FLAG`	This process does or might use 4-Kbyte mappings.
`HAT_32M_FLAG`	This process does or might use 32-Mbyte mappings.
`HAT_256M_FLAG`	This process does or might use 256-Mbyte mappings.

To synchronize between processes mapping or unmapping shared segments, a new locking primitive, called sfmmu_ismhat_enter(), was built on top of the hat_lock. This primitive controls access to an sfmmu_flags flag with priority inheritance to provide mutual exclusion without introducing unnecessary lock contention, since ISM unmap operations can take a long time to complete.

12.2.6.2. Locking Order

The order of lock acquisition is shown in Figure 12.10. One important note about the locking order is that the mapping list lock is acquired before many of the HAT operations that need to acquire the hat_lock are called. Thus, attempting to allocate kernel memory while hat_lock is held can result in deadlock. All memory allocations are therefore done without the hat_lock.

Figure 12.10. SPARC HAT Locking Order

Attempting to acquire the hat_lock for the kernel's HAT is a no-op, so there is no chance of recursively trying to take the same lock while holding the hat_lock on behalf of the current process.

12.2.6.3. TSB Consistency

In older versions of Solaris, where multiple contexts could share a TSB when a context was stolen from a process, the TSB associated with that context needed to be flushed of all TTEs matching that context. Because of this, individual TSB entries did not need to be unmapped if the context was the invalid context.

With per-address space TSBs in Solaris 10, the context is no longer associated with the TSB, so flushing the TSB "per context" is no longer a supported operation and no unmapping is done if a context is stolen from a process. Instead, the process faults, acquires a new context, and continues using TTEs cached earlier in its TSB. This results in significant savings should a process's context be stolen. The trade-off, albeit a small one, is that now individual TSB entries must always be unmapped, regardless of the active context. However, the TLB flush is still skipped in this case, since the invalid context is never allowed to have TTEs in the TLB, and we know that the TLB will be flushed of the last context before it is reused on a given CPU.

In the new implementation, TTEs are unmapped from the TSB in the following instances:

Address range unmap.
Temporary noncacheable (TNC) entries due to virtual address cache (VAC) conflict with a new mapping.
A change in protections of an existing mapping.
A thread dilemma: It needs to unmap an entry in a process's TSB but could not acquire exclusive access to it because the TSB was being relocated in physical memory.

The entire TSB is flushed of all mappings in the following instances:

ISM shared address range unmap
Initial TSB memory allocation

Of note is that in the new implementation, sfmmu_inv_tsb() has been completely rewritten to use block store VIS instructions. This has the desired side effect of invalidating any data in level-1 caches that will be stale when the TSB is updated with its physical address, while simultaneously providing a tremendous boost in the rate at which TSBs may be flushed. Although the TSB is still flushed on an ISM address range unmap or when a TSB is first allocated, it is no longer flushed when an address space is freed and so conserves valuable CPU time when processes terminate or call exec().

12.2.7. SPARC HAT Layer Kernel Tunables

Table 12.8. TSB Related Tunables
Tunable	Description
`default_tsb_size`	Selects size of initial TSB allocated to all processes. Default is 8K (size code 0). It's OK to tune this, but a value that is too large may waste kernel memory. In any case the default TSB size may never exceed the value of TSB slab size or the system will panic. The possible values are as follows: 08 KBytes 116 KBytes 232 KBytes 364 KBytes 4128 KBytes 5256 KBytes 6512 KBytes 71 MByte
`enable_tsb_rss_sizing`	When set to 1 (the default), TSBs may be resized per the RSS sizing algorithm discussed on page 603. When set to 0, TSBs will remain at `default_tsb_size`.
`tsb_remap_ttes`	When set to 1 (which is the default), TTEs will be remapped when a TSB is grown to a larger size. This saves the overhead of faulting on those mappings and repopulating the TSB, at the expense of doing the copy with the locks held (blocking all threads from running that process for the duration of the copy).
`tsb_rss_factor`	The value of `tsb_rss_factor` / 512 gives the percentage of the TSB that must be used before the TSB is grown to a larger size. 75% is chosen as the default value, since some virtual addresses are expected to map to the same slot (conflict) in the TSB.
`tsb_alloc_hiwater_factor`	The factor of physical memory used to determine the value to establish for `tsb_alloc_hiwater`, below. Defaults to 1/32 of physical memory (32).
`tsb_alloc_hiwater`	The limit of TSB memory usage, in bytes, beyond which TSBs will no longer be grown and may be shrunken as described in page 603. Care should be exercised when increasing this value, as it must be lower than other kernel values such as `swapfs_minfree` to prevent system hangs. It is always safe to set this value smaller than the default, which is 1/`tsb_alloc_hiwater` factor of physical memory installed at boot time or during a DR reconfiguration event.
`tsb_sectsb_threshold`	The number of 4-Mbyte mappings a process must have resident before a second TSB is allocated to the process to cache 4-Mbyte mappings. The default value varies depending on the TLB characteristics of the machine. Setting it very high will essentially disable the TSB for 4-Mbyte mappings and will cause all 4-Mbyte mappings to be retrieved from the hash.

12.2.8. SPARC Hat Layer kstats

The sfmmu_global_stat kstats (described in Table 12.9) keep track of general HAT layer statistics while sfmmu_tsbsize_stat kstats provide information on TSB size allocations:

$ kstat -n sfmmu_tsbsize_stat module: unix                             instance: 0 name:   sfmmu_tsbsize_stat               class:    hat crtime                           799.82297424         sf_tsbsz_128k                    275         sf_tsbsz_16k                     3891         sf_tsbsz_1m                      0         sf_tsbsz_256k                    75         sf_tsbsz_2m                      0         sf_tsbsz_32k                     3721         sf_tsbsz_4m                      0         sf_tsbsz_512k                    0         sf_tsbsz_64k                     371         sf_tsbsz_8k                      228875         snaptime                         2081123.44794904

Table 12.9. HAT Layer Kstats from `unix:hat:sfmmu_global_stat`.
Statistic	Description
`sf_tsb_exceptions`	The number of calls to `sfmmu_tsbmiss_exception()`.
`sf_tsb_raise_exception`	The number of TSB exceptions raised to synchronize MMU state or update the TSB base register and TTE after the TSB replacement.
`sf_pagefaults`	Pagefaults.
`sf_uhash_searches`	User hash searches.
`sf_uhash_links`	User hash links.
`sf_khash_searches`	Kernel hash searches.
`sf_khash_links`	Kernel hash links.
`sf_swapout`	The number of times process TSBs were swapped out.
`sf_ctxfree`	The count of contexts allocated from free list.
`sf_ctxdirty`	The count of contexts allocated from dirty list.
`sf_ctxsteal`	The number of contexts allocated by stealing it away from another process.
`sf_tsb_alloc`	Number of TSB allocations.
`sf_tsb_allocfail`	The number of times TSB allocations failed due to resource exhaustion.
`sf_tsb_sectsb_create`	The number of times a second TSB was allocated for a process.
`sf_tteload8k`	The number of calls to `sfmmu_tteload_addentry()` to add a 8-Kbyte TTE to an `hme_blk`.
`sf_tteload64k`	The number of calls to `sfmmu_tteload_addentry()` to add a 64-Kbyte TTE to an `hme_blk`.
`sf_tteload512k`	The number of calls to `sfmmu_tteload_addentry()` to add a 512-Kbyte TTE to an `hme_blk`.
`sf_tteload4m`	The number of calls to `sfmmu_tteload_addentry()` to add a 4-Mbyte TTE to an `hme_blk`.
`sf_tteload32m`	The number of calls to `sfmmu_tteload_addentry()` to add a 32-Mbyte TTE to an `hme_blk`.
`sf_tteload256m`	The number of calls to `sfmmu_tteload_addentry()` to add a 256-Mbyte TTE to an `hme_blk`.
`sf_tsb_load8k`	The number of times a TTE was pre-loaded into the 8-Kbyte indexed TSB at map time.
`sf_tsb_load4m`	Count of the number of times a TTE was pre-loaded into the 4-Mbyte indexed TSB at map time.
`sf_hblk_hit`	The number of times an `hme_blk` was found to add a TTE to.
`sf_hblk8_ncreate`	The number of static nucleus hblk8 structures created at startup.
`sf_hblk8_nalloc`	The number of hblk8 structures allocated from the static nucleus pool during startup.
`sf_hblk1_ncreate`	The number of static hblk1 structures created at startup.
`sf_hblk1_nalloc`	The number of `hblk1` structures allocated from the static nucleus pool during startup.
`sf_hblk_slab_cnt`	The number of `sfmmu8_cache` slabs created.
`sf_hblk_reserve_cnt`	The `hblk_reserve` usage count. This is the number of times we could not get an `hme_blk` to map a `sfmmu8_cache` slab from the reserve pool and had to use `hblk_reserve`.
`sf_hblk_recurse_cnt`	The count of recursive `hblk_reserve` owner requests. This is the condition where we already own `hblk_reserve` but need another `hblk8`.
`sf_hblk_reserve_hit`	The number of `hblk_reserve` hash hits. This is the number of times we encountered a `hblk_reserve` in the HME hash chains.
`sf_get_free_success`	The number of times we found an `hme_blk` in the reserve pool.
`sf_get_free_throttle`	The number of times we failed to obtain an `hme_blk` from the reserve pool due to throttling. Reserve pool allocations are throttled when the number of HME blocks in the pool reaches `HBLK_RESERVE_MIN`.
`sf_get_free_fail`	The number of times the reserve pool was empty when we went to grab a `hme_blk`.
`sf_put_free_success`	The number of times a `hme_blk` was freed by adding it to the reserve pool.
`sf_put_free_fail`	The number of times a `hme_blk` could not be freed to the reserve pool because it was full.
`sf_pgcolor_conflict`	The number of times a virtual address cache (VAC) conflict was encountered while loading a mapping.
`sf_uncache_conflict`	The number of mappings made temporarily uncacheable (TNC) to resolve VAC conflicts. This happens in the case of a conflict within an existing large page (in which case all constituent mappings are uncached) or with a locked 8-Kbyte page.
`sf_unload_conflict`	The number of 8-Kbyte page mappings unloaded to resolve a VAC conflict.
`sf_ism_uncache`	The ISM mappings made TNC to resolve VAC conflicts.
`sf_ism_recache`	The ISM mappings previously uncached that were re-cached after the VAC conflict has passed.
`sf_recache`	The number of mappings previously uncached that were recached after the VAC conflict has passed.
`sf_steal_count`	The number of HME blocks stolen when we failed to allocate a new one.
`sf_pagesync`	The number of times `sfmmu_pagesync()` was called to syncronize a page structure's hardware dependent attributes.
`sf_clrwrt`	The number of times write permission was removed from a mapping to a page, so that we can detect the next modification of it `sf_pagesync_invalid` pagesync with invalid TTE
`sf_kernel_xcalls`	The count of kernel cross-calls.
`sf_user_xcalls`	The count of user cross-calls.
`sf_tsb_grow`	The number of times process TSBs were successfully increased in size.
`sf_tsb_shrink`	The number of times process TSBs were successfully decreased in size on unmap to free up memory.
`sf_tsb_resize_failures`	The number of times a process TSB resize was attempted but could not succeed due to an allocation failure.
`sf_tsb_reloc`	The number of times a TSB was relocated.
`sf_user_vtop`	The number of `sfmmu_uvatopfn()` calls to translate a user VA into a PFN.
`sf_ctx_swap`	The number of times a process context was changed to serialize ISM demap with the trap handlers.
`sf_tlbflush_all`	The number of times all TLBs were flushed.
`sf_tlbflush_ctx`	The number of times a context was flushed from the TLBs.
`sf_tlbflush_deferred`	The number of times a TLB context flush was not immediately performed. As a performance optimization on platforms that support the TLB demap-all operation, requests to flush a context are deferred until a context is allocated. At that time if the context free list is empty, a context is allocated from the dirty list, a TLB demap-all is issued, and the dirty list is moved to the free list.
`sf_tlb_reprog_pgsz`	The number of times the TLBs were reprogrammed with new page sizes.

12.2. The UltraSPARC HAT Layer

12.2.1. Introduction

Figure 12.2. UltraSPARC-I-IV MMU Topology

Figure 12.3. Virtual Address Translation Hardware and Software

12.2.2. struct hat

Figure 12.4. Linkage from the proc Structure to the hat Structure

12.2.3. The Translation Table

Figure 12.5. Hash Table Data Structures

12.2.3.1. The Translation Table Entry

Figure 12.6. TTE Data Fields

Figure 12.7. Hardware and Software Representations of the TTE Tag

12.2.3.2. sf_hment Structure

12.2.3.3. hme_blk

Table 12.2. HME Block Rehash Values

12.2.3.4. Shadow HME Blocks

12.2.3.5. HME Block Allocation

12.2.3.6. hme_blk Hash Tables

Table 12.3. HME Hash Table Constants