9.4. Virtual Address Space ManagementA virtual address space is a container for a set of mappings. There is one address space for each process within the system, and one for the kernel. The address space layers manage setup and teardown changes to the address spaces on behalf of a process or the kernel, and MMU faults within. 9.4.1. Address Space ManagementThe Solaris kernel is implemented with a central address management subsystem that other parts of the kernel call into. The address space module is a wrapper around the segment drivers, so that subsystems need not know what segment driver is used for a memory range. The address space object shown in Figure 9.7 is linked from the process's address space and contains pointers to the segments that constitute the address space. Figure 9.7. The Address SpaceThe address space subsystem manages the following functions:
Recall that the process and kernel subsystems call into the address space subsystem to manage their address spaces. The address space subsystem consists of a series of functions, grouped to perform the functions listed above. Although the subsystem has a lot of entry points, the implementation is fairly simple because most of the functions simply look up which segment the operation needs to operate on and then route the request to the appropriate segment driver. A call to the as_alloc() function creates an address space, but as_alloc() is invoked only oncewhen the system boots and the init process is created. After the init process is created, all address spaces are created by duplication of the init process's address space with fork(). The fork() system call in turn calls the as_dup() function to duplicate the address space of the current process as it creates a new process, and the entire address space configuration, including the stack and heap, is replicated at this point. The behavior of vfork() at this point is somewhat different. Rather than calling as_dup() to replicate the address space, vfork() creates a new process by borrowing the parent's existing address space. The vfork function is useful if the fork is going to call exec() since it saves all the effort of duplicating the address space that would otherwise have been discarded once exec() is called. The parent process is suspended while the child is using its address space, until exec() is called. Once the process is created, the address space object is allocated and set up. The Solaris 10 data structure for the address space object is shown below. struct as { kmutex_t a_contents; /* protect certain fields in the structure */ uchar_t a_flags; /* as attributes */ uchar_t a_vbits; /* used for collecting statistics */ kcondvar_t a_cv; /* used by as_rangelock */ struct hat *a_hat; /* hat structure */ struct hrmstat *a_hrm; /* ref and mod bits */ caddr_t a_userlimit; /* highest allowable address in this as */ struct seg *a_seglast; /* last segment hit on the addr space */ krwlock_t a_lock; /* protects segment related fields */ size_t a_size; /* size of address space */ struct seg *a_lastgap; /* last seg found by as_gap() w/ AS_HI (mmap) */ avl_tree_t a_segtree; /* segments in this address space. (AVL tree) */ struct watched_page *a_wpage; /* list of watched pages (procfs) */ int a_nwpage; /* number of watched pages */ uchar_t a_updatedir; /* mappings changed, rebuild a_objectdir */ vnode_t **a_objectdir; /* object directory (procfs) */ size_t a_sizedir; /* size of object directory */ struct as_callback *a_callbacks; /* callback list */ }; See vm/as.h The following output from the DTrace vm.d script shows the path through the virtual memory layers as a process allocates more memory via brk(): 0 => brk 13076 0 -> as_rangelock 13077 0 <- as_rangelock 13078 0 -> as_map 13082 0 -> seg_alloc 13087 0 -> seg_attach 13091 0 -> as_addseg 13093 0 <- as_addseg 13101 0 <- seg_attach 13102 0 <- seg_alloc 13104 0 -> segvn_create 13106 0 -> anon_resvmem 13108 0 <- anon_resvmem 13110 0 -> anon_grow 13117 0 <- anon_grow 13123 0 -> seg_free 13125 0 -> as_removeseg 13127 0 <- as_removeseg 13132 0 <- seg_free 13134 0 <- segvn_create 13137 0 -> as_setwatch 13139 0 <- as_setwatch 13141 0 <- as_map 13143 0 -> as_rangeunlock 13144 0 <- as_rangeunlock 13146 0 <= brk 13147 Address space fault handling is performed in the address space subsystem; some of the faults are handled by the common address space code, and others are redirected to the segment handlers. When a page fault occurs, the Solaris trap handlers call the as_fault() function, which looks to see what segment the page fault occurred in by calling the as_setat() function. If the fault does not lie in any of the address space's segments, then as_fault() sends a SIGSEGV signal to the process. If the fault does lie within one of the segments, then the segment's fault method is called and the segment handles the page fault. Table 9.3 lists the segment functions in alphabetical order.
9.4.2. Address Space CallbacksAn address space callback is a facility which supports the ability to inform clients of specific events pertaining to address space management. An example of such an event is an address space unmap requestto prevent holding the address space's lock (a_lock) for a large amount of time during an unmap (which can cause ps(1) and other tools to hang), the unmap is performed as a callback without holding the a_lock. As one example, we use this facility to prevent an NFS server timeout from hanging ps. A client calls as_add_callback() to register an address space callback for a range of pages, specifying the events that need to occur. When as_do_callbacks() is called and finds a matching entry, the callback is called once, and the callback function MUST call as_delete_callback() when all callback activities are complete. The thread calling as_do_callbacks() blocks until the as_delete_callback() is called. This allows for asynchronous events to subside before the as_do_callbacks() thread continues. An example of the need for this is a driver which has done long-term locking of memory. Address space management operations (events) such as as_free(), as_unmap(), and as_setprot() will block indefinitely until the pertinent memory is unlocked. The callback mechanism provides the way to inform the driver of the event so that the driver may do the necessary unlocking. 9.4.3. Virtual Memory Protection ModesWe break each process into segments so that we can treat each part of the address space differently. For example, the kernel maps the machine code portion of the executable binary into the process as read-only to prevent the process from modifying its machine code instructions. The virtual memory subsystem does this by taking advantage of the hardware MMU's virtual memory protection capabilities. Solaris relies on the MMU having the following protection modes:
The implementation of protection modes is done in the segment and HAT layers. 9.4.4. Page Faults in Address SpacesThe Solaris virtual memory system uses the hardware MMU's memory management capabilities. MMU-generated exceptions tell the operating system when a memory access cannot continue without the kernel's intervention, by interrupting the executing process with a trap and then invoking the appropriate piece of memory management code. Three major types of memory-related hardware exceptions can occur: major page faults, minor page faults, and protection faults. A major page fault occurs when an attempt to access a virtual memory location that is mapped by a segment does not have a physical page of memory mapped to it and the page does not exist in physical memory. The page fault allows the virtual memory system to hide the management of physical memory allocation from the process. The virtual memory system traps accesses to memory that the process believes is accessible and arranges to have either a new page created for that address (in the case of the first access) or copies in the page from the swap device. Once the memory system places a real page behind the memory address, the process can continue normal execution. If a reference is made to a memory address that is not mapped by any segment, then a segmentation violation signal (SIGSEGV) is sent to the process. The signal is sent as a result of a hardware exception caught by the processor and translated to a signal by the address space layer. A minor page fault occurs when an attempt is made to access a virtual memory location that resides within a segment and the page is in physical memory, but no current MMU translation is established to the physical page from the address space that caused the fault. For example, a process maps in the libc.so library and makes a reference to a page within it. A page fault occurs, but the physical page of memory is already present and the process simply needs to establish a mapping to the existing physical page. Minor faults are also referred to as attaches. A page protection fault occurs when a program attempts to access a memory address in a manner that violates the preconfigured access protection for a memory segment. Protection modes can enable any of read, write, or execute access. For example, the text portion of a binary is mapped read-only, and if we attempt to write to any memory address within that segment, we will cause a memory protection fault. The memory protection fault is also initiated by the hardware MMU as a trap that is then handled by the segment page fault handling routine. Figure 9.8 shows the relationship between a virtual address space, its segments, and the hardware MMU. Figure 9.8. Virtual Address Space Page Fault ExampleIn the figure, we see what happens when a process accesses a memory location within its heap space that does not have physical memory mapped to it. This has most likely occurred because the page of physical memory has previously been stolen by the page scanner as a result of a memory shortage. In the numbered events in the figures we see:
Once this process is completed, the process can continue execution. The following example using the DTrace vm.d script shows the logical flow for a zero-fill on demand page fault. 0 -> as_fault 4210 0 | as_fault:as_fault 4210 0 -> as_segat 4211 0 <- as_segat 4212 0 -> segvn_fault 4213 0 -> anonmap_alloc 4216 0 -> anon_create 4219 0 <- anon_create 4224 0 <- anonmap_alloc 4225 0 -> anon_array_enter 4227 0 -> page_get_pagecnt 4228 0 <- page_get_pagecnt 4229 0 -> anon_get_slot 4230 0 <- anon_get_slot 4231 0 <- anon_array_enter 4232 0 -> anon_zero 4234 0 -> anon_alloc 4236 0 <- anon_alloc 4240 0 -> page_lookup 4243 0 -> page_lookup_create 4244 0 <- page_lookup_create 4245 0 <- page_lookup 4246 0 -> page_create_va 4248 0 -> page_get_freelist 4251 0 -> page_trylock 4258 0 <- page_trylock 4259 0 -> page_sub 4261 0 <- page_sub 4262 0 <- page_get_freelist 4266 0 -> page_hashin 4268 0 <- page_hashin 4270 0 -> page_io_lock 4272 0 <- page_io_lock 4273 0 -> page_add 4275 0 <- page_add 4276 0 <- page_create_va 4278 0 -> pvn_plist_init 4279 0 -> page_sub 4280 0 <- page_sub 4281 0 -> page_io_unlock 4283 0 <- page_io_unlock 4285 0 <- pvn_plist_init 4286 0 -> pagezero 4289 0 <- pagezero 4294 0 -> page_downgrade 4296 0 <- page_downgrade 4297 0 | anon_zero:zfod 4298 0 -> hat_page_setattr 4299 0 <- hat_page_setattr 4301 0 <- anon_zero 4302 0 -> anon_set_ptr 4303 0 <- anon_set_ptr 4305 0 -> hat_memload 4306 0 <- hat_memload 4316 0 -> page_unlock 4317 0 <- page_unlock 4318 0 -> anon_array_exit 4319 0 <- anon_array_exit 4320 0 <- segvn_fault 4321 0 <- as_fault 4322 |