Section 5.5. Shared Memory | The Design and Implementation of the FreeBSD Operating System

5.5. Shared Memory

In Section 5.4, we explained how the address space of a process is organized. This section shows the additional data structures needed to support shared address space between processes. Traditionally, the address space of each process was completely isolated from the address space of all other processes running on the system. The only exception was read-only sharing of program text. All interprocess communication was done through well-defined channels that passed through the kernel: pipes, sockets, files, and special devices. The benefit of this isolated approach is that, no matter how badly a process destroys its own address space, it cannot affect the address space of any other process running on the system. Each process can precisely control when data are sent or received; it can also precisely identify the locations within its address space that are read or written. The drawback of this approach is that all interprocess communication requires at least two system calls: one from the sending process and one from the receiving process. For high volumes of interprocess communication, especially when small packets of data are being exchanged, the overhead of the system calls dominates the communications cost.

Shared memory provides a way to reduce interprocess-communication costs dramatically. Two or more processes that wish to communicate map the same piece of read-write memory into their address space. Once all the processes have mapped the memory into their address space, any changes to that piece of memory are visible to all the other processes, without any intervention by the kernel. Thus, interprocess communication can be achieved without any system-call overhead other than the cost of the initial mapping. The drawback to this approach is that, if a process that has the memory mapped corrupts the data structures in that memory, all the other processes mapping that memory also are corrupted. In addition, there is the complexity faced by the application developer who must develop data structures to control access to the shared memory and must cope with the race conditions inherent in manipulating and controlling such data structures that are being accessed concurrently.

Some UNIX variants have a kernel-based semaphore mechanism to provide the needed serialization of access to the shared memory. However, both getting and setting such semaphores require system calls. The overhead of using such semaphores is comparable to that of using the traditional interprocess-communication methods. Unfortunately, these semaphores have all the complexity of shared memory, yet confer little of its speed advantage. The primary reason to introduce the complexity of shared memory is for the commensurate speed gain. If this gain is to be obtained, most of the data-structure locking needs to be done in the shared memory segment itself. The kernel-based semaphores should be used for only those rare cases where there is contention for a lock and one process must wait. Consequently, modern interfaces, such as POSIX Pthreads, are designed such that the semaphores can be located in the shared memory region. The common case of setting or clearing an uncontested semaphore can be done by the user process, without calling the kernel. There are two cases where a process must do a system call. If a process tries to set an already locked semaphore, it must call the kernel to block until the semaphore is available. This system call has little effect on performance because the lock is contested, so it is impossible to proceed, and the kernel must be invoked to do a context switch anyway. If a process clears a semaphore that is wanted by another process, it must call the kernel to awaken that process. Since most locks are uncontested, the applications can run at full speed without kernel intervention.

Mmap Model

When two processes wish to create an area of shared memory, they must have some way to name the piece of memory that they wish to share, and they must be able to describe its size and initial contents. The system interface describing an area of shared memory accomplishes all these goals by using files as the basis for describing a shared memory segment. A process creates a shared memory segment by using

 caddr_t addr = mmap(     caddr_t address, /* base address */     size_t length,   /* length of region */     int protection,  /* protection of region */     int flags,       /* mapping flags */     int fd,          /* file to map */     off_t offset);   /* offset to begin mapping */

to map the file referenced by descriptor fd starting at file offset offset into its address space starting at addr and continuing for len bytes with access permission prot. The flags parameter allows a process to specify whether it wants to make a shared or private mapping. Changes made to a shared mapping are written back to the file and are visible to other processes. Changes made to a private mapping are not written back to the file and are not visible to other processes. Two processes that wish to share a piece of memory request a shared mapping of the same file into their address space. Thus, the existing and well-understood filesystem name space identifies shared objects. The contents of the file are used as the initial value of the memory segment. All changes made to the mapping are reflected back into the contents of the file, so long-term state can be maintained in the shared memory region, even across invocations of the sharing processes.

Some applications want to use shared memory purely as a short-term interprocess-communication mechanism. They need an area of memory that is initially zeroed and whose contents are abandoned when they are done using it. Such processes neither want to pay the relatively high startup cost associated with paging in the contents of a file to initialize a shared memory segment nor to pay the shutdown costs of writing modified pages back to the file when they are done with the memory. Although FreeBSD does provide the limited and quirky naming scheme of the System V shmem interface as a rendezvous mechanism for such short-term shared memory, the designers ultimately decided that all naming of memory objects for mmap should use the filesystem name space. To provide an efficient mechanism for short-term shared memory, they created a virtual-memory-resident filesystem for transient objects. Unless memory is in high demand, files created in the virtual-memory-resident filesystem reside entirely in memory. Thus, both the initial paging and later write-back costs are eliminated. Typically, a virtual-memory-resident filesystem is mounted on /tmp. Two processes wishing to create a transient area of shared memory create a file in /tmp that they can then both map into their address space.

When a mapping is no longer needed, it can be removed using

 munmap(caddr_t address, size_t length);

The munmap system call removes any mappings that exist in the address space, starting at addr and continuing for len bytes. There are no constraints between previous mappings and a later munmap. The specified range may be a subset of a previous mmap, or it may encompass an area that contains many mmap'ed files. When a process exits, the system does an implied munmap over its entire address space.

During its initial mapping, a process can set the protections on a page to allow reading, writing, and/or execution. The process can change these protections later by using

 mprotect(caddr_t address, int length, int protection);

This feature can be used by debuggers when they are trying to track down a memory-corruption bug. By disabling writing on the page containing the data structure that is being corrupted, the debugger can trap all writes to the page and verify that they are correct before allowing them to occur.

Traditionally, programming for real-time systems has been done with specially written operating systems. In the interests of reducing the costs of real-time applications and of using the skills of the large body of UNIX programmers, companies developing real-time applications have expressed increased interest in using UNIX-based systems for writing these applications. Two fundamental requirements of a real-time system are guaranteed maximum latencies and predictable execution times. Predictable execution time is difficult to provide in a virtual-memory-based system, since a page fault may occur at any point in the execution of a program, resulting in a potentially large delay while the faulting page is retrieved from the disk or network. To avoid paging delays, the system allows a process to force its pages to be resident, and not paged out, by using

 mlock(caddr_t address, size_t length);

As long as the process limits its accesses to the locked area of its address space, it can be sure that it will not be delayed by page faults. To prevent a single process from acquiring all the physical memory on the machine to the detriment of all other processes, the system imposes a resource limit to control the amount of memory that may be locked. Typically, this limit is set to no more than one-third of the physical memory, and it may be set to zero by a system administrator that does not want random processes to be able to monopolize system resources.

When a process has finished with its time-critical use of an mlock'ed region, it can release the pages using

 munlock(caddr_t address, size_t length);

After the munlock call, the pages in the specified address range are still accessible, but they may be paged out if memory is needed and they are not accessed.

An application may need to ensure that certain records are committed to disk without forcing the writing of all the dirty pages of a file done by the fsync system call. For example, a database program may want to commit a single piece of metadata without writing back all the dirty blocks in its database file. A process does this selective synchronization using

 msync(caddr_t address, int length);

Only those modified pages within the specified address range are written back to the filesystem. The msync system call has no effect on anonymous regions.

Shared Mapping

When multiple processes map the same file into their address space, the system must ensure that all the processes view the same set of memory pages. As shown in Section 5.4, each file that is being used actively by a client of the virtual-memory system is represented by an object. Each mapping that a process has to a piece of a file is described by a vm_map_entry structure. An example of two processes mapping the same file into their address space is shown in Figure 5.7. When a page fault occurs in one of these processes, the process's vm_map_entry references the object to find the appropriate page. Since all mappings reference the same object, the processes will all get references to the same set of physical memory, thus ensuring that changes made by one process will be visible in the address spaces of the other processes as well.

Figure 5.7. Multiple mappings to a file.

Private Mapping

A process may request a private mapping of a file. A private mapping has two main effects:

Changes made to the memory mapping the file are not reflected back into the mapped file.
Changes made to the memory mapping the file are not visible to other processes mapping the file.

An example of the use of a private mapping would be during program debugging. The debugger will request a private mapping of the program text so that, when it sets a breakpoint, the modification is not written back into the executable stored on the disk and is not visible to the other (presumably nondebugging) processes executing the program.

The kernel uses shadow objects to prevent changes made by a process from being reflected back to the underlying object. The use of a shadow object is shown in Figure 5.8. When the initial private mapping is requested, the file object is mapped into the requesting-process address space, with copy-on-write semantics. If the process attempts to write a page of the object, a page fault occurs and traps into the kernel. The kernel makes a copy of the page to be modified and hangs it from the shadow object. In this example, process A has modified page 0 of the file object. The kernel has copied page 0 to the shadow object that is being used to provide the private mapping for process A.

Figure 5.8. Use of a shadow object for a private mapping.

If free memory is limited, it would be better simply to move the modified page from the file object to the shadow object. The move would reduce the immediate demand on the free memory, because a new page would not have to be allocated. The drawback to this optimization is that, if there is a later access to the file object by some other process, the kernel will have to allocate a new page. The kernel will also have to pay the cost of doing an I/O operation to reload the page contents. In FreeBSD, the virtual-memory system never moves the page rather than copying it.

When a page fault for the private mapping occurs, the kernel traverses the list of objects headed by the vm_map_entry, looking for the faulted page. The first object in the chain that has the desired page is the one that is used. If the search gets to the final object on the chain without finding the desired page, then the page is requested from that final object. Thus, pages on a shadow object will be used in preference to the same pages in the file object itself. The details of page-fault handling are given in Section 5.11.

When a process removes a mapping from its address space (either explicitly from an munmap request or implicitly when the address space is freed on process exit), pages held by its shadow object are not written back to the file object. The shadow-object pages are simply placed back on the memory free list for immediate reuse.

When a process forks, it does not want changes to its private mappings made after it forked to be visible in its child; similarly, the child does not want its changes to be visible in its parent. The result is that each process needs to create a shadow object if it continues to make changes in a private mapping. When process A in Figure 5.8 forks, a set of shadow-object chains is created, as shown in Figure 5.9 (on page 162). In this example, process A modified page 0 before it forked and then later modified page 1. Its modified version of page 1 hangs off its new shadow object, so those modifications will not be visible to its child. Similarly, its child has modified page 0. If the child were to modify page 0 in the original shadow object, that change would be visible in its parent. Thus, the child process must make a new copy of page 0 in its own shadow object.

Figure 5.9. Shadow-object chains.

If the system runs short of memory, the kernel may need to reclaim inactive memory held in a shadow object. The kernel assigns to the swap pager the task of backing the shadow object. The swap pager sets up data structures (described in Section 5.10) that can describe the entire contents of the shadow object. It then allocates enough swap space to hold the requested shadow pages and writes them to that area. These pages can then be freed for other uses. If a later page fault requests a swapped-out page, then a new page of memory is allocated and its contents are reloaded with an I/O from the swap area.

Collapsing of Shadow Chains

When a process with a private mapping removes that mapping either explicitly with an munmap system call or implicitly by exiting, its parent or child process may be left with a chain of shadow objects. Usually, these chains of shadow objects can be collapsed into a single shadow object, often freeing up memory as part of the collapse. Consider what happens when process A exits in Figure 5.9. First, shadow object 3 can be freed along with its associated page of memory. This deallocation leaves shadow objects 1 and 2 in a chain with no intervening references. Thus, these two objects can be collapsed into a single shadow object. Since they both contain a copy of page 0, and since only the page 0 in shadow object 2 can be accessed by the remaining child process, the page 0 in shadow object 1 can be freed along with shadow object 1 itself.

If the child of process A were to exit, then shadow object 2 and the associated page of memory could be freed. Shadow objects 1 and 3 would then be in a chain that would be eligible for collapse. Here, there are no common pages, so object 3 would retain its own page 1 and acquire page 0 from shadow object 1. Object 1 would then be freed. In addition to merging the pages from the two objects, the collapse operation requires a similar merger of any swap space that has been allocated by the two objects. If page 2 had been copied to object 3 and page 4 had been copied to object 1, but these pages were later reclaimed, the pager for object 3 would hold a swap block for page 2, and the pager for object 1 would hold a swap block for page 4. Before freeing object 1, its swap block for page 4 would have to be moved over to object 3.

A performance problem can arise if either a process or its children repeatedly fork. Without some intervention, they can create long chains of shadow objects. If the processes are long-lived, the system does not get an opportunity to collapse these shadow-object chains. Traversing these long chains of shadow objects to resolve page faults is time consuming, and many inaccessible pages can build up forcing the system to needlessly page them out to reclaim them.

One alternative would be to calculate the number of live references to a page after each copy-on-write fault. When only one live reference remained, the page could be moved to the shadow object that still referenced it. When all the pages had been moved out of a shadow object, it could be removed from the chain. For example, in Figure 5.9, when the child of process A wrote to page 0, a copy of page 0 was made in shadow object 2. At that point, the only live reference to page 0 in object 1 was from process A. Thus, the page 0 in object 1 could be moved to object 3. That would leave object 1 with no pages, so it could be reclaimed leaving objects 2 and 3 pointing at the file object directly. Unfortunately, this strategy would add considerable overhead to the page-fault handling routine which would noticeably slow the overall performance of the system. So FreeBSD does not make this optimization.

FreeBSD uses a lower-cost heuristic to reduce the length of shadow chains. When taking a copy-on-write fault, it checks to see whether the object taking the fault completely shadows the object below it in the chain. When it achieves complete coverage, it can bypass that object and reference the next object down the chain. The removal of its reference has the same effect as if it had exited and allowed the normal shadow chain collapse to be done with the sole remaining reference to the object. For example, in Figure 5.9, when the child of process A wrote to page 0, a copy of page 0 was made in shadow object 2. At that point object 2 completely shadows object 1, so object 2 can drop its reference to object 1 and point to the file object directly. With only one reference to object 1, the collapse of object 3 and object 1 will happen as already described. The cost of determining whether one object completely covers the object below it is much easier than tracking live references. Thus, this approach is used by FreeBSD.

This heuristic works well in practice. Even with the heavily forking parent or child processes found in many network servers, shadow-chain depth generally stabilizes at a depth of about three to four levels.

Private Snapshots

When a process makes read accesses to a private mapping of an object, it continues to see changes made to that object by other processes that are writing to the object through the filesystem or that have a shared mapping to the object. When a process makes a write access to a private mapping of an object, a snapshot of the corresponding page of the object is made and is stored in the shadow object, and the modification is made to that snapshot. Thus, further changes to that page made by other processes that are writing to the page through the filesystem or that have a shared mapping to the object are no longer visible. However, changes to unmodified pages of the object continue to be visible. This mix of changing and unchanging parts of the file can be confusing.

To provide a more consistent view of a file, a process may want to take a snapshot of the file at the time that it is initially privately mapped. Historically, both Mach and 4.4BSD provided a copy object whose effect was to take a snapshot of an object at the time that the private mapping was set up. The copy object tracked changes to an object by other processes and kept original copies of any pages that changed. None of the other UNIX vendors implemented copy objects, and there are no major applications that depend on them. The copy-object code in the virtual-memory system was large and complex, and it noticeably slowed virtual-memory performance. Consequently, copy objects were deemed unnecessary and were removed from FreeBSD as part of the early cleanup and performance work done on the virtual-memory system. Applications that want to get a snapshot of a file can do so by reading it into their address space or by making a copy of it in the filesystem and then referring to the copy.