Section 5.6. Creation of a New Process | The Design and Implementation of the FreeBSD Operating System

5.6. Creation of a New Process

Processes are created with a fork system call. The fork is usually followed shortly thereafter by an exec system call that overlays the virtual address space of the child process with the contents of an executable image that resides in the filesystem. The process then executes until it terminates by exiting, either voluntarily or involuntarily, by receiving a signal. In Sections 5.6 to 5.9, we trace the management of the memory resources used at each step in this cycle.

A fork system call duplicates the address space of an existing process, creating an identical child process. The fork set of system calls is the only way that new processes are created in FreeBSD. Fork duplicates all the resources of the original process and copies that process's address space.

The virtual-memory resources of the process that must be allocated for the child include the process structure and its associated substructures, and the user structure and kernel stack. In addition, the kernel must reserve storage (either memory, filesystem space, or swap space) used to back the process. The general outline of the implementation of a fork is as follows:

Reserve virtual address space for the child process.
Allocate a process entry and thread structure for the child process, and fill it in.
Copy to the child the parent's process group, credentials, file descriptors, limits, and signal actions.
Allocate a new user structure and kernel stack, copying the current ones to initialize them.
Allocate a vmspace structure.
Duplicate the address space, by creating copies of the parent vm_map_entry structures marked copy-on-write.
Arrange for the child process to return 0, to distinguish its return value from the new PID that is returned to the parent process.

The allocation and initialization of the process structure, and the arrangement of the return value, were covered in Chapter 4. The remainder of this section discusses the other steps involved in duplicating a process.

Reserving Kernel Resources

The first resource to be reserved when an address space is duplicated is the required virtual address space. To avoid running out of memory resources, the kernel must ensure that it does not promise to provide more virtual memory than it is able to deliver. The total virtual memory that can be provided by the system is limited to the amount of physical memory available for paging plus the amount of swap space that is provided. A few pages are held in reserve to stage I/O between the swap area and main memory.

The reason for this restriction is to ensure that processes get synchronous notification of memory limitations. Specifically, a process should get an error back from a system call (such as sbrk, fork, or mmap) if there are insufficient resources to allocate the needed virtual memory. If the kernel promises more virtual memory than it can support, it can deadlock trying to service a page fault. Trouble arises when it has no free pages to service the fault and no available swap space to save an active page. Here, the kernel has no choice but to send a kill signal to the process unfortunate enough to be page faulting. Such asynchronous notification of insufficient memory resources is unacceptable.

Excluded from this limit are those parts of the address space that are mapped read-only, such as the program text. Any pages that are being used for a read-only part of the address space can be reclaimed for another use without being saved because their contents can be refilled from the original source. Also excluded from this limit are parts of the address space that map shared files. The kernel can reclaim any pages that are being used for a shared mapping after writing their contents back to the filesystem from which they are mapped. Here, the filesystem is being used as an extension of the swap area. Finally, any piece of memory that is used by more than one process (such as an area of anonymous memory being shared by several processes) needs to be counted only once toward the virtual-memory limit.

The limit on the amount of virtual address space that can be allocated causes problems for applications that want to allocate a large piece of address space but want to use the piece only sparsely. For example, a process may wish to make a private mapping of a large database from which it will access only a small part. Because the kernel has no way to guarantee that the access will be sparse, it takes the pessimistic view that the entire file will be modified and denies the request. One extension that many BSD-derived systems have made to the mmap system call is to add a flag that tells the kernel that the process is prepared to accept asynchronous faults in the mapping. Such a mapping would be permitted to use up to the amount of virtual memory that had not been promised to other processes. If the process then modifies more of the file than this available memory, or if the limit is reduced by other processes allocating promised memory, the kernel can then send a segmentation-fault or a more specific out-of-memory signal to the process. On receiving the signal, the process must munmap an unneeded part of the file to release resources back to the system. The process must ensure that the code, stack, and data structures needed to handle the segment-fault signal do not reside in the part of the address space that is subject to such faults.

Tracking the outstanding virtual memory accurately and determining when to limit further allocation is a complex task. Because most processes use only about half of their virtual address space, limiting outstanding virtual memory to the sum of process address spaces is needlessly conservative. However, allowing greater allocation runs the risk of running out of virtual-memory resources. Although FreeBSD calculates the outstanding-memory load, it does not enforce any total memory limit so it can be made to promise more than it can deliver. When memory resources run out, it picks a process to kill favoring processes with large memory use. An important future enhancement will be to develop a heuristic for determining when virtual-memory resources are in danger of running out and need to be limited.

Duplication of the User Address Space

The next step in fork is to allocate and initialize a new process structure. This operation must be done before the address space of the current process is duplicated because it records state in the process structure. From the time that the process structure is allocated until all the needed resources are allocated, the parent process is locked against swapping to avoid deadlock. The child is in an inconsistent state and cannot yet run or be swapped, so the parent is needed to complete the copy of its address space. To ensure that the child process is ignored by the scheduler, the kernel sets the process's state to NEW during the entire fork procedure.

Historically, the fork system call operated by copying the entire address space of the parent process. When large processes fork, copying the entire user address space is expensive. All the pages that are on secondary storage must be read back into memory to be copied. If there is not enough free memory for both complete copies of the process, this memory shortage will cause the system to begin paging to create enough memory to do the copy (see Section 5.12). The copy operation may result in parts of the parent and child processes being paged out, as well as the paging out of parts of unrelated processes.

The technique used by FreeBSD to create processes without this overhead is called copy-on-write. Rather than copy each page of a parent process, both the child and parent processes resulting from a fork are given references to the same physical pages. The page tables are changed to prevent either process from modifying a shared page. Instead, when a process attempts to modify a page, the kernel is entered with a protection fault. On discovering that the fault was caused by an attempt to modify a shared page, the kernel simply copies the page and changes the protection field for the page to allow modification once again. Only pages modified by one of the processes need to be copied. Because processes that fork typically overlay the child process with a new image with exec shortly thereafter, this technique significantly improves the performance of fork.

The next step in fork is to traverse the list of vm_map_entry structures in the parent and to create a corresponding entry in the child. Each entry must be analyzed and the appropriate action taken:

If the entry maps a read-only or shared region, the child can take a reference to it.
If the entry maps a privately mapped region (such as the data area or stack), the child must create a copy-on-write mapping of the region. The parent must be converted to a copy-on-write mapping of the region. If either process later tries to write the region, it will create a shadow map to hold the modified pages.

Map entries for a process are never merged (simplified). Only entries for the kernel map itself can be merged. The kernel-map entries need to be simplified so that excess growth is avoided. It might be worthwhile to do such a merge of the map entries for a process when it forks, especially for large or long-running processes.

With the virtual-memory resources allocated, the system sets up the kernel- and user-mode state of the new process, including the hardware memory-management registers, user structure, and stack. It then clears the NEW flag and places the process's thread on the run queue; the new process can then begin execution.

Creation of a New Process Without Copying

When a process (such as a shell) wishes to start another program, it will generally fork, do a few simple operations such as redirecting I/O descriptors and changing signal actions, and then start the new program with an exec. In the meantime, the parent shell suspends itself with wait until the new program completes. For such operations, it is not necessary for both parent and child to run simultaneously, and therefore only one copy of the address space is required. This frequently occurring set of system calls led to the implementation of the vfork system call. Although it is extremely efficient, vfork has peculiar semantics and is generally considered to be an architectural blemish.

The implementation of vfork will always be more efficient than the copy-on-write implementation because the kernel avoids copying the address space for the child. Instead, the kernel simply passes the parent's address space to the child and suspends the parent. The child process does not need to allocate any virtual-memory structures, receiving the vmspace structure and all its pieces from its parent. The child process returns from the vfork system call with the parent still suspended. The child does the usual activities in preparation for starting a new program, then calls exec. Now the address space is passed back to the parent process, rather than being abandoned, as in a normal exec. Alternatively, if the child process encounters an error and is unable to execute the new program, it will exit. Again, the address space is passed back to the parent instead of being abandoned.

With vfork, the entries describing the address space do not need to be copied, and the page-table entries do not need to be marked and then cleared of copy-on-write. Vfork is likely to remain more efficient than copy-on-write or other schemes that must duplicate the process's virtual address space. The architectural quirk of the vfork call is that the child process may modify the contents and even the size of the parent's address space while the child has control. Although modification of the parent's address space is bad programming practice, some programs have been known to take advantage of this quirk.