The fork1() Kernel Routine

The `fork1()` Kernel Routine

Today the two system calls fork() and vfork() enter the kernel at the same point: fork1(). This kernel routine looks for the passed forktype argument (set to either FORK_PROCESS or FORK_VFORK) to determine which set of rules to play by. In either case, a number of kernel procedures are used. Let's first examine the actions for forktype=FORK_PROCESS, the full copy-on-write version.

The `fork()` System Call Mechanics

After entering fork1(), calls are made to obtain unique process and thread identification numbers. The current release of HP-UX reserves pid 0 through 7 for system use (0 for the swapper, 1 for init, 2 for page-out, 3 for statdaemon, 4 for the unhash daemon, 5 for netisr, 6 for the socket registration daemon, and 7 for commit). Thread identification numbers are unique within the scope of the kernel, and in a similar manner tids 0 to 7 are reserved for the system threads associated with the first eight system processes.

NOTE

in the future, HP-UX may be modified to allow each process to start numbering its threads with kt_tid = 0, but for now all threads have a unique tid.

Once we have a unique process and thread ID, we allocate the process table and thread table structures. Kernel routines check for available structures by following the freeproc_list and freekthread pointers. Although both of these structures are currently maintained as dynamic lists, the current implementation does not return unused process and thread structures to the kernel memory arenas; instead the structures are returned to the appropriate free list when their process or thread is deconstructed.

These free lists maintain a kind of high-water mark for structure utilization and save time when the kernel is asked to find new process or thread structures. If a free list is empty and we haven't exceeded the respective kernel parameter nproc or nkthread (the tunable maximums), the kernel memory allocator is asked to provide space for the new structure.

As our new structures are allocated, they are initialized and set to the idle state. The newly allocated proc and kthread structures are linked together and attached to a variety of hash chains.

tidhash[] links kthread structures to chains based on the tid.
pidhash[] links proc structures to chains based on the pid.
uidhash[] links proc structures to chains based on the UID.
sidhash[] links proc structures to chains based on the session ID.

These hashtables provide quick access to the process and thread structures based on common attributes. Once this work is completed, the process and thread table entries are linked to their appropriate active lists.

Now switch() is called to pass control to the kernel routine newproc(). Before we continue with the creation of our new process, a sanity check is made checking the thread state of the new child's kthread structure. If it is anything other than kt_stat=TSIDL, then the kernel panics, as we shouldn't be here under any other circumstance. Assuming we pass this test, the child's proc and kthread structures inherit basic data from the parent's environment, such as the UID and GID. We reload the child's scheduling and usage counters, and temporarily disable swapping for the parent (after all, we will be copying the parent's memory view to the child and don't want vhand attempting to swap it out while we are making our copy).

Now, as we are rounding the far turn and entering the home stretch, control is passed to procdup(), where we start duplicating the parent vas, pregion, and private region structures to create a new memory view for the child.

Copying a Process Memory View

The kernel manages regions of pages to be used by a process for any of a number of purposes. Each process maintains its own virtual memory view, facilitated by assigning virtual space values to work in conjunction with an offset address for each region. Since the virtual view belongs primarily to the process, it is in the process pregion that p_space and p_vaddr (the space and offset used to create a virtual address) are recorded. The entire vas/pregion structure of the parent process must be duplicated for the child and attached to its proc structure (using p_vas).

The kernel checks each pregion in the parent's list to see if it points to a region of type RT_SHARED or RT_PRIVATE. In the case of a shared region, the kernel needs only copy the pregion to the child's list and increment the reference count (r_refcnt) in the region. In the case of a private region the kernel must build a duplicate region and pregion and map it to the child's list. This involves duplicating the region's page list (containing the b-tree's broot, bnodes, and chunks) and reserving swap space for it and for the potential number of pages it could map. A unique "space" value must also be obtained so that the child will have its own virtual view.

This is where copy-on-write comes into play. As a duplicate private region is created, all of its vfds are set to copy-on-write, and any in-core page frames have their usage count incremented (pf_use++) in their pfdat. The page's entry in the pdir is flagged for copy-on-write behavior (this includes updating multiple pdes in the case of an aliased page), and any existing tlb entries are purged. This assures a page fault on first access and allows the fault handler to update access information and physical page copying at the time if necessary. In the case of a child making an immediate exec(), we have minimized the amount of page copying done at the time of the fork().

Let's examine what happens when a copy-on-write page is accessed.

Copy-on-Write, First Read Access from the Parent

If the first access to the page is by the parent, the system experiences a tlb fault (since we purged the tlbs during the region duplication process). The resulting fault loads the translation from the pdir and sets the copy-on-write access bit.

We try one more time and, as there is now a valid translation in the tlb and this is a read request, access is granted (effectively creating a shared read access mode).

Copy-on-Write, First Read Access from a Child

If the first access to the page is by a child, a tlb fault occurs, no pde entry will exist (the child was placed in its own space, and its virtual page numbers have not been mapped in the pdir), and the kernel page fault handler is called next.

This handler determines the faulting thread's identity and follows pointers from its kthread structure to the proc structure to the vas to the pregion list. The pregions are searched (using their skiplist links) to find the one containing the faulting virtual page. Finally, we calculate the page offset within the region and search the b-tree to locate the vfd/dbd data for the faulted page. This search will reveal a valid vfd entry with pg_cw=1.

Since this is a first access, the kernel also needs to add an alias entry to the pfn_to_virt table entry for this physical page before we continue. In this manner, we postpone the creation of the alias links until it is absolutely necessary. We also make an entry in the pdir for the new virtual page number. If the pde pointed to by the initial hash for the new virtual page is not available and we have to link to a sparse pde, we allocate it from the aa_pdirfreelist.

At this point the instruction is retried, but since the tlb has not been updated, it will fault. The handler finds the new valid pde in the pdir and uses it to update the tlb, setting the copy-on-write access bit in the process. We try one more time and finally our read request is allowed.

Copy-on-Write, First Write Access

As with a first read request, the virtual page translation will have been purged from the tlbs. For a parent the pdir will exist, and for a child there will not be a pdir entry. We mirror the actions taken during the first read mechanics we just discussed.

Once we get to the point where we have valid pdir and tlb entries, the instruction is retried but fails this time with an access protection fault (we tried to write to a page with the pg_cw bit set).

Once we enter the fault handler, the current page use count is checked. If pf_use=1 (in pfdat), then we know we are the last to reference the page and we simply change the access rights by clearing the copy-on-write bits in the tlb, pde, and vfd for the page. The instruction is retried, and this time around it should succeed. If, however, pf_use>1, then the kernel must create a copy of the page for use by this process.

To copy the page, we first need to get a write lock so no one may change its contents while our copy is in progress. We remove the current virtual-to-physical translations for the page (purge the tlb entry, free the pdir, and invalidate the vfd), as we will be creating new ones. Next, we find a free pfdat entry (or block until one is available) and update the appropriate virtual-to-physical and physical-to-virtual tables.

We set the access type in the pde to PDE_AR_KRW for our new page to keep anyone from accessing it as user data until we finish the copy. Once the copy is completed, the original page's pf_use is decremented and the new page's tlb is purged. It is necessary to purge the tlb because it was set during the copy process and we need the next access to fault and reload the access rights from the newly created pdir entry. Now we are ready to continue using our new private copy of the data.

Copy-on-Write for the Parent, Copy-on-Access for the Child

Before we move on, let's talk a bit about the older copy-on-write/access mechanics. In most cases, it behaves similarly to copy-on-write except that virtual page alias entries are not utilized and all first access by a child results in a private copy being made.

If the kernel parameter COW=1, copy-on-write is used for duplication of all r_type=RT_SHARED regions. Currently, the HP-UX kernel defaults to COW=0 and uses copy-on-write/access. The exception is in the case of private text regions encountered in EXEC_MAGIC programs, where it always uses copy-on-write (first implemented with the HP-UX 10.0 release). This parameter is not user-tunable in HP-UX 11.x and may only be changed using adb (not for the faint of heart).

In HP-UX 10.x, the COW was a user-tunable parameter. Why this has been changed remains a bit of a mystery, since Hewlett-Packard has stated it intends to move completely to copy-on-write in the future.

It's time to return to our discussion of duplicating a process's memory view during a vofk(). A special case exists when the region to be copied holds the current thread's uarea.

Creating the New `uarea`

The pregion referencing the parent's uarea is somewhat of a special case and is replaced by one for a new uarea created for the child's initial thread. Swap reservation must be performed to reserve space for the uarea and its b-tree structure. When a new uarea is created, its pregion is flagged as PF_NOPAGE to keep it from being paged by vhand. A thread's uarea may be paged but only if the entire process has been marked for deallocation/deactivation during times of extreme memory pressure.

Ready, Set, Run Queue

As soon as all the new kernel structures are created, we are ready to allow the new thread to compete for runtime. We change its proc table state to p_stat=SINUSE, the thread state to kt_stat=TSRUN, and link the kthread to the end of the appropriate run queue in accordance with its kt_spu_wanted value. We also flag the parent to allow swapping again, as we are through making copies.

In general, a new process and its first thread start on the same spu as the parent thread that created it. This makes sense because both initially share the same execution environment and need access to the same text page for their next instruction. See Listing 9.1.

Listing 9.1. `q4> fields struct kthread (an edited listing)`

 These fields store the spu number of the run queue that the thread is currently on, the  spu it would like to be on, and the spu group number (processor set) that it wants to be  part of  92 0  4 0 int      kt_spu  96 0  4 0 int      kt_spu_wanted 100 0  4 0 int      kt_spu_group The next field is used to express a mandatory processor  locking strategy (sometimes called a processor affinity lock) 104 0  1 0 u_char   kt_spu_mandatory

By sharing the same processor, we minimize cache-coherency issues that might arise if the parent and child were scheduled on separate processors. Various load-balancing mechanisms are employed by the HP-UX kernel to disperse threads to all the processors and locality domains on multiple processor systems. We cover processor load-balancing in greater detail when we talk about multiprocessor systems in Chapter 12, "Multiprocessing and HP-UX."

As we unwind from this litany of kernel procedural calls, an interesting situation occurs. We started down this path when a thread made a fork() system call. In most cases, every call results in a single return at some future point in time. With the fork() call, two separate returns result from a successful call, one in the parent context and one in the newly created child context. Upon returning from the fork(), the pid of the newly created child is passed to the parent thread, while the new child thread receives a 0 to let it know that it is the new kid on the block.

The `vfork()` System Call Mechanics

When you call vfork(), you enter the same kernel procedure, fork1(), but there are several deviations from the fork() sequence we just examined. The first point of departure occurs when control is passed to newproc(). In the case of forktype=FORK_VFORK, the newly created child shares the parent's memory view, which is accomplished by having the child's proc table simply point to the parent's vas structure, as shown in Figure 9-3.

Figure 9-3. The `vfork()`

graphics/09fig03.gif

To preserve the sanctity of the parent's data, the kernel allocates a vforkinfo buffer. This structure is used to hold state information about the vfork() and a copy of the parent's uarea and stack. See Listing 9.2.

Listing 9.2. `q4> fields struct vforkinfo (an edited listing)`

 The first field is used to store the state of the vfork. Since  this is visible to both the parent and the child, the current  state of the call only needs to be stored in this one location  0 0 4 0 enum4  vfork_state Next we store a pointer to the parent and child kthread structure  4 0 4 0 *      pthreadp  8 0 4 0 *      cthreadp We also store the number of buffer pages and the size of the  parent thread's uarea (which will be stored in the buffer) 12 0 4 0 int    buffer_pages 16 0 4 0 u_long u_and_stack_len Space is provided to save the rp and a pointer to the last  vforkbuffer (to handle nesting of vforks) 20 0 4 0 *      saved_rp_ptr 24 0 4 0 long   saved_rp This pointer directs us to the beginning of the uarea copy in the buffer 28 0 4 0 *      u_and_stack_buf Next is the pointer to the previous p_vforkbuf 32 0 4 0 *      prev And finally a pointer to the parent thread's uarea 36 0 4 0 *      p_upreg

When the buffer is initialized, it is linked by the pointer p_vforkbuf in the proc tables of both the parent and child processes, and its state is set to vfork_state=VFORK_INIT. The structure remains in use until the child calls either exit() or exec(). As the waiting parent resumes operation, the parent's uarea and stack are restored from the buffer and its space is then freed.

The next step in the vfork() sequence is to check if the parent is a multithreaded process. If so, the thread obtains exclusive write access to the process memory view and suspends all the sibling threads. This is necessary because the child will soon begin running in the parent's memory view, sharing its uarea and data regions we can't have sibling threads also working in the same view!

This is an important issue for programmers to understand. When deciding to use vfork() with a multithreaded process, until the child calls exec() or exit(), the parent and all of its siblings are suspended. This suspension should last a very short time, but in a symmetrical multiprocessing (SMP) environment with siblings distributed to various processors, this could result in a number of forced context switches taking place.

The next deviation for vfork() occurs when control is passed to procdup() where we set vfork_state=VFORK_PARENT in the vforkinfo buffer. The child is prepared for running, and the call returns in both the parent and the child. Note that in the vfork there is no time spent copying the vas and pregion structures, as the child's proc pointer, p_vas, simply points to the parent process's vas.

Following the return in the parent thread, a call is made to sleep(). As this call is entered, a check is made to see if the requesting thread is currently in a vfork and if vfork_state=VFORK_PARENT. If this is the case, several additional steps are taken. First, a calculation is made of the size of the parent's uarea and stack, and they are copied to the address pointed to by vforkinfo _u_and_stack_buf in the vforkinfo buffer so they may be restored when the parent wakes.

We now set vfork_state=VFORK_CHILDRUN, and the parent thread sleeps, waiting for a wakeup call. The correct use of the vfork() is for the child to immediately call either exec() or exit() upon its return. While a programmer may choose an alternate logic, any deviation from the stated use is not supported by Hewlett-Packard!

Before we take a look at exec(), let's spend a little time defining the process and kthread states.

Table of content