Section 2.7. Process Creation

2.7. Process Creation

The fork(2) system call creates a new process. The newly created process is assigned a unique process identification (PID) and is a child of the process that called fork(2); the calling process is the parent. The exec(2) system call overlays the new process with an executable specified as a path name in the first argument to the exec(2) call. The model, in pseudocode format, looks like this.

main(int argc, char *argv[], char *envp[]) {         pid_t child_pid;         child_pid = fork();         if (child_pid == -1)                 perror("fork"); /* fork system call failed */         else if (child_pid == 0)                 execv("/path/new_binary",argv); /* in the child, so exec */         else                 wait()  /* pid > 0, weUre in the parent */ }

The pseudocode above calls fork(2) and checks the return value from fork(2) (PID). Remember, once fork(2) executes successfully, there are two processes: fork returns a value of 0 to the child process and returns the PID of the child to the parent process. In the example, we called exec(2) to execute new_binary once in the child. Back in the parent, we simply wait for the child to complete (we get back later to this notion of "waiting").

The default behavior of fork(2) has changed in Solaris 10. In prior releases, fork(2) replicated all the threads in the calling process unless the code was linked with the POSIX thread library (-lpthread), in which case, fork(2) created a new process with only the calling thread. Previous releases provided a fork1(2) interface for programs that needed to replicate only the calling thread in the new process and did not link with pthread.so. In Solaris 10, fork(2) replicates only the calling thread in the new process, and a forkall(2) interface replicates all threads in the new process if desired.

Finally, there's vfork(2), which is described as a "virtual memory efficient" version of fork. A call to vfork(2) results in the child process "borrowing" the address space of the parent, rather than the kernel duplicating the parent's address space for the child, as it does in fork(2) and fork1(2). The child's address space following a vfork(2) is the same address space as that of the parent. More precisely, the physical memory pages of the new process's (the child) address space are the same memory pages as for the parent. The implication here is that the child must not change any state while executing in the parent's address space until the child either exits or executes an exec(2) system callonce an exec(2) call is executed, the child gets its own address space. In fork(2) and fork1(2), the address space of the parent is copied for the child by means of the kernel address space duplicate routine.

Another addition to Solaris 10 on the process creation front is the posix_spawn(3C) interface, which does fork/exec in one library call and does so in a memory efficient way (the vfork(2) style of not replicating the address space).

We can trace the entire code path through the kernel for process creation when fork(2) is called, using the following dtrace script.

#!/usr/sbin/dtrace -s #pragma D option flowindent syscall::fork1:entry {         self->trace=1; } fbt::: / self->trace / { } syscall::fork1:return {         self->trace=0;         exit(0); }

The dtrace script sets a probe to fire at the entry point of the fork1() system call. The reason we're using fork1() here instead of fork() is an implementation detail that has to do with the new fork() behavior described previously. In order to maintain source and binary compatibility for applications that use fork1(), the interface is still in the library. fork(2) now does the same thing fork1(2) did, so we can implement fork(2) and fork1(2) with a single source file, as long as both interfaces resolve to the same code.

/*  * fork() is fork1() for both POSIX threads and Solaris threads.  * The forkall() interface exists for applications that require  * the semantics of replicating all threads.  */ #pragma weak fork = _fork1 #pragma weak _fork = _fork1 #pragma weak fork1 = _fork1                                              See usr/src/lib/libc/port/threads/scalls.c

The #pragma binding directives associate fork and fork1. The dtrace system call provider cannot enable a fork:entry probe because, technically, the system call table does not contain a unique entry for fork(2). This is implemented in such a way as to be transparent to applicationsfork(2) system call behaves exactly as expected when implemented in application code.

Running the dtrace script on a test program that issues a fork(2) generates over 2000 lines of kernel function-flow output. The text below is cut from the output, aggressively edited, including replacing the CPU column with a LINE columnthis was done by postprocessing the output file; it is not a dtrace option.

LINE FUNCTION   1  -> fork1   2  -> cfork   3    -> holdlwps   4      -> schedctl_finish_sigblock   5      -> pokelwps   6    -> getproc   7      -> pid_assign   8      <- pid_assign   9      -> crgetruid   10      -> task_attach   11        -> task_hold   12       -> rctl_set_dup   13      <- task_attach   14    <- getproc   15    -> as_dup   16    -> forklwp   17      -> flush_user_windows_to_stack   18      -> save_syscall_args   19        -> lwp_getsysent   20          -> lwp_getdatamodel   21      -> lwp_create   22        -> segkp_cache_get   23        -> thread_create   24          -> lgrp_affinity_init   25          -> lgrp_move_thread   26        <- thread_create   27        -> lwp_stk_init   28        -> thread_load   29        -> lgrp_choose   30        -> lgrp_move_thread   31      -> init_mstate   32      -> ts_alloc   33      -> ts_fork   34        -> thread_lock   35        -> disp_lock_exit   36      <- ts_fork   37    <- forklwp   38    -> pgjoin   39    -> ts_forkret   40      -> continuelwps   41        -> setrun_locked   42          -> thread_transition   43          -> disp_lock_exit_high   44        -> ts_setrun   45        -> setbackdq   46          -> cpu_update_pct   47            -> cpu_decay   48              -> exp_x   49          -> cpu_choose   50          -> disp_lowpri_cpu   51          -> disp_lock_enter_high   52        -> cpu_resched   53  <= fork1

The kernel cfork() function is the common fork code that executes for any variant of fork(2). After some preliminary checking, holdlwps() is called to suspend the calling LWP in the process so that it can be safely replicated on the new process or, in the case of forkall(), can suspend all the LWPs. getproc() (LINE 6) is where the new proc_t is created, allocated from the kmem process_cache. Once the proc_t is allocated, the state for the new process is set to SIDL, and a PID is assigned with the pid_assign() function, where a pid_t structure is allocated and initialized (see Figure 2.4).

The Solaris kernel implements an interesting throttle here in the event of a process forking out of control and thus consuming an inordinate amount of system resources. A failure by the kernel pid_assign() code or a lack of an available process table slot indicates a large amount of process creation activity. In this circumstance, the kernel implements a delay mechanism by which the process that issued the fork call is forced to sleep for an extra clock tick (a tick occurs every 10 milliseconds). By implementing this mechanism, the kernel ensures that no more than one fork can fail per CPU per clock tick.

The throttle also scales up, such that an increased rate of fork failures results in an increased delay before the code returns the failure and the issuing process can try again. In that situation, you'll see the console message "out of processes," and the ov (overflow) column in the sar -v output will have a non-zero value. You can also look at the kernel fork_fail_pending variable with adb. If this value is non-zero, the system has entered the fork throttle code segment. Below is an example of examining the fork_fail_pending kernel variable with mdb(1).

# mdb -k Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp usba fcp fctl nca md lofs zfs random nfs fcip cpc crypto logindmux ptm sppp ipc ] > fork_fail_pending/D fork_fail_pending: fork_fail_pending:              0

Much of the remaining work in getproc() is the initialization of the fields in the new proc_t, which includes copying the parent's uarea, updating the open file pointers and reference counts, and copying the list of open files into the new process. Also, the resource controls of the parent process are replicated for the child using rctl_set_dup() (LINE 12).

Back in cfork(), the code tests to determine if a vfork() was issued, in which case the new (child) process's address space is set to the parent's address space. Otherwise, as_dup() is called to duplicate the parent's address space for the new process, looping through all of the parent's address space segments and duplicating each one to construct the address space for the child process. The next set of functions (LINES 1637) created a new LWP and new kernel thread in the new process, which involves stack initialization and setting the home lgroup (see Section 3.2). With the LWP and thread work completed, pgjoin() sets up the process group links in the new process.

Depending on the scheduling class of the calling thread, the class-specific forkret() code sets up the scheduling class information and handles CPU selection for placing the kernel thread in the new process on a dispatch queue (see Section 3.3). At this point, the new process is created, initialized, and ready to run.

With a newly created process/LWP/kthread infrastructure in place, most applications invoke exec(2). The exec(2) system call overlays the calling program with a new executable image. (Not following a fork(2) with an exec(2) results in two processes executing the same code; the parent and child executes whatever code exists after the fork(2) call.)

There are several flavors of the exec(2) call; the basic differences are in what they take as arguments. The exec(2) calls vary in whether they take a path name or file name as the first argument (which specifies the new executable program to start), whether they require a comma-separated list of arguments or an argv[] array, and whether the existing environment is used or an envp[] array is passed.

Because Solaris supports the execution of several different file types, the kernel exec code is split into object file format-dependent and object file format-independent code segments. Most common is the previously discussed ELF format. Among other supported files is a.out, which is included to provide a degree of binary compatibility that enables executables created on a SunOS 4.X system to run on SunOS 5.X. Other inclusions are a format-specific exec routine for programs that run under an interpreter, such as shell scripts and awk programs, and support code for programs in the Java programming language with a Java-specific exec code segment.

Calls into the object-specific exec code are done through a switch table mechanism. During system startup, an execsw[] array is initialized with the magic number of the supported object file types. Magic numbers uniquely identify different object file types on UNIX systems. See /etc/magic and the magic(4) man page. Each array member is an execsw structure.

struct execsw {         char     *exec_magic;         int      exec_magoff;         int      exec_maglen;         int      (*exec_func)(struct vnode *vp, struct execa *uap,                      struct uarg *args, struct intpdata *idata, int level,                      long *execsz, int setid, caddr_t exec_file,                      struct cred *cred);         int      (*exec_core)(struct vnode *vp, struct proc *p,                      struct cred *cred, rlim64_t rlimit, int sig,                      core_content_t content);         krwlock_t        *exec_lock; };                                                       See usr/src/uts/common/sys/exec.h

exec_magic, exec_magoff, exec_maglen. Support to locate and correctly read the magic number.
exec_func. A function pointer; points to the exec function for the object file type.
exec_core. A function pointer; points to the object-file-specific core dump routine.
exec_lock. A pointer to a kernel read/write lock, to synchronize access to the exec switch array.

The object file exec code is implemented as dynamically loadable kernel modules, found in the /kernel/exec directory (aoutexec, elfexec, intpexec) and /usr/kernel/exec (javaexec). The elf and intp modules load through the normal boot process since these two modules are used minimally by the kernel startup processes and startup shell scripts. The a.out and java modules load automatically when needed as a result of exec'ing a SunOS 4.X binary or a Java program. When each module loads into RAM (kernel address space in memory), the mod_install() support code loads the execsw structure information into the execsw[] array.

You can examine the execsw [] array on your system by using mdb(1). Because execsw[] is an array of execsw structures, you need to calculate the address of each array entry based on the size of an execsw structure. Fortunately, mdb(1) offers a couple of nice features that make this relatively painless. In the following example, we use the sizeof dcmd to determine the array size and let mdb(1) do the math for us.

> execsw::print "struct execsw" {      exec_magic = elf32magicstr      exec_magoff = 0      exec_maglen = 0x5      exec_func = elf32exec      exec_core = elf32core      exec_lock = 0xffffffff80132a18 } > > ::sizeof "struct execsw" sizeof (struct execsw) = 0x28 >  execsw+0x28::print "struct execsw" {      exec_magic = elf64magicstr      exec_magoff = 0      exec_maglen = 0x5      exec_func = elfexec      exec_core = elfcore      exec_lock = 0xffffffff80132a10 } >  execsw+0x50::print "struct execsw" {      exec_magic = intpmagicstr "#!"      exec_magoff = 0      exec_maglen = 0x2      exec_func = intpexec      exec_core = 0      exec_lock = 0xffffffff80132a08 } >  execsw+0x78::print "struct execsw" {      exec_magic = javamagicstr      exec_magoff = 0      exec_maglen = 0x4      exec_func = 0      exec_core = 0      exec_lock = 0xffffffff80132a00 }

The first entry is examined with the execsw symbol in mdb(1), which represents the address of the beginning of the execsw[] array. After examining the first entry, we use sizeof to determine how large an execsw structure is, add that value to the base address of the array, and get the second array entry. By doing additional arithmetic based on the number of entries into the array we want to see, we can move down the array and examine each entry. We see that the array is initialized for 32-bit ELF files, 64-bit ELF files, interpreter files (shell, Perl, etc.) and Java programs.

Figure 2.8 illustrates the flow of exec for an ELF file.

Figure 2.8. `exec()` Flow

All variants of the exec(2) system call resolve in the kernel to a common routine, exec_common(), where some initial processing is done. The path name for the executable file is retrieved, exitlwps() is called to force all but the calling LWP to exit, any POSIX4 interval timers in the process are cleared (p_itimer field in the proc structure), and the sysexec counter in the cpu_sysinfo structure is incremented (counts exec system calls, readable with sar(1M)). If scheduler activations have been set up for the process, the door interface used for such purposes is closed (that is, scheduler activations are not inherited), and any other doors that exist within the process are closed. The SPREXEC flag is set in p_flags (proc structure field), signifying that an exec is in the works for the process. The SPREXEC flag blocks any subsequent process operations until exec() has completed, at which point the flag is cleared.

The kernel generic exec code, gexec(), is now called; here is where we switch to the object-file-specific exec routine through the execsw[] array. The correct array entry for the type of file being exec'd is determined by a call to the kernel vn_rdwr() (vnode read/write) routine and a read of the first four bytes of the file, which is where the file's magic number is stored. Once the magic number has been retrieved, the code looks for a match in each entry in the execsw[] array by comparing the magic number of the exec'd file to the exec_magic field in each structure in the array. Before entering the exec switch table, the code checks permissions against the credentials of the process and the permissions of the object file being exec'd. If the object file is not executable or the caller does not have execute permissions, exec fails with an EACCESS error. If the object file has the setuid or setgid bits set, the effective UID or GID is set in the new process credentials at this time.

Note separate execsw[] array entries for each data model supported: 32-bit ILP32 ELF files and 64-bit LP64 ELF files. Let's examine the flow of the elfexec() function, since that is the most common type of executable run on Solaris systems.

Upon entry to the elfexec() code, the kernel reads the ELF header and program header (PHT) sections of the object file (see Section 2.3 for an overview of the ELF header and PHT). These two main header sections of the object file give the system the information it needs to proceed with mapping the binary to the address space of the newly forked process. The kernel next gets the argument and environment arrays from the exec(2) call and places both on the user stack of the process, using the exec_args() function. The arguments are also copied into the process uarea's u_psargs[] array at this time (see Figure 2.9).

Figure 2.9. Initial Stack Frame

Before actually setting up the user stack with the argv[] and envp[] arrays, a 64-bit kernel must first determine if a 32-bit or 64-bit binary is being exec'd. A 32-bit Solaris 10 system can only run 32-bit binaries. On SPARC systems, Solaris 10 is 64-bit only, but on x64, a 32-bit or 64-bit kernel can be booted. The binary type information is maintained in the ELF header, where the system checks the e_ident[] array for either an ELFCLASS32 or ELFCLASS64 file identity. With the data model established, the kernel sets the initial size of the exec file sections to 4 Kbytes for the stack, 4 Kbytes for stack growth (stack increment), and 1 Mbyte for the argument list (ELF32) or 2-Mbyte argument list size for an ELF64.

Once the kernel has established the process user stack and argument list, it calls the mapelfexec() function to map the various program segments into the process address space. mapelfexec() walks through the program header table (PHT), and for each PT_LOAD type (a loadable segment), mapelfexec() maps the segment into the process's address space. mapelfexec() bases the mapping on the p_filesz and p_memsz sections of the header that define the segment, using the lower-level kernel address space support code. Once the program loadable segments have been mapped into the address space, the dynamic linker (for dynamically linked executables), referenced through the PHT, is also mapped into the process's address space. The elfexec code checks the process resource limit RLIMIT_VMEM (maximum virtual memory size) against the size required to map the object file and runtime linker. An ENOMEM error is returned in the event that an address space requirement exceeds the limit.

All that remains for exec(2) to complete is some additional housekeeping and structure initialization, which is done when the code returns to gexec(). This last part deals with clearing the signal stack and setting the signal disposition to default for any signals that have had a handler installed by the parent process. The p_lwptotal is set to 1 in the new process. Finally, all open files with the close-on-exec flag set are closed, and the exec is complete. A call made into procfs clears the SPREXEC flag and unlocks access to the process by means of /proc.

As you'll see in the next chapter, threads inherit their scheduling class and priority from the parent. Some scheduling-class-specific fork code executes at the tail end of the fork process that takes care of placement of the newly created kthread on a dispatch queue. This practice gets the child executing before the parent in anticipation that the child will immediately execute an exec(2) call to load in the new object file. In the case of a vfork(2), where the child is mapped to the address space of the parent, the parent is forced to wait until the child executes and gets its own address space.

2.7. Process Creation

Figure 2.8. exec() Flow

Figure 2.9. Initial Stack Frame

Figure 2.8. `exec()` Flow