2.7. Process CreationThe fork(2) system call creates a new process. The newly created process is assigned a unique process identification (PID) and is a child of the process that called fork(2); the calling process is the parent. The exec(2) system call overlays the new process with an executable specified as a path name in the first argument to the exec(2) call. The model, in pseudocode format, looks like this. main(int argc, char *argv[], char *envp[]) { pid_t child_pid; child_pid = fork(); if (child_pid == -1) perror("fork"); /* fork system call failed */ else if (child_pid == 0) execv("/path/new_binary",argv); /* in the child, so exec */ else wait() /* pid > 0, weUre in the parent */ } The pseudocode above calls fork(2) and checks the return value from fork(2) (PID). Remember, once fork(2) executes successfully, there are two processes: fork returns a value of 0 to the child process and returns the PID of the child to the parent process. In the example, we called exec(2) to execute new_binary once in the child. Back in the parent, we simply wait for the child to complete (we get back later to this notion of "waiting"). The default behavior of fork(2) has changed in Solaris 10. In prior releases, fork(2) replicated all the threads in the calling process unless the code was linked with the POSIX thread library (-lpthread), in which case, fork(2) created a new process with only the calling thread. Previous releases provided a fork1(2) interface for programs that needed to replicate only the calling thread in the new process and did not link with pthread.so. In Solaris 10, fork(2) replicates only the calling thread in the new process, and a forkall(2) interface replicates all threads in the new process if desired. Finally, there's vfork(2), which is described as a "virtual memory efficient" version of fork. A call to vfork(2) results in the child process "borrowing" the address space of the parent, rather than the kernel duplicating the parent's address space for the child, as it does in fork(2) and fork1(2). The child's address space following a vfork(2) is the same address space as that of the parent. More precisely, the physical memory pages of the new process's (the child) address space are the same memory pages as for the parent. The implication here is that the child must not change any state while executing in the parent's address space until the child either exits or executes an exec(2) system callonce an exec(2) call is executed, the child gets its own address space. In fork(2) and fork1(2), the address space of the parent is copied for the child by means of the kernel address space duplicate routine. Another addition to Solaris 10 on the process creation front is the posix_spawn(3C) interface, which does fork/exec in one library call and does so in a memory efficient way (the vfork(2) style of not replicating the address space). We can trace the entire code path through the kernel for process creation when fork(2) is called, using the following dtrace script. #!/usr/sbin/dtrace -s #pragma D option flowindent syscall::fork1:entry { self->trace=1; } fbt::: / self->trace / { } syscall::fork1:return { self->trace=0; exit(0); } The dtrace script sets a probe to fire at the entry point of the fork1() system call. The reason we're using fork1() here instead of fork() is an implementation detail that has to do with the new fork() behavior described previously. In order to maintain source and binary compatibility for applications that use fork1(), the interface is still in the library. fork(2) now does the same thing fork1(2) did, so we can implement fork(2) and fork1(2) with a single source file, as long as both interfaces resolve to the same code. /* * fork() is fork1() for both POSIX threads and Solaris threads. * The forkall() interface exists for applications that require * the semantics of replicating all threads. */ #pragma weak fork = _fork1 #pragma weak _fork = _fork1 #pragma weak fork1 = _fork1 See usr/src/lib/libc/port/threads/scalls.c The #pragma binding directives associate fork and fork1. The dtrace system call provider cannot enable a fork:entry probe because, technically, the system call table does not contain a unique entry for fork(2). This is implemented in such a way as to be transparent to applicationsfork(2) system call behaves exactly as expected when implemented in application code. Running the dtrace script on a test program that issues a fork(2) generates over 2000 lines of kernel function-flow output. The text below is cut from the output, aggressively edited, including replacing the CPU column with a LINE columnthis was done by postprocessing the output file; it is not a dtrace option. LINE FUNCTION 1 -> fork1 2 -> cfork 3 -> holdlwps 4 -> schedctl_finish_sigblock 5 -> pokelwps 6 -> getproc 7 -> pid_assign 8 <- pid_assign 9 -> crgetruid 10 -> task_attach 11 -> task_hold 12 -> rctl_set_dup 13 <- task_attach 14 <- getproc 15 -> as_dup 16 -> forklwp 17 -> flush_user_windows_to_stack 18 -> save_syscall_args 19 -> lwp_getsysent 20 -> lwp_getdatamodel 21 -> lwp_create 22 -> segkp_cache_get 23 -> thread_create 24 -> lgrp_affinity_init 25 -> lgrp_move_thread 26 <- thread_create 27 -> lwp_stk_init 28 -> thread_load 29 -> lgrp_choose 30 -> lgrp_move_thread 31 -> init_mstate 32 -> ts_alloc 33 -> ts_fork 34 -> thread_lock 35 -> disp_lock_exit 36 <- ts_fork 37 <- forklwp 38 -> pgjoin 39 -> ts_forkret 40 -> continuelwps 41 -> setrun_locked 42 -> thread_transition 43 -> disp_lock_exit_high 44 -> ts_setrun 45 -> setbackdq 46 -> cpu_update_pct 47 -> cpu_decay 48 -> exp_x 49 -> cpu_choose 50 -> disp_lowpri_cpu 51 -> disp_lock_enter_high 52 -> cpu_resched 53 <= fork1 The kernel cfork() function is the common fork code that executes for any variant of fork(2). After some preliminary checking, holdlwps() is called to suspend the calling LWP in the process so that it can be safely replicated on the new process or, in the case of forkall(), can suspend all the LWPs. getproc() (LINE 6) is where the new proc_t is created, allocated from the kmem process_cache. Once the proc_t is allocated, the state for the new process is set to SIDL, and a PID is assigned with the pid_assign() function, where a pid_t structure is allocated and initialized (see Figure 2.4). The Solaris kernel implements an interesting throttle here in the event of a process forking out of control and thus consuming an inordinate amount of system resources. A failure by the kernel pid_assign() code or a lack of an available process table slot indicates a large amount of process creation activity. In this circumstance, the kernel implements a delay mechanism by which the process that issued the fork call is forced to sleep for an extra clock tick (a tick occurs every 10 milliseconds). By implementing this mechanism, the kernel ensures that no more than one fork can fail per CPU per clock tick. The throttle also scales up, such that an increased rate of fork failures results in an increased delay before the code returns the failure and the issuing process can try again. In that situation, you'll see the console message "out of processes," and the ov (overflow) column in the sar -v output will have a non-zero value. You can also look at the kernel fork_fail_pending variable with adb. If this value is non-zero, the system has entered the fork throttle code segment. Below is an example of examining the fork_fail_pending kernel variable with mdb(1). # mdb -k Loading modules: [ unix krtld genunix specfs dtrace uppc pcplusmp ufs ip sctp usba fcp fctl nca md lofs zfs random nfs fcip cpc crypto logindmux ptm sppp ipc ] > fork_fail_pending/D fork_fail_pending: fork_fail_pending: 0 Much of the remaining work in getproc() is the initialization of the fields in the new proc_t, which includes copying the parent's uarea, updating the open file pointers and reference counts, and copying the list of open files into the new process. Also, the resource controls of the parent process are replicated for the child using rctl_set_dup() (LINE 12). Back in cfork(), the code tests to determine if a vfork() was issued, in which case the new (child) process's address space is set to the parent's address space. Otherwise, as_dup() is called to duplicate the parent's address space for the new process, looping through all of the parent's address space segments and duplicating each one to construct the address space for the child process. The next set of functions (LINES 1637) created a new LWP and new kernel thread in the new process, which involves stack initialization and setting the home lgroup (see Section 3.2). With the LWP and thread work completed, pgjoin() sets up the process group links in the new process. Depending on the scheduling class of the calling thread, the class-specific forkret() code sets up the scheduling class information and handles CPU selection for placing the kernel thread in the new process on a dispatch queue (see Section 3.3). At this point, the new process is created, initialized, and ready to run. With a newly created process/LWP/kthread infrastructure in place, most applications invoke exec(2). The exec(2) system call overlays the calling program with a new executable image. (Not following a fork(2) with an exec(2) results in two processes executing the same code; the parent and child executes whatever code exists after the fork(2) call.) There are several flavors of the exec(2) call; the basic differences are in what they take as arguments. The exec(2) calls vary in whether they take a path name or file name as the first argument (which specifies the new executable program to start), whether they require a comma-separated list of arguments or an argv[] array, and whether the existing environment is used or an envp[] array is passed. Because Solaris supports the execution of several different file types, the kernel exec code is split into object file format-dependent and object file format-independent code segments. Most common is the previously discussed ELF format. Among other supported files is a.out, which is included to provide a degree of binary compatibility that enables executables created on a SunOS 4.X system to run on SunOS 5.X. Other inclusions are a format-specific exec routine for programs that run under an interpreter, such as shell scripts and awk programs, and support code for programs in the Java programming language with a Java-specific exec code segment. Calls into the object-specific exec code are done through a switch table mechanism. During system startup, an execsw[] array is initialized with the magic number of the supported object file types. Magic numbers uniquely identify different object file types on UNIX systems. See /etc/magic and the magic(4) man page. Each array member is an execsw structure. struct execsw { char *exec_magic; int exec_magoff; int exec_maglen; int (*exec_func)(struct vnode *vp, struct execa *uap, struct uarg *args, struct intpdata *idata, int level, long *execsz, int setid, caddr_t exec_file, struct cred *cred); int (*exec_core)(struct vnode *vp, struct proc *p, struct cred *cred, rlim64_t rlimit, int sig, core_content_t content); krwlock_t *exec_lock; }; See usr/src/uts/common/sys/exec.h
The object file exec code is implemented as dynamically loadable kernel modules, found in the /kernel/exec directory (aoutexec, elfexec, intpexec) and /usr/kernel/exec (javaexec). The elf and intp modules load through the normal boot process since these two modules are used minimally by the kernel startup processes and startup shell scripts. The a.out and java modules load automatically when needed as a result of exec'ing a SunOS 4.X binary or a Java program. When each module loads into RAM (kernel address space in memory), the mod_install() support code loads the execsw structure information into the execsw[] array. You can examine the execsw [] array on your system by using mdb(1). Because execsw[] is an array of execsw structures, you need to calculate the address of each array entry based on the size of an execsw structure. Fortunately, mdb(1) offers a couple of nice features that make this relatively painless. In the following example, we use the sizeof dcmd to determine the array size and let mdb(1) do the math for us. > execsw::print "struct execsw" { exec_magic = elf32magicstr exec_magoff = 0 exec_maglen = 0x5 exec_func = elf32exec exec_core = elf32core exec_lock = 0xffffffff80132a18 } > > ::sizeof "struct execsw" sizeof (struct execsw) = 0x28 > execsw+0x28::print "struct execsw" { exec_magic = elf64magicstr exec_magoff = 0 exec_maglen = 0x5 exec_func = elfexec exec_core = elfcore exec_lock = 0xffffffff80132a10 } > execsw+0x50::print "struct execsw" { exec_magic = intpmagicstr "#!" exec_magoff = 0 exec_maglen = 0x2 exec_func = intpexec exec_core = 0 exec_lock = 0xffffffff80132a08 } > execsw+0x78::print "struct execsw" { exec_magic = javamagicstr exec_magoff = 0 exec_maglen = 0x4 exec_func = 0 exec_core = 0 exec_lock = 0xffffffff80132a00 } The first entry is examined with the execsw symbol in mdb(1), which represents the address of the beginning of the execsw[] array. After examining the first entry, we use sizeof to determine how large an execsw structure is, add that value to the base address of the array, and get the second array entry. By doing additional arithmetic based on the number of entries into the array we want to see, we can move down the array and examine each entry. We see that the array is initialized for 32-bit ELF files, 64-bit ELF files, interpreter files (shell, Perl, etc.) and Java programs. Figure 2.8 illustrates the flow of exec for an ELF file. Figure 2.8. exec() Flow
All variants of the exec(2) system call resolve in the kernel to a common routine, exec_common(), where some initial processing is done. The path name for the executable file is retrieved, exitlwps() is called to force all but the calling LWP to exit, any POSIX4 interval timers in the process are cleared (p_itimer field in the proc structure), and the sysexec counter in the cpu_sysinfo structure is incremented (counts exec system calls, readable with sar(1M)). If scheduler activations have been set up for the process, the door interface used for such purposes is closed (that is, scheduler activations are not inherited), and any other doors that exist within the process are closed. The SPREXEC flag is set in p_flags (proc structure field), signifying that an exec is in the works for the process. The SPREXEC flag blocks any subsequent process operations until exec() has completed, at which point the flag is cleared. The kernel generic exec code, gexec(), is now called; here is where we switch to the object-file-specific exec routine through the execsw[] array. The correct array entry for the type of file being exec'd is determined by a call to the kernel vn_rdwr() (vnode read/write) routine and a read of the first four bytes of the file, which is where the file's magic number is stored. Once the magic number has been retrieved, the code looks for a match in each entry in the execsw[] array by comparing the magic number of the exec'd file to the exec_magic field in each structure in the array. Before entering the exec switch table, the code checks permissions against the credentials of the process and the permissions of the object file being exec'd. If the object file is not executable or the caller does not have execute permissions, exec fails with an EACCESS error. If the object file has the setuid or setgid bits set, the effective UID or GID is set in the new process credentials at this time. Note separate execsw[] array entries for each data model supported: 32-bit ILP32 ELF files and 64-bit LP64 ELF files. Let's examine the flow of the elfexec() function, since that is the most common type of executable run on Solaris systems. Upon entry to the elfexec() code, the kernel reads the ELF header and program header (PHT) sections of the object file (see Section 2.3 for an overview of the ELF header and PHT). These two main header sections of the object file give the system the information it needs to proceed with mapping the binary to the address space of the newly forked process. The kernel next gets the argument and environment arrays from the exec(2) call and places both on the user stack of the process, using the exec_args() function. The arguments are also copied into the process uarea's u_psargs[] array at this time (see Figure 2.9). Figure 2.9. Initial Stack Frame
Before actually setting up the user stack with the argv[] and envp[] arrays, a 64-bit kernel must first determine if a 32-bit or 64-bit binary is being exec'd. A 32-bit Solaris 10 system can only run 32-bit binaries. On SPARC systems, Solaris 10 is 64-bit only, but on x64, a 32-bit or 64-bit kernel can be booted. The binary type information is maintained in the ELF header, where the system checks the e_ident[] array for either an ELFCLASS32 or ELFCLASS64 file identity. With the data model established, the kernel sets the initial size of the exec file sections to 4 Kbytes for the stack, 4 Kbytes for stack growth (stack increment), and 1 Mbyte for the argument list (ELF32) or 2-Mbyte argument list size for an ELF64. Once the kernel has established the process user stack and argument list, it calls the mapelfexec() function to map the various program segments into the process address space. mapelfexec() walks through the program header table (PHT), and for each PT_LOAD type (a loadable segment), mapelfexec() maps the segment into the process's address space. mapelfexec() bases the mapping on the p_filesz and p_memsz sections of the header that define the segment, using the lower-level kernel address space support code. Once the program loadable segments have been mapped into the address space, the dynamic linker (for dynamically linked executables), referenced through the PHT, is also mapped into the process's address space. The elfexec code checks the process resource limit RLIMIT_VMEM (maximum virtual memory size) against the size required to map the object file and runtime linker. An ENOMEM error is returned in the event that an address space requirement exceeds the limit. All that remains for exec(2) to complete is some additional housekeeping and structure initialization, which is done when the code returns to gexec(). This last part deals with clearing the signal stack and setting the signal disposition to default for any signals that have had a handler installed by the parent process. The p_lwptotal is set to 1 in the new process. Finally, all open files with the close-on-exec flag set are closed, and the exec is complete. A call made into procfs clears the SPREXEC flag and unlocks access to the process by means of /proc. As you'll see in the next chapter, threads inherit their scheduling class and priority from the parent. Some scheduling-class-specific fork code executes at the tail end of the fork process that takes care of placement of the newly created kthread on a dispatch queue. This practice gets the child executing before the parent in anticipation that the child will immediately execute an exec(2) call to load in the new object file. In the case of a vfork(2), where the child is mapped to the address space of the parent, the parent is forced to wait until the child executes and gets its own address space. |