Section 3.3. Process Creation: fork(), vfork(), and clone() System Calls

3.3. Process Creation: fork(), vfork(), and clone() System Calls

After the sample code is compiled into a file (in our case, an ELF executable^[2]), we call it from the command line. Look at what happens when we press the Return key. We already mentioned that any given process is created by another process. The operating system provides the functionality to do this by means of the fork(), vfork(), and clone() system calls.

^[2] ELF executable is an executable format that Linux supports. Chapter 9 discusses the ELF executable format.

The C library provides three functions that issue these three system calls. The prototypes of these functions are declared in <unistd.h>. Figure 3.9 shows how a process that calls fork() executes the system call sys_fork(). This figure describes how kernel code performs the actual process creation. In a similar manner, vfork() calls sys_fork(), and clone() calls sys_clone().

Figure 3.9. Process Creation System Calls

All three of these system calls eventually call do_fork(), which is a kernel function that performs the bulk of the actions related to process creation. You might wonder why three different functions are available to create a process. Each function slightly differs in how it creates a process, and there are specific reasons why one would be chosen over the other.

When we press Return at the shell prompt, the shell creates the new process that executes our program by means of a call to fork(). In fact, if we type the command ls at the shell and press Return, the pseudocode of the shell at that moment looks something like this:

 if( (pid = fork()) == 0 )    execve("foo"); else    waitpid(pid);

We can now look at the functions and trace them down to the system call. Although our program calls fork(), it could just as easily have called vfork() or clone(), which is why we introduced all three functions in this section. The first function we look at is fork(). We delve through the calls fork(), sys_fork(), and do_fork(). We follow that with vfork() and finally look at clone() and trace them down to the do_fork() call.

3.3.1. fork() Function

The fork() function returns twice: once in the parent and once in the child process. If it returns in the child process, fork() returns 0. If it returns in the parent, fork() returns the child's PID. When the fork() function is called, the function places the necessary information in the appropriate registers, including the index into the system call table where the pointer to the system call resides. The processor we are running on determines the registers into which this information is placed.

At this point, if you want to continue the sequential ordering of events, look at the "Interrupts" section in this chapter to see how sys_fork() is called. However, it is not necessary to understand how a new process gets created.

Let's now look at the sys_fork() function. This function does little else than call the do_fork() function. Notice that the sys_fork() function is architecture dependent because it accesses function parameters passed in through the system registers.

 ----------------------------------------------------------------------- arch/i386/kernel/process.c asmlinkage int sys_fork(struct pt_regs regs) {    return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL); } ----------------------------------------------------------------------- ----------------------------------------------------------------------- arch/ppc/kernel/process.c int sys_fork(int p1, int p2, int p3, int p4, int p5, int p6,              struct pt_regs *regs) {         CHECK_FULL_REGS(regs);         return do_fork(SIGCHLD, regs->gpr[1], regs, 0, NULL, NULL); } -----------------------------------------------------------------------

The two architectures take in different parameters to the system call. The structure pt_regs holds information such as the stack pointer. The fact that gpr[1] holds the stack pointer in PPC, whereas %esp^[3] holds the stack pointer in x86, is known by convention.

^[3] Recall that in code produced in "AT&T" format, registers are prefixed with a %.

3.3.2. vfork() Function

The vfork() function is similar to the fork() function with the exception that the parent process is blocked until the child calls exit() or exec().

 sys_vfork() arch/i386/kernel/process.c asmlinkage int sys_vfork(struct pt_regs regs) {    return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.ep, &regs, 0, NULL, NULL); } ----------------------------------------------------------------------- arch/ppc/kernel/process.c int sys_vfork(int p1, int p2, int p3, int p4, int p5, int p6,               struct pt_regs *regs) {         CHECK_FULL_REGS(regs);         return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->gpr[1],                         regs, 0, NULL, NULL); } -----------------------------------------------------------------------

The only difference between the calls to sys_fork() in sys_vfork() and sys_fork() are the flags that do_fork() is passed. The presence of these flags are used later to determine if the added behavior just described (of blocking the parent) will be executed.

3.3.3. clone() Function

The clone() library function, unlike fork() and vfork(), takes in a pointer to a function along with its argument. The child process created by do_fork()calls this function as soon as it gets created.

[View full width]
 ----------------------------------------------------------------------- sys_clone() arch/i386/kernel/process.c asmlinkage int sys_clone(struct pt_regs regs) {  unsigned long clone_flags;         unsigned long newsp;         int __user *parent_tidptr, *child_tidptr;         clone_flags = regs.ebx;         newsp = regs.ecx;         parent_tidptr = (int __user *)regs.edx;         child_tidptr = (int __user *)regs.edi;         if (!newsp)                 newsp = regs.esp;         return do_fork(clone_flags & ~CLONE_IDLETASK, newsp, &regs, 0, parent_tidptr,  child_tidptr); } ----------------------------------------------------------------------- ----------------------------------------------------------------------- arch/ppc/kernel/process.c int sys_clone(unsigned long clone_flags, unsigned long usp,               int __user *parent_tidp, void __user *child_thread\ ptr,               int __user *child_tidp, int p6,               struct pt_regs *regs) {         CHECK_FULL_REGS(regs);         if (usp == 0)                 usp = regs->gpr[1];     /* stack pointer for chi\ ld */         return do_fork(clone_flags & ~CLONE_IDLETASK, usp, regs,\  0,                         parent_tidp, child_tidp); } -----------------------------------------------------------------------

As Table 3.4 shows, the only difference between fork(), vfork(), and clone() is which flags are set in the subsequent calls to do_fork().

Table 3.4. Flags Passed to do_fork by fork(), vfork(), and clone()
	fork()	vfork()
`SIGCHLD`	X	X
`CLONE_VFORK`		X
`CLONE_VM`		X

Finally, we get to do_fork(), which performs the real process creation. Recall that up to this point, we only have the parent executing the call to fork(), which then enables the system call sys_fork(); we still do not have a new process. Our program foo still exists as an executable file on disk. It is not running or in memory.

3.3.4. do_fork() Function

We follow the kernel side execution of do_fork() line by line as we describe the details behind the creation of a new process.

[View full width]
 ----------------------------------------------------------------------- kernel/fork.c 1167   long do_fork(unsigned long clone_flags, 1168       unsigned long stack_start, 1169       struct pt_regs *regs, 1170       unsigned long stack_size, 1171       int __user *parent_tidptr, 1172       int __user *child_tidptr) 1173   { 1174     struct task_struct *p; 1175     int trace = 0; 1176     long pid; 1177 1178     if (unlikely(current->ptrace)) { 1179        trace = fork_traceflag (clone_flags); 1180        if (trace) 1181          clone_flags |= CLONE_PTRACE; 1182     } 1183 1184   p = copy_process(clone_flags, stack_start, regs, stack_size, parent_tidptr,  child_tidptr); -----------------------------------------------------------------------

Lines 11781183

The code begins by verifying if the parent wants the new process ptraced. ptracing references are prevalent within functions dealing with processes. This book explains only the ptrace references at a high level. To determine whether a child can be traced, fork_traceflag() must verify the value of clone_flags. If CLONE_VFORK is set in clone_flags, if SIGCHLD is not to be caught by the parent, or if the current process also has PT_TRACE_FORK set, the child is traced, unless the CLONE_UNTRACED or CLONE_IDLETASK flags have also been set.

Line 1184

This line is where a new process is created and where the values in the registers are copied out. The copy_process() function performs the bulk of the new process space creation and descriptor field definition. However, the start of the new process does not take place until later. The details of copy_process() make more sense when the explanation is scheduler-centric. See the "Keeping Track of Processes: Basic Scheduler Construction" section in this chapter for more detail on what happens here.

 ----------------------------------------------------------------------- kernel/fork.c ... 1189     pid = IS_ERR(p) ? PTR_ERR(p) : p->pid; 1190 1191     if (!IS_ERR(p)) { 1192        struct completion vfork; 1193 1194        if (clone_flags & CLONE_VFORK) { 1195          p->vfork_done = &vfork; 1196          init_completion(&vfork); 1197        } 1198 1199        if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) { ... 1203          sigaddset(&p->pending.signal, SIGSTOP); 1204          set_tsk_thread_flag(p, TIF_SIGPENDING); 1205        } ... -----------------------------------------------------------------------

Line 1189

This is a check for pointer errors. If we find a pointer error, we return the pointer error without further ado.

Lines 11941197

At this point, check if do_fork() was called from vfork(). If it was, enable the wait queue involved with vfork().

Lines 11991205

If the parent is being traced or the clone is set to CLONE_STOPPED, the child is issued a SIGSTOP signal upon startup, thus starting in a stopped state.

 ----------------------------------------------------------------------- kernel/fork.c 1207     if (!(clone_flags & CLONE_STOPPED)) { ... 1222          wake_up_forked_process(p); 1223     } else { 1224        int cpu = get_cpu(); 1225 1226        p->state = TASK_STOPPED; 1227        if (!(clone_flags & CLONE_STOPPED)) 1228          wake_up_forked_process(p);   /* do this last */ 1229        ++total_forks; 1230 1231        if (unlikely (trace)) { 1232          current->ptrace_message = pid; 1233          ptrace_notify ((trace << 8) | SIGTRAP); 1234        } 1235 1236        if (clone_flags & CLONE_VFORK) { 1237          wait_for_completion(&vfork); 1238          if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE))  1239             ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP); 1240        } else ... 1248          set_need_resched(); 1249     } 1250     return pid; 1251   } -----------------------------------------------------------------------

Lines 12261229

In this block, we set the state of the task to TASK_STOPPED. If the CLONE_STOPPED flag was not set in clone_flags, we wake up the child process; otherwise, we leave it waiting for its wakeup signal.

Lines 12311234

If ptracing has been enabled on the parent, we send a notification.

Lines 12361239

If this was originally a call to vfork(), this is where we set the parent to blocking and send a notification to the trace if enabled. This is implemented by the parent being placed in a wait queue and remaining there in a TASK_UNINTERRUPTIBLE state until the child calls exit() or execve().

Line 1248

We set need_resched in the current task (the parent). This allows the child process to run first.