Section 3.2. Process Descriptor

3.2. Process Descriptor

In the kernel, the process descriptor is a structure called task_struct, which keeps track of process attributes and information. All kernel information regarding a process is found there. Throughout a process' lifecycle, a process interacts with many aspects of the kernel, such as memory management and scheduling. The process descriptor keeps track of information regarding these interactions, as well as the standard UNIX process attributes. The kernel stores all the process descriptors in a circular doubly linked list called the task_list. The kernel also keeps a reference to the currently running process' task_struct by means of the global variable current. (We refer to current throughout this book to indicate the process descriptor of the currently running process.)

A process may be comprised of one or more threads. Each thread has a task_struct associated with it, including a unique thread ID. Threads in a common process share the same memory address space.

The following categories describe some of the types of things a process descriptor must keep track of during a process' lifespan:

Process attributes
Process relationships
Process memory space
File management
Signal management
Process credentials
Resource limits
Scheduling related fields

We now closely look at the fields in the task_struct structure. This section describes what they do and refers to the actual processing with which the field is involved. Although many of the fields are used for activities related to the aforementioned categories, some are beyond the scope of this book. The task_struct structure is defined in include/linux/sched.h:

 ----------------------------------------------------------------------- include/linux/sched.h 384   struct task_struct { 385     volatile long state; 386     struct thread_info *thread_info; 387     atomic_t usage; 388     unsigned long flags;   389     unsigned long ptrace; 390 391     int lock_depth; 392 393     int prio, static_prio; 394     struct list_head run_list; 395     prio_array_t *array; 396 397     unsigned long sleep_avg; 398     long interactive_credit; 399     unsigned long long timestamp; 400     int activated; 401 302     unsigned long policy; 403     cpumask_t cpus_allowed; 404     unsigned int time_slice, first_time_slice; 405 406     struct list_head tasks; 407     struct list_head ptrace_children; 408     struct list_head ptrace_list; 409 410     struct mm_struct *mm, *active_mm; ... 413     struct linux_binfmt *binfmt; 414     int exit_code, exit_signal; 415     int pdeath_signal; ... 419     pid_t pid; 420     pid_t tgid; ... 426     struct task_struct *real_parent; 427     struct task_struct *parent; 428     struct list_head children; 429     struct list_head sibling; 430     struct task_struct *group_leader; ... 433     struct pid_link pids[PIDTYPE_MAX]; 434 435     wait_queue_head_t wait_chldexit; 436     struct completion *vfork_done; 437     int __user *set_child_tid; 438     int __user *clear_child_tid; 439 440     unsigned long rt_priority; 441     unsigned long it_real_value, it_prof_value, it_virt_value; 442     unsigned long it_real_incr, it_prof_incr, it_virt_incr; 443     struct timer_list real_timer; 444     unsigned long utime, stime, cutime, cstime; 445     unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; 446     u64 start_time; ... 450     uid_t uid,euid,suid,fsuid; 451     gid_t gid,egid,sgid,fsgid; 452     struct group_info *group_info; 453     kernel_cap_t cap_effective, cap_inheritable, cap_permitted; 454     int keep_capabilities:1; 455     struct user_struct *user; ... 457     struct rlimit rlim[RLIM_NLIMITS]; 458     unsigned short used_math; 459     char comm[16]; ... 461     int link_count, total_link_count; ... 467     struct fs_struct *fs; ... 469     struct files_struct *files; ... 509     unsigned long ptrace_message; 510     siginfo_t *last_siginfo; ... 516   }; -----------------------------------------------------------------------

3.2.1. Process AttributeRelated Fields

The process attribute category is a catch-all category we defined for task characteristics related to the state and identification of a task. Examining these fields' values at any time gives the kernel hacker an idea of the current status of a process. Figure 3.2 illustrates the process attributerelated fields of the task_struct.

Figure 3.2. Process AttributeRelated Fields

3.2.1.1. state

The state field keeps track of the state a process finds itself in during its execution lifecycle. Possible values it can hold are TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_ZOMBIE, TASK_STOPPED, and TASK_DEAD (see the "Process Lifespan" section in this chapter for more detail).

3.2.1.2. pid

In Linux, each process has a unique process identifier (pid). This pid is stored in the task_struct as a type pid_t. Although this type can be traced back to an integer type, the default maximum value of a pid is 32,768 (the value pertaining to a short int).

3.2.1.3. flags

Flags define special attributes that belong to the task. Per process flags are defined in include/linux/sched.h and include those flags listed in Table 3.1. The flag's value provides the kernel hacker with more information regarding what the task is undergoing.

Table 3.1. Selected task_struct Flag's Field Values
Flag Name	When Set
`PF_STARTING`	Set when the process is being created.
`PF_EXITING`	Set during the call to `do_exit()`.
`PF_DEAD`	Set during the call to `exit_notify()` in the process of exiting. At this point, the state of the process is either `TASK_ZOMBIE` or `TASK_DEAD`.
`PF_FORKNOEXEC`	The parent upon forking sets this flag.

3.2.1.4. binfmt

Linux supports a number of executable formats. An executable format is what defines the structure of how your program code is to be loaded into memory. Figure 3.2 illustrates the association between the task_struct and the linux_binfmt struct, the structure that contains all the information related to a particular binary format (see Chapter 9 for more detail).

3.2.1.5. exit_code and exit_signal

The exit_code and exit_signal fields hold a task's exit value and the terminating signal (if one was used). This is the way a child's exit value is passed to its parent.

3.2.1.6. pdeath_signal

pdeath_signal is a signal sent upon the parent's death.

3.2.1.7. comm

A process is often created by means of a command-line call to an executable. The comm field holds the name of the executable as it is called on the command line.

3.2.1.8. ptrace

ptrace is set when the ptrace() system call is called on the process for performance measurements. Possible ptrace() flags are defined in include/ linux/ptrace.h.

3.2.2. Scheduling Related Fields

A process operates as though it has its own virtual CPU. However, in reality, it shares the CPU with other processes. To sustain the switching between process executions, each process closely interrelates with the scheduler (see Chapter 7 for more detail).

However, to understand some of these fields, you need to understand a few basic scheduling concepts. When more than one process is ready to run, the scheduler decides which one runs first and for how long. The scheduler achieves fairness and efficiency by allotting each process a timeslice and a priority. The timeslice defines the amount of time the process is allowed to execute before it is switched off for another process. The priority of a process is a value that defines the relative order in which it will be allowed to be executed with respect to other waiting processesthe higher the priority, the sooner it is scheduled to run. The fields shown in Figure 3.3 keep track of the values necessary for scheduling purposes.

Figure 3.3. Scheduling Related Fields

3.2.2.1. prio

In Chapter 7, we see that the dynamic priority of a process is a value that depends on the processes' scheduling history and the specified nice value. (See the following sidebar for more information about nice values.) It is updated at sleep time, which is when the process is not being executed and when timeslice is used up. This value, prio, is related to the value of the static_prio field described next. The prio field holds +/- 5 of the value of static_prio, depending on the process' history; it will get a +5 bonus if it has slept a lot and a -5 handicap if it has been a processing hog and used up its timeslice.

3.2.2.2. static_prio

static_prio is equivalent to the nice value. The default value of static_prio is MAX_PRIO-20. In our kernel, MAX_PRIO defaults to 140.

Nice

The nice() system call allows a user to modify the static scheduling priority of a process. The nice value can range from 20 to 19. The nice() function then calls set_user_nice() to set the static_prio field of the task_struct. The static_prio value is computed from the nice value by way of the PRIO_TO_NICE macro. Likewise, the nice value is computed from the static_prio value by means of a call to NICE_TO_PRIO.

 ---------------------------------------kernel/sched.c #define NICE_TO_PRIO(nice) (MAX_RT_PRIO + nice + 20) #define PRIO_TO_NICE(prio) ((prio  MAX_RT_PRIO  20) -----------------------------------------------------

3.2.2.3. run_list

The run_list field points to the runqueue. A runqueue holds a list of all the processes to run. See the "Basic Structure" section for more information on the runqueue struct.

3.2.2.4. array

The array field points to the priority array of a runqueue. The "Keeping Track of Processes: Basic Scheduler Construction" section in this chapter explains this array in detail.

3.2.2.5. sleep_avg

The sleep_avg field is used to calculate the effective priority of the task, which is the average amount of clock ticks the task has spent sleeping.

3.2.2.6. timestamp

The timestamp field is used to calculate the sleep_avg for when a task sleeps or yields.

3.2.2.7. interactive_credit

The interactive_credit field is used along with the sleep_avg and activated fields to calculate sleep_avg.

3.2.2.8. policy

The policy determines the type of process (for example, time sharing or real time). The type of a process heavily influences the priority scheduling. For more information on this field, see Chapter 7.

3.2.2.9. cpus_allowed

The cpus_allowed field specifies which CPUs might handle a task. This is one way in which we can specify which CPU a particular task can run on when in a multiprocessor system.

3.2.2.10. time_slice

The time_slice field defines the maximum amount of time the task is allowed to run.

3.2.2.11. first_time_slice

The first_time_slice field is repeatedly set to 0 and keeps track of the scheduling time.

3.2.2.12. activated

The activated field keeps track of the incrementing and decrementing of sleep averages. If an uninterruptible task gets woken, this field gets set to -1.

3.2.2.13. rt_priority

rt_priority is a static value that can only be updated through schedule(). This value is necessary to support real-time tasks.

3.2.2.14. nivcsw and nvcsw

Different kinds of context switches exist. The kernel keeps track of these for profiling reasons. A global switch count gets set to one of the four different context switch counts, depending on the kind of transition involved in the context switch (see Chapter 7 for more information on context switch). These are the counters for the basic context switch:

The nivcsw field (number of involuntary context switches) keeps count of kernel preemptions applied on the task. It gets incremented only upon a task's return from a kernel preemption where the switch count is set to nivcsw.
The nvcsw field (number of voluntary context switches) keeps count of context switches that are not based on kernel preemption. The switch count gets set to nvcsw if the previous state was not an active preemption.

3.2.3. Process RelationsRelated Fields

The following fields of the task_struct are those related to process relationships. Each task or process p has a parent that created it. Process p can also create processes and, therefore, might have children. Because p's parent could have created more than one process, it is possible that process p might have siblings. Figure 3.4 illustrates how the task_structs of all these processes relate.

Figure 3.4. Process RelationsRelated Fields

3.2.3.1. real_parent

real_parent points to the current process' parent's description. It will point to the process descriptor of init() if the original parent of our current process has been destroyed. In previous kernels, this was known as p_opptr.

3.2.3.2. parent

parent is a pointer to the descriptor of the parent process. In Figure 3.4, we see that this points to the ptrace task_struct. When ptrace is run on a process, the parent field of task_struct points to the ptrace process.

3.2.3.3. children

children is the struct that points to the list of our current process' children.

3.2.3.4. sibling

sibling is the struct that points to the list of the current process' siblings.

3.2.3.5. group_leader

A process can be a member of a group of processes, and each group has one process defined as the group leader. If our process is a member of a group, group_leader is a pointer to the descriptor of the leader of that group. A group leader generally owns the tty from which the process was created, called the controlling terminal.

3.2.4. Process CredentialsRelated Fields

In multiuser systems, it is necessary to distinguish among processes that are created by different users. This is necessary for the security and protection of user data. To this end, each process has credentials that help the system determine what it can and cannot access. Figure 3.5 illustrates the fields in the task_struct related to process credentials.

Figure 3.5. Process CredentialsRelated Fields

3.2.4.1. uid and gid

The uid field holds the user ID number of the user who created the process. This field is used for protection and security purposes. Likewise, the gid field holds the group ID of the group who owns the process. A uid or gid of 0 corresponds to the root user and group.

3.2.4.2. euid and egid

The effective user ID usually holds the same value as the user ID field. This changes if the executed program has the set UID (SUID) bit on. In this case, the effective user ID is that of the owner of the program file. Generally, this is used to allow any user to run a particular program with the same permissions as another user (for example, root). The effective group ID works in much the same way, holding a value different from the gid field only if the set group ID (SGID) bit is on.

3.2.4.3. suid and sgid

suid (saved user ID) and sgid (saved group ID) are used in the setuid() system calls.

3.2.4.4. fsuid and fsgid

The fsuid and fsgid values are checked specifically for filesystem checks. They generally hold the same values as uid and gid except for when a setuid() system call is made.

3.2.4.5. group_info

In Linux, a user may be part of more than one group. These groups may have varying permissions with respect to system and data accesses. For this reason, the processes need to inherit this credential. The group_info field is a pointer to a structure of type group_info, which holds all the information regarding the various groups of which the process can be a member.

The group_info structure allows a process to associate with a number of groups that is bound by available memory. In Figure 3.5, you can see that a field of group_info called small_block is an array of NGROUPS_SMALL (in our case, 32) gid_t units. If a task belongs to more than 32 groups, the kernel can allocate blocks or pages that hold the necessary number of gid_ts beyond NGROUPS_SMALL. The field nblocks holds the number of blocks allocated, while ngroups holds the value of units in the small_block array that hold a gid_t value.

3.2.5. Process CapabilitiesRelated Fields

Traditionally, UNIX systems offer process-related protection of certain accesses and actions by defining any given process as privileged (super user or UID = 0) or unprivileged (any other process). In Linux, capabilities were introduced to partition the activities previously available only to the superuser; that is, capabilities are individual "privileges" that may be conferred upon a process independently of each other and of its UID. In this manner, particular processes can be given permission to perform particular administrative tasks without necessarily getting all the privileges or having to be owned by the superuser. A capability is thus defined as a given administrative operation. Figure 3.6 shows the fields that are related to process capabilities.

Figure 3.6. Process CapabilitiesRelated Fields

3.2.5.1. cap_effective, cap_inheritable, cap_permitted, and keep_capabilities

The structure used to support the capabilities model is defined in include/linux/security.h as an unsigned 32-bit value. Each 32-bit mask corresponds to a capability set; each capability is assigned a bit in each of:

cap_effective. The capabilities that can be currently used by the process.
cap_inheritable. The capabilities that are passed through a call to execve.
cap_permitted. The capabilities that can be made either effective or inheritable.
One way to understand the distinction between these three types is to consider the permitted capabilities to be similar to a trivialized gene pool made available by one's parents. Of the genetic qualities made available by one's parents, we can display a subset of them (effective qualities) and/or pass them on (inheritable). Permitted capabilities constitute more of a potentiality whereas effective capabilities are an actuality.
Therefore, cap_effective and cap_inheritable are always subsets of cap_permitted.
keep_capabilities. Keeps track of whether the process will drop or maintain its capabilities on a call to setuid().

Table 3.2 lists some of the supported capabilities that are defined in include/linux/capability.h.

Table 3.2. Selected Capabilities
Capability	Description
`CAP_CHOWN`	Ignores the restrictions imposed by `chown()`
`CAP_FOWNER`	Ignores file-permission restrictions
`CAP_FSETID`	Ignores `setuid` and `setgid` restrictions on files
`CAP_KILL`	Ignores `ruid` and `euid`s when sending signals
`CAP_SETGID`	Ignores group-related permissions checks
`CAP_SETUID`	Ignores `uid`-related permissions checks
`CAP_SETCAP`	Allows a process to set its capabilities

The kernel checks if a particular capability is set with a call to capable() passing as a parameter the capability variable. Generally, the function checks to see whether the capability bit is set in the cap_effective set; if so, it sets current->flags to PF_SUPERPRIV, which indicates that the capability is granted. The function returns a 1 if the capability is granted and 0 if capability is not granted.

Three system calls are associated with the manipulation of capabilities: capget(), capset(), and prctl(). The first two allow a process to get and set its capabilities, while the prctl() system call allows manipulation of current->keep_capabilities.

3.2.6. Process LimitationsRelated Fields

A task uses a number of the resources made available by hardware and the scheduler. To keep track of how they are used and any limitations that might be applied to a process, we have the following fields.

3.2.6.1. rlim

The rlim field holds an array that provides for resource control and accounting by maintaining resource limit values. Figure 3.7 illustrates the rlim field of the task_struct.

Figure 3.7. task_struct Resource Limits

Linux recognizes the need to limit the amount of certain resources that a process is allowed to use. Because the kinds and amounts of resources processes might use varies from process to process, it is necessary to keep this information on a per process basis. What better place than to keep a reference to it in the process descriptor?

The rlimit descriptor (include/linux/resource.h) has the fields rlim_cur and rlim_max, which are the current and maximum limits that apply to that resource. The limit "units" vary by the kind of resource to which the structure refers.

 ----------------------------------------------------------------------- include/linux/resource.h struct rlimit {    unsigned long   rlim_cur;    unsigned long   rlim_max; }; -----------------------------------------------------------------------

Table 3.3 lists the resources upon which their limits are defined in include/asm/resource.h. However, both x86 and PPC have the same resource limits list and default values.

Table 3.3. Resource Limits Values
RL Name	Description	Default rlim_cur	Default rlim_max
`RLIMIT_CPU`	The amount of CPU time in seconds this process may run.	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_FSIZE`	The size of a file in 1KB blocks.	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_DATA`	The size of the heap in bytes.	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_STACK`	The size of the stack in bytes.	`_STK_LIM`	`RLIM_INFINITY`
`RLIMIT_CORE`	The size of the core dump file.	`0`	`RLIM_INFINITY`
`RLIMIT_RSS`	The maximum resident set size (real memory).	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_NPROC`	The number of processes owned by this process.	`0`	`0`
`RLIMIT_NOFILE`	The number of open files this process may have at one time.	`INR_OPEN`	`INR_OPEN`
`RLIMIT_MEMLOCK`	Physical memory that can be locked (not swapped).	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_AS`	Size of process address space in bytes.	`RLIM_INFINITY`	`RLIM_INFINITY`
`RLIMIT_LOCKS`	Number of file locks.	`RLIM_INFINITY`	`RLIM_INFINITY`

When a value is set to RLIM_INFINITY, the resource is unlimited for that process.

The current limit (rlim_cur) is a soft limit that can be changed via a call to setrlimit(). The maximum limit is defined by rlim_max and cannot be exceeded by an unprivileged process. The geTRlimit() system call returns the value of the resource limits. Both setrlimit() and getrlimit() take as parameters the resource name and a pointer to a structure of type rlimit.

3.2.7. Filesystem- and Address SpaceRelated Fields

Processes can be heavily involved with files throughout their lifecycle, performing tasks such as opening, closing, reading, and writing. The task_struct has two fields that are associated with file- and filesystem-related data: fs and files (see Chapter 6, "Filesystems," for more detail). The two fields related to address space are active_mm and mm (see Chapter 4, "Memory Management," for more detail on mm_struct). Figure 3.8 shows the filesystem- and address spacerelated fields of the task_struct.

Figure 3.8. Filesystem- and Address SpaceRelated Fields

3.2.7.1. fs

The fs field holds a pointer to filesystem information.

3.2.7.2. files

The files field holds a pointer to the file descriptor table for the task. This file descriptor holds pointers to files (more specifically, to their descriptors) that the task has open.

3.2.7.3. mm

mm points to address-space and memory-managementrelated information.

3.2.7.4. active_mm

active_mm is a pointer to the most recently accessed address space. Both the mm and active_mm fields start pointing at the same mm_struct.

Evaluating the process descriptor gives us an idea of the type of data that a process is involved with throughout its lifetime. Now, we can look at what happens throughout the lifespan of a process. The following sections explain the various stages and states of a process and go through the sample program line by line to explain what happens in the kernel.