Section 13.7. Interprocess Communication | Linux for Programmers and Users

[Page 569 (continued)]

13.7. Interprocess Communication

In this section, I describe the data structures and algorithms that the Linux kernel uses to support the basic IPC using signals and more sophisticated forms of IPC using pipes and sockets.

13.7.1. Signals

Signals inform processes of asynchronous events like keyboard input, error conditions, and job control. Linux supports the following types of signals:

traditional UNIX signals (non-real-time)
real-time or queued signals (mandated by POSIX 1003.1b)

Using queued signals, multiple instances (i.e., duplicates) of the same signal can be delivered to a process, unlike traditional signals where only one of each signal type can be pending at any time (and you can lose subsequent signals if you're still handling a previous one).

The kernel or root can send a signal to any process. Only user processes that have the same UID and GID can signal each other.

[Page 570]

The data structures that support signals are stored in the task_struct and signal_struct associated with the process. Every process includes the following data structures associated with signal handling:

a pointer to the signal_structure which contains an array of pointers to signal handler functions (that process each signal)
a signal bitmap, used to specify, one bit per type of signal, that a particular type of signal has arrived for processing
a blocked signal bitmap, used to specify, one bit per type of signal, that a particular type of signal should not be processed yet (except SIGSTOP and SIGKILL)
a signal queue (sigqueue), used to keep track of queued signals
a process group ID, which is used when distributing signals

Note that the signal bitmaps are one machine word in size and therefore dictate how many signals can be processed on a particular machine. Most common 32-bit architectures (e.g., PCs) use 32-bit bitmaps, so 32 signals are supported. On 64-bit architectures, like Itanium and Alpha, as many as 64 signals can be supported.

Blocked signals remain as pending in the signal bitmap until they are unblocked and processed or ignored. A signal handler function pointer that is SIG_DFL (zero) will cause the kernel to perform the default action for that signal (see "man 7 signal"). Set the pointer to SIG_IGN to ignore the particular signal.

13.7.1.1. setpgrp ()

setpgrp () sets the calling process's process group number to its own PID, thereby placing it in its own unique process group. A fork'ed process inherits its parent's process group. setpgrp () works by changing the process group number entry in the task struct. The process group number is used by kill (), as you'll see later.

13.7.1.2. signal ()

signal () sets the way that a process responds to a particular type of signal. There are three options: ignore the signal, perform the default kernel action, or execute a user-installed signal handler. The entries in the signal handler array are set as follows:

If the signal is to be ignored, the entry is set to 1.
If the signal is to cause the default action, the entry is set to 0.
If the signal is to be processed using a user-installed handler, the entry is set to the address of the handler.

When a signal is sent to a process, the kernel sets the appropriate bit in the receiving process's signal bitmap or adds the signal to the process's sigqueue, depending on the type of signal. If the receiving process is sleeping at an interruptible priority, it is awakened so that it may process the signal. The kernel checks a process's signal bitmap for pending signals whenever the process returns from kernel mode to user mode (i.e., when returning from a system call) or when the process returns from a sleep state.

[Page 571]

When using non-real-time signals, the pending signal bitmap does not keep a count of how many of a particular type of signal are pending. This means that if three SIGINT signals arrive in close succession, it's possible that only one of them will be noticed.

13.7.1.3. Signals After a fork or an exec

A fork'ed process inherits the contents of its parent's signal handler array. When a process execs, the signals that were originally ignored continue to be ignored, and all others are set to their default setting. In other words, all entries equal to 1 are unchanged, and all others are set to 0.

13.7.1.4. Processing a Signal

When the kernel detects that a process has a pending signal, it either ignores it, performs the default action, or invokes a user-installed handler. To invoke the handler, it appends a new stack frame to the process's stack and modifies the process's program counter to make the receiving process act as if it had called the signal handler from its current program location. When the kernel returns the process to user mode, the process executes the handler and then returns from the function back to the previous program location. The "death of a child" signal (SIGCHLD) is processed slightly differently, as you'll see when I describe the wait () system call.

13.7.1.5. exit ()

When a process terminates, it leaves its exit code in a field (exit_code) in its task_struct, and is marked as a zombie process. This exit code is obtainable by the parent process via the wait () system call. The kernel always informs a parent process that one of its children has died by sending it a "death of child" (SIGCHLD) signal.

The reason you can't kill a zombie process by sending it a SIGKILL is that it never again looks at its signal bitmap. A zombie process is merely a task_struct in the task list, but it is no longer a running process, so nothing can check the signal bitmap and process it.

13.7.1.6. wait ()

wait () returns in one of two conditions; either the calling process has no children, in which case it returns an error code, or one of the calling process's children has terminated, in which case it returns the child process's PID and exit code. The way that the kernel processes a wait () system call may be split up into a three-step algorithm:

1.	If a process calls wait () and doesn't have any children, wait () returns an error code.
2.	If a process calls wait () and one or more of its children are already zombies, the kernel picks a child at random, removes it from the task list, and returns its PID and exit code.
3.	If a process calls wait () and none of its children is a zombie, the wait call goes to sleep. It is awakened by the kernel when any signals are received, at which point it resumes from step 1.

Although this algorithm would work as it stands, there's one small problem; if a process chose to ignore SIGCHLD signals, all of its children would remain zombies, and this could clog up the task list. To avoid this problem, the kernel treats ignorance of the SIGCHLD signal as a special case. If a SIGCHLD signal is received and the signal is ignored, the kernel immediately removes all the parent's zombie children from the task list and then allows the wait () system call to proceed as normal. When the wait () call resumes, it doesn't find any zombie children, and so it goes back to sleep. Eventually, when the last child's death signal is ignored, the wait () system call returns with an error code to signify that the calling process has no child processes.

[Page 572]

13.7.1.7. kill ()

kill () makes use of the real user ID and process group ID fields in the task list. For example, when the following line of code is executed:

kill (0, SIGINT);

the kernel sets the bit in the signal bitmap corresponding to SIGINT in every process whose process group ID matches that of the calling process. Linux uses this facility to distribute the signals triggered by Control-C and Control-Z to all of the processes in the control terminal's process group.

13.7.2. Pipes

There are two kinds of pipes in Linux: unnamed pipes and named pipes (called FIFOs in Linux). Unnamed pipes are created by pipe (), and named pipes are created using mkfifo () or mknod () and with the mkfifo utility. Data written to a pipe is stored in the file system. When either kind of pipe is created, the kernel allocates a VFS inode, two open file entries, and two file descriptors. Originally, the inode describes an empty file. If the pipe is named, a hard link is made from the specified directory to the pipe's inode; otherwise, no hard link is created and the pipe remains anonymous.

The kernel maintains the current write position and current read position of each pipe in its inode, rather than in the file list structure. This ensures that each byte in the pipe is read by exactly one process. It also keeps track of the number of processes reading from the pipe and writing to the pipe. As you'll soon see, it needs both of these counts to process a close () properly.

13.7.2.1. Writing to a Pipe

When data is written to a pipe, the kernel allocates disk blocks and increments the current write position as necessary, until the last direct block has been allocated. For reasons of simplicity and efficiency, a pipe is never allocated indirect blocks, thereby limiting the size of a pipe to about 40K, depending on the file system's block size. If a write to a pipe would overflow its storage capacity, the writing process writes as much as it can to the pipe and then sleeps until some of the data is drained by reader processes. If a writer tries to write past the end of the last direct block, the write position "wraps around" to the beginning of the file, starting at offset 0. Thus, the direct blocks are treated like a circular buffer. Although it might seem that using the file system for implementing pipes would be slow, remember that disk blocks are buffered in the buffer cache, and so most pipe I/O is buffered in RAM.

[Page 573]

13.7.2.2. Reading from a Pipe

As data is read from a pipe, its current read position is updated accordingly. The kernel ensures that the read position never overtakes the write position. If a process attempts to read from an empty pipe, it is sent to sleep until output becomes available.

13.7.2.3. Closing a Pipe

When a pipe's file descriptor is closed, the kernel does some special processing:

It updates the count of the pipe's reader and writer processes.
If the writer count drops to zero and there are processes trying to read from the pipe, they return from read () with an error condition.
If the reader count drops to zero and there are processes trying to write to the pipe, they are sent a signal.
If the reader and writer counts drop to zero, all of the pipe's blocks are deallocated and the inode's current write and read positions are reset. If the pipe is unnamed, the inode is also deallocated.

13.7.3. Sockets

We examined sockets from the application point of view in Chapter 12, "Systems Programming," on page 508. Now we want to see how things work in the kernel to support this.

When a socket is created using socket (), the system creates a sock structure that holds all of the information pertaining to the socket, like the socket domain and socket protocol. A VFS inode is created that points to the socket structure and it is associated with an open file in the process's file structure. The associated file descriptor is returned to the calling process.

Then a socket works much like a pipe. Data can be written to the socket and is buffered by the file system until read by the other end of the socket. One difference is the data is encapsulated in a socket buffer (sk_buff) that contains other data needed by the network layers. The file structure contains pointers to socket driver operator functions that are used when performing I/O on the socket.