Section 11.8. Local Interprocess-Communication

11.8. Local Interprocess-Communication

The socket interfaces are not the only APIs that provide interprocess communication. Applications that wish to divide up work on a single host use semaphores, messages queues, and shared memory to communicate between their processes. Each type of local IPC has different performance characteristics and provides a different form of communication. The local IPC mechanisms currently supported in FreeBSD 5.2 are derived from System V, as described in Bach [1986]. For this reason they are often referred to as System V semaphores, mutexes, and shared memory.

Every type of IPC must make it possible for independently executing processes to rendezvous and find the resources they are sharing. This piece of information must be known to all of them and must be unique enough that no other process could come across the same information by accident. In the local IPC realm this piece of information is called a key.

A key is a long integer that is treated by the cooperating processes as an opaque piece of data, meaning that they do not attempt to decipher or attribute any meaning to it. The library routine, ftok(), is used to generate a key from a pathname. As long as each process uses the same pathname they are guaranteed to get the same key.

All of the local IPC subsystems were designed and implemented to be used in a similar way. Once a process has a key, it uses it to create or retrieve the relevant object, using a subsystem specific get call, which is similar to a file open or creat. To create an object the IPC_CREAT flag is passed as an argument to the get call. All get calls return an integer to be used in all subsequent local IPC system calls. Just like a file descriptor this integer is used to identify the object that the process is manipulating.

Each subsystem has its own way of operating on the underlying object, and these functions are described in the following sections. All control operations, such as retrieving statistics or removing a previously created object, are handled by a subsystem-specific ctl routine. A summary of all the user level APIs is given in Table 11.6, and an excellent introduction to using them can be found in Stevens [1999].

Table 11.6. Local IPC, user level APIs.
Subsystem	Create	Control	Communicate
semaphores	semget	semctl	semop
message queues	msgget	mesgctl	msgrcv, msgsnd
shared memory	shmget	shmctl, shmdt	n/a

Semaphores

A semaphore is the smallest atom of IPC available to a set of cooperating processes. Each semaphore contains a short integer that can be increased or decreased. A process that attempts to reduce the value of the semaphore below 0 will either be blocked or, if called in a nonblocking mode, will return immediately with an errno value of EAGAIN. The concept of semaphores and how they are used in multiprocess programs was originally proposed in Dijkstra & Genuys [1968].

Unlike the semaphores described in most computer science textbooks, semaphores in FreeBSD 5.2 are grouped into arrays so that the code in the kernel can protect the process using them from causing a deadlock. Deadlocks were discussed in terms of locking within the kernel in Section 4.3 but are discussed here are well.

With System V semaphores the deadlock occurs between two user-level processes rather than between kernel threads. A deadlock occurs between two processes, A and B, when they both attempt to take two semaphores, S₁ and S₂. If process A acquires S₁ and process B acquires S₂, then a deadlock occurs when process A tries to acquire S₂ and process B tries to acquire S_l because there is no way for either process to give up the semaphore that the other one needs to make progress. It is always important when using semaphores that all cooperating processes acquire and release them in the same order to avoid this situation.

The implementation of semaphores in System V protected against deadlock by forcing the user of the API to group their semaphores into arrays and to perform semaphore operations as a sequence of events on the array. If the sequence submitted in the call could cause a deadlock, then an error was returned. The section on semaphores in Bach [1986] points out that this complexity should never have been placed in the kernel, but in order to adhere to the previously defined API, the same complexity exists in FreeBSD as well. At some point in the future the kernel should provide a simpler form of semaphores to replace the current implementation.

Creating and attaching to a semaphore is done with the semget system call. Although semaphores were designed to look like file descriptors, they are not stored in the file descriptor table. All the semaphores in the system are contained in a single table in the kernel, whose size and shape are described by several tunable parameters. This table is protected by the giant lock (see Section 4.3) creating entries in it. This lock is only taken when creating or attaching to a semaphore and is not a bottleneck in the actual use of existing semaphores.

Once a process has created a semaphore, or attached to a preexisting one, it calls semop to perform operations on it. The operations on the semaphore are passed to the system call as an array. Each element of the array includes the semaphore number to operate on (the index into the array returned by the previous semget call), the operation to perform, and a set of flags. The operation is a misnomer because it is not a command but simply a number. If the number is positive, then the corresponding semaphore's value is increased by that amount. If the operation is 0 and the semaphore's value is not 0, then either the process is put to sleep until the value is 0 or, if the IPC_NOWAIT flag was passed, EAGAIN is returned to the caller. When the operation is negative, there are several possible outcomes. If the value of the semaphore was greater than the absolute value of the operation, then the value of the operation is subtracted from the semaphore and the call returns. If subtracting the absolute value of the operation from the semaphore would force its value to be less than zero, then the process is put to sleep, unless the IPC_NOWAIT flag was passed. In that case EAGAIN is returned to the caller.

All of this logic is implemented in the semop system call. The call first does some rudimentary checks to make sure that it has a chance of succeeding, including making sure there is enough memory to execute all the operations in one pass and that the calling process has the proper permissions to access the semaphore. Each semaphore ID returned to a process by the kernel has its own mutex to protect against multiple processes modifying the same semaphore at the same time. The routine locks this mutex and then attempts to perform all the operations passed to it in the array. It walks the array and attempts to perform each operation in order. There is the potential for this call to sleep before it completes all its work. If this situation occurs, then the code rolls back all its work before it goes to sleep. When it reawakens, the routine starts at the beginning of the array and attempts to do the operations again. Either the routine will complete all its work, return with an appropriate error, or go back to sleep. Rolling back all the work is necessary to guarantee the idempotence of the routine. Either all the work is done or none of it is.

Message Queues

A message queue facilitates the sending and receiving of typed, arbitrary-length messages. The sending process adds messages at one end of the queue, and the receiving process removes messages from the other. The queue's size and other characteristics are controlled by a set of tunable kernel parameters. Message queues are inherently half duplex, meaning that one process is always the sender and the other is the receiver, but there are ways to use them as a form of full duplex communication, as we will see later.

The messages passed between the endpoints contain a type and a data area, as shown in Figure 11.14. This data structure should not be confused with the mbufs that are used by the networking code (see Section 11.3). MSGMNB is a tunable kernel parameter that defines the size of a message queue, and therefore the largest possible message that can be sent between two processes, and is set to 2048 by default.

Figure 11.14. Message data structure.

Message queues can be used to implement either a pure first-in first-out queue, where all messages are delivered in the order in which they were sent, or a priority queue, where messages with a certain type can be retrieved ahead of others. This ability is provided by the type field of the message structure.

When a process sends a message, it invokes the msgsnd system call, which checks all the arguments in the call for correctness and then attempts to get enough resources to place the message into the queue. If there aren't enough resources, and the caller did not pass the IPC_NOWAIT flag, then the caller is put to sleep until such time as resources are available. The resources come from a pool of memory that is allocated by the kernel at boot time. The pool is arranged in fixed segments whose length is defined by MSGSSZ. The memory pool is managed as a large array so the segments can be located efficiently.

The kernel data structures that control the message queues in the system are protected by a single lock (msq_mtx), which is taken and held by both msgsnd and msgrcv for the duration of their execution. The use of a single lock for both routines protects the queue from being read and written simultaneously, possibly causing data corruption. It is also a performance bottleneck because it means that all other message queues are blocked when any one of them is being used.

Once the kernel has enough resources, it copies the message into the segments in the array and updates the rest of the data structures related to this queue.

To retrieve a message from the queue, a process calls msgrcv. If the processes are using the queue as a simple fifo, then the receiver passes a 0 in the msgtype argument to this call to retrieve the first available message in the queue. To retrieve the first message in the queue of a particular type, a positive integer is passed. Processes implement a priority queue by using the type as the priority of the message. To implement a full duplex channel, each process picks a different type say, 1 and 2. Messages of type 1 are from process A, and messages of type 2 are from B. Process A sends messages with type 1 and receives messages with type 2, while process B does exactly the opposite.

After acquiring the message queue mutex, the receive routine finds the correct queue from which to retrieve data, and if there is an appropriate message, it returns data from the segments to the caller. If no data are available and the caller specified the IPC_NOWAIT flag, then the call returns immediately; otherwise, the calling process is put to sleep until there are data to be returned. When a message is retrieved from a message queue, its data are deallocated after they have been delivered to the receiving process.

Shared Memory

Shared memory is used when two or more processes need to communicate large amounts of data between them. Each process stores data in the shared memory just as it would within its own, per-process, memory. Care must be taken to serialize access to the shared memory so that processes do not write over each other. Hence, shared memory is often used with semaphores to synchronize read and write access.

Processes that are using shared memory are really sharing virtual memory (see Chapter 5). When a process creates a segment of shared memory by calling the shmget system call, the kernel allocates a set of virtual memory pages and places a pointer to them in the shared memory handle that is then returned to the calling process. To actually use the shared memory within a process, it must call the shmat system call, which attaches the virtual memory pages into the calling process. The attach routine uses the shared memory handle passed to it as an argument to find the relevant pages and returns an appropriate virtual address to the caller. Once this call completes, the process can then access the memory pointed to by the returned address as it would any other kind of memory.

When the process is through using the shared memory, it detaches from it using the shmdt system call. This routine does not free the associated memory, because other processes may be using it, but it removes the virtual memory mapping from the calling process.

The shared memory subsystem depends on the virtual memory system to do most of the real work (mapping pages, handling dirty pages, etc.), so its implementation is relatively simple.

11.8. Local Interprocess-Communication

Table 11.6. Local IPC, user level APIs.

Semaphores

Message Queues

Figure 11.14. Message data structure.

Shared Memory