Section 4.7. POSIX IPC

4.7. POSIX IPC

The evolution of the POSIX standard and associated application programming interfaces (APIs) resulted in a set of industry-standard interfaces that provide the same types of facilities as the System V IPC set: shared memory, semaphores, and message queues. They are quite similar in form and function to their System V equivalents but very different in implementation.

The POSIX implementation of all three IPC facilities is built in userland libraries on top of existing IPC facilities. It uses the notion of POSIX IPC names, which essentially look like file names but need not be actual files in a file system. This POSIX name convention provides the necessary abstraction, a file descriptor, to use the Solaris file memory mapping interface, mmap(2), on which all the POSIX IPC mechanisms are built. This is very different from the System V IPC functions, for which a key value was required to fetch the proper identifier of the desired IPC resource. In System V IPC, a common method used for generating key values was the ftok(3C) (file-to-key) function, whereby a key value was generated, based on the path name of a file. POSIX eliminates the use of the key, and processes acquire the desired resource by using a file name convention.

No kernel tuneable parameters are required (or available) for the POSIX IPC code. The per-process limits of the number of open files and memory address space are the only potentially limiting factors in POSIX IPC.

Table 4.8 lists the POSIX APIs for the three IPC facilities.

Table 4.8. POSIX IPC Interfaces
Semaphores	Message Queues	Shared Memory
`sem_open`	`mq_open`	`shm_open`
`sem_close`	`mq_close`	`shm_unlink`
`sem_unlink`	`mq_unlink`
`sem_init`	`mq_getattr`
`sem_destroy`	`mq_setattr`
`sem_wait`	`mq_send`
`sem_trywait`	`mq_receive`
`sem_post`	`mq_notify`
`sem_getvalue`	`mq_getvalue`

All the POSIX IPC functions are either directly or indirectly based on memory mapped files. The message queue and semaphore functions make direct calls to mmap(2), creating a memory mapped file, based on the file descriptor returned from the xx_open(3R) call. Using POSIX shared memory requires the programmer to make the mmap(2) call explicitly from the application code.

The details of mmap(2) and memory mapped files are covered in subsequent chapters, but, briefly, the mmap(2) system call maps a file or some other named object into a process's address space, as shown in Figure 4.3. The address space mapping created by mmap(2) can be private or shared. It is the shared mapping capability that the POSIX IPC implementation relies on.

Figure 4.3. Process Address Space with `mmap(2)`

4.7.1. POSIX Shared Memory

The POSIX shared memory interfaces provide an API for support of the POSIX IPC name abstraction. The interfaces shm_open(3R) and shm_unlink(3R) do not allocate or map memory into a calling process's address space. The programmer using POSIX shared memory must create the address space mapping with an explicit call to mmap(2). Different processes that must access the same shared segment can execute shm_open(2) on the same object, for example, shm_open("seg1",...,), and then execute mmap(2) on the file descriptor returned from shm_open(3R). Any writes to the shared segment are directed to an underlying file and thus made visible to processes that run mmap(2) on the same file descriptor or, in this case, POSIX object name.

Under the covers, the shm_open(3R) call invokes open() to open the named object (file). shm_unlink(3R) also uses the unlink(2) system call to remove the directory entry. That is, the file (object) is removed.

4.7.2. POSIX Semaphores

The POSIX specification provides for two types of semaphores that can be used for the same purposes as System V semaphores but that are implemented differently. POSIX named semaphores follow the POSIX IPC name convention discussed earlier and are created with the sem_open(3R) call. POSIX also defines unnamed semaphores, which do not have a name in the file system space and are memory based. Additionally, a set of semaphore interfaces that are part of the Solaris threads library provides the same level of functionality as POSIX unnamed semaphores but uses a different API. Table 4.9 lists the different semaphore interfaces that currently ship with Solaris.

Table 4.9. Solaris Semaphore APIs
Origin or Type	Interfaces	Library	Manual Section
System V	`semget()`, `semctl()`, `semop()`	`libc`	Section (2)
POSIX named	`sem_open()`, `sem_close()`, `sem_unlink()`, `sem_wait()`, `sem_try-wait()`, `sem_post()`, `sem_getvalue()`	`libposix4`	Section (3R)
POSIX unnamed	`sem_init()`, `sem_destroy()`, `sem_wait()`, `sem_try-wait()`, `sem_post()`, `sem_getvalue()`	`libposix4`	Section (3R)
Solaris threads	`sema_init()`, `sema_destroy()`, `sema_wait()`, `sema_try-wait()`, `sema_post()`	`libthread`	Section (3T)

Note the common functions for named and unnamed POSIX semaphores: The actual semaphore operationssem_wait(3R), sem_trywait(3R), sem_post(3R) and sem_getvalue(3R)are used for both types of semaphores. The creation and destruction interfaces are different. The Solaris implementation of the POSIX sem_init(3R), sem_destroy(3R), sem_wait(3R), sem_trywait(3R), and sem_post(3R) functions actually invokes the Solaris threads library functions of the same name through a jump-table mechanism in the Solaris POSIX library. The jump table is a data structure that contains function pointers to semaphore routines in the Solaris threads library, libthread.so.1.

The use of POSIX named semaphores begins with a call to sem_open(3R), which returns a pointer to an object defined in the /usr/include/semaphore.h header file, sem_t. The sem_t structure defines what a POSIX semaphore looks like, and subsequent semaphore operations reference the sem_t object. The fields in the sem_t structure include a count (sem_count), a semaphore type (sem_type), and magic number (sem_magic). sem_count reflects the actual semaphore value. sem_type defines the scope or visibility of the semaphore, either USYNC_THREAD, which means the semaphore is visible only to other threads in the same process, or USYNC_PROCESS, which means the semaphore is visible to other processes running on the same system. sem_magic is simply a value that uniquely identifies the synchronization object type as a semaphore rather than a condition variable, mutex lock, or reader/writer lock (see /usr/include/synch.h).

Semaphores within the same process are maintained by the POSIX library code on a linked list of semaddr structures. The structure fields and linkage are illustrated in Figure 4.4.

Figure 4.4. POSIX Named Semaphores

The linked list exists within the process's address space, not in the kernel. semheadp points to the first semaddr structure on the list, and sad_next provides the pointer for support of a singly linked list. The character array sad_name[] holds the object name (file name), sad_addr points to the actual semaphore, and sad_inode contains the inode number of the file that was passed in the sem_open(3R) call. Here is the sequence of events.

When entered, sem_open(3R) obtains a file lock on the passed file argument, using the pos4obj_lock() internal interface.
Once the lock is acquired, pos4obj_open() and underlying routines open the file and return a file descriptor.
If this is a new semaphore, the file is truncated with ftruncate(3C) to the size of a sem_t structure (it does not need to be any larger than that).
If it's not a new semaphore and the process is opening an existing semaphore, then the linked list is searched, beginning at semheadp, until the inode number of the file argument to sem_open(3R) matches the sad_inode field of one of the semaddr structures, which means the code found the desired semaphore.
Once the semaphore is found, the code returns sad_addr, a pointer to the semaphore, to the calling program.

The POSIX semaphore code uses the /tmp file system for the creation and storage of the files that the code memory maps according to the name argument passed in the sem_open(3R) call. For each semaphore, a lock file and a data file are created in /tmp, with the file name prefix of .SEML for the lock file, and .SEMD for the data file. The full file name is prefix plus the strings passed as an argument to sem_open(3R), without the leading slash character. For example, if a sem_open(3R) call was issued with "/sem1" and the first argument, the resulting file names in /tmp would be .SEMLsem1 and .SEMDsem1. This file name convention is used in the message queue code as well, as we'll see shortly.

If a new semaphore is being created, the following events occur.

Memory for a semaddr structure is malloc'd, the passed file descriptor is mmap'd, the semaphore (sem_t) fields and semaddr fields are initialized, and the file descriptor is closed.
Part of the initialization process is done with the jump table and a call to sema_init(). (sema_init() is used for semaphore calls from the Solaris threads library, libthread, and also used for POSIX unnamed semaphores.) sema_init() is passed a pointer to a sem_t (either from the user code or when invoked from sem_open(3R), as is the case here), an initial semaphore value, and a type.
The fields in sem_t are set according to the passed arguments, and the code returns. If a type is not specified, the type is set to USYNC_PROCESS.

The sem_t structure contains two additional fields not shown in the diagram. In semaphore.h, they are initialized as extra space in the structure (padding). The space stores a mutex lock and condition variable used by the library code to synchronize access to the semaphore and to manage blocking on a semaphore that's not available to a calling thread.

The remaining semaphore operations follow the expected, documented behavior for using semaphores in code.

sema_close(3R) frees the allocated space for the semaddr structure and unmaps the mmap'd file.
Once closed, the semaphore is no longer accessible to the process, but it still exists in the systemsimilar to what happens in a file.
close. sem_unlink(3R) removes the semaphore from the system.

4.7.3. POSIX Message Queues

POSIX message queues are constructed on a linked list built by the internal libposix4 library code. Several data structures are defined in the implementation, as shown in Figure 4.5. We opted not to show every member of the message queue structure, in the interests of space and readability.

Figure 4.5. POSIX Message Queue Structures

The essential interfaces for using message queues are mq_open(3R) which opens, or creates and opens, a queue, making it available to the calling process, mq_send(3R) and mq_receive(3R) for sending and receiving messages. Other interfaces (see Table 4.8) manage queues and set attributes, but our discussion focusses on the message queue infrastructure, built on the open, send, and receive functions.

A POSIX message queue is described by a message queue header, a data structure created and initialized when the message queue is first created. The message queue header contains information on the queue, such as the total size in bytes (mq_totsize), maximum size of each message (mq_maxsz), maximum number of messages allowed on the queue (mq_maxmsq), current number of messages (mq_current), current number of threads waiting to receive messages (mq_waiters), and the current maximum message priority (mq_curmaxprio).

Some attributes are tuneable with mq_setattr(3R). The library code sets default values of 128 for the maximum number of messages, 1024 for the maximum size of a single message, and 32 for maximum number of message priorities. If necessary, you can increase the message size and number of messages by using msg_setattr(3R), or you can increase them initially when the queue is created, by populating an attributes structure and passing it on the mq_open(3R) call.

The message pointers, mq_headpp and mq_tailpp, in the header do not point directly to the messages on the linked list. That is, they do not contain the address of the message headers. Since the shared mapping can result in the different processes referencing the message queue so that each has a different virtual address within their address space for the mapping, mq_headpp and mq_tailpp are implemented as offsets into the shared region.

A message descriptor maintains additional information about the queue, such as the file permission flags (read-only or read/write) and the magic number identifying the type of POSIX named object. A second structure (mq_dn) maintains per-process flags on the message, allowing different processes to specify either blocking or nonblocking behavior on the message queue files. This is analogous to regular file flags, for which a file descriptor for an open file is maintained at the process level, and different processes can have different flags set on the same file. (For example, one process could have the file opened for read/write and another process could have the same file opened read-only.)

With the big picture in place (Figure 4.5), let's look at what happens when a message queue is created and opened.

When mq_open() is entered, it creates the lock file and acquires a file lock. All the message queue files use the /tmp directory and follow a file name convention similar to that described in the semaphore section. That is, file names begin with a prefix.MQD (data file), .MQL (lock file), .MQP (permission file), or .MQN (description file)and end with the appended file name passed as an argument to mq_open(3R) minus the slash character.
If a new message queue is being created, the maximum message size and messages per-queue sizes are set, either with the default values or from a passed attributes structure in the mq_open(3R) call.
The permission file is opened, permissions are verified, and the file is closed.
The total amount of space needed for messages, based on the limits and structure size, is calculated, and the data file is created, opened, and set to the appropriate size with ftruncate(3C).
If a new message queue is not being created, then an existing queue is being opened, in which case the permission test is done and the queue data file is tested to ensure the queue has been initialized.
The steps described next apply to a new or existing message queue; the latter case is a queue being opened by another process.
Space for a message queue descriptor is malloc'd (mqdes_t), and the data file is mmap'd into a shared address space, setting the mqhp pointer (Figure 4.5) as the return address from the mmap(2) call.
The message queue descriptor file is created, opened, mmap'd (also into a shared address space), and closed.
For new message queues, the mq_init() function (not part of the API) is called to complete the initialization process. Each message queue header has several semaphores (not shown in Figure 4.5) that used to synchronize access to the messages, the header structure, and other areas of the queue infrastructure.
mq_init() initializes the semaphores with calls to sem_init(), which is part of the libposix4.so library.
The queue head (mq_headpp), tail (mq_tailpp), and free (mq_freep) pointers are set on the message header structure, and mq_init() returns to mq_open(), completing the open process.
Once a queue is established, processes insert and remove messages by using mq_send(3R) and mq_receive(3R).
mq_send(3R) does some up-front tests on the file type (mqd_magic) and tests the mq_notfull semaphore for space on the queue.
If the process's queue flag is set for nonblocking mode, sem_trywait() is called and returns to the process if the semaphore is not available, meaning there's no space on the queue.
Otherwise, sem_wait() is called, causing the process to block until space is available.
Once space is available, sem_wait() is called to acquire the mq_exclusive mutex, which protects the queue during message insertions and removals.

POSIX message queues offer an interesting feature that is not available with System V message queues: automatic notification to a process or thread when a message has been added to a queue. An mq_notify(3R) interface can be issued by a process that needs to be notified of the arrival of a signal. To continue with the sequence for the next code segment:

mq_send() checks to determine if a notification has been set up by testing the mq_sigid structure's sn_pid field. If it is non-NULL, the process has requested notification, and a notification signal is sent if no other processes are already blocked, waiting for a message.
Finally, the library's internal mq_putmsg() function is called to locate the next free message block of the free list (mq_freep) and to place the message on the queue.

For receiving messages

mq_receive() issues a sem_trywait() call on the mq_notempty semaphore.
If the queue is empty and the descriptor has been set to nonblock, sem_trywait() returns with an EAGAIN error to the caller.
Otherwise, the mq_rblocked semaphore is incremented (sem_post), and sem_wait() is called.
Once a message shows up on the queue, the mq_exclusive semaphore is acquired, and the internal mq_getmsg() function is called.
The next message is pulled off the head of the queue, and the pointers are appropriately adjusted.

Our description omits some subtle details, mostly around the priority mechanism available for POSIX message queues. A message priority can be specified in the mq_send(3R) and mq_receive(3R) calls. Messages with better priorities (larger numeric values) are inserted into the queue before messages of lower priority, so higher-priority messages are kept at the front of the queue and are removed first.