Section 4.1. The System V IPC Framework

4.1. The System V IPC Framework

The System V interprocess communication (IPC) facilities provide three servicesmessage queues, semaphore arrays, and shared memory segmentswhich are managed by file-system-like namespaces. Unlike a file system, these namespaces aren't mounted and accessible via a path. Instead, a special API interacts with the different facilities (nothing precludes a VFS-based interface, but the standards require the special APIs). Furthermore, these special APIs don't use file descriptors, nor do they have an equivalent. This means that every operation which acts on an object needs to perform the equivalent of a lookup, which in turn means that every operation can fail if the specified object doesn't exist in the facility's namespace.

4.1.1. IPC Objects

Each object in a namespace has a unique ID, which the system assigns and uses to identify the object when performing operations on it. An object can also have a key, which is selected by the user at allocation time and is used as a primitive rendezvous mechanism. An object without a key is said to have a "private" key.

To perform an operation on an object given its key, you first perform a lookup and obtain its ID. The ID is then used to identify the object when the operation is performed. If the object has a private key, the ID must be known or obtained by other means.

Each object in the namespace has a creator UID and GID, as well as an owner UID and GID. Both are initialized with the RUID and RGID of the process that created the object. The creator or current owner can change the owner of the object. Each object in the namespace has a set of file-like permissions, which, in conjunction with the creator and owner UID and GID, control read and write access to the object (execute is ignored). Each object also has a creator project, which accounts for the object's resource usage.

All three facilities have five operations in common: GET, SET, STAT, RMID, and IDS:

GET, like open, allocates a new object or obtains an existing one (using its key). It takes a key, a set of flags and mode bits, and, optionally, facility-specific arguments. If the key is IPC_PRIVATE, a new object with the requested mode bits and facility-specific attributes is created. If the key isn't IPC_PRIVATE, the GET attempts to look up the specified key and either returns that or creates a new key, depending on the state of the IPC_CREAT and IPC_EXCL flags, much like open. If GET needs to allocate an object, it can fail if there is insufficient space in the namespace (the maximum number of IDs for the facility has been exceeded) or if the facility-specific initialization fails. If GET finds an object it can return, it can still fail if that object's permissions or facility-specific attributes are less than those.
SET adjusts facility-specific parameters of an object, in addition to the owner UID and GID and mode bits. It can fail if the caller isn't the creator or owner.
STAT obtains information about an object, including the general attributes as well as facility-specific information. It can fail if the caller doesn't have read permission.
RMID removes an object from the namespace. Subsequent operations using the object's ID or key will fail (until another object is created with the same key or ID). Since an RMID can be performed asynchronously with other operations, it is possible that other threads or processes will have references to the object. While a facility may have actions that need to be performed at RMID time, only when all references are dropped can the object be destroyed. RMID fails if the caller isn't the creator or owner.
IDS obtains a list of all IDs in a facility's namespace. There are no facility-specific behaviors of IDS.

4.1.2. IPC Framework Design

Because some IPC facilities provide services whose operations must scale, a mechanism that allows fast, concurrent access to individual objects is needed. Of primary importance is object lookup based on ID (SET, STAT, others). Allocation (GET), deallocation (RMID), ID enumeration (IDS), and key lookups (GET) are lesser concerns but should be implemented in such a way that ID lookup isn't affected (at least not in the common case).

Starting from the bottom up, each object is represented by a structure, the first member of which must be a kipc_perm_t. The kipc_perm_t contains the information described above in Section 4.1.1, a reference count (since the object may continue to exist after it has been removed from the namespace), as well as some additional metadata that manages data structure membership. These objects are dynamically allocated.

Above the objects is a power-of-2 sized table of ID slots. Each slot contains a pointer to an object, a sequence number, and a lock. An object's ID is a function of its slot's index in the table and its slot's sequence number. Every time a slot is released (by RMID), its sequence number is increased. Strictly speaking, the sequence number is unnecessary. However, checking the sequence number after a lookup provides a certain degree of robustness against the use of stale IDs (useful since nothing else does). When the table fills up, it is resized (see Section 4.1.3).

Of an ID's 31 bits (an ID is, as defined by the standards, a signed int) the top IPC_SEQ_BITS are used for the sequence number with the remainder holding the index into the table. The size of the table is therefore bounded at 2 ^ (31 - IPC_SEQ_BITS) slots.

Managing this table is the ipc_service structure. It contains a pointer to the dynamically allocated ID table, a namespace-global lock, an id_space for managing the free space in the table, and sundry other metadata necessary for the maintenance of the namespace. An AVL tree of all keyed objects in the table (sorted by key) is used for key lookups. An unordered doubly linked list of all objects in the namespace (keyed or not) is maintained to facilitate ID enumeration.

To help visualize these relationships, Figure 4.1 illustrates a namespace with a table of size 8 containing three objects (IPC_SEQ_BITS = 28).

Figure 4.1. IPC Namespace Example

4.1.3. Locking

Three locks (or sets of locks) ensure correctness: the slot locks, the namespace lock, and p_lock (needed when checking resource controls). Their ordering is

namespace lock -> slot lock 0 -> ... -> slot lock t -> p_lock

Generally, the namespace lock protects allocation and removal from the namespace, ID enumeration, and resizing the ID table. Specifically,

Write access to all fields of the ipc_service structure; read access to all variable fields of ipc_service except ipcs_tabsz (table size) and ipcs_table (the table pointer)
Read/write access to ipc_avl, ipc_list in visible objects' kipc_perm structures (that is, objects that have been removed from the namespace don't have this restriction); write access to ipct_seq and ipct_data in the table entries

A slot lock by itself is meaningless (except when resizing). Of greater interest conceptually is the notion of an ID locka "virtual lock" that refers to whichever slot lock an object's ID currently hashes to.

An ID lock protects all objects with that ID. Normally, there will only be one such object: the one pointed to by the locked slot. However, if an object is removed from the namespace but retains references (for example, an attached shared memory segment that has been RMID'd), it continues to use the lock associated with its original ID. While this can result in increased contention, operations that require taking the ID lock of removed objects are infrequent.

Specifically, an ID lock protects the contents of an object's structure, including the contents of the embedded kipc_perm structure (but excluding those fields protected by the namespace lock). It also protects the ipct_seq and ipct_data fields in its slot (it is really a slot lock, after all).

Recall that the table is resizable. To avoid requiring every ID lookup to take a global lock, we employed a scheme much like that employed for file descriptors (see Section 14.2.1) is used. Note that the sequence number and data pointer are protected by both the namespace lock and their slot lock. When the table is resized, the following operations take place:

A new table is allocated.
The global lock is taken.
All old slots are locked, in order.
The first half of the new slots are locked.
All table entries are copied to the new table and cleared from the old table.
The ipc_service structure is updated to point to the new table.
The ipc_service structure is updated with the new table size.
All slot locks (old and new) are dropped.

Because the slot locks are embedded in the table, ID lookups and other operations that require taking a slot lock need to verify that the lock taken wasn't part of a stale table. To verify that, we check the table size before and after dereferencing the table pointer and taking the lock: if the size changes, the lock must be dropped and reacquired. It is this additional work that distinguishes an ID lock from a slot lock.

Because we can't guarantee that threads aren't accessing the old tables' locks, they are never deallocated. To prevent spurious reports of memory leaks, a pointer to the discarded table is stored in the new one in step 5. (Theoretically, ipcs_destroy will delete the discarded tables, but it is only ever called from a failed _init invocation; that is, when there aren't any.)

The following interfaces are provided by the ipc module for use by the individual IPC facilities.

int ipcperm_access(kipc_perm_t *, int, cred_t *); Given an object and a cred structure, determines if the requested access type is allowed. int ipcperm_set(ipc_service_t *, struct cred *, kipc_perm_t *, struct ipc_perm *,                 model_t); int ipcperm_set64(ipc_service_t *, struct cred *, kipc_perm_t *, ipc_perm64_t *); void ipcperm_stat(struct ipc_perm *, kipc_perm_t *, model_t); void ipcperm_stat64(ipc_perm64_t *, kipc_perm_t *); Performs the common portion of an STAT or SET operation. All (except stat and stat64) can fail, so they should be called before any facility-specific non-reversible changes are made to an object. Similarly, the set operations have side effects, so they should only be called once the possibility of a facility-specific failure is eliminated. ipc_service_t *ipcs_create(const char *, rctl_hndl_t, size_t, ipc_func_t *,                            ipc_func_t *, int, size_t); Creates an IPC namespace for use by an IPC facility. void ipcs_destroy(ipc_service_t *); Destroys an IPC namespace. void ipcs_lock(ipc_service_t *); void ipcs_unlock(ipc_service_t *); Takes the namespace lock. Ideally such access wouldn't be necessary, but there may be facility-specific data protected by this lock (e.g. project-wide resource consumption). ipc_lock kmutex_t *ipc_lock(ipc_service_t *, int); Takes the lock associated with an ID. Can't fail. kmutex_t *ipc_relock(ipc_service_t *, int, kmutex_t *); Like ipc_lock, but takes a pointer to a held lock. Drops the lock unless it is the one that would have been returned by ipc_lock. Used after calls to cv_wait. kmutex_t *ipc_lookup(ipc_service_t *, int, kipc_perm_t **); Performs an ID lookup, returns with the ID lock held. Fails if the ID doesn't exist in the namespace. void ipc_hold(ipc_service_t *, kipc_perm_t *); Takes a reference on an object. void ipc_rele(ipc_service_t *, kipc_perm_t *); Releases a reference on an object, and drops the object's lock. Calls the object's destructor if last reference is being released. void ipc_rele_locked(ipc_service_t *, kipc_perm_t *); Releases a reference on an object. Doesn't drop lock, and may only be called when there is more than one reference to the object. int ipc_get(ipc_service_t *, key_t, int, kipc_perm_t **, kmutex_t **); int ipc_commit_begin(ipc_service_t *, key_t, int, kipc_perm_t *); kmutex_t *ipc_commit_end(ipc_service_t *, kipc_perm_t *); void ipc_cleanup(ipc_service_t *, kipc_perm_t *); Components of a GET operation. ipc_get performs a key lookup, allocating an object if the key isn't found (returning with the namespace lock and p_lock held), and returning the existing object if it is (with the object lock held). ipc_get doesn't modify the namespace. ipc_commit_begin begins the process of inserting an object allocated by ipc_get into the namespace and can fail. If successful, it returns with the namespace lock and p_lock held. ipc_commit_end completes the process of inserting an object into the namespace and can't fail. The facility can call ipc_cleanup at any time following a successful ipc_get and before ipc_commit_end or a failed ipc_commit_begin to fail the allocation. Pseudocode for the suggested GET implementation: top:         ipc_get         if failure                 return         if found {                 if object meets criteria                         unlock object and return success                 unlock object and return failure         } else {                 perform resource control tests                 drop namespace lock, p_lock                 if failure                         ipc_cleanup                 perform facility-specific initialization                 if failure {                         facility-specific cleanup                         ipc_cleanup                 }                 ( At this point the object should be destructible using the                   destructor given to ipcs_create )                 ipc_commit_begin                  if retry                          goto top                  else if failure                          return                 perform facility-specific resource control tests/allocations                 if failure                         ipc_cleanup                 ipc_commit_end                 perform any infallible post-creation actions, unlock, and return         } int ipc_rmid(ipc_service_t *, int, cred_t *); Performs the common portion of an RMID operation -- looks up an ID removes it, and calls the a facility-specific function to do RMID-time cleanup on the private portions of the object. int ipc_ids(ipc_service_t *, int *, uint_t, uint_t *); Performs the common portion of an IDS operation.

4.1.4. Module Creation

The System V IPC kernel modules are implemented as dynamically loadable modules. Each facility has a corresponding loadable module in the /kernel/sys direc-tory (shmsys, semsys, and msgsys). In addition, all three methods of IPC require loading of the /kernel/misc/ipc module, which provides two low-level routines shared by all three facilities. The ipcperm_access() routine verifies access permissions to a particular IPC resource, for example, a shared memory segment, a semaphore, or a message queue. The ipcget() code fetches a data structure associated with a particular IPC resource that generated the call, based on a key value that is passed as an argument in the shmget(2), msgget(2), and semget(2) system calls.

When an IPC resource is initially created, a positive integer, known as an identifier, is assigned to identify the IPC object. The identifier is derived from a key value. The kernel IPC xxxget(2) system call will return the same identifier to processes or threads, using the same key value, which is how different processes can be sure to access the desired message queue, semaphore or shared memory segment. An ftok(3C), or file-to-key interface, is the most common method of having different processes obtain the correct key before they call one of the IPC xxxget() routines.

Associated with each IPC resource is an id data structure, which the kernel allocates and initializes the first time an xxxget(2) system call is invoked with the appropriate flags set. The xxxget(2) system call for each facility returns the identifier to the calling application, again based on arguments passed in the call and permissions. The structures are similar in name and are defined in the header file for each facility (see Table 4.1).

Table 4.1. IPC ID Structure Names
Facility Type	xxxget(2)	ID Structure Name
semaphores	`semget(2)`	`semid_ds`
shared memory	`shmget(2)`	`shmid_ds`
message queues	`msgget(2)`	`msgid_ds`

The number of xxxid_ds structures available is capped by each facility's project.max-xxx-ids resource control limit (see Chapter 7), that is, max-shm-ids, max-sem-ids, and max-msg-ids determine the maximum number of msgid_ds, semid_ds, and shmid_ds structures available, respectively.

Most fields in the ID structures are unique for each IPC type, but they all include as the first structure member a pointer to an ipc_perm data structure, which defines the access permissions for that resource, much as access to files is defined by permissions maintained in each file's inode. The ipc_perm structure is defined as follows.

/* Common IPC access structure */ struct ipc_perm {         uid_t           uid;    /* owner's user id */         gid_t           gid;    /* owner's group id */         uid_t           cuid;   /* creator's user id */         gid_t           cgid;   /* creator's group id */         mode_t          mode;   /* access modes */         uint_t          seq;    /* slot usage sequence number */         key_t           key;    /* key */ #if !defined(_LP64)         int             pad[4]; /* reserve area */ #endif };                                                            See /usr/include/sys/ipc.h

For each IPC resource, the UID and GID of the owner and creator will be the same. Ownership could subsequently be changed through a control system call, but the creator's IDs never change. The access mode bits are similar to file access modes, differing in that there is no execute mode for IPC objects; thus, the mode bits define read/write permissions for the owner, group, and all others. The seq field, described as the slot usage sequence number, is used by the kernel to establish the unique identifier of the IPC resource when it is first created.