Section 17.6. ReaderWriter Locks | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

17.6. Reader/Writer Locks

Reader/writer (RW) locks provide mutual exclusion semantics on write locks. Only one thread at a time is allowed to own the write lock, but there is concurrent access for readers. These locks are designed for scenarios in which it is acceptable to have multiple threads reading the data at the same time, but only one writer. While a writer is holding the lock, no readers are allowed. Also, because of the wakeup mechanism, a writer lock is a better solution for kernel code segments that require relatively long hold times, as we will see shortly.

The basic mechanics of RW locks are similar to mutexes, in that RW locks have an initialization function (rw_init()), an entry function to acquire the lock (rw_enter()), and an exit function to release the lock (rw_exit()). The entry and exit points are optimized in assembly code to deal with the simple cases, and they call into C language functions if anything beyond the simplest case must be dealt with. As with mutex locks, the simple case is that the requested lock is available on an entry (acquire) call and no threads are waiting for the lock on the exit (release) call.

17.6.1. Solaris Reader/Writer Locks

Reader/writer locks are implemented as a single-word data structure in the kernel, either 32 bits or 64 bits wide, depending on the data model of the running kernel, as depicted in Figure 17.6.

Figure 17.6. Reader/Writer Lock

typedef struct rwlock_impl {         uintptr_t       rw_wwwh;        /* waiters, write wanted, hold count */ } rwlock_impl_t; #endif  /* _ASM */ #define RW_HAS_WAITERS           1 #define RW_WRITE_WANTED          2 #define RW_WRITE_LOCKED          4 #define RW_READ_LOCK             8 #define RW_WRITE_LOCK(thread)    ((uintptr_t)(thread) | RW_WRITE_LOCKED) #define RW_HOLD_COUNT            (-RW_READ_LOCK) #define RW_HOLD_COUNT_SHIFT      3               /* log2(RW_READ_LOCK) */ #define RW_READ_COUNT            RW_HOLD_COUNT #define RW_OWNER                 RW_HOLD_COUNT #define RW_LOCKED                RW_HOLD_COUNT #define RW_WRITE_CLAIMED         (RW_WRITE_LOCKED | RW_WRITE_WANTED) #define RW_DOUBLE_LOCK           (RW_WRITE_LOCK(0) | RW_READ_LOCK)                                                                        See sys/rwlock.h

There are two states for the reader writer lock, depending on whether the lock is held by a writer, as indicated by bit 2, wrlock. Bit 2, wrlock, is the actual write lock, and it determines the meaning of the high-order bits. If the write lock is held (bit 2 set), then the upper bits contain a pointer to the kernel thread holding the write lock. If bit 2 is clear, then the upper bits contain a count of the number of threads holding the lock as a read lock.

The Solaris 10 RW lock defines bit 0, the wait bit, set to signify that threads are waiting for the lock. The wrwant bit (write wanted, bit 1) indicates that at least one thread is waiting for a write lock. The simple cases for lock acquisition through rw_enter() are the circumstances listed below:

The write lock is wanted and is available.
The read lock is wanted, the write lock is not held, and no threads are waiting for the write lock (wrwant is clear).

The acquisition of the write lock results in bit 2 getting set and the kernel thread pointer getting loaded in the upper bits. For a reader, the hold count (upper bits) is incremented. Conditions where the write lock is being held, causing a lock request to fail, or where a thread is waiting for a write lock, causing a read lock request to fail, result in a call to the rw_enter_sleep() function.

Important to note is that the rw_enter() code sets a flag in the kernel thread used by the dispatcher code when establishing a kernel thread's priority before preemption or changing state to sleep. We cover this in more detail in the paragraph beginning "It is in the dispatcher queue insertion code" on Page 262. Briefly, the kernel thread structure contains a t_kpri_req (kernel priority request) field that is checked in the dispatcher code when a thread is about to be preempted (forced off the processor on which it is executing because a higher-priority thread becomes runnable) or when the thread is about to have its state changed to sleep. If the t_kpri_req flag is set, the dispatcher assigns a kernel priority to the thread, such that when the thread resumes execution, it will run before threads in scheduling classes of lower priority (timeshare and interactive class threads). More succinctly, the priority of a thread holding a write lock is set to a better priority to minimize the hold time of the lock.

Getting back to the rw_enter() flow: If the code falls through the simple case, we need to set up the kernel thread requesting the RW lock to block.

rw_enter_sleep() establishes whether the calling thread is requesting a read or write lock and does another test to see if the lock is available. If it is, the caller gets the lock, the lockstat(1M) statistics are updated, and the code returns. If the lock is not available, then the turnstile code is called to look up a turnstile in preparation for putting the calling thread to sleep.
With a turnstile now available, another test is made on the lock availability. (On today's fast processors, and especially multiprocessor systems, it's quite possible that the thread holding the lock finished what it was doing and the lock became available.) Assuming the lock is still held, the thread is set to a sleep state and placed on a turnstile.
The RW lock structure will have the wait bit set for a reader waiting (forced to block because a writer has the lock) or the wrwant bit set to signify that a thread wanting the write lock is blocking.
The cpu_sysinfo structure for the processor maintains two counters for failures to get a read lock or write lock on the first pass: rw_rdfails and rw_wrfails. The appropriate counter is incremented just prior to the turn-stile call; this action places the thread on a turnstile sleep queue. The mpstat(1M) command sums the counters and displays the fails-per-second in the srw column of its output.

The acquisition of a RW lock and subsequent behavior if the lock is held are straightforward and similar in many ways to what happens in the mutex case. Things get interesting when a thread calls rw_exit() to release a lock it is hold-ingthere are several potential solutions to the problem of determining which thread gets the lock next. A wakeup is issued on all threads that are sleeping, waiting for the mutex, and we know from empirical data that this solution works well for reasons previously discussed. With RW locks, we're dealing with potentially longer hold times, which could result in more sleepers, a desire to give writers priority over readers (it's typically best to not have a reader read data that's about to be changed by a pending writer), and the potential for the priority inversion problem described in Section 17.7.

For rw_exit(), which is called by the lock holder when it is ready to release the lock, the simple case is that there are no waiters. In this case, the wrlock bit is cleared if the holder was a writer, or the hold count field is decremented to reflect one less reader. The more complex case of the system having waiters when the lock is released is dealt with in the following manner:

The kernel does a direct transfer of ownership of the lock to one or more of the threads waiting for the lock when the lock is released, either to the next writer or to a group of readers if more than one reader is blocking and no writers are blocking.
This situation is very different from the case of the mutex implementation, for which the wakeup is issued and a thread must obtain lock ownership in the usual fashion. Here, a thread or threads wake up owning the lock they were blocking on.
The algorithm used to figure out who gets the lock next addresses several requirements that provide for generally balanced system performance. The kernel needs to minimize the possibility of starvation (a thread never getting the resource it needs to continue executing) while allowing writers to take precedence whenever possible.
rw_exit_wakeup() retests for the simple case and drops the lock if there are no waiters (clear wrlock or decrement the hold count).
When waiters are present, the code grabs the turnstile (sleep queue) associated with the lock and saves the pointer to the kernel thread of the next write waiter that was on the turnstile's sleep queue (if one exists).
The turnstile sleep queues are organized as a FIFO (first in, first out) queue, so the queue management (turnstile code) makes sure that the thread that was waiting the longest (the first in) is the thread that is selected as the next writer (first out). Thus, part of the fairness policy we want to enforce is covered.

The remaining bits of the algorithm go as follows:

If a writer is releasing the write lock and there are waiting readers and writers, readers of the same or higher priority than the highest-priority blocked writer are granted the read lock.
The readers are handed ownership, and then woken up by the turnstile_wakeup() kernel function,
These readers also inherit the priority of the writer that released the lock if the reader thread is of a lower priority (inheritance is done on a per-reader thread basis when more than one thread is being woken up). Lock ownership handoff is a relatively simple operation. For read locks, there is no notion of a lock owner, so it's a matter of setting the hold count in the lock to reflect the number of readers coming off the turnstile, then issuing the wakeup of each reader.
An exiting reader always grants the lock to a waiting writer, even if there are higher-priority readers blocked.
It is possible for a reader freeing the lock to have waiting readers, although it may not be intuitive, given the multiple reader design of the lock. If a reader is holding the lock and a writer comes along, the wrwant bit is set to signify that a writer is waiting for the lock. With wrwant set, subsequent readers cannot get the lockwe want the holding readers to finish so the writer can get the lock. Therefore, it is possible for a reader to execute rw_exit_wakeup() with waiting writers and readers.

The "let's favor writers but be fair to readers" policy described above was first implemented in Solaris 2.6.