17.3. Hardware Considerations for Locks and Synchronization
Hardware-specific considerations must enter into the implementation of lock primitives on a system. The first consideration has to do with the processor's instruction set and the availability of machine instructions suitable for locking code. The second deals with the visibility of a lock's state when it is examined by executing kernel threads.
To understand how these considerations apply to lock primitives, keep in mind that a lock is a piece of data at a specific location in the system's memory. In its simplest form, a lock is a single byte location in RAM. A lock that is
Most modern processors shipping today provide some form of byte-level
instruction that is
Consider what could happen without an instruction that has atomic properties. A thread executing on one processor could issue a load (read) of the lock and while it is doing a test operation to determine if the lock is held or not, another thread executing on another processor issues a lock call to get the same lock at the same time. If the lock is not held, both threads would assume the lock is available and would issue a store to hold the lock. Obviously, more than one thread cannot own the same lock at the same time, but that would be the result of such a sequence of events. Atomic instructions prevent such things from happening.
SPARC processors implement memory access instructions that provide atomic test-and-set semantics for mutual exclusion primitives, as well as instructions that can force a particular ordering of memory operations (more on the latter feature in a moment). UltraSPARC processors (the SPARC V9 instruction set) provide three memory access instructions that guarantee atomic behavior:
(load and store unsigned byte),
(compare and swap), and
(swap byte locations). These instructions
Figure 17.2 illustrates the ldstub and cas instructions. The swap instruction (not shown) simply swaps a 32-bit value between a hardware register and a memory location, similar to what cas does if the compare phase of the instruction sequence is equal.
Figure 17.2. Atomic Instructions for Locks on SPARC Systems
The implementation of locking code with the assembly language test-and-set style of instructions requires a
For example, the ldstub instruction retrieves the byte value (the lock) from memory and stores it in the specified hardware register. Locking code must test the value of the register to determine if the lock was held or available when the ldstub executed. If the register value is all 1's, the lock was held, so the code must branch off and deal with that condition. If the register value is all 0's, the lock was not held and the code can progress as being the current lock holder. Note that in both cases, the lock value in memory is set to all 1's, by virtue of the behavior of the ldstub instruction (store 0xFF at designated address). If the lock was already held, the value simply didn't change. If the lock was 0 (available), it will now reflect that the lock is held (all 1's). The code that releases a lock sets the lock value to all 0's, indicating the lock is no longer being held.
The Solaris lock code uses assembly language instructions when the lock code is entered. The basic design is such that the entry point to acquire a lock enters an assembly language routine, which uses either
to grab the lock. The assembly code is designed to deal with the simple case, meaning that the desired lock is available. If the lock is being held, a C language code
The second hardware consideration referred to earlier has to do with the visibility of the lock state to the running processors when the lock value is changed. It is critically important on multiprocessor systems that all processors have a consistent view of data in memory,
Modern processors implement hardware buffering to provide optimal performance. In addition to the hardware caches, processors also use load and store buffers to hold data being read from (load) or written to (store) memory in order to keep the instruction pipeline running and not have the processor stall waiting for data or a data write-to-memory cycle. The data hierarchy is
Figure 17.3. Hardware Data Hierarchy
The illustration in Figure 17.3 does not depict a specific processor; it is a generic representation of the various levels of data flow in a typical modern high-end microprocessor. It shows the flow of data to and from physical memory from a processor's main execution units (integer units, floating point units, etc.).
The sizes of the load/store buffers vary across processor
The implementation of a store buffer is part of the memory model implemented by the hardware. The memory model defines the constraints that can be imposed on the order of memory operations (loads and stores) by the system. Many processors implement a sequential consistency model, where loads and stores to memory are executed in the same order in which they were issued by the processor. This model has advantages in terms of memory consistency, but there are performance trade-offs with such a model because the hardware cannot optimize cache and memory operations for speed. The SPARC architecture specification  provides for building SPARC-based processors that support multiple memory models, the choice being left up to the implementors as to which memory models they wish to support. All current SPARC processors implement a Total Store Ordering (TSO) model, which requires compliance with the following rules for loads and stores:
The TSO model is not quite as strict as the sequential consistency model but not as relaxed as two additional memory models defined by the SPARC architecture. SPARC-based processors also support
A final consideration in data visibility applies also to the memory model and concerns instruction ordering. The execution unit in modern processors can reorder the incoming instruction stream for processing through the execution units. The goals again are performance and creation of a sequence of instructions that will keep the processor pipeline full.
The hardware considerations described in this section are summarized in Table 17.1, along with the solution or implementation detail that applies to the particular issue.
Table 17.1. Hardware Considerations and Solutions for Locks
The issues of consistent memory views in the face of a processor's load and store buffers, relaxed memory models, and atomic test-and-set capability for locks are addressed at the processor instruction-set level. The mutex lock and RW lock primitives implemented in the Solaris kernel use the
instructions for lock testing and acquisition on UltraSPARC-based systems and use the
(compare/exchange long) instruction on x86. The lock primitive routines are part of the
SPARC processors provide various forms of memory barrier (membar) instructions, which, depending on options that are set in the instruction, impose specific constraints on the ordering of memory access operations (loads and stores) relative to the sequence with which they were issued. To ensure a consistent memory view when a mutex or RW lock operation has been issued, the Solaris kernel issues the appropriate membar instruction after the lock bits have changed.
As we move from the strongest consistency model (sequential consistency) to the weakest model (RMO), we can build a system with
Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris
DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD (Oracle Solaris Series)
Solaris 10 Security Essentials
Oracle Solaris 10 System Virtualization Essentials (Oracle Solaris System Administration Series)