12.6 Semaphore Support for Parallel Processes | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Modern operating systems that support either multiple processors or cooperating processes (or threads) require some minimal hardware mechanisms in order to avoid conflicts for shared memory, which could otherwise lead to deadlock or to incorrectly computed results.

The work of Edsger Wybe Dijkstra (1930 2002) in 1965 on the foundations of a theory of cooperating processes has been continued by many researchers, and books on operating systems treat this subject in detail. Here we shall only indicate that the Itanium architecture provides machine instructions that modify memory atomically, from which semaphores or higher-level mechanisms for resource management and conflict resolution can be constructed.

12.6.1 Previous Architectures

CISC architectures have generally provided at least one suitable machine instruction, or sometimes several such instructions, that can modify a memory location atomically. The versatile XCHG instruction of IA-32 architecture can function as a locking semaphore instruction, when one operand is a memory location and the other is a general register. In essence, it interchanges the contents of the two source locations.

A programmer in a high-level language would have to introduce an intermediate temporary variable in order to swap two quantities, using three steps:

 tmp = mem(x) mem(x) = y y = tmp

Such a nonatomic method fails to satisfy the theoretically provable requirements as a basis for building a semaphore. During the normal course of scheduling processes to run, some other process that can access the quantity x in shared memory could become active and change it to a different value than what tmp has copied before this whole sequence could be completed. If the first process made decisions on tmp after being rescheduled to use the processor, it would be acting on stale information.

Atomic instructions of a read-modify-write variety constitute an exception to the general rule in RISC-like architectures that memory access instructions (loads and stores) should not perform any other operations on the data being copied between memory and processor registers. This exception is essential, however, if the ISA is to satisfy the criterion of providing a basis for construction of semaphores or other mechanisms for resource management and conflict resolution.

PA-RISC architecture implements load-and-clear (LDC) instructions that can serve as a basis for the construction of semaphores. These instructions atomically bring a quantity from memory into a processor register and write zero back into the memory location.

12.6.2 Itanium Architecture

The Itanium architecture provides several suitably atomic read-modify-write instructions from which semaphores and other synchronization schemes can be built, as well as related 16-byte variants of integer load and store operations. Not surprisingly, the first of those is upwardly compatible with both the IA-32 XCHG instruction and the PA-RISC LDC instruction. The others offer additional functionality.

Each of the following instruction types requires that the information unit in memory be naturally aligned. The two exchange instructions also require that the memory location be accessed through the cache system. These instructions have longer latency than many other integer instructions because they must stall the pipeline in order to ensure atomicity.

Exchange instruction

The xchg instruction, which replaces the content of a memory location with a segment of equivalent size from a general register, has the following assembler syntax:

 xchgsz.ldhint r1 = [r3],r2   // val = [r3]                              // [r3] = lo sz bytes of r2                              // r1 = zero-extend(val)

where we indicate in the comments the hypothetical nonatomic operations carried out by this instruction. Actually, the read-modify-write memory operations are guaranteed to execute atomically. That is achieved by stalling the pipeline and making the indivisible instruction not subject to interruption.

The referenced information unit in memory must be aligned according to its size, sz, which may be 1, 2, 4, or 8 bytes. The value from memory is zero-extended when it is placed into register r1.

The values of ldhint (the load hint completer) are the same as for ordinary load instructions (none at all, nt1, nta) as previously discussed (Section 4.5.3).

The memory access is performed with acquire semantics that is, this memory read-write is made visible (to other processes and processors) prior to all subsequent data memory accesses.

Fetch and add immediate instruction

The fetchadd instruction modifies the content of a memory location by a small signed increment and copies the original value into a general register. The assembler syntax is:

 fetchaddsz.sem.ldhint r1 = [r3], inc3                              // val = [r3]                              // val = zero-extend(val)                              // r1 = val                              // val = val + sign-ext(inc3)                              // [r3] = lo sz bytes of val

The referenced information unit in memory must be aligned according to its size, sz, which may be 4 or 8 bytes. The value from memory is zero-extended before it is placed into register r1 and incremented.

The possible increments that can be encoded in the inc3 immediate field in the instruction are ±1, ±4, ±8, ±16 (note that ±2 is not included).

The values of sem (the ordering semantics completer) are: acq (acquire), which means that the memory read-write is made visible prior to all subsequent data memory accesses; or rel (release), which means that the memory read-write is made visible after all previous data memory accesses.

The values of ldhint (the load hint completer) are the same as for ordinary load instructions (none at all, nt1, nta) as previously discussed (Section 4.5.3).

Compare criterion value register and compare and store register

Two application registers (Appendix D.6), ar.ccv (compare criterion value) and ar.csd (compare and store data) are associated with a third type of atomic read-test-modify-write memory operation and with special 16-byte load and store operations. These registers can be read or written using mov instructions (Section 4.5.6).

Compare and exchange instruction

The cmpxchg instruction performs an atomic compare-and-exchange, but only conditionally modifies the content of a memory location. The assembler syntax is:

 cmpxchgsz.sem.ldhint r1 = [r3],r2,ar.ccv                              // val = [r3]                              // val = zero-extend(val)                              // r1 = val                              // if val == ccv then                              //   [r3] = lo sz bytes of r2 cmp8xchg16.sem.ldhint r1 = [r3],r2,ar.csd,ar.ccv                              // val = [r3]                              // val = zero-extend(val)                              // r1 = val                              // if val == ccv then                              //   [r3 & ~0x8] =  r2                              //   [(r3 & ~0x8) + 8] = ar.csd

where we indicate in the comments the hypothetical nonatomic operations carried out by this instruction. Actually, the read-test-modify-write memory operations are guaranteed to execute atomically. That is achieved by stalling the pipeline and making the indivisible instruction not subject to interruption.

The referenced information unit in memory must be aligned according to its size, sz, which may be 1, 2, 4, or 8 bytes for cmpxchgsz or 8 bytes for cmp8xchg16. The value from memory is zero-extended when it is placed into register r1. If the value from memory did not match the value in register ar.ccv, then no write-back to memory is performed.

The values of ldhint (the load hint completer) are the same as for ordinary load instructions (none at all, nt1, nta) as previously discussed (Section 4.5.3).

Special 16-byte load and store instructions

The ar.csd application register is also involved with special 16-byte atomic forms of integer load and store instructions:

 ld16.ldhint         r1,ar.csd = [r3]     // r1 = [r3]                                          // ar.csd = [r3 + 8] ld16.acq.ldhint     r1,ar.csd = [r3]     // r1 = [r3]                                          // ar.csd = [r3 + 8] st16.sttype.sthint  [r3] = r2,ar.csd     // [r3] = r2                                          // [r3 + 8] = ar.csd

where register r3 specifies a memory address that must be aligned on a 16-byte addressing boundary.

The ld16 instruction performs a single 16-byte atomic read operation from memory and places the quad word from the lower address in register r1 and the value from the higher address into register ar.csd.

The only recognized values of the load type completer are none at all and acq (Section 4.5.3). The values of ldhint (the load hint completer) are the same as for ordinary load instructions as previously discussed (Section 4.5.3).

The st16 instruction stores two quad word values from registers r2 and ar.csd in a single 16-byte atomic write operation to memory at the address in register r3 and the address 8 bytes higher, respectively.

The values of sttype (the store type completer) and sthint (the store hint completer) are the same as for ordinary store instructions as previously discussed (Section 4.5.2).