Section 16.1. Memory Hierarchy Designs | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

16.1. Memory Hierarchy Designs

In this section we take a closer look at NUMA and CMT: what they are and why we need them.

16.1.1. What Is NUMA?

Typically, NUMA machines are made up of a number of nodes, each with CPUs, memory, and (possibly) I/O devices that use a small, fast local bus and special hardware to connect the buses of the various nodes. All the nodes are interconnected such that they can share one physical address space and can access the memory in all the other nodes. However, it takes longer to access the memory in a remote node than in the local one. There may also be varying degrees of remote latency. (That is, some memory will be close, some farther away, and some farther away still.) A node may physically be a board, a machine consisting of multiple boards, or even a single processor with local memory.

Among the factors affecting just how much longer a remote access takes are the speed of the interconnect, the topology of how the nodes are connected, whether the memory location is currently in cache (and which cache it's in), and whether any cache coherency must be maintained. For example, incurring a remote read miss on Starcat is about 1.5 to 3 times slower than a local miss, depending on these factors.

The Starcat and AMD Opteron hypertransport based machines are examples of these systems.

16.1.1.1. Why NUMA?

Building an SMP with a large number of fast CPUs is a hard problem. The physical size of the backplane required to allow connection of all the CPUs, I/O devices, etc., tends to limit the speed at which the backplane can operate, while at the same time the increasing number and speed of the CPUs places a constant upward pressure on the desired backplane speed (and thus capacity).

Various techniques have been employed to work around these issues. For example, plugging cards in from both sides (centerplane), using a crossbar switch rather than a traditional bus (basically making the bus more parallel), and increasing the width of the bus. However, over the long term it is likely that if SMP machines are to continue to grow in overall capacity, another approach is needed. NUMA offers a solution to this problem by allowing the computer to scale beyond a single SMP.

NUMA is a design trade-off, however. NUMA designs, while allowing for larger systems can introduce prohibitively large memory latencies, which in turn can impact performance. Because the amount of memory latency experienced by a given thread may vary depending on where the thread is running and which memory it is accessing, application performance on NUMA systems can be nondeterministic. This can be especially problematic for programmers who predicate their application's performance on the assumption that multiple parallel threads of execution will complete a given task in a constant amount of time (as is common in barrier synchronization).

Programmers concerned with extracting the highest possible performance from the entire machine will need to tune their applications to reflect the processor-memory-I/O topology. However, our experience so far with the adoption of new Solaris APIs has shown that ISVs would rather have their software work well without any changes. They do not want to optimize their code for a particular platform, because their code would be less portable and their testing and maintenance costs would increase. To address that concern, Solaris OS introduces a set of optional APIs that allow the ISV to advise any intentional relationship between threads and memoryadvice that is completely portable to any specific machine topology. If ISVs are unable or unwilling to modify their application to use the optional APIs, the Solaris kernel will, by default, employ a default set of policies and optimizations to enhance the application's performance while reducing the performance variability the application would otherwise experience.

16.1.1.2. What Is Cache Coherent NUMA?

Cache coherent NUMA (ccNUMA) is a fairly common flavor of NUMA among the computer vendors, including Sun, who make NUMA machines today. In ccNUMA machines, hardware keeps the memory cache lines coherent across all the nodes in the machine. This approach is much faster than the alternatives of ensuring the coherency with software or disabling the caching of remote memory altogether. The MPO enhancements in Solaris are found only with ccNUMA machines.

16.1.2. What Is CMT?

Chip multithreading (CMT) refers to the family of processor technologies that allow a given physical processor to simultaneously execute multiple threads of execution. Several techniques presently exist for implementing CMT.

The first is chip multiprocessing (CMP), wherein multiple processing cores are implemented in a single physical processor package. UltraSPARC IV is Sun's first CMP, incorporating two UltraSPARC III+ cores per chip. Each UltraSPARC IV appears to the operating system as two logical processors (one per core).

Another technique is vertical multithreading (VT), wherein a single processor core may multiplex multiple threads of execution across its pipeline. Rather than stalling the pipeline when waiting for a memory request, the core can simply switch to another thread. Because this multiplexing is managed by the hardware, each VT core appears to the operating system as multiple logical CPUs upon which threads may be scheduled to run. Sun's UltraSPARC T1 (Niagara) is an example of a vertically threaded CMP processor, incorporating 8 cores with 4 threads per core. Each UltraSPARC T1 chip therefore presents to the OS 32 logical CPUs.

Like vertical threading, simultaneous multithreading (SMT) allows a single processor core to execute multiple threads, but SMT differs from VT in that the core can process instructions from multiple instruction streams simultaneously.

P4/Xeon is an example of an SMT processor.

16.1.2.1. Why CMT?

Chip multithreading represents a divergence from the traditional set of techniques used to increase the performance of a given processor architecture.

As the gap between processor and memory speeds widens, trying to increase performance by ramping up the processor clock speed begins to have diminishing returns because the time spent by the processor stalled waiting for memory will tend to dominate. CMT attacks this problem by allowing useful work from other instruction streams to fill what would otherwise be a stalled pipeline.

The design of CMT focuses therefore not on executing a single instruction stream as quickly as possible (stalling along the way) but rather on increasing the aggregate amount of work done by the processor in a given unit of time (throughput), a goal fulfilled by multiple threads running in parallel. In VT and SMT, we achieve this parallelism by filling pipeline stalls with instructions from other streams. In CMP, we achieve additional parallelism by adding more processing cores (each with the capability of running one or more threads) to the chip.

16.1.2.2. CMT and Solaris

Without CMT support, the kernel would see and treat each logical CPU presented by the chip no differently than it would any other CPU. It is important for the kernel to consider, however, the various sharing relationships that exist among a CMT chip's logical CPUs. Some CPUs may share a pipeline for example, while others may share caches or perhaps a data path to cache or memory. The performance of a thread running on a given logical CPU can therefore be impacted (for better or worse) by threads running elsewhere on the core or chip.

CMT support in Solaris allows the dispatcher to be aware of the sharing relationships that exist among a given chip's logical CPUs. To reduce contention over shared processor resources and to improve bandwidth, the dispatcher load-balances running threads across the system's physical processors and cores. Where caches are shared among multiple logical CPUs, threads are given an affinity for the set of CPUs sharing a cache such that if the thread must migrate, it should try to next run on another CPU sharing that same cache.

Without CMT awareness, the dispatcher could, for example, schedule multiple memory-bandwidth-hungry threads to run on CPUs all sharing the same memory controller, when it would have been far better to load-balance the threads across the physical processors such that each thread has dedicated use of a memory controller and need not contend.