Each year, hardware vendors introduce larger and larger servers with ever increasing capacity. At the same time, software places ever increasing demands on systems as more users undertake more compute intensive tasks than ever before. For the last three decades, however, Moore's Law meant that even if a fast enough machine wasn't available when you began a large software development project, one probably would be by the time you finished. Today, while CPU speeds still continue to double annually, server designs are simply running into speed of light constraints as vendors try to place more and more CPUs into the same sized cabinet. At the same time, the growth of corporate Intranets and the Internet have meant that user populations, instead of doubling, often grow by a factor of ten times or more overnight. So how can more complicated software meet these growth requirements as individual CPU speeds only double annually and theoretical constraints limit potential server growth? The answer is various combinations of hardware architectures including SMP (symmetric multiprocessing), MPP (massively parallel processing), NUMA (non-uniform memory architecture), and clustering. Below we introduce the reader to the major server architectures being promoted by hardware vendors today and describe the impact of each on software design.
The SMP, or symmetric multiprocessor, hardware architecture is the most popular multiprocessor hardware architecture in use today. The "symmetric" in SMP refers to the symmetric relation between CPUs and memory. SMP designs use a shared memory architecture where all memory is shared equally among all CPUs. An SMP server typically has multiple CPU and memory banks and any CPU can access any memory bank in an equal amount of time, hence the "symmetric" nomenclature . A typical SMP design is shown in Figure 16-6. SMP servers range in size from two CPU Pentium PCs to Sun's 64 CPU Enterprise 10000 (Starfire) server. Because CPU to memory access speeds are constant, no special tuning or software changes are required for multithreaded applications to gain near-linear performance improvements as CPUs and memory are added. Typical multithreaded applications (see Chapter 19) will see over 90% performance improvement for each CPU added in an SMP system. Eventually, however, physical and electrical limitations will limit how many CPUs can be connected with an SMP architecture.
In any SMP design, the ultimate hardware limit is the performance of the system interconnect, or backplane. There are two primary measures of backplane performance, throughput and latency. Throughput is determined by the combination of a backplane's clock rate and its width. A 64 bit (8 byte) wide backplane running at 100 MHz has a theoretical bandwidth limitation of 800 Mbytes/second. This bandwidth must be divided between all the CPUs, memory banks, and I/O devices on the backplane. At 100 MHz, typical speed of light propagation delays in silicone limit backplane lengths to about 19 inches, or about the width of a standard computer equipment rack. This in turn limits the number of CPUs that can be mechanically connected to a backplane in a 19 inch rack to about 64, give or take a factor of 2. The second measure of backplane performance is its latency. Backplane latency is measured as the delay from when a backplane transaction is initiated (i.e., a CPU to memory transfer) to when it is completed. Once again, speed of light limitations in a 19 inch backplane lead to best case latencies of several hundred nanoseconds.
Lets examine a real SMP example to illustrate some of these limitations. Sun's Enterprise 6000 SMP processor has a backplane that is clocked at 83 MHz with 500 nanosecond worst case latency. The peak transfer rate of the backplane is 2600 Mbytes/second with a sustained transfer rate of 2500 Mbytes/second. Sun's backplane is packet switched that allows multiple transactions to be outstanding on the backplane at the same time. This leads to the high sustained to peak transfer rate ratio when compared to more traditional circuit-switched backplanes such as VME designs. A single Enterprise 6000 CPU, however, is able to sustain 400 Mbytes/second transfer rates to a single memory bank. Thus, if all CPUs needed to access memory at their peak rate, the backplane could not sustain more than 2500/400 = 6.25 CPUs. In practice, it would be nearly impossible to design a program that drove all CPUs to access memory at peak rates continuously. In addition, each CPU has a local cache of up to 4 Mbytes which further reduces CPU to memory bandwidth. As a result, the Enterprise 6000 shows nearly linear scalability on many benchmarks all the way up to its maximum of 30 CPUs. Other than using a correctly multithreaded application, therefore, there are few software architecture considerations to take into account when using an SMP hardware architecture.
An MPP, or massively parallel processor, typically has several hundred or more CPUs, each with local memory instead of shared memory. CPUs are interconnected to one or more adjacent processors with high-speed interfaces. A CPU wishing to access the memory of another processor must do so by requesting the access through that processor. Typically, MPP architectures have special message passing interfaces that are tuned to provide high inter-CPU bandwidth. For applications that can be partitioned so that all memory access is local to a given CPU, MPP architectures scale excellently. This, however, requires careful consideration in your software design. MPP architectures are thus well suited to certain classes of scientific problems but generally are not good choices for general commercial computing. A sample MPP architecture is shown in Figure 16-7.
Another variant of the MPP architecture are so called MPP-SMP hybrid designs such as IBM's SP2. The SP2 is an MPP architecture that consists of multiple RS-6000 nodes with a high speed interconnect between nodes. Each RS-6000 node in the SP2 is actually an SMP architecture with up to 8 CPUs. A much larger class of problems can be partitioned so as to fit within an 8 CPU node and these will scale well on an architecture such as the SP2. If the problem requires more than 8 CPUs, memory access must then take place over the system interconnect, which runs much slower than the Gbyte speed of single backplane systems. If you are planning on using such a hardware architecture, you should pay close attention to your software architecture to allow for application partitioning. If you cannot predict the memory access patterns of you application in advance, such as in a large data warehouse with ad-hoc queries, then you will have a much harder time scaling your application.
While a NUMA, or non-uniform memory access, hardware architecture is fundamentally a bad architecture from a computer science perspective, the physical limits of SMP designs are leading all vendors to investigate NUMA or NUMA-like hardware architectures. NUMA architectures were envisioned on the basis that as CPU and backplane speeds increase, physical packaging limitations will limit shared memory SMP designs to fewer and fewer nodes. Even as backplane and cache technologies improve, an SMP system supporting 32 CPUs at 300 MHz in 1998 might only support 24 CPUs at 600 MHz in 2000 and 18 CPUs at 1200 MHz in 2002. To continue building servers whose performance scales with CPU speeds, some sort of NUMA architecture becomes a fundamental requirement by around the year 2000.
An early adopter of NUMA technology is Sequent, with their NUMA-Q architecture. Sequent uses a building block of quad Intel Pentium CPUs, hence the NUMA-Q name . These four CPUs are basically arranged in an SMP architecture. A total of 8 nodes, or 32 CPUs are supported in a fully populated system. The local memory on each node acts as a large cache for the node, and thus a 32 CPU system can be supported with a slower backplane than would be required with a 32 CPU SMP design. The NUMA-Q implementation, for instance, supports a 500 Mbyte/second system bus local to each quad, with a 1.6 Gbyte/second interconnect between quads . Special cache coherency algorithms allow all memory to be accessed in a shared fashion, although there is a significantly higher latency, up to eight times longer, to access memory outside the local quad node. NUMA-Q architecture will thus support standard multithreaded code as long as the application is partitioned such that enough memory accesses are local to each node to offset the higher latency of off-node accesses . Until more industry experience is gained with NUMA architectures and with NUMA cache coherency, the best suggestion is to either partition your software architecture more closely to how you would in an MPP system or to carefully prototype and benchmark your application to ensure anticipated NUMA performance is realized.
Another way to extend system performance is to cluster servers together so that, at least at some levels, they appear as one resource to an application. Compared to MPP or NUMA architectures, clustered nodes typically are more loosely coupled and have lower bandwidth/higher latency interconnects than nodes in an MPP or NUMA system. Since cluster nodes are more loosely coupled , they generally provide greater levels of high availability and fault tolerance than non-clustered systems. One of the earliest examples of a mainstream clustered architecture was the DEC VMS Cluster. Today, the two primary cluster architectures being implemented are the Microsoft "Wolfpack" cluster and Sun's "Full Moon" cluster. Except for specialized parallel database applications such as Oracle's Parallel Database Server, clusters require that the software architecture clearly separate applications running on one cluster node from another. Both vendors, however, have announced roadmaps moving toward a single operating system image running over all nodes in the cluster. When such capability is reached in several years , the differences between MPP, NUMA, SMP, and clusters will be further blurred.
While closely related , there is a big difference in hardware architecture between highly available and fault-tolerant systems. A highly available system architecture minimizes service interruption, typically by providing IP address failover to a redundant host in the event of a system failure. Fault-tolerant systems employ hardware component redundancy within a host with a system to allow for non-stop operation even in the presence of a hardware failure. Stratus and Tandem are examples of vendors with fault-tolerant system architectures. In their designs, two or more CPUs execute in lock-step fashion, executing the same instructions each memory cycle. Specialized hardware compares the results and disables any CPU or other component upon detection of a failure. Failure recovery time is typically less than a second. Highly available system architectures do not operate redundant hardware in lock-step at the CPU level, but rather provide for redundancy at the system level.
Most highly available architectures rely on IP failover techniques to minimize service interruption in the event of system failures. A typical high availability system architecture is shown in Figure 16-8. This system is based on the concept of physical and logical hosts. In such a system, each logical host typically runs its own set of applications on top of a physical host. A heartbeat signal is exchanged between the two or more physical hosts , usually once a second or more. Should any host stop responding to the heartbeat signal, the assigned failover host will take over control of the failed host's disks and restart all applications on the new host. The host will then takeover the IP address of the failed host and resume responding to client requests . The typical failover time for such a system is on the order of several minutes. The client impact of a failover will depend on whether the application is stateful or stateless. In a stateless application such as NFS file serving or web page serving, clients may see an application timeout, but will otherwise resume normal operation after the failover action completes. Stateful or connection oriented applications will typically require the client application to restart or otherwise reconnect to the server.
As illustrated by the many hardware architectures presented in this chapter, hardware, both at the system level and at the CPU level, is becoming ever more complicated. To build a fast overall system it is no longer sufficient to have the fastest CPU. The operating system must also take advantage of the features provided by the CPU and vice versa. That is one of the reasons Microsoft has so many engineers working on-site at Intel and Intel has so many engineers working on-site at Microsoft. Companies that develop their own CPU architectures and operating systems, such as Sun Microsystems with their SPARC CPU architecture and Solaris operating system, also place hardware and OS engineers together on the same project team when designing a new CPU. Compiler writers also work in parallel with CPU and OS designers to make architecture tradeoffs at design time. For instance, instruction reordering was a new CPU optimization technique that SPARC designers considered adding to the UltraSPARC 1 CPU. This feature ended up being left out of the UltraSPARC 1, at the benefit of other functionality, when compiler writers convinced the hardware team they could more efficiently reorder instructions in the compiler. As system architectures continue to get more complicated, you should consider the potential performance improvement of selecting vendors who have a close relationship among their product's hardware, OS, and compiler design teams .