A.6 Concurrency: Getting Someone Else to Help You | HP-UX CSE(c) Official Study Guide and Desk Reference

We have seen that simply working harder is no longer the only choice available to us. The laws of physics mean that time is running out on simply working harder . We have looked at some ideas on how processor designers are trying to work smarter in order to extract as much from a single CPU cycle as possible. We have discussed some of the prevalent architectures available today and how they try to work smarter . We have also discussed cache memory as well as main memory. Applications nowadays require so much processing power that it is unrealistic to find a machine with a single processor capable of the task. Economic forecasting, fluid dynamics, and digital image processing all manipulate vast amounts of data and in many cases demand output in real time. To provide this level of performance, we need to harness the power of multiple processors in order to provide the performance bandwidth required. This can be achieved in a number of ways: everything from interconnecting individual processors in a large array-like configuration to harnessing multiple computers themselves , interconnected by a high speed, dedicated network. A level of concurrency in our architecture can lead to dramatic improvements in overall throughput. Our first task is to try to classify the myriad design decisions into a manageable classification scheme. In this way, we categorize what we are trying to achieve and compare that with the characteristics of competing architectures. A convenient and well-used method to study these concepts goes back to 1972 and uses Michael Flynn's Classification.

A.6.1 Flynn's Classification

In 1972, Michael Flynn designed a classification system to encapsulate the differing computer architectures attempting some level of concurrency in operation. To this day, Flynn's Classification is a well-used and convenient way to study concurrency . Its beauty lies in its simplicity. Flynn categorized features in concurrency by noting that concurrency happens in one of two places. We can achieve concurrency when working on data and/or instructions: A computer can work on a single or multiple instructions. Similarly, a computer can work on a single data element or multiple data elements. With such a simple philosophy, we end up with four basic models to study.

A.6.1.1 SISD: SINGLE INSTRUCTION SINGLE DATA

The PC on which I'm writing this book is an SISD machine. Here, we have a control unit fetching and decoding instructions from some form of memory. This single instruction stream is passed to the processor to execute. As we have seen, most processors have this control unit as part of the processor hardware itself. Any data that the processor needs will come from the memory system via a single data stream. This is the traditional von Neumann machine (see Figure A-11).

Figure A-11. SISD machine.

graphics/ap01fig11.gif

A.6.1.2 SIMD: SINGLE INSTRUCTION MULTIPLE DATA

In this design, we still have a single instruction stream. The Controller transmits this instruction to all the processors. The processors themselves will have some local memory to store data elements that they will retrieve from some memory system. When instructed by the Controller, all processors execute the single instruction on their own data elements. In its purest form, it's hard to see how a processor will communicate directly with the memory system to access its own data elements. A common feature of SIMD machines is that a host computer provides a user interface and storage, and supplies data elements directly to the individual Processors. The Controller has its own control processor and control memory. A large proportion of the control operations of the executing program will be performed by the Controller. The Controller will call on the processors to perform arithmetic and logical operations only when necessary. Where we have large collections of data elements and want to perform the same operation on all data points, e.g., vector or array manipulation , such an approach may be appropriate. SIMD machines have been seen to be too specialized in their approach to problem solving and have consequently gone slightly out of favor. Machines such as CRAY and Thinking Machine supercomputers are SIMD machines. Their specialized approach to problem solving lends them to particular classes of problems, e.g., forecasting and simulations where manipulation of large collections of data is central to the types of problems encountered .

Figure A-12. SIMD machine.

A.6.1.3 MISD: MULTIPLE INSTRUCTION SINGLE DATA

This type of machine is not common because we have multiple processors issuing different instructions on the same data. Flynn had trouble himself visualizing this. One example commonly cited is characterized by machines used in complex mathematics such as cracking security codes. Each processor is given the same data (cyphertext). Each processor works on the same data by trying different encryption keys to break the code. In this way, we would be achieving concurrency by having the set of possible keys divided by the number of processors.

Figure A-13. MISD machine.

graphics/ap01fig13.gif

A.6.1.4 MIMD: MULTIPLE INSTRUCTIONS MULTIPLE DATA

These are the most common concurrent machines you will find in the workplace. This type of machine forms the cornerstone of multiprocessing as we know it. Each processor issues its own instructions delivered by its own controller and operates on its own data. Processors can communicate with each other via the Shared Memory Pool or via some primitive messaging system. Within this classification, you will find variants such as Symmetrical Multi-Processing (SMP), Non-Uniform Memory Access (NUMA and its variants, e.g., cc-NUMA, NORMA), and Collection Of Workstations (COWs). Machines that house processors and memory in a single chassis are known as tightly coupled , whereas machines where collections of processors and memory are housed in separate chassis are known as loosely coupled .

Because this type of machine is the most common in the marketplace , we look a little closer at some of the variants we find with the MIMD classification, namely SMP and NUMA (see Figure A-14).

Figure A-14. MIMD machine using Shared Memory as a means by which processors communicate with each other

A.6.1.4.1 Symmetrical Multi-Processor (SMP)

This architecture is best described by describing the symmetry hinted at in the title. The symmetry comes from the notion that all processors have equal access to main memory and the IO subsystem in terms of priority and access time. This also means that the operating system as well as user processes can run on any processor. Older attempts at SMP failed this last test:

Master/Slave : This is where one processor is dedicated to running the operating system: the master. All other processors were used for user processors.
Asymmetric : This allows the operating system to run on any processor, but only one processor.

Both these implementations have serious limitations. In a system that is performs lots of IO, all IO interrupts coming from the external and internal devices will be handled solely by the master or monarch processor. Without distributing IO interrupts, the monarch could be swamped with processing requests leaving the user processors waiting for IO events to be acknowledged , letting them continue with their computations . HP-UX still has the concept of a monarch processor, but only to boot HP-UX. Once booted , all processors are treated as equal.

A simplified diagram of this an SMP can be seen in Figure A-15.

Figure A-15. Simplified view of an SMP architecture.

graphics/ap01fig15.gif

It may seem a trivial exercise to extend this architecture to accommodate an infinite number of processors. If you recall during our discussion on memory, we found that a uni-processor machine can quite simply swamp a DRAM-based memory system. The solution was to employ SRAM cache. The size of our cache is one aspect of our SMP design that will be crucial as to whether our SMP architecture will scale well . What we mean by this is that by adding additional processors, our overall system performance should scale proportionally. If we add a second processor, all else considered , we would expect a 100 percent improvement in performance. To avoid too much memory contention , we need to look at two aspects of this design:

Cache size
Memory bus interface

We have discussed cache performance previously. The increased cost of providing a significant amount of L1 SRAM-based cache means that a hardware vendor needs to feel confident that its customers will pay the price. A single L1 cache architecture may be used in both uni-processor and multi-processor models. Many database systems work on many megabytes of data and, hence, their Spatial locality is large. These and other similar applications will benefit from larger caches. The second aspect of SMP design that we need to consider is the architecture of the memory bus . In its simplest form, it is a collection of wires that pass information between the processor(s) and the memory subsystem. The number and bandwidth of each data path will influence the overall scalability of the SMP design. Having a single, large memory bu s will require some forward thinking; we will need to think of the largest SMP configuration that we will ever support, add to that the maximum expected bandwidth required per processor, and multiply these figures together to give us a maximum peak bandwidth requirement. This would probably require additional wires connected to each processor. Additional wires are expensive and require additional pins on the processor to be available. With all the other resources requiring access to the processor, this can be a limiting factor in this overall design. If we cannot provide the required bandwidth, our processors will stall due to cache line misses and the latency in fetching datum into cache. As mentioned previously, overall cache size can alleviate this as can the size of each cache line. Another method to improve overall scalability is to have sufficient connectivity to allow every processor independent access to every memory bank. If we have multiple non-conflicting requests, they can all be satisfied simultaneously . Such a design is often referred to as a crossbar architecture. The interconnection of wires is managed by some clever circuitry that deals with arbitrating among conflicting requests. This arbiter is commonly known as a crossbar switch . Figure A-16 shows diagrammatically how a crossbar switch may look.

Figure A-16. Crossbar switch to support SMP.

graphics/ap01fig16.gif

As you can see, the scalability of such a solution is intrinsically linked to the ability to scale the connectivity within the crossbar switch . If we were to double the number of processors, we would quadruple the number of connections within the crossbar switch ; in fact, increasing the number of processors by n increases the number of connections by n ² . Although vendors such as HP have supplied this architecture in the V-Class machines, the cost of supplying significant scalability is the limiting factor in whether such a design will succeed. Before moving on to look at NUMA architectures, we need to look at how a number of caches will operate when functioning in a multiprocessor architecture.

A.6.1.4.2 Cache coherency protocols

As we mentioned previously, a common policy employed in a cache for writing data to memory is known as write back ; this means that data will stay in-cache until that cache line needs to be freed for some other data. In a multi-processor model, this becomes more complicated. What would happen if one processor required data that currently resides in the cache of another processor? Some people would think this unlikely . Consider the case where multiple threads are executing on separate processors (processor 1 running thread 1 and processor 2 running thread 2). It is not unreasonable to assume that in some cases threads from the same process could require access to the same data. How does processor 2 get the data from the cache of processor 1? We could build into the operating system a low-level instruction that checked all the caches for the datum required, and if we found it, it would then manage how the data would get transferred to processor 2. Due to the overheads involved, very few architectures employ software cache coherency . What is required is some mechanism by which the hardware can manage this. There are a number of hardware solutions to cache coherency . The most common are:

A snoop bus
Directory-based cache coherency

A.6.1.4.3 Snoop bus

The idea of a snoop bus is to provide a mechanism for caches to snoop on each other. If processor 1 requests a cache line, it will make that request via a broadcast to all other processors and to the memory subsystem. If the cache line is currently residing in another processor, that processor will transfer the cache line to processor 1 and signal to the memory subsystem that the request has been met. To provide this, we would have to utilize existing bandwidth within the memory bus itself to communicate cache coherency between processors. This does not scale well with the memory bus quickly becoming overwhelmed by the additional traffic; the broadcast is going to have to wait until the bus is free before sending the broadcast (known as bus arbitration). One solution would be to provide a private, separate snoop bus that processors and the memory subsystem could use uniquely for snooping . Figure A-17 shows such a configuration.

Figure A-17. SMP architecture with a private snoop bus.

graphics/ap01fig17.gif

This design works well, but again we need to consider the problem of scaling. If we want to support large SMP configurations, we need to consider how much additional workload cache coherency will impose on the processor infrastructure; more and more processors increase proportionally the likelihood that we will be requested to transfer cache lines to other processors. If we could infinitely increase the bandwidth of the snoop bus and maintain latency times to satisfy each additional request, this design might work. The costs of increasing the snoop bus have put realistic limits on the scalability of this design. The snoop bus (whether private or as a function of the memory bus) still has many supporters in the processor industry, even though it has been found by various hardware vendors that, above approximately 20 processors, this design may need some careful designing.

A.6.1.4.4 Directory-based cache coherency

This design also requires additional hardware. It is more complex and more expensive; however, it has been shown to scale better than a purely snoop bus . In this solution, we have a hardware directory that is hardwired to all the caches; you could imagine it as building extra intelligence into our crossbar design. We might even have an independent snooping crossbar . When we make a request for a cache line, it is resolved either by looking up the directo ry or from main memory. If the directory resolves the request, it will manage the move of the data from one cache to another. This cuts down the number of cache coherency requests sent to individual processors, because we no longer broadcast requests to all processors but make individual requests via the directory . This technology works very well in small-scale implementations, but like a crossbar, its complexity grows as a square of the increase in the number of processors. In some mainframe installations, it was found that the directory (commonly known as a storage control unit ) was more complicated than the processors itself. Although its use has waned recently, a form of the directory has been used in some NUMA systems.

With its limitations on scale and some would say flexibility, let's look at an alternative MIMD architecture: NUMA.

A.6.1.5 NON-UNIFORM MEMORY ACCESS

Let's start by explaining the name . To be a true SMP system, all processors must have fair and equal access to all of main memory. In this way, an SMP system can be thought of as a Uniformed Memory Access (UMA) system. It follows that in a Non-Uniform Memory Access design, groups of processors have different access times to different portions of memory. This is true. What we will call local memory will perform as it does in an SMP system. Access to what we will call remote memory will incur addition latencies. This type of architecture requires additional hardware, and the operating system needs to understand the costs of using remote memory : Due to Spatial locality, it would be much better to have data in the local memory of a processor. Another aspect of such a design deals with processor affinity . This is a concept not only for NUMA systems but also SMP systems. Processor affinity is the concept that a running thread will have its locale loaded into a processor's cache. When it is time for that thread to run next , it would make sense if we can run that thread back on the processor it was previously running on. There is a chance that the data previously used will still be in-cache, negating the need to load them from main memory. We see later how HP-UX offers processor affinity . When considering a NUMA architecture, we have to consider the additional latencies involved when a process is moved to a remote location . Before going any further, let's discuss how some NUMA implementations look from a schematic point of view:

First, groups of processors have access to local memory . This grouping will be in some hardware-implemented node or cell .
Cells will be able to communicate with each other over a high-speed interconnect .
The high-speed interconnect needs to be able to communicate with every cell and, hence, every bit of memory.

Figure A-18 shows an example of a NUMA architecture.

Figure A-18. Example of a NUMA architecture.

The individual cells or nodes could be fully functioning SMP systems themselves. In fact, a number of implementations have been seen like this, namely IBM's (formerly Sequent) NUMA-Q and Hewlett-Packard's Scalable Computing Architecture (SCA). The high-speed interconnect in those cases is governed by a set of IEEE standards known as Scalable Coherent Interconnect (SCI). SCI is a layered protocol covering everything from electrical signaling, SCI connectors, and SCI cabling all the way to how to keep cache coherency between individual caches in remote nodes . An alternative to SCI is a project undertaken by Stanford University whereby we connect together regular SMP systems, but in doing so we use a device similar to the storage control unit we saw with directory-based cache coherency, to interconnect a group of processors instead of individual processors, which SCI tries to accomplish. This directory controller will perform cache coherency between the nodes/cells whenever a request to a particular memory address or active cache line is made. This architecture is known as DASH ( D irectory A rchitecture for SH ared memory). In this way, we are attempting to provide a single uniform address space; however, there will need to be fundamental changes to how memory is addressed so an address can be decoded to tell one node/cell which other node/cell a particular memory address is actually located. This should have no impact on individual nodes/cells when operating independently.

The success or failure of a NUMA solution will rest with how non-uniform the access times to remote memory locations are. If the latencies are large, then applications that have processes/threads communicating between each other may see significant performance problems when message-passing via memory slow- downs due to the increased memory latencies. On the other hand, if the latencies are small, the NUMA system will perform like a conventional SMP system without the inherent scalability problems.

Many hardware vendors are starting to employ NUMA architecture in their offerings because the plug-and-play nature of constructing a complex of cells allows customers to configure virtual servers with varying numbers of cells depending on performance and high availability considerations. Hewlett-Packard's cell-based systems like the rp84X0 and Superdome fall into this category, where a cell coherency control maintains the state of cache lines and memory references within and between individual cells . The processors that are used in these servers (PA-8600, PA-8700, PA-8700+) utilize a physical memory address that is encoded with the cell number.

Sun Microsystems offer similar solutions in the SunFire servers, as does IBM with various servers including their high-end p690 UNIX server.

Although hardware architectures support a NUMA-based architecture, the operating system needs to understand these features. HP-UX 11i version 1 had no concept of NUMA features of the underlying architecture and treated available memory as if it were a traditional SMP system. HP-UX 11i version 2 has started to utilize NUMA features by introducing concepts into the operating system such as cell-local memory .

A.6.1.6 OTHER NUMA VARIANTS

The following NUMA variants are seldom seen as standalone solutions. Commonly, they are implemented as "configuration alternatives" of a cc-NUMA architecture where a particular application/user requirement dictates such a configuration.

NORMA (No Remote Memory Access) : Access to memory in other nodes/cells is not allowed.
COMA (Cache-Only Memory Architecture) : Here, processors are allowed to access data only in their own cache/memory. If accessing a "remote page," the "cache controller" will bring the data into memory on the local node, allowing a process on the local node to continue operation. It sounds a lot like cc-NUMA, and in some ways it is. Some would say that early cc-NUMA machines are more like COMA machines. The differences are subtle but distinguishable . I would direct you to the superb book In Search of Clusters , 2nd Edition, by Gregory F. Pfister (ISBN: 0-13-899709-8).

A.6.1.7 MASSIVELY PARALLEL PROCESSORS (MPP)

I have left MPP systems out of our discussions because it could fill another entire book. It's not that they are not important; in fact, the Top 500 (http://www.top500.org) supercomputers all employ MPP as their underlying architecture. I will refer you to a quote made by the well-respected computer scientist Andrew S. Tannenbaum (1999): " When you come down to it, an MPP is a collection of more-or-less standard processing nodes connected by a very fast interconnection network ."

Many vendors offer this as a means of gluing together a vast number of individual machines in order to provide massive computational power. Usually, this is at such an expense that only a few organizations can afford or need such machines. Some would argue that an MPP system is loosely coupled because the individual nodes are physically separate. Others would argue that the high-speed interconnect is effectively acting in the same manner as a high-speed interconnect in a NUMA architecture where individual nodes can be viewed as cells, rendering the architecture as tightly coupled .

A cheaper alternative would be to use what is known as a COW (Cluster Of Workstations). Unlike the tightly coupled architecture of MPP, COWs are loosely coupled in that they may utilize their current TCP/IP network as a means of communication. The COW is formed by advanced scheduling software that distributes tasks to individual nodes. Individual nodes could be as simple as a desktop PC. Some people would say that "COWS are supercomputers on the cheap."

A.6.2 SPMD: Single Program Multiple Data

Flynn's Classifications traditionally covers only four architectural definitions. Very few people would argue that this fifth definition truly belongs under the banner of Flynn's Classification. This is more a model for parallel processing. It is almost a hybrid between SIMD and MIMD. This time, the single program is something akin to a standard UNIX program or process. The difference here is that within a process we have multiple independent and differing operations to perform. For example, fluid dynamics involves complex fluid flows where complex equations are required to calculate the behavior at boundaries, e.g., between a pipe and the fluid itself. Our traditional UNIX process can deal with this by spawning a new thread to deal with each boundary condition. The calculations are more complex for boundary than non-boundary conditions. We achieve parallel processing through the use of threads. These programs are run on classical MIMD machines. More system time may be required to manage the additional threads, but overall we hope that the problem will be completed in less real time .

We have spent considerable time looking at the memory hierarchy and issues surrounding processor architectures. Let's discuss other aspects of system design that can influence overall performance.