3.1 Microprocessor Architecture | System Performance Tuning2002

At a fundamental level, a microprocessor is the physical implementation of a set of rules. These rules, called instructions , specify exactly what tasks the microprocessor is allowed to perform. The set of all the rules together is called the instruction set ; in combination with other information, they define the instruction set architecture , or ISA , which contains all the information you'd need to write a program that runs correctly. For a more detailed discussion of these abstract underpinnings, see Chapter 1.

In the last 20 years , microprocessor designers have split into two competing camps based on design philosophies. One camp espoused creating an instruction set in which individual instructions were very powerful, almost on the level of C primitives. These instructions are quite complex, and this camp came to be known as building complex instruction set computing (CISC) designs. The other camp decided to create instruction sets in which the individual instructions were minimal, but their simplicity allowed many optimizations to be performed. This is known as reduced instruction set computing (RISC) processor design. Both methods accomplish the same amount of work. The difference is that a RISC design will probably take more instructions, but those instructions will most likely execute in less time.

RISC has won the war of performance. Increasingly, CISC processor designs are becoming more RISC-like. For example, Intel's P6 processor core actually translates the complex IA-32 instructions into a much simpler internal format before executing them. Another good example is the Transmeta Crusoe processor, which uses specialized hardware (called code morphing ) to, in part, translate IA-32 instructions into the Crusoe's internal instructions.

Lately, designers have started to add extra instructions, typically for graphics support. Sun's VIS and Intel's MMX extensions are good examples of this. ^[2] These instructions are based around the idea that while the smallest addressable piece of memory is typically 32 bits long, we might only operate on 8 of those bits, so we perform 4 memory operations and 4 arithmetic operations when the data would have fit in 1 memory operation. So, these extensions work by packing the full 32 bits of data in, so that only 1 memory operation and only 1 arithmetic operation is required. However, using these extensions to maximum effect involves quite a bit of effort on the part of the compiler design team. Figure 3-1 shows multimedia extension parallelism.

^[2] Intel's IA-32 architecture is, as we've said, a CISC design. MMX is still a good example of a graphics-specialized instruction set, however.

Figure 3-1. Multimedia extension parallelism

So, are RISC processors with these extensions added on (for example, Sun's UltraSPARC) still RISC? The important thing to remember is that RISC was a means to an end -- building the fastest microprocessor possible within specific technological constraints -- not an end in and of itself. Some people have taken to saying that these "second generation" RISC processors aren't actually RISC at all: instead they implement fast instruction set computing (FISC), which perhaps defines their intent a little more clearly.

3.1.1 Clock Rates

Let's imagine our microprocessor as an Athenian trireme ^[3] moving through the Mediterranean. In order for the ship to move forward at peak efficiency, all the rowers must row at the same pace: their actions are synchronized by the beat of a large drum. In much the same way, everything on a microprocessor happens in synchronization by means of an external clock signal . This signal, which is typically cycling millions of times per second (one hertz is one cycle per second; processor clock signals are generally counted in megahertz (MHz), or millions of hertz), is usually generated by a clock crystal on the processor's logic board. In large datacenter systems, the clock signal is sometimes generated on a dedicated clock board. This clock rate is increased by a set multiplier to produce the internal clock rate , which is the number that is typically counted. This multiplier determines how fast a given processor will perform in comparison to other similar processors.

^[3] A trireme is a seafaring vessel, typically a warship, characterized by three rows of oars. The Greek city-state of Athens was known for its naval strength, much as Sparta was known for its army.

I can't emphasize enough how dangerous it is to make performance comparisons based only on clock rate. Table 3-1 is a comparison of the IBM PowerPC 604e microprocessor, which is used in their RS/6000 line to the MIPS R10000, which is used by SGI.

Table 3-1. Processor/clock rate performance comparison

Processor	Clock rate	SPECint95	SPECfp95
IBM PowerPC 604e	332 MHz	12.9	6.2
IBM PowerPC 604e	266 MHz	9.2	5.8
MIPS R10000	250 MHz	12.4	9.7

We'll talk more about the SPEC benchmark suite later (also see Section 2.3.2.2 in Chapter 2), but a gross analysis of the data is possible here. From a simple comparison of the clock rates, it would seem that the MIPS R10000 processor, at a mere 250 MHz, should be slightly slower than the 266 MHz PowerPC, and 25% slower than the 332 MHz part. However, the benchmarks tell a very different tale: the R10000 is almost as fast as the 332 MHz PowerPC in integer performance, and bests it by a substantial margin in floating-point computation.

The astute reader will have noticed that Table 3-1 also provides evidence that increasing clock rate does not imply an equivalent increase in performance. While the 332 MHz IBM PowerPC processor has an almost 25% higher clock rate than the 266 MHz version, it only has a slight (6.8%) edge in floating-point performance. This is because increasing the clock rate of the processor tends to accentuate the constant and substantial delay involved in accessing memory. This is particularly an effect of floating-point benchmarks, because they tend to be rather large.

The overall limit of the clock signal is dependent on some rather esoteric parameters in the processor design, such as the underlying semiconductor technology. While it is possible to optimize for clock speed by adjusting these parameters, the real-world characteristics -- such as cost, heat, and power consumption -- of the resulting design may make it infeasible for the marketplace . However, the clock rate also depends on the complexity of work being done. It takes a certain amount of time for the processor to execute a specific task. The slowest path to take through the processor is called the critical path . Reducing the critical path, such as by simplifying the design, lets the clock speed go ever higher.

Another way of thinking about the performance of a processor is to consider how many cycles it takes to execute an instruction, on average. This is called the cycles per instruction (CPI); obviously, lower numbers are better.

3.1.2 Pipelining

Since we can't always increase the performance of a processor simply by increasing the clock rate, we must be more devious in our approach. One method involves placing two functional units in the design; for example, instead of having just one unit that can perform integer arithmetic, we'll have two. ^[4] However, this is rather expensive in terms of processor die space.

^[4] A processor that uses this technique is called superscalar . I discuss this in Section 3.1.3 later in this chapter.

Instead, we might look closer at how a microprocessor operates. While an instruction is being retrieved from memory, the unit responsible for integer arithmetic is sitting idle; surely this isn't the best way to make a fast processor! For example, we might design a processor that breaks up instructions into five stages:

Instruction fetch ( IF ): Retrieve the next instruction to be executed from memory.
Instruction decode ( ID ): Figure out what the instruction actually does.
Execute ( EX ): Execute the instruction.
Memory ( MEM ): Write data to and read data from memory.
Writeback ( WB ): Write computed results into registers.

A single instruction might not use all of these stages -- for example, a memory load instruction might not have any need of the circuitry provided in the execute stage. ^[5] So, let's try overlapping the execution of the instructions.

^[5] Then again, it might. This depends on the particulars of the processor design.

The first instruction is initially fetched. After it moves onto the decode stage, the next instruction is fetched . As the processor continues to work, it occupies all of itself in this fashion, so that in a given instant there are five instructions being executed at once. This process is called pipelining .

In a five-stage pipeline such as we've described, we ideally get the results of an instruction every clock cycle. However, this does not mean that each instruction takes only one cycle to execute; rather, each one takes five cycles. It's important to be careful about this when you're thinking about CPI measurements; in a pipelined processor operating perfectly , the CPI is 1.0. This just means that, on average, one instruction is completed every clock cycle. So, pipelining does not increase the performance of a single instruction; rather, it increases the throughput of the microprocessor. In addition, as we break each instruction up into a set of stages, we reduce the complexity of each stage. The more stages we have, the simpler we can make each one, which means that we can probably increase the clock rate of the processor as well. Figure 3-2 illustrates instructions in flight through a pipeline.

Figure 3-2. Instructions in flight through a pipeline

Pipelining is one of the most dominant factors in the historical improvement of processor performance. However, it is not quite so simple as we have made it out to be. Sometimes, an instruction might take a particularly long time; for example, reading data from memory might take substantially longer if the data to be read doesn't reside in a cache. This causes the pipeline to "gum up," which is called a stall . Microprocessor designers go to great lengths to avoid pipeline stalls; stalls have a very detrimental effect on performance.

3.1.2.1 Variable-length instructions

The smooth functioning of our pipeline is dependent on all our instructions taking approximately the same amount of time. In a RISC design, each instruction is generally the same amount of data. However, because of their complexity, CISC instructions often vary in length. For example, a "return from routine" instruction is pretty compact, but a complex "multiply-add" instruction might be a lot longer. Figure 3-3 illustrates the difference between fixed-length instructions and variable-length instructions.

Figure 3-3. Fixed-length and variable-length instructions

The processor has no idea how long an instruction will be until it decodes it and finds out what it is. Since the processor has to fetch the instruction before it can be decoded, if the instruction is longer than what was fetched, the processor will have to go back to the instruction fetch stage and recover the rest. This involves a costly trip to memory, and stalls the pipeline. If, as in most RISC instruction sets, the length of each instruction is fixed by definition, then this problem doesn't occur, and the pipeline can flow more smoothly.

3.1.2.2 Branches

Another problem in pipelining is instructions known as conditional branches . A conditional branch specifies the next instruction to be executed, depending on a specific computation. Consider the following code snippet:

 if (i == j) {    delta++; }

The if statement here is an example of a conditional branch. These are implemented in most RISC instruction sets as branch if equal (BEQ) and branch if not equal (BNE) instructions.

When the processor retrieves an instruction from memory, it has no idea what the instruction is. So, it fetches the next instruction in memory. By the time the processor knows that the instruction is a branch, it has finished decoding it. If the branch falls through (that is, the next instruction is the direction the branch goes), everything works well. However, if it turns out that the branch is taken (that is, the next instruction executed is not the instruction that was fetched), the processor needs to go back and remove all the pending instructions in the pipeline and fetch the next instruction we want to execute.

Unfortunately, analysis has shown that branch instructions occur on average about every seven instructions. If only half of our branches fall through, we lose about 20% of the pipeline's efficiency. Rather than take a penalty for clearing out the pipeline, many RISC processors implement a branch delay slot . This means that the instruction that goes into the pipeline right after the branch is either harmless or useful, whichever way the branch goes. For example, consider the pseudocode below:

  loop:  c--;    i = j++;    j = k + 4;    l = m + n; if (c == 0) goto loop;

Because adding m and n and storing the result in l does not depend on any earlier instructions in the loop, we can reorganize the instructions like this in order to use a branch delay slot:

  loop:  c--;    i = j++;    j = k + 4; if (c == 0) goto loop; l = m + n;

By the time the processor has decided which way the branch should go, that instruction is about to be fetched. If we can't find a good instruction to fill the branch delay slot, we simply insert a no-operation (NOP) instruction. This is generally handled by the assembler and is not dependent on the programmer. This is a neat way to reduce the impact of branch operation- related processor stalls; however, as processors became capable of running more concurrent instructions, more powerful approaches were needed.

If we could decrease the branch miss rate further, the pipeline would be more efficient. With this in mind, processor designers devised schemes that allow the processor to predict the direction of a branch. As an instruction is decoded, the processor notices that it is a branch and consults a table of the recent operation of the branch. Based on this recent behaviour, it makes a guess and immediately begins fetching instructions from the guessed location. If the prediction is wrong, we have to cancel all our current instructions. One trivially simple mechanism of branch prediction is to always assume that the branch that refers to an earlier address is always taken. This is surprisingly accurate, because many programs iterate over a loop many times:

 for (i = 0; i <= 100000; i++) {    j[i] = k[i] + j[i]; }

If the branch prediction is correct, branches are no more expensive than any other instruction. If not, we must stall the processor, but hopefully this won't occur very frequently. A simple branch-prediction scheme is over 95% correct, which significantly reduces the overall penalty of branch-related pipeline stalls.

3.1.3 The Second Generation of RISC Processor Design

The ideal first-generation RISC processor would be able to execute one instruction per clock cycle. Although a five-stage pipeline would take five instructions to fill completely, if we kept it full, the overall CPI would start to grow close to 1.0. Once this was approached in advanced first-generation designs, microprocessor design companies began to compete to see who could build the fastest RISC processor. These strategies fell into three basic areas: speed freaks, superscalar designs, and superpipelined designs.

Speed freak designs are based on optimizing a simple, small, efficient core for very high clock rates. This approach is epitomized by the Digital Equipment Corporation (DEC) Alpha processor, which has historically exhibited clock rates approximately twice that of the nearest competitor.

By adding more computational elements to the processor die, performance can be increased without increasing the clock rate. This is known as superscalar processor design. The number and type of operations that can be run concurrently is dependent on both the application, which must have some degree of parallelism, and the processor, which must have enough functional units and a way of keeping them busy. This idea seems simple, but it is quite difficult, on the part of both the compiler writer and the hardware engineer. Current state-of-the-art superscalar processors can issue six instructions per clock cycle. At peak performance, they achieve a CPI of 0.17!

One variation on superscalar processing is called very long instruction word (VLIW) design. In such a scheme, the processor itself is very simple, and lacks almost all of the advanced scheduling hardware found in superscalar processors. In a VLIW processor, such as the Transmeta Crusoe, instructions are grouped together into bundles (instruction words) of three or four. ^[6] The processor then loads each instruction word in sequence and executes all of the instructions in parallel on multiple functional units. This causes an incredible burden to be placed on the shoulders of the compiler writers: writing an optimizing compiler for a VLIW processor is exceptionally hard, and is still an area of active research. The Crusoe design lightens this burden by implementing what amounts to an on-the-fly recompiler in hardware.

^[6] In the early VLIW processors built by Multiflow Computer, as many as 28 operations could be bundled in a single instruction word.

In RISC processors, simpler often means faster; perhaps some of the instruction steps in your pipeline are overly complex. The hallmark of a superpipelined design is to reduce per-step complexity by increasing the number of steps. If the reduced complexity allows the processor to execute faster, superpipelined processors can achieve the same performance improvements as superscalar processors. However, deep pipelines tend to mean that branches cannot be resolved until relatively late in the pipeline, which makes mispredicted branch stalls very expensive. Superpipelining can also be combined with other approaches, as is evidenced by the MIPS R8000, which was a superscalar design with a pair of deep pipelines.

Designing a fast computer is more than just a function of a fast microprocessor, however. Removing the bottlenecks in processing exposes more bottlenecks, particularly in the way that information is moved in and out of the CPU.