10.1 Processor-Level Parallelism

Most contemporary processors, whether RISC or CISC, have come to rely upon instruction pipelining as an important means of enhancing performance. In any architecture, the various stages of the instruction cycle (such as in Figure 2-4) involve specialized internal hardware. Some electronic components perform instruction decoding, for example, while others handle the retention of results.

An analogy is often drawn to the industrial model of an assembly line in a manufacturing process, but the philosophical concept is at least as old as the discussion of specialized skills in relation to social organization in Plato's Republic.

10.1.1 Simplified Instruction Pipeline

The basic tenets of instruction pipelining involve specialization of function and an attempt to keep all of the major components productive at all times. To make this graphic, consider the passage of a sequence of instructions I1, I2, I3, … through a hypothetical processor whose instruction cycle involves the five major stages that were shown in Figure 2-4.

An instruction pipeline has a depth, generally taken to be the number of stages that handle various distinguishable phases of a complete instruction cycle. The depth for the cycle shown in Figure 2-4 would be five:

Stage 0: Fetch instruction
Stage 1: Decode instruction
Stage 2: Obtain operands
Stage 3: Compute result
Stage 4: Store result

Every instruction passes through these stages in the same sequential manner. For actual instruction sets, there might be no work to be done at any given stage for a few specific instruction types, but here we ignore such complicating irregularities.

For purposes of illustration, it is helpful to envision an ideal situation where each stage takes the same amount of time as any other, with that time also being identical for all instruction types.

Ideally, one new instruction can be started at every clock cycle and another instruction already in progress can be concluded. With five stages, the CPU can be working on different phases of up to five instructions at once, keeping all the physically distinct functional stages busy. The throughput of a pipelined CPU with a depth of five can thus ideally be five times greater than that of a similar, but nonpipelined, CPU, all other factors being the same. We diagram these ideas in Figure 10-1, showing the progression of instructions (I1, I2, …) through each functional stage over time.

Figure 10-1. Five-stage pipeline
	Time Interval
	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Stage 0	`I1`	`I2`	`I3`	`I4`	`I5`				`I6`	`I7`	`I8`	`I9`	`I10`	`I11`
Stage 1		`I1`	`I2`	`I3`	`I4`				`I5`	`I6`	`I7`	`I8`	`I9`	`I10`
Stage 2			`I1`	`I2`	`I3`	^[*]	^[*]	^[*]	`I4`	`I5`	`I6`	`I7`	`I8`	`I9`
Stage 3				`I1`	`I2`	`I2`	`I2`	`I2`	`I3`	`I4`	`I5`	`I6`	`I7`	`I8`
Stage 4					`I1`				`I2`	`I3`	`I4`	`I5`	`I6`	`I7`

^[*] Latency of three extra cycles assumed to be caused by instruction I2.

In this diagram, we show the consequences if instruction I3 has to wait three extra cycles before it can be issued i.e., be allowed to pass through stages 3 and 4, which execute an instruction and store its results. We assume that instruction I2 dwells in stage 3 much longer than any of the others. We say that instruction I2 has a latency of four units, relative to the single-unit latency of the other instructions. An instruction is retired only when it has finished executing, in stage 4.

Since stages 0 through 2 already hold instructions, no further instructions can enter the pipeline until the blocked instruction I2 begins to move along again. If we are watching to see the succession of instructions as they are completed, we will observe a "bubble," instead of a steady flow, because of pipeline stalling.

10.1.2 Superscalar Pipelining

Both RISC and advanced CISC architectures have been enhanced with variations of the basic pipelining scheme, with the goal of executing the most instructions per second using available chip technology.

In one approach, sometimes called superpipelining, the pipeline is described as deeper, longer, or more fine-grained. If one stage for a given ISA would inherently take more time than the others, the pipeline could be redesigned to split that stage into two or three simpler stages. The clocking rate could then be increased, and throughput should increase. The benefits of this approach diminish rapidly, owing to the nonuniform requirements of different instruction types and the prevalence of pipeline hazards with costly recovery times. This design approach can only find optimal parameters for a "one size fits all" pipeline.

Most contemporary CPU architectures are described as superscalar because they contain two or more pipelines that can operate in parallel on independent scalar data elements. Two or more instructions can be fetched together, decoded at the same time, and possibly executed simultaneously. Such designs come at greater cost, since the CPU must contain duplicate logical and functional units. The aggregate throughput from N pipelines never attains an N-fold improvement over a single pipeline of the same technology because of pipeline hazards, which are described later in this chapter.

Notice that we did not say N identical pipelines. Industry experience has shown that it is wiser to build differentiated pipelines, each of which specializes in handling portions of an instruction set. The most obvious differentiation is floating-point versus integer operations, for each pipeline can then have a depth appropriate to the intrinsic complexity of the operations. When a program has a good mix of instruction types, the differentiated pipeline types can be kept busy while avoiding some of the pipeline hazards that would otherwise introduce bubbles and lower overall performance.

Contemporary designs use several different ways of controlling multiple pipelines. In some (e.g., PowerPC), the hardware itself takes responsibility for efficiently directing instructions from the sequence prepared by the programmer into the pipelines. In others (e.g., first-generation Alpha), the programmer or compiler must take greater responsibility for the detailed sequencing of different instruction types.

Very long instruction word (VLIW) processor designs require that the programmer fill instruction fields according to corresponding and available pipelines. The VLIW approach welds the instructions too rigidly to the requirements of pipelines of a particular CPU implementation, thus provoking a challenge to the distinction between architecture and implementation. As we shall see, the EPIC design principles of the Itanium architecture achieve the main benefits of the VLIW approach while avoiding most of its limitations, thus balancing performance and convenience.

10.1.3 Itanium 2 Processor Pipelines

The design of the Itanium processors introduces an advanced degree of CPU pipeline differentiation. As we mentioned in Chapter 4, the processor contains specialized execution units: M units can perform memory accesses (loads and stores) and some arithmetic operations; I units perform arithmetic and other integer operations; F units perform floating-point operations; and B units perform branches of all kinds, including special cases for loop control and procedure calls.

Intel describes the general pipeline of the Itanium 2 processor using the stages briefly outlined in Table 10-1. The eight stages can be grouped into two larger phases, front-end (IPG, ROT) and back-end (EXP, REN, REG, EXE, DET, WRB).

The overall pattern demonstrated in Table 10-1 adapts to accommodate the execution of all Itanium instruction types. Specialized circuitry makes possible parallel execution in the units that handle A, I, M, F, and B instruction types.

The floating-point pipeline is two stages longer than the general pipeline because of the greater complexity of the fused multiply add operation upon which many floating-point instructions are based. In effect, four execution stages for floating-point operations replace stages 6 and 7 (EXE and DET) for integer operations.

Although the Itanium 2 processor issues instructions strictly in bundle order, internal buffering permits instructions to complete out of order. This prevents integer instructions from having to wait for the more time-consuming floating-point instructions from earlier bundles to finish their execution.

Table 10-1. Itanium 2 Processor General Pipeline
Stage	Mnemonic	Description of Activities
1	IPG	Instruction Pointer generation
2	ROT	Rotate instructions of current groups into position so that bundle 0 contains the first instruction that should be executed
3	EXP	Expand bundle templates and disperse up to 6 instructions through 11 ports in conjunction with opcode information for the execution units
4	REN	Rename (remap) registers for the register stack engine; decode instructions
5	REG	Register read (delivery of data from the Gr, Fr, and Pr registers)
6	EXE	Execute operations
7	DET	Detect exceptions; abandon result of execution if instruction predicate was not true; resteer mispredicted branches
8	WRB	Write-back (store results in Gr, Fr, and/or Pr registers)

If predicted correctly, a taken branch incurs no additional cycles of overhead. Additional delays may result if the instruction cache cannot deliver the appropriate bundle when the cycle returns from stage 8 (WRB) to stage 1 (IPG). A correctly predicted fall-through also incurs no penalty.

For the Itanium 2 processor, a mispredicted branch incurs a six-cycle penalty when the pipeline is cleared. Any instructions begun at the falsely predicted branch target must not be permitted to affect the machine state.

10.1.4 Pipeline Hazards

Instruction pipeline performance can be degraded by hazards that can be grouped into three broad causes: resource conflicts, procedural dependencies, and data dependencies. In this section, we describe this topic from a generic perspective i.e., with applicability to processors in general. Later we will discuss some of the mitigation strategies that are specific to the Itanium design.

Resource conflicts can arise when, for instance, instructions in different stages of the pipeline require simultaneous access to the same memory or a specific functional unit. Designing a CPU with only one stage for reading or writing data, with duplicate functional units, can mitigate these conflicts.

Procedural dependencies are typically the result of branches, which comprise 10 20% of a program's instructions. Sometimes partially completed instructions in a pipeline have to be abandoned, but in such a way that the logical state of the machine is not altered. Strategies to resolve the breakdown of pipeline flow include instruction buffers, dynamic branch prediction logic, delayed branches, and static prediction arranged by compilers.

Data dependencies may include dataflow and output dependencies, as well as antidependency. Dataflow issues occur when an instruction requires data that are not yet computed or written, while output dependency refers to two instructions attempting to write to the same destination. Antidependency, when one instruction overwrites data that another instruction still requires, can perhaps be the most confusing.

Data dependencies can be alleviated using a combination of remedies: intelligent compilers, interlocking, forwarding (deriving source information earlier in the execution process), and register renaming. The Itanium architecture makes extensive use of register rotation, which is similar to register renaming. An Itanium processor has a large pool of generic registers that can allow two instructions to proceed as if they are both working with the same logically named, but physically different, registers. Substitution through register renaming in some architectures is transparent to the programmer, while, in keeping with EPIC principles, the Itanium register rotation is visible to the programmer.

The smooth flow of instructions through the pipeline ensures maximum software performance, but that flow can be disrupted by any of the following:

data stalls from the cache and slower memory components;
branch-induced pipeline flushing;
producer consumer dependencies; and
multiple-issue difficulties.

Such disturbances can cause the throughput to degrade all the way down to the performance level of a similar, but nonpipelined, CPU. A significant goal of the programmer or optimizing compiler is to avoid these situations.

Data stalls

Data stalls can occur when writing new data or when reading previously stored data that are not currently contained in the optimal cache level. Since the cost of a stall can be significant, frequently used values like loop control elements should, whenever feasible, be held in processor registers.

From a programmer's point of view, most cache systems work at a level of granularity that spans several adjacent information units. The smallest number of adjacent bytes from main memory that can be shadowed in a cache is called the line size of that cache. Thus items stored adjacently may be held in the same cache line.

For Itanium programs, it is wise to specify at least quad word alignment for data elements or arrays. Convention specifies that stack space in memory must be allocated in 16-byte multiples.

Branch effects

In an ordinary sequence, instructions are fetched from the instruction cache or main memory as the address in the instruction pointer (IP) is incremented.

Branches taken, subroutine calls, and subroutine returns determine a new value for the IP. Any instructions already in progress must be ignored i.e., flushed from the pipeline. This may completely empty the pipeline for a substantial number of cycles unless the branch, call, or return target instruction is already held in the instruction cache.

Certain architectures (and compilers) support some form of prefetching, which can initiate, well ahead of an actual jump, the copying of a new sequence of instructions from main memory into the instruction cache on a speculative basis that does not intrude upon current operations.

With relative ease, an architecture can predict all forward conditional branches as not taken and all backward branches as taken, which is the generic behavior of idealized loops. Branch prediction is a major topic of research and heuristic strategies.

Producer consumer effects

In addition to possibly prolonged data stalls caused by cache and memory response times, the dependence of an instruction upon a result computed by a logically prior instruction can lead to restrictions on pipeline flow. Consider decrementing a counter for loop control:

        add     r21 = -1,r21;;          // r21 counts down...        cmp.eq  p6,p0 = 0,r21           // ...until zero (p6)   br.cond.sptk.few back           // Go back for more?

where the Itanium architecture would require that the add and cmp instructions be separated into different execution groups, but provides the special case of cmp and br instructions occurring within the same execution group. The add instruction produces (writes) a new value of the counter in register r21, which the cmp instruction consumes (reads). Similarly, the cmp instruction produces (writes) a new predicate value in register p6, which the br instruction consumes (reads).

Within practical limits, a CPU can contain redundant internal connections that permit some measure of helpful "spying" between the pipeline stages for the sake of reducing delays, or latency. In the absence of such efforts, we might expect that stage 5 for the cmp instruction would follow immediately after stage 8 for the add instruction, if the compare were detained at stage 4 while the addition goes through stages 6, 7, and 8. Latency would thus be four cycles instead of the one cycle for a nonconsumer instruction that might otherwise follow the add instruction. The actual latency can be as little as one cycle in this case, since the compare instruction can derive its source information through fast internal CPU connections; it does not have to read, literally, from register r21. This is what is meant by forwarding, which is part of most contemporary processor designs. When a CPU is implemented, its latency rules become a documented feature of its design, not of its architecture.

The Itanium architecture requires that stops be inserted between groups of instructions where a later group depends on some result computed by an earlier group. A particular hardware implementation then adjusts the duration of pipeline bubbles to as many cycles as needed to satisfy its actual latency rules. Because of differences in pipeline staging, a sequence of dependent integer computations may execute one per cycle, while a sequence of dependent floating-point computations require a latency of 4, one computation per four cycles. For both data types, however, independent computations can be completed at higher operational density. Software must analyze the dependencies and construct optimal instruction groups.

Multiple-issue effects

Up to this point, the execution stages of the Itanium pipeline have been considered general purpose, capable of working on every opcode in the instruction set. In reality, there is considerable differentiation according to function or data type. Separate execution units perform integer operations, floating-point operations, loads or stores, and branching. Separate units allow the issuing of multiple instructions, which can then be processed in parallel, provided that they have orthogonal resource requirements.

Some modern processors select instructions from a holding buffer and may issue them out of order i.e., not strictly according to their addressing sequence so long as the programmer's logical intent can still be satisfied. Similarly, some modern processors can retire or complete instructions out of order, again subject to the constraint of the logical contract with the programmer.

The Itanium architecture expects that the results from instructions in one group will be determined prior to the results of any of the instructions in the following group. Within each group, the instructions are completed independently, excepting the special allowance for a branch whose predicate is determined by a comparison within the same instruction group.