10.2 Instruction-Level Parallelism | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Partitioning the instruction cycle into pipeline stages does not, by itself, shorten the time to complete any single instruction. Rather, partitioning simply makes more efficient use of the hardware by staging several instructions at the same time.

Pipelining is transparent to the programmer. If there are no pipeline bubbles, a program runs faster. If bubbles do arise, a program still runs no slower than on a nonpipelined machine using the same level of technology.

Computer designers turn to instruction-level parallelism as a means to attain greater throughput than pipelining and large cache structures can provide. Advanced hardware can execute several instructions in parallel, but this often places a greater burden on the programmer and compiler (Rau and Fisher). Major approaches to parallelism exist in RISC, VLIW (very long instruction word), and EPIC architectures.

10.2.1 RISC Approaches

Tanenbaum whimsically remarks that RISC is a directive to relegate the important stuff to compilers, by which he means that the software should generate logical sequences of instructions that can execute most smoothly on a particular RISC architecture.

Early RISC designs (e.g., MIPS, SPARC) had a delay slot associated with the longer time required to determine a branch destination than to compute an arithmetic result. If a compiler could fill most delay slots with useful instructions, instead of no-ops, then overall throughput was improved.

Advanced RISC designs are superscalar, and some of them (e.g., PowerPC and later Alpha implementations) can determine, in hardware, which nondependent instructions can be issued or completed out of order. Programmers and compilers are mostly relieved of the burden of analyzing dependency among instructions, though Tanenbaum suggests that there may still be advantages if highly optimizing compilers perform such analysis.

10.2.2 The VLIW Idea

Advanced CISC instruction sets involve, at the microarchitectural level, wide internal instruction words retrieved from very fast control stores. Each bit in a wide representation serves as the logical source for one of the great many internal control signals needed to operate the ALU and other components of the CPU (see Tanenbaum).

RISC implementations typically eschew the microarchitectural level in favor of uniformly narrow instruction words. A few, mostly experimental, RISC architectures implement very long instruction words (VLIW), whose many bits guide the simultaneous execution of several RISC-like instructions.

The drawbacks of this approach generally outweigh any advantages. Compilers take on an unprecedented responsibility to analyze and implement instruction-level parallelism and to accommodate the architectural latency among all instructions. Delay slots must be embedded into instruction words, reflecting the varying complexity of instructions and different pipeline depths of execution units. These timing gaps become part of compiled software, requiring recompilation (and redistribution) for each new hardware implementation.

Rau analyzed these challenges and proposed certain minimal modifications to the VLIW approach that could overcome the architecture implementation difficulties. Essentially that analysis has led to what Hewlett-Packard and Intel have called the EPIC approach to architecture.

10.2.3 EPIC as a Way Forward

A research architecture called PlayDoh (see Schlansker et al.) developed at Hewlett-Packard Company had many features that were ultimately incorporated into the Itanium architecture. This served as a test bed to explore the practical implications of Rau's theoretical analysis.

Although the Itanium architecture does have a very long instruction word (128 bits wide), it departs from and expands upon VLIW concepts in several ways. The most common ALU operations (type A instructions in Appendix C) execute either in M-units or in I-units. Templates determine which instructions units execute which instructions, requiring architectural stops to separate the instruction stream into groups of independent operations. Finally, most of the instruction set can be explicitly predicated.