13.3 The Original Itanium Processor

Previous chapters of this book have steadfastly focused on Itanium architecture, with only the occasional clarification about details of implementation, where necessary. As long as the architecture does not change, the material you have learned should remain applicable to new implementations.

We now illustrate the factors influencing the evolution of implementations of computer architectures by comparing features and characteristics of the original Itanium processor to the more widely known Itanium 2 processor.

13.3.1 Comparison to the Itanium 2 Processor

Table 13-1 compares the first Itanium processor and the Itanium 2 processor along several key dimensions. We will not define or discuss every entry in this table, but we wanted to present a diverse set of measures customarily used for comparisons in the industry, as well as a few Itanium-specific characteristics to contrast the two implementations.

Many of the quantitative comparisons in Table 13-1 are self-evident evolutionary modifications, but we draw your attention to the very few instruction differences. The architecture has remained intact. The Itanium 2 processor implements the brl and chk instructions more fully in hardware, but a program that ran on the original Itanium processor will run on an Itanium 2 processor. In a different way, the change involving the brp instruction is also benign, since the Itanium 2 processor uses different heuristics to predict branches and does not require the same type of compile-time assistance.

Table 13-1. Selected Characteristics of the First Two Itanium Processors
	Itanium	Itanium 2
Code name during development	Merced	McKinley
Year of market release	2001	2002
Chip technology:
speeds marketed	733 MHz, 800 MHz	900 MHz, 1.0 GHz
process feature size	180 nm	180 nm
transistor count	25 x 10⁶	221 x 10⁶
layers	6	6
operating voltage	1.5 V	1.5 V
power (including cache)	116 130 W	130 W
Processor features:
physical stacked registers	96	96
RSE modes	only enforced lazy	only enforced lazy
integer units	2 M and 2 I	4 M and 2 I
memory units	2 load or store	2 load and 2 store
parallel floating-point units	2	1
Pipeline depth:
integer	10	8
floating-point	12	10
Memory support:
physical address bits	44	50
virtual address bits	54	64
data bus width	64 bits	128 bits
maximum page size	256 MiB	4 GiB
System bus:
speed	133 MHz	200 MHz
width	64 bits	128 bits
bandwidth	2.1 GB/s	6.4 GB/s
Instruction differences:
`brl` instruction	emulated in op sys	implemented
`brp` instruction	useful	affects instruction prefetching
`chk.a` or `chk.s` handling	needs the op sys	usually done by hardware
L3 cache location	off-chip, in-package	on-chip

13.3.2 Cache Hierarchy

The original Itanium processor contained slower cache structures of different sizes than the Itanium 2 processor implementation (Section 4.5.1). Table 13-2 gives some characteristics of the components of the original Itanium cache hierarchy.

As compared to the Itanium 2 processor (Table 4-3), all of the cache access times were slower, the line sizes were smaller, and the L2 cache was smaller. Although the L3 cache could be larger, its accessibility was poorer. The L3 cache structure was manufactured separately and connected to the main processor chip inside a cartridge assembly, and the access bus between cache levels L2 and L3 was within-cartridge (Figure 13-1).

Figure 13-1. Original Itanium system cache relationships

graphics/13fig01.gif

Table 13-2. Characteristics of the Cache Hierarchy for the Original Itanium Processor
Level	Capacity	Line Size	Type	Write Policy	Load Latency Cycles
					Integer	Floating
L1-I	16 KiB	32 B	4-way	n/a	2	2
L1-D	16 KiB	32 B	4-way	write-through	2	n/a
L2	96 KiB	64 B	6-way	write-back	6	9
L3	2 or 4 MiB	64 B	4-way	write-back	21	24
Memory	up to 16 TiB				> 100	> 100

In contrast to fixed response times for the original Itanium processor, the latency for response from the L2 and L3 cache structures in the Itanium 2 processor varies (Table 4-3) because the cache line sought might be held in a bank of the cache that is busy satisfying another request from the processor. A compiler can mitigate this effect by the way it arranges data and especially through very careful scheduling of load instructions. This provides another illustration of the increased burden on compiler technology as the full impact of EPIC principles and the consequences of parallelism come into play.

The new cache design illustrates the extraordinary impact of a breakthrough cache implementation. Benchmarks have shown that the Itanium 2 processor achieves roughly twice the instruction throughput of its predecessor, although it operates at less than a factor of two increase in clock frequency (Table 13-1).

13.3.3 Execution Units and Issue Ports

The original Itanium processor had two M-type execution units, while the internal parallelism of the Itanium 2 processor was increased by adding two more M-type units. The number of issue ports was 9 for the original Itanium processor, but is 11 for the Itanium 2 processor (refer to Figure 10-2).

Code sequences that were previously limited by memory access can utilize more of the parallelism of the Itanium 2 processor. For example, a loop sequence in a software pipeline can potentially load four floating-point source values using the load pair instruction (Section 8.3.3), execute two independent two-operand floating-point operations (add, subtract, or multiply), and store two floating-point results per cycle of register rotation.

Alternatively, since all four M-units and the two I-units can execute type A instructions, the superscalar degree for selected integer operations is six. Compilers can exploit this high superscalar degree and the large integer register set to build instruction groups for compute-intensive integer code sequences. Again, it is easiest to envision fully exploiting the parallelism of the processor using software pipelining.

Having four, instead of only two, M-units makes possible a dual issue of many more pairings of bundle templates, as summarized in Table 13-3. Remember that nop instructions must have an execution unit assigned to them, even though nothing productive occurs. The very large number of cells containing "I2" in Table 13-3 gives compilers substantial latitude in composing templates for large instruction groups.

13.3.4 Pipelines

The general pipeline of the original Itanium processor, which was two stages longer than that in the Itanium 2 processor (Table 10-1), can be described using the stages briefly outlined in Table 13-4. The 10 stages can be grouped into four larger phases: front-end (IPG, FET, ROT); instruction delivery (EXP, REN); operand delivery (WLD, REG); and finally execution, and change of machine state (EXE, DET, WRB).

Table 13-3. Possible Dual Issue of Bundles for Original and Itanium 2 (I2) Processors^[*]
	MII	MLX	MMI	MFI	MMF	MIB	MBB	BBB	MMB	MFB
MII	I2	no	I2	I2	I2	I2	both	both	I2	both
MLX	I2	I2	I2	both	I2	both	I2	both	I2	both
MMI	I2	I2	I2	I2	I2	I2	I2	both	I2	I2
MFI	I2	both	I2	both	I2	both	both	both	I2	both
MMF	I2	I2	I2	I2	I2	I2	I2	both	I2	I2
MIB	I2	both	I2	both	I2	both	both	no	I2	both
MBB	no	no	no	no	no	no	no	no	no	no
BBB	no	no	no	no	no	no	no	no	no	no
MMB	I2	I2	I2	I2	I2	I2	I2	no	I2	I2
MFB	both	both	I2	both	I2	both	both	no	I2	both

^[*] Row is the first bundle in a group, and column is the second bundle.

Table 13-4. Original Itanium Processor General Pipeline
Stage	Mnemonic	Description of Activities
1	IPG	Instruction Pointer generation
2	FET	Prefetch up to 6 instructions per cycle into an 8-bundle buffer; predict branch direction
3	ROT	Rotate instructions of current group into position; calculate branch addresses
4	EXP	Disperse up to 6 instructions through 9 ports using bundle templates in conjunction with opcode information
5	REN	Rename (remap) registers, especially the register stack engine
6	WLD	Word line decode (delivery of data loaded from memory, after a latency of one or more cycles)
7	REG	Register read (delivery of data from the Gr, Fr, and Pr registers)
8	EXE	Execute (ALU operations; branch address validation)
9	DET	Exception detect (and abandon result of execution if instruction predicate was not true)
10	WRB	Write-back (store ALU result in Gr, Fr, and/or Pr registers)

The floating-point pipeline was two stages longer than the integer pipeline. In effect, four execution stages for floating-point operations replaced stages 8 and 9 (EXE, DET) for integer operations.

A correctly predicted branch taken incurred one to three cycles of overhead. Additional delays might result if the I-cache structures could not deliver a bundle from the address shown in the instruction pointer when the cycle returned from stage 10 (WRB) to stage 1 (IPG). Only a correctly predicted fall-through incurred no penalty at all.

When a branch was mispredicted, the first-generation Itanium processor incurred a nine-cycle penalty because all eight of the following instructions already in the pipeline had to be abandoned without having been allowed to affect the machine state.

13.3.5 Latency Factors

The Itanium 2 processor implementation contains numerous optimizations in its pipelines, which in conjunction with the efficient cache design, result in reduced latencies for many kinds of producer consumer situations between instructions. Table 13-5 summarizes such latencies for the initial Itanium processor implementation (values in parentheses) and the Itanium 2 processor.

Table 13-5. Producer Consumer Latencies for the (Itanium) and Itanium 2 Processors^[*]
	Consumer Instruction
Producer Instruction	Qp	branch qp	MM	Mem Addr	setf	ALU	Store	Fmac Fmisc getf
Adder	n/a	n/a	3	(1 2) 1	1	1	1	n/a
Multimedia	n/a	n/a	2	(4) 3	3	(4) 3	(4) 3	n/a
getf	n/a	n/a	(9) 6	(9) 6	(9) 6	(9) 5	(9) 5	n/a
setf	n/a	n/a	n/a	n/a	n/a	n/a	(2) 6	(2) 6
Fmac, Fmisc	n/a	n/a	n/a	n/a	n/a	n/a	(8) 4	(5 8) 4
cmp, tbit, tnat	1	0	n/a	n/a	n/a	n/a	n/a	n/a
fcmp	2	1	n/a	n/a	n/a	n/a	n/a	n/a
FP predicates	2	2	n/a	n/a	n/a	n/a	n/a	n/a
Integer load	n/a	n/a	R+1	R+1	R	R	R	R
FP load	n/a	n/a	F+2	F+2	F+1	F+1	F+1	F+1
mov =br, alloc	n/a	n/a	3	2	2	2	2	n/a
mov ar, cr	n/a	n/a	A	A	A	A	A	n/a
mov pr=	1	0	3	2	n/a	2	2	n/a
mov indirect	n/a	n/a	I	I	I	I	I	n/a
Adder: arithmetic, logical, compare, immediate moves, and postincrementing aspect of load/store. Fmac: arithmetic calculations, both 64- and parallel 32-bit, including the reciprocal approximations. Fmisc: compare, abs/min/max, merge/mix/pack, sign extend, swap, select, class, and logical. FP predicates: frcpa, fprcpa, frsqrta, fprsqrta. R: L1, L2, or L3 cache hit cycle time (see Tables 4-3 and 13-2), or longer from main memory. F: L2 or L3 cache hit cycle time (see Tables 4-3 and 13-2), or longer from main memory. A: ec and lc 2 cycles; fpsr and cr 10 12 cycles; others 2 35 cycles, depending on the application or control register being addressed. I: 6 35 cycles, depending on the register set being addressed.

^[*] Adapted from Intel Itanium 2 Processor Reference Manual for Software Development and Optimization and Intel Itanium Processor Reference Manual for Software Optimization.

The best-case L1-D cache delay for loading integers (R) has dropped from two cycles to one cycle. The best-case L2 cache delay for loading floating-point data (F) has dropped from nine cycles to six cycles; at least one additional cycle is always needed to convert floating-point data from the 32- or 64-bit memory representation to the 82-bit register representation. Noncyclic instruction sequences in an Itanium 2 processor thus obtain their data much more quickly, while the software pipelines for cyclic situations can be tightened up considerably.

13.3.6 Branch Prediction

The great cost of mispredicted branches motivates the computer industry to invent techniques for avoiding branches altogether, through mechanisms such as conditional moves and predication, as well as techniques for conveying branch hints and branch history information to the execution pipelines in contemporary processors.

The Itanium architecture not only provides branch hint completers on branch instructions themselves (Section 5.3.1), but also includes a branch predict instruction that could help attain optimal performance with the original Itanium processor:

 brp.ipwh.ih        target25,tag13   // IP-relative form brp.indwh.ih       b2,tag13         // Indirect form brp.ret.indwh.ih   b2,tag13         // Return form

where the tag13 operand is encoded by the assembler from the symbolic address of the branch instruction itself and where the target25 is encoded by the assembler to match the symbolic destination address represented in the branch instruction itself.

The indirect forms are used when the target address has been placed in a general register, possibly in one of several predicated ways. The return form indicates that the branch will be a return.

There are four values for ipwh (the IP-relative predict whether hint). A completer sptk or dptk specifies that the branch should be predicted static taken or dynamically taken, respectively. The completer loop specifies that the branch will be one of br.cloop, br.ctop, or br.wtop. The completer exit specifies that the branch will be either br.cexit or br.wexit.

There are two values for indwh (the indirect predict whether hint): A completer sptk or dptk specifies that the branch should be predicted static taken or dynamically taken, respectively.

There are two possibilities for ih (the importance hint): none at all indicates that the branch is relatively unimportant, while imp indicates that the branch is one of a small number of very important branches (e.g., a branch in an inner loop).

An implementation of the Itanium architecture may provide various associated hardware support elements for these implied capabilities for branch prediction. The brp instruction operates as a nop instruction except for its performance effects, such as prefetching instructions at the new predicted branch destination into the cache structures. The Itanium 2 processor exploits different and more powerful technologies for hardware branch prediction, but does use information from a brp instruction to improve I-cache performance.

13.3.7 Other Differences and Features

On the initial Itanium processor implementation, which lacked hardware support for the brl instruction, an exception would occur and system software was expected to provide a fault handler with emulation to complete the jump. There could be a severe time penalty for that fault recovery, thus illustrating how implementation-specific details can affect the performance of code. Likewise, the implementation of chk instructions was not as complete, and operating system assistance was needed in order to branch to the recovery code.

We have not attempted an exhaustive review of features where the Itanium 2 processor has improved upon its predecessor. In all, the Itanium 2 processor refines the implementation of the architecture in a manner that an engineer or physicist might call good impedance matching of its many powerful but complex component parts.

13.3.1 Comparison to the Itanium 2 Processor

Table 13-1. Selected Characteristics of the First Two Itanium Processors

13.3.2 Cache Hierarchy

Figure 13-1. Original Itanium system cache relationships

Table 13-2. Characteristics of the Cache Hierarchy for the Original Itanium Processor

13.3.3 Execution Units and Issue Ports

13.3.4 Pipelines

Table 13-3. Possible Dual Issue of Bundles for Original and Itanium 2 (I2) Processors[*]

Table 13-4. Original Itanium Processor General Pipeline

13.3.5 Latency Factors

Table 13-5. Producer Consumer Latencies for the (Itanium) and Itanium 2 Processors[*]

13.3.6 Branch Prediction

13.3.7 Other Differences and Features

Table 13-3. Possible Dual Issue of Bundles for Original and Itanium 2 (I2) Processors^[*]

Table 13-5. Producer Consumer Latencies for the (Itanium) and Itanium 2 Processors^[*]