13.3 The Original Itanium Processor

Previous chapters of this book have steadfastly focused on Itanium architecture, with only the occasional clarification about details of implementation, where necessary. As long as the architecture does not change, the material you have learned should remain applicable to new implementations.

We now illustrate the factors influencing the evolution of implementations of computer architectures by comparing features and characteristics of the original Itanium processor to the more widely known Itanium 2 processor.

13.3.1 Comparison to the Itanium 2 Processor

Table 13-1 compares the first Itanium processor and the Itanium 2 processor along several key dimensions. We will not define or discuss every entry in this table, but we wanted to present a diverse set of measures customarily used for comparisons in the industry, as well as a few Itanium-specific characteristics to contrast the two implementations.

Many of the quantitative comparisons in Table 13-1 are self-evident evolutionary modifications, but we draw your attention to the very few instruction differences. The architecture has remained intact. The Itanium 2 processor implements the brl and chk instructions more fully in hardware, but a program that ran on the original Itanium processor will run on an Itanium 2 processor. In a different way, the change involving the brp instruction is also benign, since the Itanium 2 processor uses different heuristics to predict branches and does not require the same type of compile-time assistance.

Table 13-1. Selected Characteristics of the First Two Itanium Processors
 

Itanium

Itanium 2

Code name during development

Merced

McKinley

Year of market release

2001

2002

Chip technology:

  

speeds marketed

733 MHz, 800 MHz

900 MHz, 1.0 GHz

process feature size

180 nm

180 nm

transistor count

25 x 106

221 x 106

layers

6

6

operating voltage

1.5 V

1.5 V

power (including cache)

116 130 W

130 W

Processor features:

  

physical stacked registers

96

96

RSE modes

only enforced lazy

only enforced lazy

integer units

2 M and 2 I

4 M and 2 I

memory units

2 load or store

2 load and 2 store

parallel floating-point units

2

1

Pipeline depth:

  

integer

10

8

floating-point

12

10

Memory support:

  

physical address bits

44

50

virtual address bits

54

64

data bus width

64 bits

128 bits

maximum page size

256 MiB

4 GiB

System bus:

  

speed

133 MHz

200 MHz

width

64 bits

128 bits

bandwidth

2.1 GB/s

6.4 GB/s

Instruction differences:

  

brl instruction

emulated in op sys

implemented

brp instruction

useful

affects instruction prefetching

chk.a or chk.s handling

needs the op sys

usually done by hardware

L3 cache location

off-chip, in-package

on-chip

13.3.2 Cache Hierarchy

The original Itanium processor contained slower cache structures of different sizes than the Itanium 2 processor implementation (Section 4.5.1). Table 13-2 gives some characteristics of the components of the original Itanium cache hierarchy.

As compared to the Itanium 2 processor (Table 4-3), all of the cache access times were slower, the line sizes were smaller, and the L2 cache was smaller. Although the L3 cache could be larger, its accessibility was poorer. The L3 cache structure was manufactured separately and connected to the main processor chip inside a cartridge assembly, and the access bus between cache levels L2 and L3 was within-cartridge (Figure 13-1).

Figure 13-1. Original Itanium system cache relationships

graphics/13fig01.gif

Table 13-2. Characteristics of the Cache Hierarchy for the Original Itanium Processor

Level

Capacity

Line Size

Type

Write Policy

Load Latency Cycles

     

Integer

Floating

L1-I

16 KiB

32 B

4-way

n/a

2

2

L1-D

16 KiB

32 B

4-way

write-through

2

n/a

L2

96 KiB

64 B

6-way

write-back

6

9

L3

2 or 4 MiB

64 B

4-way

write-back

21

24

Memory

up to 16 TiB

   

> 100

> 100

In contrast to fixed response times for the original Itanium processor, the latency for response from the L2 and L3 cache structures in the Itanium 2 processor varies (Table 4-3) because the cache line sought might be held in a bank of the cache that is busy satisfying another request from the processor. A compiler can mitigate this effect by the way it arranges data and especially through very careful scheduling of load instructions. This provides another illustration of the increased burden on compiler technology as the full impact of EPIC principles and the consequences of parallelism come into play.

The new cache design illustrates the extraordinary impact of a breakthrough cache implementation. Benchmarks have shown that the Itanium 2 processor achieves roughly twice the instruction throughput of its predecessor, although it operates at less than a factor of two increase in clock frequency (Table 13-1).

13.3.3 Execution Units and Issue Ports

The original Itanium processor had two M-type execution units, while the internal parallelism of the Itanium 2 processor was increased by adding two more M-type units. The number of issue ports was 9 for the original Itanium processor, but is 11 for the Itanium 2 processor (refer to Figure 10-2).

Code sequences that were previously limited by memory access can utilize more of the parallelism of the Itanium 2 processor. For example, a loop sequence in a software pipeline can potentially load four floating-point source values using the load pair instruction (Section 8.3.3), execute two independent two-operand floating-point operations (add, subtract, or multiply), and store two floating-point results per cycle of register rotation.

Alternatively, since all four M-units and the two I-units can execute type A instructions, the superscalar degree for selected integer operations is six. Compilers can exploit this high superscalar degree and the large integer register set to build instruction groups for compute-intensive integer code sequences. Again, it is easiest to envision fully exploiting the parallelism of the processor using software pipelining.

Having four, instead of only two, M-units makes possible a dual issue of many more pairings of bundle templates, as summarized in Table 13-3. Remember that nop instructions must have an execution unit assigned to them, even though nothing productive occurs. The very large number of cells containing "I2" in Table 13-3 gives compilers substantial latitude in composing templates for large instruction groups.

13.3.4 Pipelines

The general pipeline of the original Itanium processor, which was two stages longer than that in the Itanium 2 processor (Table 10-1), can be described using the stages briefly outlined in Table 13-4. The 10 stages can be grouped into four larger phases: front-end (IPG, FET, ROT); instruction delivery (EXP, REN); operand delivery (WLD, REG); and finally execution, and change of machine state (EXE, DET, WRB).

Table 13-3. Possible Dual Issue of Bundles for Original and Itanium 2 (I2) Processors[*]
 

MII

MLX

MMI

MFI

MMF

MIB

MBB

BBB

MMB

MFB

MII

I2

no

I2

I2

I2

I2

both

both

I2

both

MLX

I2

I2

I2

both

I2

both

I2

both

I2

both

MMI

I2

I2

I2

I2

I2

I2

I2

both

I2

I2

MFI

I2

both

I2

both

I2

both

both

both

I2

both

MMF

I2

I2

I2

I2

I2

I2

I2

both

I2

I2

MIB

I2

both

I2

both

I2

both

both

no

I2

both

MBB

no

no

no

no

no

no

no

no

no

no

BBB

no

no

no

no

no

no

no

no

no

no

MMB

I2

I2

I2

I2

I2

I2

I2

no

I2

I2

MFB

both

both

I2

both

I2

both

both

no

I2

both

[*] Row is the first bundle in a group, and column is the second bundle.

Table 13-4. Original Itanium Processor General Pipeline

Stage

Mnemonic

Description of Activities

1

IPG

Instruction Pointer generation

2

FET

Prefetch up to 6 instructions per cycle into an 8-bundle buffer; predict branch direction

3

ROT

Rotate instructions of current group into position; calculate branch addresses

4

EXP

Disperse up to 6 instructions through 9 ports using bundle templates in conjunction with opcode information

5

REN

Rename (remap) registers, especially the register stack engine

6

WLD

Word line decode (delivery of data loaded from memory, after a latency of one or more cycles)

7

REG

Register read (delivery of data from the Gr, Fr, and Pr registers)

8

EXE

Execute (ALU operations; branch address validation)

9

DET

Exception detect (and abandon result of execution if instruction predicate was not true)

10

WRB

Write-back (store ALU result in Gr, Fr, and/or Pr registers)

The floating-point pipeline was two stages longer than the integer pipeline. In effect, four execution stages for floating-point operations replaced stages 8 and 9 (EXE, DET) for integer operations.

A correctly predicted branch taken incurred one to three cycles of overhead. Additional delays might result if the I-cache structures could not deliver a bundle from the address shown in the instruction pointer when the cycle returned from stage 10 (WRB) to stage 1 (IPG). Only a correctly predicted fall-through incurred no penalty at all.

When a branch was mispredicted, the first-generation Itanium processor incurred a nine-cycle penalty because all eight of the following instructions already in the pipeline had to be abandoned without having been allowed to affect the machine state.

13.3.5 Latency Factors

The Itanium 2 processor implementation contains numerous optimizations in its pipelines, which in conjunction with the efficient cache design, result in reduced latencies for many kinds of producer consumer situations between instructions. Table 13-5 summarizes such latencies for the initial Itanium processor implementation (values in parentheses) and the Itanium 2 processor.

Table 13-5. Producer Consumer Latencies for the (Itanium) and Itanium 2 Processors[*]
 

Consumer Instruction

Producer Instruction

Qp

branch qp

MM

Mem Addr

setf

ALU

Store

Fmac Fmisc getf

Adder

n/a

n/a

3

(1 2) 1

1

1

1

n/a

Multimedia

n/a

n/a

2

(4) 3

3

(4) 3

(4) 3

n/a

getf

n/a

n/a

(9) 6

(9) 6

(9) 6

(9) 5

(9) 5

n/a

setf

n/a

n/a

n/a

n/a

n/a

n/a

(2) 6

(2) 6

Fmac, Fmisc

n/a

n/a

n/a

n/a

n/a

n/a

(8) 4

(5 8) 4

cmp, tbit, tnat

1

0

n/a

n/a

n/a

n/a

n/a

n/a

fcmp

2

1

n/a

n/a

n/a

n/a

n/a

n/a

FP predicates

2

2

n/a

n/a

n/a

n/a

n/a

n/a

Integer load

n/a

n/a

R+1

R+1

R

R

R

R

FP load

n/a

n/a

F+2

F+2

F+1

F+1

F+1

F+1

mov =br, alloc

n/a

n/a

3

2

2

2

2

n/a

mov ar, cr

n/a

n/a

A

A

A

A

A

n/a

mov pr=

1

0

3

2

n/a

2

2

n/a

mov indirect

n/a

n/a

I

I

I

I

I

n/a

Adder: arithmetic, logical, compare, immediate moves, and postincrementing aspect of load/store.

Fmac: arithmetic calculations, both 64- and parallel 32-bit, including the reciprocal approximations.

Fmisc: compare, abs/min/max, merge/mix/pack, sign extend, swap, select, class, and logical.

FP predicates: frcpa, fprcpa, frsqrta, fprsqrta.

R: L1, L2, or L3 cache hit cycle time (see Tables 4-3 and 13-2), or longer from main memory.

F: L2 or L3 cache hit cycle time (see Tables 4-3 and 13-2), or longer from main memory.

A: ec and lc 2 cycles; fpsr and cr 10 12 cycles; others 2 35 cycles, depending on the application or control register being addressed.

I: 6 35 cycles, depending on the register set being addressed.

[*] Adapted from Intel Itanium 2 Processor Reference Manual for Software Development and Optimization and Intel Itanium Processor Reference Manual for Software Optimization.

The best-case L1-D cache delay for loading integers (R) has dropped from two cycles to one cycle. The best-case L2 cache delay for loading floating-point data (F) has dropped from nine cycles to six cycles; at least one additional cycle is always needed to convert floating-point data from the 32- or 64-bit memory representation to the 82-bit register representation. Noncyclic instruction sequences in an Itanium 2 processor thus obtain their data much more quickly, while the software pipelines for cyclic situations can be tightened up considerably.

13.3.6 Branch Prediction

The great cost of mispredicted branches motivates the computer industry to invent techniques for avoiding branches altogether, through mechanisms such as conditional moves and predication, as well as techniques for conveying branch hints and branch history information to the execution pipelines in contemporary processors.

The Itanium architecture not only provides branch hint completers on branch instructions themselves (Section 5.3.1), but also includes a branch predict instruction that could help attain optimal performance with the original Itanium processor:

 brp.ipwh.ih        target25,tag13   // IP-relative form brp.indwh.ih       b2,tag13         // Indirect form brp.ret.indwh.ih   b2,tag13         // Return form 

where the tag13 operand is encoded by the assembler from the symbolic address of the branch instruction itself and where the target25 is encoded by the assembler to match the symbolic destination address represented in the branch instruction itself.

The indirect forms are used when the target address has been placed in a general register, possibly in one of several predicated ways. The return form indicates that the branch will be a return.

There are four values for ipwh (the IP-relative predict whether hint). A completer sptk or dptk specifies that the branch should be predicted static taken or dynamically taken, respectively. The completer loop specifies that the branch will be one of br.cloop, br.ctop, or br.wtop. The completer exit specifies that the branch will be either br.cexit or br.wexit.

There are two values for indwh (the indirect predict whether hint): A completer sptk or dptk specifies that the branch should be predicted static taken or dynamically taken, respectively.

There are two possibilities for ih (the importance hint): none at all indicates that the branch is relatively unimportant, while imp indicates that the branch is one of a small number of very important branches (e.g., a branch in an inner loop).

An implementation of the Itanium architecture may provide various associated hardware support elements for these implied capabilities for branch prediction. The brp instruction operates as a nop instruction except for its performance effects, such as prefetching instructions at the new predicted branch destination into the cache structures. The Itanium 2 processor exploits different and more powerful technologies for hardware branch prediction, but does use information from a brp instruction to improve I-cache performance.

13.3.7 Other Differences and Features

On the initial Itanium processor implementation, which lacked hardware support for the brl instruction, an exception would occur and system software was expected to provide a fault handler with emulation to complete the jump. There could be a severe time penalty for that fault recovery, thus illustrating how implementation-specific details can affect the performance of code. Likewise, the implementation of chk instructions was not as complete, and operating system assistance was needed in order to branch to the recovery code.

We have not attempted an exhaustive review of features where the Itanium 2 processor has improved upon its predecessor. In all, the Itanium 2 processor refines the implementation of the architecture in a manner that an engineer or physicist might call good impedance matching of its many powerful but complex component parts.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net