10.3 Explicit Parallelism in the Itanium Processors

Separating Itanium instructions into independent groups and bundling three instructions with a template represent departures from the programming model for most other computer architectures. Here we expand upon the preliminary discussion given in Chapter 5.

10.3.1 Instruction Templates

The 128-bit Itanium instruction word includes three 41-bit slots for instructions and a 5-bit field for the template that guides those three instructions into particular instruction units. With a 5-bit field, up to 32 different templates are possible, but only three-fourths of those have been defined, as shown in Table 10-2.

Table 10-2. Itanium Instruction Templates
Code	Slot	Unit	Code	Slot	Unit	Code	Slot	Unit	Code	Slot	Unit
0x0	0 1 2	M I I	0x8	0 1 2	M M I	0x10	0 1 2	M I B	0x18	0 1 2	M M B
0x1	0 1 2	M I I;;	0x9	0 1 2	M M I;;	0x11	0 1 2	M I B;;	0x19	0 1 2	M M B;;
0x2	0 1 2	M I;; I	0xa	0 1 2	M;; M I	0x12	0 1 2	M B B	0x1a	0 1 2	^[*]
0x3	0 1 2	M I;; I;;	0xb	0 1 2	M;; M I;;	0x13	0 1 2	M B B;;	0x1b	0 1 2	^[*]
0x4	0 1 2	M L X	0xc	0 1 2	M F I	0x14	0 1 2	^[*]	0x1c	0 1 2	M F B
0x5	0 1 2	M L X;;	0xd	0 1 2	M F I;;	0x15	0 1 2	^[*]	0x1d	0 1 2	M F B;;
0x6	0 1 2	^[*]	0xe	0 1 2	M M F	0x16	0 1 2	B B B	0x1e	0 1 2	^[*]
0x7	0 1 2	^[*]	0xf	0 1 2	M M F;;	0x17	0 1 2	B B B;;	0x1f	0 1 2	^[*]

^[*] Reserved for future extensions to the architecture.

The programmer, assembler, or compiler has substantial flexibility in placing typical sequences of instructions into bundles, because type A instructions can be set to execute on M- or I-units.

Manually assigned templates

The assemblers recognize a special directive for the explicit assignment of instructions to templates:

 {.mmi                     // use an MMI template for this bundle type M or type A instruction possible stop             // if M;;MI or M;;MI;; type M or type A instruction type I instruction possible stop             // if MMI;; or M;;MI;; }

where the .mmi directive can be used for any of four templates (see Table 10-2) with zero, one, or two stops in particular positions.

Assembler-selected templates

In order to demonstrate that the assignment of templates is not fully determined by a programmer's code sequence, we used several assemblers to investigate the segment of the SCANFILE program (Figure 9-5) between the loop and eof labels. In Table 10-3, we compare the results from Intel's ecc for Linux, gcc for Linux, and Hewlett-Packard's cc_bundled for HP-UX.

Table 10-3. Assembler-Selected Templates for a Program Segment
Source Code	No-op Instructions and Templates Selected
	ecc (Linux)		gcc (Linux)		cc_bundled (HP - UX)		^[*]
`loop:`
`mov out0 = loc3`		`//M`		`//M`		`//M`	1
`add out1 = @gprel(IFORM),gp`		`//I`		`//I`		`//M`	1
`mov out2=loc6`		`//I`		`//I`		`//I`	1
	`nop.m`	`//M`^[]	`nop.m`	`//M`
	`nop.i`	`//I`	`nop.f`	`//F`
`br.call.sptk.many b0 = fscanf`		`//B`^[]		`//B`^[]		`//B`	1
					`nop.b`	`//B`
					`nop.b`	`//B`^[]
`mov gp = loc2`		`//M`		`//M`		`//M`	2
`cmp4.ne p6,p0 = 1,ret0`		`//I`		`//I`		`//M`	2
`(p6) br.cond.sptk.few eof;;`		`//B;;`		`//B;;`		`//B;;`	2
`mov out0 = loc4`		`//M`		`//M`		`//M`	3
`add out1 = @gprel(OFORM),gp`		`//I`		`//I`		`//M`	3
`mov out2 = loc6`		`//I`		`//I`		`//I`	3
	`nop.m`	`//M`^[]	`nop.m`	`//M`
	`nop.i`	`//I`	`nop.f`	`//F`
`br.call.sptk.many b0 = fprintf`		`//B`^[]		`//B`^[]		`//B`	3
					`nop.b`	`//B`
					`nop.b`	`//B`^[]
`mov gp = loc2`		`//M`		`//M`		`//M`	4
`cmp4.lt p6,p0 = ret0,r0`		`//I`		`//I`		`//M`	4
`(p6) br.cond.sptk.few stop7`		`//B`^[]		`//B`^[]		`//B`^[]	4
	`nop.m`	`//M`
`add loc5 = 1,loc5`		`//I`		`//M`		`//M`	5
			`nop.f`	`//F`
`br.cond.sptk.few loop`		`//B`^[]		`//B`^[]		`//B`	5
					`nop.b`	`//B`^[]
`eof:`

^[*] Machine cycles on the Itanium 2 processor; the original Itanium processor would require a total of 7 cycles (ecc) or 5 cycles (gcc or cc_bundled).

^[] The original Itanium processor would encounter split issue here.

^[] Both the original Itanium processor and the Itanium 2 processor would encounter split issue here.

In the Linux programming environment, gdb directly shows the templates (e.g., MII) and shows double semicolons when the disas command is used. In the HP-UX programming environment, we inspected the instruction bundles as hexadecimal values in order to deduce the template codes (Table 10-2).

This exercise shows some differences, both in the way bundles are made up and in the way the type A instructions are assigned to templates. All three sets of bundles conform to rules of the Itanium architecture, but may differ in efficiency of execution on any particular implementation, particularly with the earliest Itanium processor.

Since explicit targeting of instructions is rather new to the industry, we should not be surprised by such differences. The software technology of Itanium assemblers and compilers is expected to become more capable as the product line matures.

Relative efficiency of fitting instructions into bundles

Each assembler placed the 16 instructions, along with five nop instructions, into seven bundles, an equivalent packing efficiency that is less than 6/7. This sample code fragment cannot be made any more compact because a nonbranch instruction that follows a branch instruction has to begin a new bundle (see the patterns evident in Table 10-2).

Inspection for split issues

The limits of parallel execution of instructions occur when a stop (;;) is reached, when two full bundles of three instructions apiece are issued, or when there is a shortage of one type of execution unit for a particular Itanium implementation. This latter condition is called split issue.

The Itanium 2 processor improves upon the performance of the initial Itanium processor implementation, since the former, but not the latter, can redirect a type A instruction to an otherwise idle M-unit even if the template had indicated the use of an I-unit. For this reason, the assignment of type A instructions to MII templates by the ecc assembler introduces more instances of split issue for an initial Itanium processor than the strategies of the other assemblers for this particular program segment.

Asymmetry among execution units

Additional situations, including the intrinsic limitations of some execution units, may cause split issue. The MIB, MFB, and MMB templates target unit B2 ordinarily, but target unit B0 for nop.b or brp instructions; moreover, they always cause split issue after the branch in early Itanium processor implementations, except in the case of a nop.b instruction. In addition, an Itanium processor implementation may be able to execute certain instructions only on a specific execution unit. Early Itanium implementations contain only one bit-shifter associated with execution unit I0, and thus some type I instructions cannot be targeted to unit I1. As another illustration of asymmetry, the Itanium 2 processor performs store operations on two of its M-units and load operations on the other two M-units.

Template and nop strategy

This single brief example shows the potential impact of various strategies for introducing nop instructions and choosing templates. Note that there is a form of nop instruction for each type of execution unit. The Hewlett-Packard software appears to emphasize assignment of type A instructions to the M-units (in the absence of any load or store operations here). The gcc software appears to select nop.f instructions (in the absence of any floating-point operations here). The Intel software appears to assign the first type A instruction to an M-unit and then any others preferentially to I-units.

This is but one isolated code example. With other code sequences, the various assemblers might show different relative efficiencies.

Diagram of instruction issue according to templates

In order to indicate how the templates direct the issuance of instructions to the execution units, we present in Figure 10-2 the split issue effects that come in the first few cycles with the version of the code fragment from SCANFILE that was produced by the ecc software.

Figure 10-2. Execution of an instruction sequence with split issue

graphics/10fig02.jpg

In cycle 1, an Itanium 2 processor can execute the six instructions from the first two bundles because four M-units and two I-units are available, all of which can execute type A instructions. In cycle 2, only the three instructions in the bundle with an MIB template can execute, since that template produces split issue on early Itanium implementations. The next bundle with an MII template must wait until the next cycle. In cycle 3, six instructions from two bundles of this code fragment can again be issued to execute in parallel on an Itanium 2 processor.

10.3.2 Data Dependency and Speculation

We have already discussed data dependency using the producer consumer terminology. More generally, four cases of potential dependency within an instruction group can be identified that have relevance to the Itanium architecture and its implementations:

RAW (read-after-write) dependencies are not allowed for registers, with a few exceptions (branching on predicate results determined within a single machine cycle by non-floating-point operations, writing into a stacked register brought into scope by an alloc instruction, and other special circumstances). By contrast, RAW dependency is allowed for memory i.e., a load from a recently written memory location will retrieve the data from the most recent store.
WAW (write-after-write) dependencies are generally not allowed for registers, but several compares with .and/.andcm or .or/.orcm completers may coexist within an instruction group. Otherwise, the Itanium architecture permits any given register (except predicate register Pr₀) to occur as a destination only once within an instruction group, even if that group extends over multiple machine cycles. WAW dependency should be avoided for memory, as a first write may stall access to a subsequent store operation that needs to write to the same information unit.
WAR (write-after-read) dependencies are generally permitted for both registers and memory. In the case of registers, this rule permits such common operations as incrementing the value in a register.
RAR (read-after-read) dependencies are permitted for both registers and memory. In the case of registers, this rule permits multiple instructions to draw upon the same source information.

The programmer, assembler, or compiler must select templates that insert an explicit stop (;;) for every instance of RAW or WAW dependency involving data from a source register.

The memory hierarchy requires an indeterminate time to deliver data, possibly many tens of machine cycles. Execution pipelines may stall badly if data from a load instruction cannot be obtained quickly. Optimizing compilers for RISC-like systems generally attack this problem by rearranging the programmer's code to begin the load request as early as possible without altering the programmer's intended logic. The Itanium EPIC architecture provides hardware support for another powerful technique called data speculation.

An Itanium CPU includes an internal structure called the advanced load address table (ALAT). Entries in the ALAT are tagged with a register identifier and a memory address. Every Itanium store operation queries the ALAT and must invalidate all entries that overlap with addresses of any bytes in the memory hierarchy that will be affected by this store operation. Speculations associated with invalidated ALAT entries will subsequently fail.

An Itanium compiler may gamble that any neighboring store operations do not overlap the storage region from which the moved load instruction obtains data. A recovery routine must be present to handle the contingency where the speculation has failed. The code consists of an advanced load, any number of instructions speculatively executed using the value from that load, a check instruction, and appropriate recovery. There are simple and more complex cases.

When only the load instruction is speculative

The simplest case of speculative code rearrangement involves an advanced load, and a check load that functions as the contingent recovery:

 no data speculation       using data speculation                           ld8.a     r20 = [r15];;  // advanced <sequence A>              <sequence A> st8   [r14] = r24         st8       [r14] = r24 ld8   r20 = [r15];;       ld8.c.clr r20 = [r15];;  // check add   r20 = 1,r20         add       r20 = 1,r20 <sequence B>              <sequence B> st8   [r16] = r20         st8       [r16] = r20

The compiler may not know whether registers r14 and r15 will dynamically contain pointers to overlapping or nonoverlapping storage; hence moving the load instruction is a speculative decision instead of a guaranteed situation.

The advanced load (.a as the load type completer) inserts the address in register r15 as an ALAT entry. As with other cache-like structures of limited capacity, this operation might displace some other ALAT entry that is still relevant; if that happens, some other speculation will fail.

The check load (.c.clr as the load type completer) searches the ALAT for the r20 register identifier. On a hit, the ALAT is cleared and no new load operation is needed because register r20 already contains valid data. On a miss, the check load becomes a regular load operation that refreshes the value in register r20. The .c.nc completer may be substituted if the ALAT entry should be retained.

The speculation will succeed if no store conflict has arisen and few enough other speculative requests have been made that the ALAT still holds the pertinent entry. The effect is to hide some or all of the load latency by permitting other useful instructions (sequence A) to execute while the memory hierarchy takes time to respond to the load request.

When more instructions than the load are speculative

A compiler could treat the above code fragment more aggressively. Suppose that sequence A is too brief to hide the anticipated load latency, but the potentially interfering store would be of low probability. The compiler might then proceed directly to the addition and sequence B instructions that depend on the value loaded into register r20:

 no data speculation       using aggressive data speculation                           ld8.a     r20 = [r15];; // advanced <sequence A>              <sequence A>                           add       r20 = 1,r20                           <sequence B> st8   [r14] = r24         st8       [r14] = r24 ld8   r20 = [r15];;       chk.a.clr r20,recover   // check add   r20 = 1,r20 <sequence B>            back: st8   [r16] = r20         st8       [r16] = r20                           <continue main sequence>                         recover:    // at some other address                           ld8       r20 = [r15];; // reload                           add       r20 = 1,r20                           <sequence B>                           br        back

where the recovery code must then repeat all of the instructions (add and sequence B) that had been speculatively executed using incorrect data in register r20. The recovery code may repeat the invalidated load operation, either with a regular load instruction as shown or with an advanced load if there is some reason for reallocating an ALAT entry.

The Itanium chk.a instruction can interrogate the ALAT and then conditionally branch to recovery code. As with the check form of a load instruction, the chk.a instruction requires the further completer .clr or .nc to control its effect on the ALAT entry. The branch range of the chk.a instruction is the same as that of other IP-relative branch instructions (Section 5.3.5), but it executes on an M-unit, not a B-unit.

This aggressive strategy produces larger code segments. The compiler should arrange the out-of-sequence recovery routine to favor the most probable dynamic circumstances (e.g., all but the first or last time through a loop). Effective use of the ALAT and data speculation represents a challenging opportunity for improved heuristics in Itanium compilers.

Releasing ALAT entries

Because of the small capacity of the ALAT (32 entries for the Itanium 2 processor), programs should invalidate any entries that are no longer useful. The .clr completer on ld.c and chk.a instructions provides a simple way to do this, but the Itanium ISA includes a standalone instruction also:

 invala           // invalidate all ALAT entries invala.e  r1     // only entry for integer register invala.e  f1     // only entry for floating-point register

where the selective register forms do nothing if there is no matching ALAT entry.

10.3.3 Control Dependency and Speculation

Control dependency refers to situations where speculative execution is contingent on the logical flow of a program containing conditional branches or explicit predication. The potential snag is that exceptions should not be raised or reported if instruction(s) speculatively executed would not, in fact, have been encountered in normal nonspeculative program flow. The 65th bit (NaT) associated with every Itanium general register (Appendix D.2) or the special NaTVal in a floating-point register (Section 8.2.2) holds the exception in abeyance.

These special indicators can propagate through sequences of ALU operations until they are tested. Compare instructions generally set both predicates false when one or more NaT bits are detected for source registers in order to prevent predicated pathways from executing inappropriately.

The Itanium tnat instruction can be used to test the NaT bit associated with a particular general register:

 tnat.trel.ctype pt,pf =  r3

where trel (the test relationship completer) can be z (zero) or nz (nonzero), as for the test bit instruction (Section 6.1.4), and where ctype (the compare type completer) can be any of those for compare instructions (Sections 5.2.1 and 6.1.5).

The fclass instruction (Section 8.6.2) provides a similar capability to test for NaTVal in a floating-point register and set a predicate pair accordingly.

Note that spill and fill variants of store and load instructions are provided for the purpose of moving potentially invalid data between memory and processor registers without raising exceptions. Conversions of data between the general and floating-point registers also pass along the token of invalidity in the appropriate form for the destination.

An example of control speculation

Suppose that a load with potentially long latency depends on predicated falling through a branch. If the compiler believes that falling through is indeed the most likely situation, then it may opt to produce a speculated code sequence in order to do useful work while the memory hierarchy responds to the load request:

 no speculation        using control speculation                             ld8.s r20 = [r15];;  // speculate       <sequence A>          <sequence A>  (px) br.cond notdo    (px) br.cond notdo       ld8 r20 = [r15];;     chk.s r20,recover    // check                       back:       add r20 = 1,r20       add    r20 = 1,r20       <sequence B>          <sequence B> notdo:               notdo:                      recover:      // at some other address                            ld8     r20 = [r15]   // reload                            br      back;;

where the recovery code repeats the failed load instruction.

The Itanium architecture provides the chk.s instruction to branch conditionally if the NaT bit associated with a particular register is set. The branch range of the chk.s instruction is the same as that of other IP-relative branch instructions (Section 5.3.5), but it is actually a pseudo-op for specific chk.s.m and chk.s.i instructions that can be directed by a bundle template to execute on an M- or I-unit.

Another example of control speculation

Consider the schematic situation where two potentially time-consuming load operations and related calculations depend on a simple logical alternative:

 <code that determines a, b> if    a < b then  load 1, etc. else  load 2, etc. endif

This situation is amenable to a programming strategy using control speculation, along the following lines:

 ld8.s 1 ld8.s 2 <code that determines a, b> if    a < b then  chk.s 1, etc. else  chk.s 2, etc. endif

In this way, the latency of the two load instructions can overlap with the useful work leading up to the logical comparison.

10.3.4 Combined Control and Data Speculation

The Itanium architecture also provides a speculative advanced load with ld.sa forms of load instructions, which combine the capabilities of advanced and speculative loading. That is, a load instruction with the .sa completer uses only the ALAT to track success or failure, but any exceptions are deferred (like ld.s and unlike ld.a).

Neither advanced nor speculative loads are entirely free goods. The ALAT entries are few in number and are vulnerable to nullification by circumstances unrelated to the particular load. Moving code around extends the scope where certain registers are needed to hold valuable data, thus possibly restricting the ability of a compiler to optimize register allocation. Finally, code size is increased, both from the check instructions themselves and from the required recovery code.