Separating Itanium instructions into independent groups and bundling three instructions with a template represent departures from the programming model for most other computer architectures. Here we expand upon the preliminary discussion given in Chapter 5. 10.3.1 Instruction TemplatesThe 128-bit Itanium instruction word includes three 41-bit slots for instructions and a 5-bit field for the template that guides those three instructions into particular instruction units. With a 5-bit field, up to 32 different templates are possible, but only three-fourths of those have been defined, as shown in Table 10-2.
The programmer, assembler, or compiler has substantial flexibility in placing typical sequences of instructions into bundles, because type A instructions can be set to execute on M- or I-units. Manually assigned templatesThe assemblers recognize a special directive for the explicit assignment of instructions to templates: {.mmi // use an MMI template for this bundle type M or type A instruction possible stop // if M;;MI or M;;MI;; type M or type A instruction type I instruction possible stop // if MMI;; or M;;MI;; } where the .mmi directive can be used for any of four templates (see Table 10-2) with zero, one, or two stops in particular positions. Assembler-selected templatesIn order to demonstrate that the assignment of templates is not fully determined by a programmer's code sequence, we used several assemblers to investigate the segment of the SCANFILE program (Figure 9-5) between the loop and eof labels. In Table 10-3, we compare the results from Intel's ecc for Linux, gcc for Linux, and Hewlett-Packard's cc_bundled for HP-UX.
In the Linux programming environment, gdb directly shows the templates (e.g., MII) and shows double semicolons when the disas command is used. In the HP-UX programming environment, we inspected the instruction bundles as hexadecimal values in order to deduce the template codes (Table 10-2). This exercise shows some differences, both in the way bundles are made up and in the way the type A instructions are assigned to templates. All three sets of bundles conform to rules of the Itanium architecture, but may differ in efficiency of execution on any particular implementation, particularly with the earliest Itanium processor. Since explicit targeting of instructions is rather new to the industry, we should not be surprised by such differences. The software technology of Itanium assemblers and compilers is expected to become more capable as the product line matures. Relative efficiency of fitting instructions into bundlesEach assembler placed the 16 instructions, along with five nop instructions, into seven bundles, an equivalent packing efficiency that is less than 6/7. This sample code fragment cannot be made any more compact because a nonbranch instruction that follows a branch instruction has to begin a new bundle (see the patterns evident in Table 10-2). Inspection for split issuesThe limits of parallel execution of instructions occur when a stop (;;) is reached, when two full bundles of three instructions apiece are issued, or when there is a shortage of one type of execution unit for a particular Itanium implementation. This latter condition is called split issue. The Itanium 2 processor improves upon the performance of the initial Itanium processor implementation, since the former, but not the latter, can redirect a type A instruction to an otherwise idle M-unit even if the template had indicated the use of an I-unit. For this reason, the assignment of type A instructions to MII templates by the ecc assembler introduces more instances of split issue for an initial Itanium processor than the strategies of the other assemblers for this particular program segment. Asymmetry among execution unitsAdditional situations, including the intrinsic limitations of some execution units, may cause split issue. The MIB, MFB, and MMB templates target unit B2 ordinarily, but target unit B0 for nop.b or brp instructions; moreover, they always cause split issue after the branch in early Itanium processor implementations, except in the case of a nop.b instruction. In addition, an Itanium processor implementation may be able to execute certain instructions only on a specific execution unit. Early Itanium implementations contain only one bit-shifter associated with execution unit I0, and thus some type I instructions cannot be targeted to unit I1. As another illustration of asymmetry, the Itanium 2 processor performs store operations on two of its M-units and load operations on the other two M-units. Template and nop strategyThis single brief example shows the potential impact of various strategies for introducing nop instructions and choosing templates. Note that there is a form of nop instruction for each type of execution unit. The Hewlett-Packard software appears to emphasize assignment of type A instructions to the M-units (in the absence of any load or store operations here). The gcc software appears to select nop.f instructions (in the absence of any floating-point operations here). The Intel software appears to assign the first type A instruction to an M-unit and then any others preferentially to I-units. This is but one isolated code example. With other code sequences, the various assemblers might show different relative efficiencies. Diagram of instruction issue according to templatesIn order to indicate how the templates direct the issuance of instructions to the execution units, we present in Figure 10-2 the split issue effects that come in the first few cycles with the version of the code fragment from SCANFILE that was produced by the ecc software. Figure 10-2. Execution of an instruction sequence with split issueIn cycle 1, an Itanium 2 processor can execute the six instructions from the first two bundles because four M-units and two I-units are available, all of which can execute type A instructions. In cycle 2, only the three instructions in the bundle with an MIB template can execute, since that template produces split issue on early Itanium implementations. The next bundle with an MII template must wait until the next cycle. In cycle 3, six instructions from two bundles of this code fragment can again be issued to execute in parallel on an Itanium 2 processor. 10.3.2 Data Dependency and SpeculationWe have already discussed data dependency using the producer consumer terminology. More generally, four cases of potential dependency within an instruction group can be identified that have relevance to the Itanium architecture and its implementations:
The programmer, assembler, or compiler must select templates that insert an explicit stop (;;) for every instance of RAW or WAW dependency involving data from a source register. The memory hierarchy requires an indeterminate time to deliver data, possibly many tens of machine cycles. Execution pipelines may stall badly if data from a load instruction cannot be obtained quickly. Optimizing compilers for RISC-like systems generally attack this problem by rearranging the programmer's code to begin the load request as early as possible without altering the programmer's intended logic. The Itanium EPIC architecture provides hardware support for another powerful technique called data speculation. An Itanium CPU includes an internal structure called the advanced load address table (ALAT). Entries in the ALAT are tagged with a register identifier and a memory address. Every Itanium store operation queries the ALAT and must invalidate all entries that overlap with addresses of any bytes in the memory hierarchy that will be affected by this store operation. Speculations associated with invalidated ALAT entries will subsequently fail. An Itanium compiler may gamble that any neighboring store operations do not overlap the storage region from which the moved load instruction obtains data. A recovery routine must be present to handle the contingency where the speculation has failed. The code consists of an advanced load, any number of instructions speculatively executed using the value from that load, a check instruction, and appropriate recovery. There are simple and more complex cases. When only the load instruction is speculativeThe simplest case of speculative code rearrangement involves an advanced load, and a check load that functions as the contingent recovery: no data speculation using data speculation ld8.a r20 = [r15];; // advanced <sequence A> <sequence A> st8 [r14] = r24 st8 [r14] = r24 ld8 r20 = [r15];; ld8.c.clr r20 = [r15];; // check add r20 = 1,r20 add r20 = 1,r20 <sequence B> <sequence B> st8 [r16] = r20 st8 [r16] = r20 The compiler may not know whether registers r14 and r15 will dynamically contain pointers to overlapping or nonoverlapping storage; hence moving the load instruction is a speculative decision instead of a guaranteed situation. The advanced load (.a as the load type completer) inserts the address in register r15 as an ALAT entry. As with other cache-like structures of limited capacity, this operation might displace some other ALAT entry that is still relevant; if that happens, some other speculation will fail. The check load (.c.clr as the load type completer) searches the ALAT for the r20 register identifier. On a hit, the ALAT is cleared and no new load operation is needed because register r20 already contains valid data. On a miss, the check load becomes a regular load operation that refreshes the value in register r20. The .c.nc completer may be substituted if the ALAT entry should be retained. The speculation will succeed if no store conflict has arisen and few enough other speculative requests have been made that the ALAT still holds the pertinent entry. The effect is to hide some or all of the load latency by permitting other useful instructions (sequence A) to execute while the memory hierarchy takes time to respond to the load request. When more instructions than the load are speculativeA compiler could treat the above code fragment more aggressively. Suppose that sequence A is too brief to hide the anticipated load latency, but the potentially interfering store would be of low probability. The compiler might then proceed directly to the addition and sequence B instructions that depend on the value loaded into register r20: no data speculation using aggressive data speculation ld8.a r20 = [r15];; // advanced <sequence A> <sequence A> add r20 = 1,r20 <sequence B> st8 [r14] = r24 st8 [r14] = r24 ld8 r20 = [r15];; chk.a.clr r20,recover // check add r20 = 1,r20 <sequence B> back: st8 [r16] = r20 st8 [r16] = r20 <continue main sequence> recover: // at some other address ld8 r20 = [r15];; // reload add r20 = 1,r20 <sequence B> br back where the recovery code must then repeat all of the instructions (add and sequence B) that had been speculatively executed using incorrect data in register r20. The recovery code may repeat the invalidated load operation, either with a regular load instruction as shown or with an advanced load if there is some reason for reallocating an ALAT entry. The Itanium chk.a instruction can interrogate the ALAT and then conditionally branch to recovery code. As with the check form of a load instruction, the chk.a instruction requires the further completer .clr or .nc to control its effect on the ALAT entry. The branch range of the chk.a instruction is the same as that of other IP-relative branch instructions (Section 5.3.5), but it executes on an M-unit, not a B-unit. This aggressive strategy produces larger code segments. The compiler should arrange the out-of-sequence recovery routine to favor the most probable dynamic circumstances (e.g., all but the first or last time through a loop). Effective use of the ALAT and data speculation represents a challenging opportunity for improved heuristics in Itanium compilers. Releasing ALAT entriesBecause of the small capacity of the ALAT (32 entries for the Itanium 2 processor), programs should invalidate any entries that are no longer useful. The .clr completer on ld.c and chk.a instructions provides a simple way to do this, but the Itanium ISA includes a standalone instruction also: invala // invalidate all ALAT entries invala.e r1 // only entry for integer register invala.e f1 // only entry for floating-point register where the selective register forms do nothing if there is no matching ALAT entry. 10.3.3 Control Dependency and SpeculationControl dependency refers to situations where speculative execution is contingent on the logical flow of a program containing conditional branches or explicit predication. The potential snag is that exceptions should not be raised or reported if instruction(s) speculatively executed would not, in fact, have been encountered in normal nonspeculative program flow. The 65th bit (NaT) associated with every Itanium general register (Appendix D.2) or the special NaTVal in a floating-point register (Section 8.2.2) holds the exception in abeyance. These special indicators can propagate through sequences of ALU operations until they are tested. Compare instructions generally set both predicates false when one or more NaT bits are detected for source registers in order to prevent predicated pathways from executing inappropriately. The Itanium tnat instruction can be used to test the NaT bit associated with a particular general register: tnat.trel.ctype pt,pf = r3 where trel (the test relationship completer) can be z (zero) or nz (nonzero), as for the test bit instruction (Section 6.1.4), and where ctype (the compare type completer) can be any of those for compare instructions (Sections 5.2.1 and 6.1.5). The fclass instruction (Section 8.6.2) provides a similar capability to test for NaTVal in a floating-point register and set a predicate pair accordingly. Note that spill and fill variants of store and load instructions are provided for the purpose of moving potentially invalid data between memory and processor registers without raising exceptions. Conversions of data between the general and floating-point registers also pass along the token of invalidity in the appropriate form for the destination. An example of control speculationSuppose that a load with potentially long latency depends on predicated falling through a branch. If the compiler believes that falling through is indeed the most likely situation, then it may opt to produce a speculated code sequence in order to do useful work while the memory hierarchy responds to the load request: no speculation using control speculation ld8.s r20 = [r15];; // speculate <sequence A> <sequence A> (px) br.cond notdo (px) br.cond notdo ld8 r20 = [r15];; chk.s r20,recover // check back: add r20 = 1,r20 add r20 = 1,r20 <sequence B> <sequence B> notdo: notdo: recover: // at some other address ld8 r20 = [r15] // reload br back;; where the recovery code repeats the failed load instruction. The Itanium architecture provides the chk.s instruction to branch conditionally if the NaT bit associated with a particular register is set. The branch range of the chk.s instruction is the same as that of other IP-relative branch instructions (Section 5.3.5), but it is actually a pseudo-op for specific chk.s.m and chk.s.i instructions that can be directed by a bundle template to execute on an M- or I-unit. Another example of control speculationConsider the schematic situation where two potentially time-consuming load operations and related calculations depend on a simple logical alternative: <code that determines a, b> if a < b then load 1, etc. else load 2, etc. endif This situation is amenable to a programming strategy using control speculation, along the following lines: ld8.s 1 ld8.s 2 <code that determines a, b> if a < b then chk.s 1, etc. else chk.s 2, etc. endif In this way, the latency of the two load instructions can overlap with the useful work leading up to the logical comparison. 10.3.4 Combined Control and Data SpeculationThe Itanium architecture also provides a speculative advanced load with ld.sa forms of load instructions, which combine the capabilities of advanced and speculative loading. That is, a load instruction with the .sa completer uses only the ALAT to track success or failure, but any exceptions are deferred (like ld.s and unlike ld.a). Neither advanced nor speculative loads are entirely free goods. The ALAT entries are few in number and are vulnerable to nullification by circumstances unrelated to the particular load. Moving code around extends the scope where certain registers are needed to hold valuable data, thus possibly restricting the ability of a compiler to optimize register allocation. Finally, code size is increased, both from the check instructions themselves and from the required recovery code. |