5.5 Stops, Instruction Groups, and Performance

Using the disas command of gdb, we noticed earlier that the GNU assembler fills out instruction bundles with nop instructions. With all optimizations turned off, compilers and assemblers do not change the sequence of instructions expressed by the programmer.

An instruction group consists of an arbitrarily long sequence of mutually independent Itanium instructions. Consider the consequences of instruction grouping for these implementations of the Itanium architecture:

  1. A hypothetical implementation with an arbitrarily large degree of internal parallelism might carry out all instructions in the group simultaneously, using different execution units.

  2. The Itanium 2 processor, in one clock cycle, can carry out up to six instructions (two bundles of three) simultaneously, using different execution units. When there is a shortage of execution units of the right types to match the instructions in sequence, fewer than six instructions can execute in that clock cycle and the rest will be deferred. When a group contains more than six instructions, more than one clock cycle will elapse before the entire group has executed.

  3. The Ski simulator (see Appendix B) has no choice but to emulate the execution of Itanium instructions strictly in sequential order. This virtual machine has no parallelism and operates much like a traditional scalar architecture executing one instruction per clock cycle.

Overall computational throughput is limited either by the number of instructions per group or by the degree of parallelism of the machine. Nevertheless, each implementation can ensure that every new value placed into a register or memory is reliably correct for the subsequent instruction groups.

5.5.1 Study of Stops and Groups in DOTLOOP

The previous argument suggests that the counted number of stops in a particular segment of a program should impose a lower limit for the relative execution time of that sequence of instructions on all processor implementations, since those stops will always be restrictions.

Before giving any rules for stops, we will demonstrate the use of disassembly as a technique for achieving a better intuitive understanding of the role of the stops between groups of instructions in the DOTLOOP program. When we use the disas command of gdb to examine the instructions from top to done, we get the following result:

 <top>:       [MMI]       ld2 r21=[r14],2;;   // load, bump <top+1>:                 ld2 r22=[r15],2     // load, bump <top+2>:                 nop.i 0x0;; <top+16>:    [MII]       nop.m 0x0 <top+17>:                pmpy2.r r21=r21,r22;; <top+18>:                sxt4 r21=r21;; <top+32>:    [MMI]       add r20=r21,r20;; <top+33>:                adds r17=-1,r17     // decrement <top+34>:                nop.i 0x0;; <top+48>:    [MFB]       cmp.lt p6,p0=r0,r17 // test <top+49>:                nop.f 0x0 <top+50>:          (p06) br.cond.sptk.few <top>;; 

where we have pruned away the explicit hexadecimal addresses for <top> etc. The body of the loop assembled into four instruction bundles. There are seven stops (;;) and thus seven groups of instructions to be executed during each traversal of the loop.

Each traversal will take at least seven clock cycles. Many more cycles would be required if data values were not found in L1 cache. Moreover, any particular implementation of the architecture may not be able to accomplish the transitions between certain pairs of instructions within one clock cycle. Thus, as a figure of merit for this particular algorithmic coding, seven clock cycles is only a lower limit.

Can we see how to do better? Every nop in an instruction slot is a wasted opportunity for some useful instruction. There seem to be four nop instructions. Since this loop has only one entrance and one exit, we should be able to rearrange the instructions internally while maintaining the same algorithm.

We have inserted a few shortened comments as hints. Perhaps you will have realized that putting the decrement of register r17 immediately above the related cmp instruction is algorithmically arbitrary, though natural to do because of the functional relationship.

Successful advancement through the loop, in distinguishing each traversal from the next, actually involves three adjustments: decrementing register r17 for loop control and incrementing registers r14 and r15 to point to the next data elements. We might expect that the decrement, as an instruction of type A (Appendix C), could potentially execute in any slot where we presently see a nonproductive nop.i or nop.m instruction.

This line of thought leads us to create a new version of the DOTLOOP program, with the decrement now positioned just below the two ld2 instructions. The modified program still gives correct results (you should always test for that!). The body of the loop assembles as follows:

 <top>:       [MMI]       ld2 r21=[r14],2;;   // load, bump <top+1>:                 ld2 r22=[r15],2     // load, bump <top+2>:                 nop.i 0x0;; <top+16>:    [MMI]       adds r17=-1,r17;;   // decrement <top+17>:                nop.m 0x0 <top+18>:                pmpy2.r r21=r21,r22;; <top+32>:    [MII]       nop.m 0x0 <top+33>:                sxt4 r21=r21;; <top+34>:                add r20=r21,r20;; <top+48>:    [MFB]       cmp.lt p6,p0=r0,r17 // test <top+49>:                nop.f 0x0 <top+50>:          (p06) br.cond.sptk.few <top>;; 

where we still see seven stops (;;) partitioning the instructions into seven groups. This expression of the loop body should execute in the same amount of time as the original version.

We now remove three explicit stops from the source: the stop after the first ld2 instruction, the stop after adds, and the stop after add. The assembler now produces:

 <top>:       [MMI]       ld2 r21=[r14],2     // load, bump <top+1>:                 ld2 r22=[r15],2     // load, bump <top+2>:                 nop.i 0x0;; <top+16>:    [MII]       adds r17=-1,r17     // decrement <top+17>:                pmpy2.r r21=r21,r22;; <top+18>:                sxt4 r21=r21;; <top+32>:    [MIB]       add r20=r21,r20 <top+33>:                cmp.lt p6,p0=r0,r17 // test <top+34>:          (p06) br.cond.sptk.few <top>;; 

Wow! The assembler has now condensed the body of our loop into three instruction bundles, rather than four, and there are only four stops (;;), partitioning the instructions into four groups. This final expression of the loop body might execute in 4/7 of the time needed for the loop body in the original DOTLOOP. Actually, since the instructions in some groups do take extra cycles, the ratio would be better expressed as (4+)/(7+).

Before reading ahead, you might want to repeat these steps yourself. You may observe some differences using another assembler. Then try to remove any one of the remaining interior stops from the body of this loop. Can you get away with it?

5.5.2 Simplified Rules for Data Dependency

We doubt that you were able to remove any of the remaining three stops from the body of the loop in the DOTLOOP program. Why not? Section 5.3.2 explained that not needing a stop between a compare and a branch is a special Itanium architectural feature. In other terms, we can call this a special case of zero latency between the compare and branch instructions because they are permitted to execute concurrently, thus allowing removal of certain stops.

The compare sets a 0 or 1 value into a predicate register that the branch inspects. The branch has a data dependency on the outcome of the compare. Such a dependency can also be described as a producer consumer relationship involving the compare (producer) and the branch (consumer), with the predicate value as the product involved.

The Itanium architectural specification allows implementations to take more time to resolve data dependencies with the exception of the optimized case of a compare followed by a branch. This optimization lies at the heart of EPIC principles, and the Itanium could not beat without it.

Two instructions that are interrelated as producer and consumer of a value in some register must be placed in different groups. For favorable situations involving simple instructions, the Itanium architectural design expects unit latency (one clock cycle).

For all situations of nonzero latency, a stop must be explicitly present in the template field of an instruction bundle. Many other architectures with superscalar features rely on hardware complexity to detect data dependency and force the consumer instruction to wait for the producer to finish. In an EPIC design, the complexity is relocated from hardware to compiler software. Simplified interdependency among the execution units can permit them to be faster and smaller, leaving room on the microprocessor chip for more execution units, more cache, or some other desired characteristics.

Instructions that load data from memory into CPU registers also are producers relative to instructions that perform subsequent calculations on the data. The time delay, however, is not determinate a priori. The data may emerge quickly from the innermost level of the cache subsystem (Section 4.5.1), but sometimes many more clock cycles could be required. For that reason, the Itanium design only requires that the software ensure the presence of a stop (;;) between a load instruction and any other instruction that depends on the target register of the load instruction. The hardware makes that instruction wait until its source data arrive.

In DOTLOOP, the sign extension of the new 32-bit product cannot begin until that product has been computed. By architectural design, the software must schedule the sxt4 instruction (consumer) to execute later than the pmpy2 instruction (producer). The stop (;;) after the pmpy2 instruction signals that some delay will occur. How much delay?

Latency, measured in clock cycles, can vary from one implementation to another, as explained in Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Table 5-1 gives some data pertinent to the discussion about DOTLOOP.

The pmpy2 instruction is part of the multimedia subset of instructions, for which the Itanium 2 processor implementation requires a latency of three cycles. Table 4-3 indicates that the latency associated with ld2 instructions is at least one clock cycle (L1-D cache), or five cycles (L2), or 12 cycles (L3), or greater (main memory or even virtual memory). In DOTLOOP, one additional cycle will be associated with loading data for the pmpy2.r instruction.

We can now modify our estimated lower bound on performance of the loop body in the DOTLOOP program before and after modification, using the latency information from Table 5-1, as follows:

                               (Figure 5-1)  (Section 5.5.1) stops (unit latency)                7              4 additional latency for load         1              1 additional latency for multiply     2              2                                    --             --                                    10              7 

The rewritten loop body might run in as little as 70% of the original loop's time.

No program ever performs as well as this sort of ideal lower bound on execution time would imply. Other activity on the system even background activity such as minding the time of day clock or evaluating inbound network signals makes the contents of both the I- and D-cache structures vulnerable to displacement. Still, compact code stands a better chance than bloated code of being retained in I-cache, and data stored near related data may survive longer in D-cache.

Table 5-1. Simplified Summary of Instruction Latency

Between

And

Stop

Latency

Integer instruction, in general

Integer instruction having no source registers in common with the prior instruction's destination registers

No

0

Integer compare instruction (producer)

Predicated branch instruction (consumer)

No

0

Integer instruction other than load, shift, or multimedia (producer)

Integer instruction (consumer)

Yes

1

Integer shift or multimedia (producer)

Integer instruction (consumer)

Yes

3[*]

Integer load instruction (producer)

Integer instruction (consumer)

Yes

>= 1[*] []

[*] For the Itanium 2 processor implementation.

[] Additionally, one more cycle is required between a load and a multimedia instruction (consumer).

5.5.3 How Itanium Assemblers Handle Stops

Itanium assemblers and compilers must produce correctly formed instruction bundles, including no-ops and architecturally required stops. When you write a program in assembly language, you do not have to manage the use of stops entirely by yourself, as the assemblers that we have illustrated offer some assistance, but in different ways.

Both the GNU assembler invoked by the gcc command and the assembler associated with the base-level cc_bundled compiler software for HP-UX warn about a dependency violation where a stop should be inserted. Although an executable file may be produced, do not trust it to produce valid results!

The Intel assembler for Linux invoked by the ecc command operates by default in an automatic mode, inserting essential stops and removing unnecessary stops.

5.5.4 Local Labels for Loops

Long programs can require huge numbers of labels, primarily as branch targets. Since good programming practice emphasizes locality of such transfers of control, a branch target seldom needs to be "visible" at great distances. Moreover, it becomes quite tedious to concoct numerous different but largely synonymous labels (such as top, next, onward, again, loop) that would be unique throughout the whole module.

An assembler may offer a special type of temporary local label. For the GNU assembler, a temporary local label is a small numeric value (1 255), which can be used in a branch instruction by putting the letter f (forward) or the letter b (backward) immediately after the final digit. The assembler looks for the first such label corresponding to the specified number in the forward or backward direction, respectively. When temporary local labels are used, a greater burden falls onto the comment field to explain the algorithm and its logical flow

Not all assemblers support temporary labels, and their use can muddy an otherwise clear algorithm. Thus temporary labels are avoided in this book.

5.5.5 Loops, Branches, and Overall Performance

Every instruction inside a loop exacts a cost equal to the execution time of the instruction multiplied by the average number of iterations through that particular loop. Slicing just one superfluous instruction out of a loop may improve performance more than shortening sequences of instructions that execute only once. When loops are nested, the instructions located within the innermost loop have greater impact than those in outer loops.

Branch instructions typically reduce overall performance. When a branch is taken, the need to fetch from out-of-sequence addresses not only disrupts orderly execution in a high-performance implementation (i.e., a pipelined implementation), but also lowers the likelihood that a copy of the new instruction is already in I-cache. Eliminating branch instructions from a loop can thus have a greater effect than eliminating other instruction types.



ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net