10.6 Program Optimization Factors | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

If we were writing this book some decades ago, we would be drawing your attention to such factors as instruction size, instruction complexity or power, instruction timing, and addressing modes. Those factors influence overall efficiency, especially in machines with CISC architectures, where diversified instruction sets offer many ways to perform given tasks.

Such factors become somewhat subtler for machines with RISC-like architectures. We now consider whether certain software aspects matter very much today, especially for the Itanium architecture.

10.6.1 Instruction Size

CISC architectures (e.g., IA-32 or VAX) offered several alternative ways to accomplish many common tasks in a program. The instruction occupying the smallest number of bytes typically performed a given task the most efficiently because of the way that fetching and decoding of instructions worked.

The Itanium and RISC-like architectures offer fewer alternative ways to accomplish any given task in a program, and instructions are of uniform size. With proper alignment, one instruction can never spill across two pages of virtual memory and cause a page fault, which the memory subsystem and/or operating system would need to resolve.

10.6.2 Addressing Mode

CISC architectures are characterized (or sometimes even caricatured) by their offering numerous addressing modes, some of which we outlined in Chapter 4. While those modes may differ in the number of instruction bytes needed to express them, the number of time-sequential memory references affects performance more significantly than instruction size in this instance. Register direct addressing is fastest, register indirect addressing is slower, and any further stages of indirection are slower still.

The Itanium architecture, following RISC principles, offers only immediate, register direct, and register indirect addressing. A program should keep its most frequently needed quantities, whether data or addresses, in processor registers rather than in memory. The performance difference in favor of the register is only lessened somewhat, not eliminated, if a cache location holds a copy of the value from the information unit in memory. For example, even a 90% cache hit ratio with a fivefold faster cache than main memory response yields an effective memory access time of 0.3 t_main (where t_main is the access time of main memory).

10.6.3 Instruction Power

CISC architectures were often thought to derive power both from their large number of distinct instruction types and from the amount of data manipulation that one instruction could perform i.e., the power of the instruction. For example, three operands in the MOVC3 instruction of the VAX architecture specified the source address of string data, the byte length, and a destination address where the string should be copied. This single instruction was equivalent to several simpler VAX instructions in a tight loop.

RISC-like architectures have fewer types of instructions, but the instructions that remain are not necessarily constrained to have reduced capabilities. The intricacies of architectural design for a processor will inevitably endow some instructions with greater relative power. Examples in the Itanium architecture include postmodification of the pointer register with memory access instructions (Sections 4.5.2 and 4.5.3) and the shladd instruction (Section 4.2.3). Exploiting versatile instructions can have remarkable overall leverage when a program is being "ported" to a new architecture.

10.6.4 Program Size

The evolution of microelectronics has long since reduced concerns about mere program size, because the cost of memory storage has decreased faster than the bulkiness of operating systems and application programs has increased. Much of the bulk consists of graphic elements, data structures, handlers for error contingencies or rare cases, and the like.

Computer scientists still rigorously analyze the design of algorithms in order to determine the irreducible number of unit operations required in relation to the size or dimensionality of a problem. If a problem is dimensionally 10 by 10, a compiler may best replicate one instruction sequence 10 times even though the total program size increases because of loop unrolling. Such unrolling is not guaranteed to yield improvement, however, because the lengthier unrolled instruction sequence may not benefit as much from cache retention as could the shorter sequence in the body of a loop.

Loop unrolling is generally beneficial on RISC-like systems because the cost of a branch instruction is relatively high. Although compilers for the Itanium architecture offer loop unrolling as one of several possible types of optimization, register rotation (Section 10.4) is often better.

10.6.5 Prefetching Lines into Cache

Compilers can sometimes enhance performance of a program by prefetching data into cache. In traditional architectures, this may only be possible using regular load instructions that specify a target register, which thus needs to be free for this purpose. The Itanium architecture provides many ways to accomplish prefetching, not only through regular, advanced, and speculative loads but also through a purpose-built line prefetch instruction that has the following forms:

 lfetch.lftype.lfhint        [r3]        // mem[r3] lfetch.lftype.lfhint        [r3],r2     // mem[r3]                                         // r3 <- r3 + r2 lfetch.lftype.lfhint        [r3],imm9   // mem[r3]                                         // r3 <- r3 + sext(imm9) lfetch.lftype.excl.lfhint  [r3]         // mem[r3] lfetch.lftype.excl.lfhint  [r3],r2      // mem[r3]                                         // r3 <- r3 + r2 lfetch.lftype.excl.lfhint  [r3],imm9    // mem[r3]                                         // r3 <- r3 + sext(imm9)

where a line containing the address in register r3 is brought into the cache structures, and then register r3 is optionally postincremented by the value in register r2 or by the sign-extended value specified by the immediate constant in the instruction.

The exclusive forms of the line prefetch instruction, containing the excl completer, allow the resulting cache line to be marked in an exclusive state. This is appropriate when the program expects rather soon to modify an information unit located in that line.

There are two values for the line prefetch type lftype. None at all specifies ignoring all faults, while the value fault raises the faults normally associated with a regular load operation.

There are four values for the line prefetch hint lfhint: none at all, nt1, nt2, and nta. None at all signifies that the program associates temporal locality in the L1 cache with the line prefetched. At the other extreme, nta provides the hint that the program considers the value prefetched to have nontemporal locality at all levels of cache. Between those extremes, nt1 or nt2 provides the hint that the program considers the value loaded to have nontemporal locality at level 1 or 2 of cache, respectively.

For instance, a compiler might use lfetch.fault.nt2 for a floating-point source and lfetch.fault.excl.nt1 for a floating-point destination or an integer store.

10.6.6 Use of Inline Functions

Some compilers have the capability to insert the body of certain routines inline where otherwise a function call would be positioned. We saw in Chapter 7 that procedure calls involve several sources of overhead: the br.call and br.ret instructions, the alloc instruction, the need to save and restore registers, and the need to handle arguments very precisely. Inline expansion of functions can reduce such overhead, with the chief cost being larger program size.

10.6.7 Instruction Reordering

Another software-based optimization is instruction reordering, optionally performed by Itanium compilers but not by Itanium assemblers. Within an instruction group, instructions are mutually independent and can be reordered freely to achieve compact and efficient instruction bundles that contain fewer no-ops and use available execution units productively. Speculative and advanced load instructions are also specific illustrations of powerful kinds of instruction reordering for the Itanium architecture.

10.6.8 Recursion and Related Factors

We turn next to recursion, a very standard topic in computer science. Recursion requires some of the overhead associated with procedure and function calls. A nonrecursive algorithm, if one can be devised to solve a problem equivalently, may run appreciably faster. The nonrecursive program implementation may or may not seem inferior in terms of design, adaptability, and maintainability. For transparency, we choose two simple examples of applications from finite number theory: the Fibonacci numbers (in the next section) and the factorial function (in an exercise at the end of this chapter).

Local variables in recursion

Recursion requires the preservation of essential context at every level of calling. Physically distinct storage must be provided anew at each level, most practically on a stack, to hold many kinds of items associated with procedure calling, as well as to allocate space for local variables. The Itanium stacked registers provide conveniently for such needs.

Each level of recursion has its own call frame. Any variables whose values may change at each level of recursion must be copied at every level where their current values must be remembered. (Why?) The general structure of a recursive routine is

 Entry point Early portion       // Using original parameters from caller Recursive call point Later portion       // May still need original parameters Exit point

Each recursive invocation of the routine has similar needs. Anything that may still be needed below the recursive call point must be preserved as a local variable on a stack or stored in a register that would be guaranteed to be saved and restored at the next level of recursion.