11.1 Compilers for RISC-like Systems

The very earliest computers, such as the von Neumann machine, bore some resemblance to what are now called RISC architectures. The von Neumann machine even fetched two instructions at once, though it did not have sufficient internal parallelism for true dual issue. As computing evolved, however, most systems developed into what came to be called CISC systems. The IBM 801 minicomputer (dating from 1975) is generally credited with being the first actual RISC architecture, although some earlier machines, such as the CDC 6600, had RISC-like features. The main examples of RISC systems began to emerge from the research community after 1980, driven in part by difficulties in compiler design posed by the sheer complexity of instruction sets and the challenge of keeping pipelines for ever-faster CISC implementations from having too many bubbles.

Compiler technology had advanced in its theoretical foundations and its commercial software engineering by the time the first RISC systems were designed. It is as though the software and hardware realms were synchronously poised to advance to a new sort of computer system. A case might be made that RISC ventures could have failed, absent improvements in compilers that made their pipelines perform adequately in spite of the timing imbalances between slower load/store instructions and faster register-to-register instructions. Otherwise, it has been argued, the relatively greater "power" of CISC instructions combined with some pipelining possibilities would have continued to hold sway, since more of the simpler RISC instructions are needed to solve comparable application problems. The longevity of the IA-32 architecture seems to reinforce this latter observation.

MIPS, in particular, became known as much for its compiler technology as for its processor architectures. MIPS pursued an approach to compiler systems involving language-specific "front ends" that convert programs into one common intermediate encoded form. A common "back end" then analyzes and optimizes the intermediate expression of the program and generates actual machine instructions. A compiler system composed of such front and back ends can be modified easily as languages change, as another language must be supported, and as new hardware implementations require different optimizations.

Digital Equipment Corporation developed the well-respected GEM compiler technology at a time when its line of VAX systems (CISC) was complemented by a line of MIPS-based systems (RISC), before the Alpha architecture (RISC) arrived. This GEM technology made it possible to offer compatible language compilers for both VAX and Alpha systems, thus facilitating a migration of customer applications from 32- to 64-bit systems, especially those for the OpenVMS programming environment.

Compilers usually provide control over the types of optimizations that they can perform. Those optimizations may include not only generally applicable techniques, such as unrolling loops, but also the deliberate use of special instructions that are implemented in hardware on some systems or in software in others. The dilemma in the case of commercial software is whether to distribute a "one size fits all" version, many versions, or one version optimized for a particular implementation (i.e., model) of a computer system.

Quite obviously, our previous consideration of the Itanium architecture suggests that it presents new challenges for compiler writers. Research at Hewlett-Packard Laboratories and elsewhere has expanded the established base of compiler theory in new directions to meet the requirements of VLIW and EPIC designs. Going into any detail on that theory would stretch well beyond the intended scope of our book. We will instead outline an investigative approach with which you can probe what available compilers actually do.

11.1.1 Optimization Levels for Open-Source Compilers

By custom, compilers for Linux and for versions of Unix have used the command-line option -Ox to provide compile-time control over the optimizations that the compiler would perform, where the digit x denotes the desired level of optimization. Table 11-1 lists the principal options for the Linux-based Itanium gcc (C/C++) and g77 (FORTRAN) compilers.

Using and Porting the GNU Compiler Collection (GCC) gives information about the machine-independent (-f) and machine-dependent (-m) flag options, which generally apply to both gcc (for C/C++) and g77 (for FORTRAN). The machine-independent optimizations permit selection of many types of optimizations on an individual basis.

The machine-dependent optimizations permit tailoring the compiled program for a specific implementation of the architecture, or selecting from among alternative strategies. For example, since the Itanium architecture does not offer an integer divide instruction, two choices are offered for inline software functions:

 -minline-divide-min-latency             // minimum latency -minline-divide-max-throughput          // maximum throughput

which have been optimized either for minimum latency (often best for a single occurrence of division) or for maximum throughput (often best within loop structures).

Table 11-1. Command-Line Options for Optimization for gcc and g77 (Linux)
Option	In brief	In more detail
`-O0`	Do not optimize	This is the default for `gcc` and `g77.`
None	(same as `-O0`)	Operates in its fastest mode. Debugging should produce the expected results.^[*]
`-O, -O1`	Optimize some	Reduces code size and execution time. Avoids doing difficult optimizations.
`-O2`	Optimize more	Performs all optional optimizations except for loop unrolling, function inlining, and register renaming.
`-O3`	Optimize the most	`-O2` plus loop unrolling and function inlining.
`-Os`	Optimize for size	Enables those `-O2` optimizations not expected to increase code size; also reduces code size with further optimizations.
`-fflag`	Do something	Asserts a machine-independent flag.
`-fno-flag`	Don't do something	Deasserts a machine-independent flag.
`-moption`	Machine option	Specifies a machine-dependent flag.

^[*] For higher levels of optimization, it may be difficult to relate what the debugger shows to the source code.

11.1.2 Optimization Levels for Intel Compilers

The Intel ecc (C/C++) and efc (FORTRAN) compilers for Itanium processors in a Linux environment share characteristics and elements of documentation with software for IA-32 processors in both Linux and Windows environments. Table 11-2 lists the principal options for these compilers for Linux for Itanium systems at the command line.

The Intel compilers can perform interprocedural optimization on individual files (using the -ip option), multiple files (-ipo), or the whole program (-wp_ipo). Further capabilities include profile-guided optimization and multithreading (see Intel manuals).

Table 11-2. Command-Line Options for Optimization for ecc and efc (Linux)
Option	In brief	In more detail
`-O0`	Do not optimize
none	Optimize more	Level `-O2` is on by default.
`-O1`	Optimize less for minimum size	Performs optimizations generally the same as `-O2`, but disables inline expansion of library functions and software pipelining of loops.
`-O, -O2`	Optimize more	Optimizes at the level of individual procedures; may expand library functions inline; may unroll loops.
`-O3`	Optimize maximally	Enables high-level optimization for speed, which may include prefetching, scalar replacement, and loop transformations. This option may result in longer compilation times.
`-nolib_inline`	Produce compact but slower code	Disables inline expansion of library functions.

11.1.3 Optimization Levels for HP-UX Compilers

The Hewlett-Packard aCC (C/C++), cc (C), and f90 (FORTRAN) compilers for Itanium processors in an HP-UX programming environment share characteristics and elements of documentation in their approaches to optimization. Table 11-3 lists the principal options for these compilers at the command line. (The cc_bundled compiler distributed with some HP-UX systems does not offer control over optimizations.)

Table 11-3. Command-Line Options for Optimization for aCC, cc, and f90 (HP-UX)
Option	In brief	In more detail
`-O0, +O0`	Do not optimize	Fastest compile time.
none	(same as `+O1`)
`+O1`	Optimize some (default)	Performs branch optimization, dead code elimination, faster register allocation, instruction scheduling, and peephole optimization.
`-O, +O2`	Optimize more	Performs optimizations over entire functions in a single file, including software pipelining, efficient expression evaluation, and much more.
`+O3`	Optimize yet more	Performs full optimization across all subprograms within a single file; may put small subprograms inline; may hinder debugging.
`+O4`	Optimize globally	Performs full optimizations across the entire application program (at link time, not at compile time); this option requires a large amount of virtual memory.
`+Ofast`	Optimize with risk	Performs optimizations for better performance (in conjunction with `+O2, +O3`, or `+O4)` that may change the behavior of the program.
`+Ofaster`	Optimize maximally	Performs all feasible optimizations for best performance (in conjunction with `+O2, +O3`, or `+O4)` that may change the behavior of the program.
`+O[no]limit`	Take lots of time	Optimizes using [un]restricted compile time.
`+O[no]size`	Make compact code	Enables [disables] code-expanding optimizations.

The Hewlett-Packard compilers also offer profile-guided optimization, whereby a large application can be iteratively tuned through feedback of its own runtime behavior into a subsequent recompilation. That topic lies beyond our intended scope of investigation (see Hewlett-Packard publications).

11.1.4 Additional Optimization Possibilities

The compilers for HP-UX offer many additional options that give fine-grained control over individual optimization techniques (see the respective manuals). These compilers have long traditions of development to support the particular needs of different computer architectures. The current Hewlett-Packard compilers permit source programs to be compiled for either PA-RISC or Itanium systems.

Similarly, the gcc development suite has evolved to support porting of the Linux operating system to a very wide range of computer architectures. With each such effort, additional people have proposed the introduction of different techniques into the overall software generally with an option to enable or disable such features when necessary. Over time, sensible groupings of those individual optimization techniques have emerged under the hierarchical rubric of the various -Ox levels.

The Intel compilers have also accumulated numerous special options that permit tuning of application code for different implementations of the IA-32 architecture (e.g., absence or presence of MMX instructions), and now for the Itanium architecture.

Our purpose in this chapter is to broaden your perspective of compilers by showing the effects of major levels of optimization. Attention to every possible form of optimization lies beyond the scope we have set for this book.

11.1.1 Optimization Levels for Open-Source Compilers

Table 11-1. Command-Line Options for Optimization for gcc and g77 (Linux)

11.1.2 Optimization Levels for Intel Compilers

Table 11-2. Command-Line Options for Optimization for ecc and efc (Linux)

11.1.3 Optimization Levels for HP-UX Compilers

Table 11-3. Command-Line Options for Optimization for aCC, cc, and f90 (HP-UX)

11.1.4 Additional Optimization Possibilities