Compilers have a well-deserved reputation for presenting their users with a bewildering array of options, none of which make particular sense. The truth is that very few of these options are really necessary to tune, but picking correct compiler settings can make a huge difference in application performance. Furthermore, you don't need to completely understand how the compiler works in order to properly optimize your applications. [9]
[9] However, if you are interested in learning about how, exactly, a compiler works, read Compilers: Principles, Techniques, and Tools by A. Aho, R. Sethi, and J. Ullman (Addison Wesley). This is commonly known as "the dragon book" and is one of the best textbooks on the theory behind compiler design.
There are two compilers in widespread use in the Sun and Linux worlds . The first is the Forte Developer compilers, which are sold by Sun as the follow-on to the now-obsolete Sun WorkShop compilers. More information, including a trial download, is available at http://www.sun.com/forte/. The second is the free GNU compiler, gcc . The chief advantages of gcc are that it is free and has been ported to a wide variety of platforms; it is almost always available, even for esoteric systems. In my opinion, the Forte Developer compilers generate significantly better code, and are much better documented and easier to work with. We will focus here primarily on options for the Forte compilers. No matter what, keep in mind that compilers are constantly being optimized and refined. It pays to have the most recent version of the compilers, as it's not unusual to see speedups in the 10 to 15% range with each new major compiler release.
There are two caveats about compiler flags. The first is that order is important: subsequent flags will override earlier flags. This is particularly a problem with -fast ; specifying -fast -xO4 with the Forte Developer 6 compilers will result in a lower optimization level being used than just -fast alone. The second is more of a heads-up; namely, that not every compiler flag need be preceded with -x .
The most important compiler option to know for Sun systems is -fast . This macro expands to different things depending on what language you are compiling and what version of the compiler you're using. For a summary, see Table 8-4.
Compiler | C | Fortran 77 | Fortran 90 |
---|---|---|---|
Sun WorkShop 5.0 | -xO4 -single -fns-fsimple=1-ftrap=%none -libmil-native | -xO4 -dalign -depend-fns -fsimple=1-ftrap=%none -libmil-native | -xO3 -dalign -fns -ftrap=common -f -native -xlibmopt |
Forte Developer 6 | -O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xlibmil -native -xprefetch=no -xvector=no | -O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes | -O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes |
Forte Developer 6 Update 1 | -O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -native -xprefetch=no -xvector=no | -O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes | -O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes |
For the most part, we won't discuss what these flags actually mean. Suffice it to say that they represent a good mix of basic optimizations.
The most commonly tuned optimization flag is the global optimization control -xOn , where n specifies the optimization level (from 1 through 5, inclusive). The default optimization level is -xO2 for C programs and -xO3 for Fortran programs in the Forte Developer 6 compilers. Table 8-5 is a summary of what each optimization level does.
Level | Description |
---|---|
-xO1 | Basic local optimizations only. |
-xO2 | Default for C. Level 1 plus global optimizations: algebraic simplification, subexpression elimination, optimized register allocation, eliminating dead code, propagation of constants, tail-call elimination . [10] |
-xO3 | Default for Fortran. Level 2 plus loop optimizations (unrolling and fusion). Software pipelining. |
-xO4 | Level 3 plus function inline and more aggressive global optimization. |
-xO5 | The highest level of optimization. It is most likely to perform well when combined with profile feedback (see Section 8.4.11 later in this chapter). |
[10] Tail calls are not eliminated until -xO4 if -g is specified, to ease debugging.
gcc , the free GNU compiler, performs no optimization by default. It has three levels of optimization, which provide varying returns summarized in Table 8-6.
Level | Description |
---|---|
-O1 | Basic local optimization only. |
-O2 | Any optimization that doesn't involve a size -for-speed tradeoff ; notably no loop unrolling or inlining. (Enabling loop unrolling is done via -funroll_loops .) |
-O3 | Level 2 optimizations plus function inlining. |
The -xarch flag specifies what instruction set architecture the program should be built with. For SPARC platforms, there are nine choices, as summarized in Table 8-7.
Value | Instruction set | Bitness | Restrictions |
---|---|---|---|
v7 | SPARC V7(no fsmuld , integer mul or integer div instructions) | 32-bit | Any SPARC machine |
v8a | SPARC V8(no fsmuld instruction) | 32-bit | Any microSPARC-I or later machine |
v8 | SPARC V8 | 32-bit | Any SuperSPARC or later machine |
v8plus | SPARC V9(no VIS) | 32-bit | Any UltraSPARC machine |
v8plusa | SPARC V9 | 32-bit | Any UltraSPARC machine |
v8plusb | SPARC V9(with UltraSPARC-III extensions) | 32-bit | UltraSPARC-III machines |
v9 | SPARC V9(no VIS) | 64-bit | Any UltraSPARC machine running in 64-bit mode |
v9a | SPARC V9 | 64-bit | Any UltraSPARC machine running in 64-bit mode |
v9b | SPARC V9(with UltraSPARC-III extensions) | 64-bit | UltraSPARC-III machines running in 64-bit mode |
In general, the best performance of a 32-bit application on UltraSPARC-class processors is obtained with -xarch=v8plusa (or v8plusb on UltraSPARC-III systems). Specifying an early architecture here can significantly impede performance of your applications; pick the latest architecture possible while still supporting all your systems. It may be worth while to compile multiple versions of a program that particularly benefits from an optimization only available on certain platforms.
If you are compiling applications with -fast on an UltraSPARC platform and either don't specify an -xarch value or specify -xarch=native , it will assume you want to use v8plusa or v8plusb . This will generate code that won't run on pre-UltraSPARC machines, and generates an error message:
cc: Warning: -xarch=native has been explicitly specified, or implicitly specified by a macro option, -xarch=native on this architecture implies -xarch=v8plusa which generates code that does not run on pre-UltraSPARC processors
This is just a warning, and won't interfere with anything. If you'd like it to go away, just explicitly specify -xarch=v8plusa (or any value you like, except for the native values).
Restricting the type of processor that will be used to run the application provides the compiler with a great deal of information. This information is mostly used to schedule instructions and handle branches in the optimal way; for example, the number of cycles of delay between loading a data value and using it in a subsequent computation (called the load-use delay ) is very processor-dependent. Tuning this parameter can have big payoffs. There are thirteen options for SPARC systems, which are laid out in Table 8-8. [11]
[11] There are also options for Solaris on Intel systems; the most useful of these is pentium_pro .
Value | Architecture |
---|---|
old | Very old (pre-SuperSPARC) processors |
super | SuperSPARC chips(any SuperSPARC chip slower than 60 MHz: SM61 or slower) |
super2 | SuperSPARC-II chips(any SuperSPARC chip faster than 75 MHz: SM71 or better) |
micro | MicroSPARC-I chips(SPARCclassic, LX) |
micro2 | MicroSPARC-II chips(Voyager, SPARCstation 4, SPARCstation 5) |
hyper | HyperSPARC-I chips |
hyper2 | HyperSPARC-II chips |
ultra | UltraSPARC-I chips(All UltraSPARC-I processors are slower than 200 MHz) |
ultra2 | UltraSPARC-II chips |
ultra2i | UltraSPARC-IIi chips(Ultra 5, Ultra 10, etc.) |
ultra3 | UltraSPARC-III chips |
native | The current architecture (what is being used to compile), assuming a 32-bit environment |
native64 | The current architecture (what is being used to compile), assuming a 64-bit environment |
Function inlining is the process of including a function inside the function that called it. It eliminates some of the overhead of jumping to another location in memory, and provides more opportunities for the compiler to find parallel scheduling opportunities.
In gcc , inlining optimizations are turned on by -finline-functions .
It's a bad idea to turn on inlining when the optimization level is higher than -xO4 , as that level of optimization already inlines functions. Specifying the inlining option twice in this fashion can cause performance slowdowns. However, specifying -xcrossfile is a good idea: it lets you inline functions that reside in separate source code files. Use this with -xO4 or higher levels of optimization for the best results.
Part of the compiler's job is to analyze data dependencies within loops, and restructure the loops if necessary. This restructuring may give the compiler more opportunities to unroll or otherwise optimize the loop, which can improve performance. The compiler will also attempt to cache block . [12]
[12] Cache blocking is a technique wherein the computation is divided up so that the data accessed in the divided parts fits in the processor cache. This improves cache efficiency.
-xdepend requires that you specify -xO3 or higher, and it is set by default in -fast .
The -xvector flag tells the compiler to use the optimized vector math library, which can be significantly faster than its scalar counterpart . This flag has the most effect when the application repeatedly calls math library intrinsics , such as log , sin , and exp .
-xvector also enables -xdepend .
By default, the Forte C compiler treats floating-point constants as double s unless explicitly declared as float s. As a result, many extra conversion instructions are often required to transform double-precision floating-point constants into single-precision ones. This is a particularly big problem for codes that perform a large number of division or square root operations, which require almost twice as many cycles when working on double-precision variables . The -xsfpconst flag forces the compiler to treat floating-point constants as single-precision.
Data prefetching is a technique that allows the processor to overlap executing instructions with fetching data from memory. This is particularly helpful on latency-bound applications on high-latency systems, such as applications that have a repeated regular access pattern (e.g., many large loops) on high-latency hardware. The -xprefetch flag tells the compiler to insert specific prefetching instructions into the program to facilitate this behavior. Because this is very architecturally dependent, you get the best results when this option is used in concert with the -xchip and - xtarget options.
Note that UltraSPARC-I processors allow the prefetch instruction, but don't actually do anything. Therefore, code compiled with -xprefetch will run on all UltraSPARC-based systems, but not do anything on UltraSPARC-I systems; potentially provide a significant improvement on UltraSPARC-II systems; and potentially provide a large improvement on UltraSPARC-III systems, because of their on-chip caches dedicated to prefetching.
This is a quick and dirty reference of compiler flags for specific sorts of applications. As always, be sure to test; they may not be ideal for your particular program. These flags all assume you are using an UltraSPARC-based system:
For applications that require strict floating-point behavior (IEEE 754), you should try -fast -xarch=v8plus -fsimple=0 .
For C applications where pointer arguments to functions don't alias each other, try -fast -xarch=v8plus -xrestrict ; if they follow the ISO C 1999 pointer dereferencing rules, also try specifying -xalias_level=std .
Fortran applications should use -stackvar , which forces the compiler to allocate local variables on the program's stack; try -fast -xarch-v8plus -stackvar .
For applications running on UltraSPARC-III systems, use -xprefetch to enable prefetching; try -fast -xdepend -xchip=ultra3 -xprefetch .
The compiler has a built-in mechanism for targeting aggressive optimization on the most frequently run portions of a program. This scheme is called profile feedback . It uses runtime execution frequency data from the application to direct further optimizations on a subsequent compiling run.
The use profile feedback optimization, you must first build the application with the -xprofile=collect: name flag, where name is the name of the executable. A subsequent run of the application will generate a name.profile directory, which will contain the runtime data. This training run will take more time than the application normally would. Finally, rebuild the application with -xprofile=use: name ; the compiler will use the data it gathered during the first application run to improve optimization.
Example 8-6 gives some source code that we can use to demonstrate the utility of profiling feedback.
/* profiling.c */ #include <stdio.h> int main(int argc, char **argv) { int i, n = 512, sum = 0; for (i = 0; i < 100000000; i++) { if (i > n) { sum++; } else { sum--; } } printf ("sum: %d\n", sum); }
We then build profiling.c and use profile feedback to improve its performance, as shown in Example 8-7.
% /opt/SUNWspro/bin/cc -fast -xarch=v8plusa profiling.c -o profiling \-xprofile=collect:profiling % ./profiling sum: 99998974 % timex ./profiling sum: 99998974 real 3.39 user 3.38 sys 0.01 % /opt/SUNWspro/bin/cc -fast -xarch=v8plusa profiling.c -o profiling \-xprofile=use:profiling % timex ./profiling sum: 99998974 real 0.69 user 0.67 sys 0.01
This is a remarkable improvement! Not all applications will improve this drastically, of course.