8.4 Interacting with Compilers | System Performance Tuning2002

Compilers have a well-deserved reputation for presenting their users with a bewildering array of options, none of which make particular sense. The truth is that very few of these options are really necessary to tune, but picking correct compiler settings can make a huge difference in application performance. Furthermore, you don't need to completely understand how the compiler works in order to properly optimize your applications. ^[9]

^[9] However, if you are interested in learning about how, exactly, a compiler works, read Compilers: Principles, Techniques, and Tools by A. Aho, R. Sethi, and J. Ullman (Addison Wesley). This is commonly known as "the dragon book" and is one of the best textbooks on the theory behind compiler design.

There are two compilers in widespread use in the Sun and Linux worlds . The first is the Forte Developer compilers, which are sold by Sun as the follow-on to the now-obsolete Sun WorkShop compilers. More information, including a trial download, is available at http://www.sun.com/forte/. The second is the free GNU compiler, gcc . The chief advantages of gcc are that it is free and has been ported to a wide variety of platforms; it is almost always available, even for esoteric systems. In my opinion, the Forte Developer compilers generate significantly better code, and are much better documented and easier to work with. We will focus here primarily on options for the Forte compilers. No matter what, keep in mind that compilers are constantly being optimized and refined. It pays to have the most recent version of the compilers, as it's not unusual to see speedups in the 10 to 15% range with each new major compiler release.

There are two caveats about compiler flags. The first is that order is important: subsequent flags will override earlier flags. This is particularly a problem with -fast ; specifying -fast -xO4 with the Forte Developer 6 compilers will result in a lower optimization level being used than just -fast alone. The second is more of a heads-up; namely, that not every compiler flag need be preceded with -x .

8.4.1 Typical Optimizations: -fast

The most important compiler option to know for Sun systems is -fast . This macro expands to different things depending on what language you are compiling and what version of the compiler you're using. For a summary, see Table 8-4.

Table 8-4. Expansions of the -fast macro

Compiler	C	Fortran 77	Fortran 90
Sun WorkShop 5.0	`-xO4 -single -fns-fsimple=1-ftrap=%none -libmil-native`	`-xO4 -dalign -depend-fns -fsimple=1-ftrap=%none -libmil-native`	-xO3 -dalign -fns -ftrap=common -f -native -xlibmopt
Forte Developer 6	-O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xlibmil -native -xprefetch=no -xvector=no	-O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes	-O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes
Forte Developer 6 Update 1	-O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -native -xprefetch=no -xvector=no	-O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes	-O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes

Compiler

Fortran 77

Fortran 90

Sun WorkShop 5.0

-xO4 -single -fns-fsimple=1-ftrap=%none -libmil-native

-xO4 -dalign -depend-fns -fsimple=1-ftrap=%none -libmil-native

 -xO3 -dalign -fns -ftrap=common -f -native -xlibmopt

Forte Developer 6

 -O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xlibmil -native -xprefetch=no -xvector=no

 -O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes

 -O5 -dalign -depend -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes

Forte Developer 6 Update 1

 -O5 -single -xmemalign=8s -fns -fsimple=2 -ftrap=%none -xalias_level=basic -xbuiltin=%all -xlibmil -native -xprefetch=no -xvector=no

 -O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=%none -xlibmil -native -xlibmopt -xvector=yes

 -O5 -dalign -depend -xprefetch -xpad=local -fns -fsimple=2 -ftrap=common -f -xlibmil -native -xlibmopt -xvector=yes

For the most part, we won't discuss what these flags actually mean. Suffice it to say that they represent a good mix of basic optimizations.

8.4.2 Optimization Level: -xO

The most commonly tuned optimization flag is the global optimization control -xOn , where n specifies the optimization level (from 1 through 5, inclusive). The default optimization level is -xO2 for C programs and -xO3 for Fortran programs in the Forte Developer 6 compilers. Table 8-5 is a summary of what each optimization level does.

Table 8-5. Optimization level summary for Forte compilers

Level	Description
-xO1	Basic local optimizations only.
-xO2	Default for C. Level 1 plus global optimizations: algebraic simplification, subexpression elimination, optimized register allocation, eliminating dead code, propagation of constants, tail-call elimination . ^[10]
-xO3	Default for Fortran. Level 2 plus loop optimizations (unrolling and fusion). Software pipelining.
-xO4	Level 3 plus function inline and more aggressive global optimization.
-xO5	The highest level of optimization. It is most likely to perform well when combined with profile feedback (see Section 8.4.11 later in this chapter).

^[10] Tail calls are not eliminated until -xO4 if -g is specified, to ease debugging.

gcc , the free GNU compiler, performs no optimization by default. It has three levels of optimization, which provide varying returns summarized in Table 8-6.

Table 8-6. Optimization level summary for gcc

Level	Description
-O1	Basic local optimization only.
-O2	Any optimization that doesn't involve a size -for-speed tradeoff ; notably no loop unrolling or inlining. (Enabling loop unrolling is done via -funroll_loops .)
-O3	Level 2 optimizations plus function inlining.

8.4.3 Specifying Instruction Set Architecture: -xarch

The -xarch flag specifies what instruction set architecture the program should be built with. For SPARC platforms, there are nine choices, as summarized in Table 8-7.

Table 8-7. Valid -xarch values

Value	Instruction set	Bitness	Restrictions
v7	SPARC V7(no `fsmuld` , integer `mul` or integer `div` instructions)	32-bit	Any SPARC machine
v8a	SPARC V8(no `fsmuld` instruction)	32-bit	Any microSPARC-I or later machine
v8	SPARC V8	32-bit	Any SuperSPARC or later machine
v8plus	SPARC V9(no VIS)	32-bit	Any UltraSPARC machine
v8plusa	SPARC V9	32-bit	Any UltraSPARC machine
v8plusb	SPARC V9(with UltraSPARC-III extensions)	32-bit	UltraSPARC-III machines
v9	SPARC V9(no VIS)	64-bit	Any UltraSPARC machine running in 64-bit mode
v9a	SPARC V9	64-bit	Any UltraSPARC machine running in 64-bit mode
v9b	SPARC V9(with UltraSPARC-III extensions)	64-bit	UltraSPARC-III machines running in 64-bit mode

In general, the best performance of a 32-bit application on UltraSPARC-class processors is obtained with -xarch=v8plusa (or v8plusb on UltraSPARC-III systems). Specifying an early architecture here can significantly impede performance of your applications; pick the latest architecture possible while still supporting all your systems. It may be worth while to compile multiple versions of a program that particularly benefits from an optimization only available on certain platforms.

If you are compiling applications with -fast on an UltraSPARC platform and either don't specify an -xarch value or specify -xarch=native , it will assume you want to use v8plusa or v8plusb . This will generate code that won't run on pre-UltraSPARC machines, and generates an error message:

 cc: Warning: -xarch=native has been explicitly specified, or implicitly specified by  a macro option, -xarch=native on this architecture implies -xarch=v8plusa which generates code that does not run on pre-UltraSPARC processors

This is just a warning, and won't interfere with anything. If you'd like it to go away, just explicitly specify -xarch=v8plusa (or any value you like, except for the native values).

8.4.4 Specifying Processor Architecture: -xchip

Restricting the type of processor that will be used to run the application provides the compiler with a great deal of information. This information is mostly used to schedule instructions and handle branches in the optimal way; for example, the number of cycles of delay between loading a data value and using it in a subsequent computation (called the load-use delay ) is very processor-dependent. Tuning this parameter can have big payoffs. There are thirteen options for SPARC systems, which are laid out in Table 8-8. ^[11]

^[11] There are also options for Solaris on Intel systems; the most useful of these is pentium_pro .

Table 8-8. Valid -xchip values

Value	Architecture
old	Very old (pre-SuperSPARC) processors
super	SuperSPARC chips(any SuperSPARC chip slower than 60 MHz: SM61 or slower)
super2	SuperSPARC-II chips(any SuperSPARC chip faster than 75 MHz: SM71 or better)
micro	MicroSPARC-I chips(SPARCclassic, LX)
micro2	MicroSPARC-II chips(Voyager, SPARCstation 4, SPARCstation 5)
hyper	HyperSPARC-I chips
hyper2	HyperSPARC-II chips
ultra	UltraSPARC-I chips(All UltraSPARC-I processors are slower than 200 MHz)
ultra2	UltraSPARC-II chips
ultra2i	UltraSPARC-IIi chips(Ultra 5, Ultra 10, etc.)
ultra3	UltraSPARC-III chips
native	The current architecture (what is being used to compile), assuming a 32-bit environment
native64	The current architecture (what is being used to compile), assuming a 64-bit environment

8.4.5 Function Inlining: -xinlining and -xcrossfile

Function inlining is the process of including a function inside the function that called it. It eliminates some of the overhead of jumping to another location in memory, and provides more opportunities for the compiler to find parallel scheduling opportunities.

In gcc , inlining optimizations are turned on by -finline-functions .

It's a bad idea to turn on inlining when the optimization level is higher than -xO4 , as that level of optimization already inlines functions. Specifying the inlining option twice in this fashion can cause performance slowdowns. However, specifying -xcrossfile is a good idea: it lets you inline functions that reside in separate source code files. Use this with -xO4 or higher levels of optimization for the best results.

8.4.6 Data Dependency Analysis: -xdepend

Part of the compiler's job is to analyze data dependencies within loops, and restructure the loops if necessary. This restructuring may give the compiler more opportunities to unroll or otherwise optimize the loop, which can improve performance. The compiler will also attempt to cache block . ^[12]

^[12] Cache blocking is a technique wherein the computation is divided up so that the data accessed in the divided parts fits in the processor cache. This improves cache efficiency.

-xdepend requires that you specify -xO3 or higher, and it is set by default in -fast .

8.4.7 Vector Operations: -xvector

The -xvector flag tells the compiler to use the optimized vector math library, which can be significantly faster than its scalar counterpart . This flag has the most effect when the application repeatedly calls math library intrinsics , such as log , sin , and exp .

-xvector also enables -xdepend .

8.4.8 Default Floating Point Constant Size: -xsfpconst

By default, the Forte C compiler treats floating-point constants as double s unless explicitly declared as float s. As a result, many extra conversion instructions are often required to transform double-precision floating-point constants into single-precision ones. This is a particularly big problem for codes that perform a large number of division or square root operations, which require almost twice as many cycles when working on double-precision variables . The -xsfpconst flag forces the compiler to treat floating-point constants as single-precision.

8.4.9 Data Prefetching : -xprefetch

Data prefetching is a technique that allows the processor to overlap executing instructions with fetching data from memory. This is particularly helpful on latency-bound applications on high-latency systems, such as applications that have a repeated regular access pattern (e.g., many large loops) on high-latency hardware. The -xprefetch flag tells the compiler to insert specific prefetching instructions into the program to facilitate this behavior. Because this is very architecturally dependent, you get the best results when this option is used in concert with the -xchip and - xtarget options.

Note that UltraSPARC-I processors allow the prefetch instruction, but don't actually do anything. Therefore, code compiled with -xprefetch will run on all UltraSPARC-based systems, but not do anything on UltraSPARC-I systems; potentially provide a significant improvement on UltraSPARC-II systems; and potentially provide a large improvement on UltraSPARC-III systems, because of their on-chip caches dedicated to prefetching.

8.4.10 Quick and Dirty Compiler Flags

This is a quick and dirty reference of compiler flags for specific sorts of applications. As always, be sure to test; they may not be ideal for your particular program. These flags all assume you are using an UltraSPARC-based system:

For applications that require strict floating-point behavior (IEEE 754), you should try -fast -xarch=v8plus -fsimple=0 .
For C applications where pointer arguments to functions don't alias each other, try -fast -xarch=v8plus -xrestrict ; if they follow the ISO C 1999 pointer dereferencing rules, also try specifying -xalias_level=std .
Fortran applications should use -stackvar , which forces the compiler to allocate local variables on the program's stack; try -fast -xarch-v8plus -stackvar .
For applications running on UltraSPARC-III systems, use -xprefetch to enable prefetching; try -fast -xdepend -xchip=ultra3 -xprefetch .

8.4.11 Profiling Feedback

The compiler has a built-in mechanism for targeting aggressive optimization on the most frequently run portions of a program. This scheme is called profile feedback . It uses runtime execution frequency data from the application to direct further optimizations on a subsequent compiling run.

The use profile feedback optimization, you must first build the application with the -xprofile=collect: name flag, where name is the name of the executable. A subsequent run of the application will generate a name.profile directory, which will contain the runtime data. This training run will take more time than the application normally would. Finally, rebuild the application with -xprofile=use: name ; the compiler will use the data it gathered during the first application run to improve optimization.

Example 8-6 gives some source code that we can use to demonstrate the utility of profiling feedback.

Example 8-6. profiling.c

 /* profiling.c */ #include <stdio.h> int main(int argc, char **argv) {         int i, n = 512, sum = 0;         for (i = 0; i < 100000000; i++) {                 if (i > n) {                         sum++;                 } else {                         sum--;                 }         }         printf ("sum: %d\n", sum); }

We then build profiling.c and use profile feedback to improve its performance, as shown in Example 8-7.

Example 8-7. Using profiling feedback to improve application performance

 %  /opt/SUNWspro/bin/cc -fast -xarch=v8plusa profiling.c -o profiling \-xprofile=collect:profiling  %  ./profiling  sum: 99998974 %  timex ./profiling  sum: 99998974 real        3.39 user        3.38 sys         0.01 %  /opt/SUNWspro/bin/cc -fast -xarch=v8plusa profiling.c -o profiling \-xprofile=use:profiling  %  timex ./profiling  sum: 99998974 real        0.69 user        0.67 sys         0.01

This is a remarkable improvement! Not all applications will improve this drastically, of course.