8.3 Optimization Patterns | System Performance Tuning2002

We now turn our attention toward patterns for optimization. The techniques presented here serve two purposes. The first is to explain how some of the most common optimizations work, so that you can understand whether a specific compiler optimization flag is likely to help performance. The second is to enable you to, in some cases, apply these optimizations by hand. This is sometimes necessary, when the compiler is not able to make assumptions about whether an optimization is safe. ^[8] This does not mean that you should apply these optimizations everywhere! The compilers are usually pretty reliable. Careful use of timing techniques in critical code flow sections will help you identify cases where the compiler could use some help.

^[8] By default, compilers are pessimistic about whether certain optimizations are permissible. Humans are often able to look at a piece of code and determine, by visual inspection and a thorough understanding of the code flow, whether a given optimization is safe.

We cannot present every optimization, so we'll focus on the most common and important ones. Generally, these optimizations fall into a handful of categories: optimizations on arithmetic, on loops , and on string operations.

These optimizations will be presented in C. If you are interested in Java tuning, please consult Java Performance Tuning by Jack Shiraz (O'Reilly). IBM has written an excellent summary text on this topic, which is called the Optimization and Tuning Guide for Fortran, C, and C++ .

8.3.1 Arithmetic

One of the most fundamental arithmetic optimizations is to replace a sequence of operations by a sequence of equivalent, less complex operations. For example, the compiler might replace a line of code like:

 x = pow (y, 5);

with a line like:

 x = y * y * y * y * y;

because in many cases repeated multiplication is less expensive than exponentiation. This is called strength reduction . It is by far the most common arithmatic optimization.

Another common example of strength reduction is replacing a division with a multiplication-by-inverse. For example, you could replace this line of code:

 x = y / z;

with a line like:

 x = y * (1 / z);

and see a performance increase simply because, on many platforms, inverting and multiplying is much faster than dividing.

Another good idea in mathematics- intensive codes is to link against your vendor's high-performance mathematics library, if one is available. This step can produce fantastic performance increases by calling highly tuned assembly-language routines for basic math functions.

8.3.2 Loops

Please note that all of these loop optimizations are performed automatically by a modern optimization compiler. However, compilers are required to be somewhat conservative (the emphasis is on generating "correct" code, rather than fast code); using some of these techniques manually can help the compiler find opportunities for other, more complex optimizations.

Loop fusion is the process of combining the statements inside several loops into a single loop.

Untuned	Tuned
for (i = 0; i <= 1024; i++) { x = x * a[i] * b[i]; } for (j = 0; j <= 1024; j++) { y = y * a[j] * b[j]; }	for (i = 0; i <= 1024; i++) { x = x * a[i] * b[i]; y = y * a[j] * b[j]; }

This technique not only halves the loop overhead, it also creates better opportunities for the compiler to overlap instructions to take advantage of parallelism in the underlying microprocessor. You must exercise care, however, as this optimization has the possibility of decreasing performance by increasing the cache miss rate (for example, when each loop accesses a significant portion of the data cache). It also can be complicated by problems of data dependence; you must be sure that the work done in the second loop is not reliant on the work finished by the first loop.

Floating invariant flow control moves flow control statements that do not change between iterations of a loop outside of the loop.

Untuned	Tuned
for (i = 0; i <= 1024; i++) { for (j = 0; j <= 1024; j++) { if (a[i] >= 64) { b[i] = -a[i]; } x = x + a[j] - b[j]; } }	for (i = 0; i <= 1024; i++) { if (a[i] >= 64) { b[i] = -a[i]; } for (j = 0; j <= 1024; j++) { x = x + a[j] - b[j]; } }

Untuned

Tuned

 for (i = 0; i <= 1024; i++) {      for (j = 0; j <= 1024; j++) {           if (a[i] >= 64) {                b[i] = -a[i];           }           x = x + a[j] - b[j];      } }

 for (i = 0; i <= 1024; i++) {      if (a[i] >= 64) {           b[i] = -a[i];      }      for (j = 0; j <= 1024; j++) {           x = x + a[j] - b[j];      } }

Most compilers do this automatically, but the compiler may not be able to if the loop contains calls to other procedures or in complex loops where the invariance cannot easily be determined.

Loop unrolling replicates the functional part of the loop while reducing the iteration count proportionally.

Untuned	Tuned
for (i = 0; i <= 1024; i++) { for (j = 0; j <= 4; j++) { a[i] = b[i, j]; } }	for (i = 0; i <= 1024; i++) { a[i] = b[i, 1]; a[i] = b[i, 2]; a[i] = b[i, 3]; a[i] = b[i, 4]; }

This is primarily of use to generate more opportunities for scheduling parallelism in the compiler and microprocessor.

Stride refers to the relationship between the layout of an array's elements in memory and the order in which they are processed . For example, an array of 32-bit integers is accessed at stride 1 if they are accessed by monotonically increasing the value of the rightmost subscript.

Untuned	Tuned
for (i = 0; i <= 1024; i++) { for (j = 0; j <= 16; j++) { a[j] = b[j, i]; } }	for (j = 0; j <= 16; i++) { for (i = 0; i <= 1024; i++) { a[j] = b[j, i]; } }

The lower the stride, the higher performing the software, due to optimal use of the cache hierarchy.

8.3.3 Strings

The runtime of the standard C library functions is proportional to the lengths of their operands. It's quite common to loop over calls to these functions and consume a significant amount of CPU time. There a few simple optimizations you can take to reduce this effect:

Instead of calling strlen( ) in a loop involving the string itself, you can rewrite the loop so that you set a temporary value equal to strlen( ) before the loop, and then modify the loop as you add and remove characters .
If you are testing whether a string is empty, strlen( ) is probably not the fastest way to do it. If the strings are of substantial size , strlen() will check every character until it finds the terminator character. You can replace it with *s == '\0' , which checks only the first character as well as saving a function call. Similarly, if you want to clear a string, *s = '\0' is much faster than strcpy (s, "") .
strcat( ) scans the full length of the string on each call. If you've been keeping track of the length of the string as you go along (as described regarding strlen( ) previously), you have a ready-made index to the end of the string, and can use strcpy( ) or memcpy ( ) from there.
You might be able to save a little bit of time with a strcmp( ) between two strings containing natural language by checking if the first characters of the strings are equal before doing the call to strcmp( ) . Because of the statistically nonuniform distribution of letters in natural languages, the chances of a match are a bit better than would be otherwise expected. However, this optimization can be counterproductive, so be aware of the environment you're working in (e.g., if you are comparing adjacent strings in sorted output, this will not buy you anything).