17.8 Optimization | Practical C Programming, 3rd Edition

I l @ ve RuBoard

And now a word on optimization: don't. Most programs do not need to be optimized. They run fast enough. Who cares whether an interactive program takes 0.5 seconds to start up instead of 0.2?

To be fair, there are a lot of slow programs out there that can be sped up. This is usually done not by the simple optimization steps shown in this chapter, but by replacing poorly designed core algorithms with more efficient ones.

For a well-written program, the simplest way to get your program to run faster is to get a faster computer. Many times it is cheaper to buy a more powerful machine than it is to optimize a program, because you may introduce new errors into your code. Don't expect miracles from optimization. Usually most programs can be sped up only 10 percent to 20 percent.

17.8.1 Profiling

In general you'll find that your program spends 90% of its time in 10% of your code. That means that optimizing that code will give the greatest results. Most compilers come with a profile that lets you instrument your code and identify which sections are taking up the most time. Unfortunately each tool is different and a discussion of all of the is not possible in this book.

17.8.2 Analyzing and Optimizing code

Example 17-7 initializes a matrix (two-dimensional array). Let's take a look at this code and see if there's any way of making it faster

Example 17-7. matrix/matrix1.cpp

 #include <assert.h> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     int x,y;    // current element to initialize      for (x = 0; x < X_SIZE; ++x) {         for (y = 0; y < Y_SIZE; ++y) {             assert((x >= 0) && (x < X_SIZE));             assert((y >= 0) && (y < Y_SIZE));             matrix[x][y] = -1;         }     } }

17.8.3 Register Declarations

How can this function be optimized? First we notice we are using two local variables . By using the qualifier register on these variables, we tell the compiler that they are frequently used and should be placed in fast registers instead of relatively slow main memory. The number of registers varies from computer to computer. Slow machines like the PC have 2, most Unix systems have about 11, and supercomputers can have as many as 128. It is possible to declare more register variables than you have registers. C++ will put the extra variables in main memory.

The register form of optimization has been overtaken by compiler technology. Most compilers do a better job of register allocation than you can by manually adding register hints, and they ignore any user register modifiers. However, this technique is still valid for older compilers.

The program now looks like Example 17-8.

Example 17-8. matrix/matrix2.cpp

 #include <assert.h> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     register int x,y;    // current element to initialize      for (x = 0; x < X_SIZE; ++x) {         for (y = 0; y < Y_SIZE; ++y) {             assert((x >= 0) && (x < X_SIZE));             assert((y >= 0) && (y < Y_SIZE));             matrix[x][y] = -1;         }     } }

17.8.3.1 Loop ordering

The outer loop is executed 60 times. This means the overhead associated with starting the inner loop is executed 60 times. If we reverse the order of the loops , we will have to deal with the inner loop only 30 times.

In general, loops should be ordered so the innermost loop is the most complex and the outermost loop is the simplest. Example 17-9 contains the init_matrix function with the loops reordered.

Example 17-9. matrix/matrix3.cpp

 #include <assert.h> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     register int x,y;    // current element to initialize      for (y = 0; y < Y_SIZE; ++y) {         for (x = 0; x < X_SIZE; ++x) {             assert((x >= 0) && (x < X_SIZE));             assert((y >= 0) && (y < Y_SIZE));             matrix[x][y] = -1;         }     } }

17.8.3.2 The power of powers of 2

Indexing an array requires a multiplication operation. For example, to execute the line:

 matrix[x][y] = -1;

the program must compute the location where we want to put the -1. To do this, the program must perform the following steps:

Get the address of the matrix .
Compute x * Y_SIZE .
Compute y .
Add up all three parts to form the address. In C++ this code looks like:

 *(matrix + (x * Y_SIZE) + y) = -1;

However, you typically won't write matrix accesses this way because C++ handles the details. But being aware of the details can help you generate more efficient code.

Almost all C++ compilers will convert multiplication by a power of 2 (2, 4, 8, ...) into shifts, thus taking an expensive operation (multiply) and changing it into an inexpensive operation (shift).

For example:

 i = 32 * j;

is compiled as:

 i = j << 5; /* 2**5 == 32 */

In Example 17-9 we define Y_SIZE as 30, which is not a power of 2. By increasing Y_SIZE to 32, we waste some memory but get a faster program.

Example 17-10 shows how we can take advantage of a power of 2.

Example 17-10. matrix/matrix4.cpp

 #include <assert.h> const int X_SIZE = 60; const int Y_SIZE = 32; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     register int x,y;    // current element to initialize      for (y = 0; y < Y_SIZE; ++y) {         for (x = 0; x < X_SIZE; ++x) {             assert((x >= 0) && (x < X_SIZE));             assert((y >= 0) && (y < Y_SIZE));             matrix[x][y] = -1;         }     } }

17.8.3.3 Making use of pointers

Since we are initializing consecutive memory locations, we can optimize the program even further. We can initialize the matrix by starting at the first location and storing a -1 in the next X_SIZE * Y_SIZE elements. Using this method, we cut the number of loops down to one. The indexing of the matrix has changed from a standard index ( matrix[x][y] ), requiring a shift and add, into a pointer dereference ( *matrix_ptr ) and an increment (++ matrix_ptr ). In Example 17-11, we've turned our arrays into pointers.

Example 17-11. matrix/matrix5.cpp

 const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     register int index;         // element counter      register int *matrix_ptr;   // Current element     matrix_ptr = &matrix[0][0];     for (index = 0; index < X_SIZE * Y_SIZE; ++index) {         *matrix_ptr = -1;         ++matrix_ptr;     } }

But why have both a loop counter and a matrix_ptr ? Couldn't we combine the two? In fact we can. In Example 17-12 we've successfully eliminated the loop counter by combining it with the array pointer.

Example 17-12. matrix/matrix6.cpp

 const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     register int *matrix_ptr;   // Current element     for (matrix_ptr = &matrix[0][0];               matrix_ptr <= &matrix[X_SIZE-1][Y_SIZE-1];              ++matrix_ptr) {         *matrix_ptr = -1;     } }

The function is now well optimized. The only way we could make it better is to manually code it into assembly language. This might make it faster; however, assembly language is highly nonportable and very error-prone . But in this case, someone else has written a highly optimized assembly-language (usually) function that we can use to do the job, and it's a part of the standard C++ library.

17.8.4 Using the System Library

The library routine memset can be used to fill a matrix or array with a single character value. We can use it to initialize the matrix in this program. Frequently used library subroutines such as memset are often coded into assembly language and may make use of special processor-dependent tricks to do the job faster than could be done in C++. In Example 17-13 we let the function memset do the work.

Example 17-13. matrix/matrix7.cpp

 #include <cstring> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; void init_matrix(  ) {     std::memset(matrix, -1, sizeof(matrix)); }

Now our function consists of only a single function call. It seems a shame to have to call a function just to call another function. We have to pay for the overhead of two function calls. It would be better if we called memset from the main function. Why don't we rewrite all the functions that call init_matrix so they use memset ? Because it has several hundred init_matrix calls, and we don't want to do all that editing.

So how do we get rid of the overhead of a function call? By making the function inline. Our final version of the function uses inline to eliminate all the call overhead. It can be seen in Example 17-14.

Example 17-14. matrix/matrix8.cpp

 #include <cstring> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; inline void init_matrix(  ) {     std::memset(matrix, -1, sizeof(matrix)); }

Why does memset successfully initialize the matrix to -1, but when we try to use it to set every element to 1, we fail?

 #include <cstring> const int X_SIZE = 60; const int Y_SIZE = 30; int matrix[X_SIZE][Y_SIZE]; inline void init_matrix(  ) {     memset(matrix, 1, sizeof(matrix)); }

I l @ ve RuBoard