7.2 Performance Tuning | Applied C++: Practical Techniques for Building Better Software

Writing efficient code is something of an art. Efficiency is not about rewriting your application in assembly code or anything that drastic. Efficiency is about writing software that meets whatever design criteria you set for it. If a design specification states that an application should "run as fast as possible," you need to rewrite the specification. It is far better to say that particular operations need to execute in a specific amount of time (on a specific platform). For many applications, especially image processing ones, performance is a very important issue. However, it is surprising how many design documents do not address performance in a concrete way.

Let's look at it another way. If you want to train to become a sprinter, you know that you need to run very fast for a relatively short period of time. If you were to write a list of goals for yourself, would you include a statement that says you should "run as fast as possible"? Of course you wouldn't. You would probably say that you need to be able to run a certain distance in a certain amount of time. And based on this goal, you can set up a training program to meet it.

Writing software is no different. You need to have plausible goals and then you can design a plan to reach them. Your goals can be difficult, but not impossible, to achieve. Sometimes a goal may seem impossible , but further investigation reveals it is actually possible, though extremely challenging. Having a goal that is well defined is absolutely essential for getting the performance you need from your application. See [Bulka99] for useful information on writing efficient C++ code.

7.2.1 General Guidelines

It is not always easy to decide when you need to worry about performance. Our recommendation is to assume that a piece of software does not have any special performance criteria unless you know this statement to be false. Avoid writing a highly optimized piece of software to solve a timing problem that may not even be a problem. It is better to design a reasonable solution first, and then discover that it must run faster. This iterative approach to design helps you reduce development time and wasted effort. It is true that you can have cases where your product does not meet the expectations of its testers on the first iteration, but this is what an iterative design approach is all about. The product design specification needs to be as clear as possible regarding the desired functionality and performance of the application.

For example, let us look at how a customer interacts with a graphical user interface (GUI). We see that the overhead of the framework and the way you write your code often has little effect on performance. The customer communicates with the software by making various requests in the form of mouse or other pointer manipulation or keyboard input. For complicated interfaces, these requests occur at no more than one request per second. The steps that the software takes to process such a request can be listed as a series of events:

Receive the event.
Invoke the event handler responsible for the event.
Process the event.
Update the user interface.

If the customer generates events at the rate of one request per second, then this sequence of events, including updating the user interface, must happen in no more than half that time. Where did this number, .5 seconds, come from? It is simply a guess based upon our perception of how customers operate such systems.

Now, without worrying about specific operating systems or GUI implementations , we can make some assumptions about how long it takes to handle each of the above steps. Receiving and invoking an event handler is a fast operation, even when written in a general purpose C++ framework. This step comprises a number of table lookups, as well as one or more virtual function calls to locate the owner of an event. It certainly does not consume a lot of time. Processing the event is strictly an application-specific task, as is updating the user interface. The amount of overhead that can be tolerated in this example is fairly large. The customer will have a very hard time distinguishing between one millisecond and 50 milliseconds .

As a contrasting example, let's look at a real-time system that has hard performance requirements. As opposed to the previous example, where there is a person waiting to see results updated on the screen, these results are needed in a certain period of time, or else the information is useless. The framework's overhead, as well as how the processing is written, is important. However, it is not so obvious how much overhead is acceptable.

We have found that dealing with percentages makes it easier to gauge how overhead can be tolerated. If your processing function does not spend 98 to 99 percent of its time doing actual work, you should examine your design more closely. For example, for very fast processing cycles, say five milliseconds, the framework overhead should be kept to less than 100 microseconds. For slower real-time systems that require about 50 milliseconds to execute, overhead should be less than one millisecond. The design of each of these systems will be very different.

To measure overhead for an image processing system, or other system that performs a repeated calculation on a number of samples, it is customary to compute the overhead on a per row or per pixel basis. Let us assume that we must perform one or more image processing steps on a 512x512 pixel image. If we want to keep the total overhead to one millisecond, the per-row overhead can be no more than two microseconds. Two microseconds is quite a bit of time for modern processors, and this permits numerous pointer manipulations or iterator updates. We are not so lucky if we have a goal of only two-tenths of a microsecond per row. In this case, you should consider optimizing your code from the onset of the design.

If you find this calculation too simplistic for your taste, you can write some simple prototypes to measure the actual performance on your platform and decide if any code optimization is needed. Since many image processing functions are easy to write, you can get a sense for how much time the operation takes. Our unit test framework is a convenient starting point, since the framework computes and reports the execution time of each unit test function. To get started, you would need to write at least two functions. The first function would contain the complete operation you want to test. The second function would contain just the overhead components . The execution time of the first test function tells us how long the image processing takes, while the second tells us if the overhead itself is significant. If our unit test framework was more complicated than it is, we would also need a third function to measure the overhead of the unit test framework itself. But, since the framework is so simple, its overhead is negligible.

7.2.2 Thirteen Ways to Improve Performance

Without getting specific about the dynamics of your application, we offer thirteen techniques that you can use to improve the overall performance of your software.

Figure 7.3. Performance Checklist

graphics/07fig03.gif

Each of these techniques is described in more detail in this section.

Measure performance with release builds, not debugging builds. Debug builds usually have more overhead from the optimizations that are employed, the maintaining of debugging symbols, and the code intended only for debugging. The difference in execution time between release and debug builds can be enormous . For compilers, such as Microsoft Developer Studio, switching between either type of build is easy. For UNIX makefiles, make sure that debugging symbols are excluded and that optimizations are enabled.
Compute only those items that you need, especially in functions that take a long time to execute or are called frequently by your application. It is very common to write a function with the desire to be complete. If you find that certain functions are returning quantities that are never used, you should investigate whether these quantities are needed at all.

For example, suppose you have a function that computes the mean and standard deviation of a number of samples. If you always find yourself throwing out the standard deviation value, then you have to ask yourself why you are computing it. There is no reason why you cannot have two similar functions: one that computes just the mean of a sample set, and another that computes the mean and standard deviation.
Compute or fetch items only once. It is better to explicitly use a local variable in your function than to assume your compiler can divine which quantities are invariant. For example:
```
 for (unsigned int y=0; y<image.height(); y++)   for (unsigned int x=0; x<image.width(); x++)     // ... 
```
In this example, the height() member is called once for every row in the image. width() is called for every pixel in the image. In this particular example, height() is a simple inline method, but this is not always the case. If you are concerned about performance, you can rewrite this loop as:
```
 unsigned int h = image.height (); unsigned int w = image.width (); for (unsigned int y=0; y<h; y++)   for (unsigned int x=0; x<w; x++)     // ... 
```
In this loop, we call width() and height() only once and save their value in a variable. Most of our performance improvement comes from saving the width variable, but it is good programming practice to write symmetrical code and save both. You may be surprised at how a series of these savings can affect the overall performance.
Use integers instead of floating point when possible. This used to be much more important than it is now, but many microprocessors have more integer registers than floating point ones, and they can be used in many more ways. On embedded platforms, you might also consider using the smallest integer format possible.

For example, if you need to sum all the pixels in an image row, you might be able to use an unsigned short instead of an unsigned int . In our own development projects, we usually consider the size of most variables and use one of a sufficient size, but no larger. As an aside, we ignore this rule for counter variables and just use an int or unsigned int .
Know and understand the objects you are using in time-critical code. For example, you don't want a copy constructor or assignment operator to take the bulk of the processing time. If objects are anything other than very simple, try using references ( const if possible) or pointers to avoid copies. This is not a big concern with our apImage<> object because it takes advantage of reference counting.
Remove all debugging and log generation code from time critical code, once you have it debugged . Otherwise, make sure you use #if to compile this code only for debug builds.
Deliver release versions of your application internally as early as possible. Internal users can provide valuable feedback to you regarding performance of which you may not be aware. There's nothing like a real user banging on the system to bring performance issues to light.
Avoid excessive operations with heavyweight objects, like the standard library string class. Many operations, like searching a large string for a substring, should be called as few times as possible. For example, if you are parsing through a large string, keep track of the last index where you found a match, so that you only need to search from that point in the string.
Precompute quantities to speed up run-time calculations. This can be as simple as precomputing quantities to improve loop performance, to caching values that remain invariant for longer periods of time. We'll use a very obvious example to demonstrate this technique. Image rotation is a common image processing function that can be implemented in a variety of ways. If you use a brute force method, the processing loop to rotate the image on a pixel-by-pixel basis looks something like this:
```
 for (unsigned int y=0; y<image.height(); y++)     for (unsigned int x=0; x<image.width(); x++) {       double xprime = cos(angle) * x - sin(angle) * y;       double yprime = sin(angle) * x + cos(angle) * y;       // ...     } 
```
We can easily compute the constant values outside of the loop and considerably improve run-time performance with code similar to this:
```
 double sine   = sin(angle);   double cosine = cos(angle);   double xprime, yprime;   for (unsigned int y=0; y<image.height(); y++)     for (unsigned int x=0; x<image.width(); x++) {       xprime = cosine * x - sine   * y;       yprime = sine   * x + cosine * y;       // ...     } 
```
Avoid system calls and heap allocations in time-critical sections of code. Wherever possible, pre-allocate the memory needed by your routine, since heap allocation is a highly variable process. Defer calls to third-party and operating system functions if they are such that they may take a long time to execute.
Minimize locks in time-critical sections of code. The process of locking is fast only if no one currently owns the lock. A section of code can potentially wait forever before it acquires the lock.
Develop an efficient algorithm that works correctly before you start fine tuning your code. Use unit tests to ensure that the tuning hasn't broken anything. There are many ways to solve the same problem. You may find that optimizing a poorly designed, well-implemented algorithm will never be as efficient as a well-designed, poorly implemented algorithm.
Use a profiler to find out where the biggest performance problems are before doing a lot of small optimizations. Often 90 percent of the execution time for an operation takes place in only 10 percent of the code. Make sure you are spending your time optimizing the right section of code.

7.2.3 Image-Specific Improvements

Image processing algorithms can be optimized in many ways because of their symmetry. To demonstrate this, we'll start with a simple function that computes the sum of all pixels in an image and optimize it in a number of iterations.

SLOW

The easiest way to write this, and probably the slowest, is:

 UTFUNC(slow) {   setDescription ("Slow version");   apRect boundary (0, 0, 512, 512);   apImage<Pel8> byteImage (boundary);   byteImage.set (1);   unsigned int sum = 0;   for (unsigned int y=0; y<byteImage.height(); y++) {       for (unsigned int x=0; x<byteImage.width(); x++) {       sum += byteImage.getPixel (x, y);     }   }   VERIFY (true);  // So this test will pass }

We can rewrite this to take advantage of our iterators. Because this version runs so quickly, we have to run it multiple times to get reported execution times greater than zero.

ORIGINAL

The rewritten function is as follows :

 UTFUNC(original) {   setDescription ("Original version, sum all pixels");   apRect boundary (0, 0, 512, 512);   apImage<Pel8> byteImage (boundary);   byteImage.set (1);   // Run many times to get an accurate measurement   for (int iterations=0; iterations<1000; iterations++) {     unsigned int sum = 0;     apImage<Pel8>::row_iterator i;     for (i=byteImage.row_begin(); i != byteImage.row_end(); i++) {       const Pel8* p = i->p;       for (unsigned int x=0; x<byteImage.width(); x++) {         sum += *p++;       }     }   }   VERIFY (true);  // So this test will pass }

OVERHEAD

Now, we rewrite the original function, so that it has the same setup, but performs no image processing, in order to measure overhead. Note that we perform one trivial operation per row, to make sure that the compiler doesn't optimize the loop for us.

 UTFUNC(overhead) {   setDescription ("Overhead version, sum all pixels");   apRect boundary (0, 0, 512, 512);   apImage<Pel8> byteImage (boundary);   byteImage.set (1);   for (int iterations=0; iterations<1000; iterations++) {     unsigned int dummy = 0;     apImage<Pel8>::row_iterator i;     for (i=byteImage.row_begin(); i != byteImage.row_end(); i++) {       const Pel8* p = i->p;       dummy += *p++; // Prevents compiler optimization of loop     }   }   VERIFY (true);  // So this test will pass }

The execution time of original includes the execution time of the unit test framework, as well as the actual image processing operation. The execution time of overhead also includes the unit test framework overhead, but only includes the overhead portion of the image processing operation, and not the actual time required for the operation. If the overhead of the unit test framework is significant, then we need a third function that measures the overhead of the unit test framework. Because our framework has very low overhead, we can skip this third test.

ORIGINAL_WIDTH

Let's further optimize the function by removing width() from the loop and placing it into a local variable, as shown.

 UTFUNC(original_width) {   setDescription ("Original version with width()");   apRect boundary (0, 0, 512, 512);   apImage<Pel8> byteImage (boundary);   byteImage.set (1);   for (int iterations=0; iterations<1000; iterations++) {     unsigned int sum = 0;     unsigned int w = byteImage.width();     apImage<Pel8>::row_iterator i;     for (i=byteImage.row_begin(); i!=byteImage.row_end(); i++) {       Pel8* p = i->p;       for (unsigned int x=0; x<w; x++) {         sum += *p++;       }     }   }   VERIFY (true);  // So this test will pass }

Let's do one more optimization.

ORIGINAL_UNROLLED

We can do some basic loop unrolling by expanding the inner loop, so that multiple pixels are handled in each iteration, as shown.

 UTFUNC(original_unrolled) {   setDescription ("Original version with unrolling optimization");   apRect boundary (0, 0, 512, 512);   apImage<Pel8> byteImage (boundary);   byteImage.set (1);   for (int iterations=0; iterations<1000; iterations++) {     unsigned int sum = 0;     unsigned int w = byteImage.width() / 8;     apImage<Pel8>::row_iterator i;     for (i=byteImage.row_begin(); i!=byteImage.row_end(); i++) {       Pel8* p = i->p;       for (unsigned int x=0; x<w; x++) {         sum += *p++;         sum += *p++;         sum += *p++;         sum += *p++;         sum += *p++;         sum += *p++;         sum += *p++;         sum += *p++;       }     }   }   VERIFY (true);  // So this test will pass }

As written, this version works for images whose width are multiples of eight pixels. Extending this to work with an arbitrary pixel value would not be difficult.

Execution Results

We ran these tests using Microsoft Visual Studio 7.0 on an Intel Pentium 4 processor-based machine, running at 2.0 GHz. Table 7.1 shows the times for a single iteration. The actual times are lower than this, because we shut off optimization to prevent the compiler from optimizing the loops .

Table 7.1. Performance Test Results

Test	Execution Time
`slow`	156 milliseconds
`original`	1.33 milliseconds
`original_width`	1.23 milliseconds
`original_unrolled`	1.03 milliseconds

Comparing the original version to original_unrolled shows over a 20 percent decrease in execution time. This is a significant gain for such a simple function. However, the inner loop for most image processing functions is more complicated. We measured the overhead of executing the loops and found it to be 0.03 milliseconds, or about 3 percent of our most optimized example.

Let's look at some of the execution times for our existing image processing functions. Table 7.2 shows the execution times for a Laplacian filter running on a 1024x1024 8-bit monochrome image.

Table 7.2. Image Processing Test Results

Test	Execution Time
`convolve()`	89 milliseconds
`laplacian3x3()`	51 milliseconds
`Intel IPP`	4 milliseconds

convolve() is our general-purpose convolution routine that contains a number of nested loops and fetches the kernel values from an array, resulting in the longest execution time. laplacian3x3() only has two loops, because it uses hard-coded statements to compute the output value for each pixel in the image, and this reduces the execution time significantly. The Intel IPP routine directly uses the power of the Intel Pentium 4 processor to achieve the fastest execution time.

7.2.4 A Note About Timing Your Code

Measuring the execution time of critical sections of your code is not always a simple task, especially if you want to establish the worse -case execution time. In our previous examples, we measured the average time, because we ran a number of iterations, and then divided the total execution time by the number of iterations. Here are some of the reasons why this value will not be the worst case:

You really need to measure the slowest code path in your code to get the worst case time. Our filter example is very simple, yet it has several possible code paths. For example, you should let the filter function determine the size and allocate the destination image. Likewise, if you are using a clamped pixel type, you should test your filter so that every pixel in the image causes an overflow.
Multithreaded and multiprocess designs "steal" processor time from the code you are trying to measure. If these threads or processes are not part of the actual application, you should test your function in isolation. If you are utilizing other threads or processes in your design, you won't be able to get actual worst-case values until you simulate these conditions in your unit test. You will also need to compute the execution time of each iteration and store the worst case value.
Interrupt service routines should also be considered if you are working with an embedded system that has more than simple interrupt routines. Like we mentioned above, you will need to measure the worst case execution time by simulating the actual conditions.

The timing routines defined in time.h provide standard, but very coarse, measurements of elapsed time. Most platforms include higher resolution facilities to measure time, and if you are designing for a single platform, we encourage you to take advantage of them. For example, under Microsoft Windows, you can take advantage of timers that provide sub-microsecond resolution. Instead of using this functionality directly ( QueryPerformanceCounter ), you should create a generic wrapper object and reuse the interface on other platforms, as shown.

 class apHiResElapsedTime { public:   apHiResElapsedTime ();   // Record the current time   double usec () const; // Return elapsed time, in microseconds.   double msec () const; // Return elapsed time, in milliseconds.   double sec  () const; // Return elapsed time, in seconds   void reset ();   // Reset the current time private:   LONGLONG starting_;     // Starting time }; apHiResElapsedTime::apHiResElapsedTime () : starting_ (0) { reset ();} double apHiResElapsedTime::sec () const {   LARGE_INTEGER t, freq;   QueryPerformanceCounter (&t);   QueryPerformanceFrequency (&freq);   return (double(t.QuadPart - starting_)) / freq.QuadPart; } void apHiResElapsedTime::reset () {   LARGE_INTEGER t;   QueryPerformanceCounter (&t);   starting_ = t.QuadPart; }