Symptoms | Preventative Programming Techniques: Avoid and Correct Common Mistakes (Charles River Media Programming)

< Day Day Up >

Premature optimization pervades computer science like a plague. You usually do not have to look very far to find an example of this problem. One of the most important steps in preventing premature optimization is to learn when it is being done. In this section, we will examine the varied symptoms that point to this illness. First, the more general indicators that apply under all circumstances are presented. Then, several primary instances when premature optimizations can occur will be looked at. Finally, we look at reasons for optimization that are often false and can lead to premature optimization. Only when you know how to diagnose the problem can you move on to curing current problems and preventing future ones.

Why?

So, if premature optimization has so many bad consequences, why do people still make this same mistake repeatedly? There are a wide variety of reasons that people give for performing optimization too early, and it is very important to know these reasons and why they are flawed. This knowledge will protect you from following these lines of reasoning and allow you to understand and discuss with other programmers why they might want to consider not performing an early optimization. As we explore the various symptoms of premature optimization, some of the major reasons for premature optimization will also be explored.

General Symptoms

There are several general indicators that premature optimization is occurring. Because of their broad nature, however, many of these signs only indicate that more investigation is necessary. Watch for these symptoms and, when they appear, delve further into their cause.

Readability

One of the most common indicators of premature optimization is poor code readability. The majority of optimizations are more complicated solutions and therefore harder to read and understand than a simpler but less optimal solution. This does not mean that all difficult to read code has been optimized. An unfortunately all too common reason for poor readability is simply poor code. It is necessary to examine the purpose of the code to determine if it is performing useful work that could be an optimization, or if it is simply hiding a much less complicated algorithm in highly obfuscated code.

The poor readability introduced by premature optimizations makes the code more difficult to understand and modify. Since every extra day a programmer spends modifying code costs money, you must balance the benefits of the performance gain over the loss in development time and increased cost of programming labor. It must be emphasized that there are few optimizations that do not hurt readability or robustness, so you should assume that you will be making this tradeoff when making a decision. It should be obvious from the link between optimization and longer development time why it is best to wait until the latter stages of development to perform optimizations.

CD-ROM Take for example the following sorting algorithm, called selection sort, written in C++ and included in Source/Examples/Chapter1/readability.cpp on the companion CD-ROM:

void selection_sort(int *io_array,                         unsigned int i_size)     {         for(unsigned int l_indexToSwap = 0;             l_indexToSwap < i_size; ++l_indexToSwap) {             unsigned int l_indexOfMinimumValue =                 l_indexToSwap;             for(unsigned int l_indexToTest =                 l_indexToSwap + 1;                 l_indexToTest < i_size;                 ++l_indexToTest) {                 if(io_array[l_indexToTest] <                    io_array[l_indexOfMinimumValue]) {                     l_indexOfMinimumValue =                         l_indexToTest;                 }             }             int l_minimumValue =                 io_array[l_indexOfMinimumValue];             io_array[l_indexOfMinimumValue] =                 io_array[l_indexToSwap];             io_array[l_indexToSwap] = l_minimumValue;         }     }

CD-ROM This straightforward algorithm is easy to read and understand. However, a much more efficient sorting algorithm is the heap sort algorithm, also included in Source/Examples/Chapter1/readability.cpp on the companion CD-ROM:

void sift_down(int *io_array, unsigned int i_size,                    int i_value, unsigned int i_index1,                    unsigned int i_index2)     {         while(i_index2 <= i_size - 1) {             if((i_index2 < i_size - 1) &&                (io_array[i_index2] <                 io_array[i_index2 + 1])) {                 ++i_index2;             }             if(i_value < io_array[i_index2]) {                 io_array[i_index1] = io_array[i_index2];                 i_index1 = i_index2;                 i_index2 *= 2;             } else {                 break;             }         }         io_array[i_index1] = i_value;     }     void heap_sort(int *io_array, unsigned int i_size)     {         if(i_size < 2) {             return;         }         for(unsigned int l_hire = i_size / 2;             l_hire > 0; --l_hire) {             sift_down(io_array, i_size,                       io_array[l_hire - 1],                       l_hire - 1,                       (2 * l_hire) - 1);         }         for(unsigned int l_retire = i_size - 1;             l_retire > 1; --l_retire) {             int l_value = io_array[l_retire];             io_array[l_retire] = io_array[0];             sift_down(io_array, l_retire,                       l_value, 0, 1);         }         int l_swap = io_array[1];         io_array[1] = io_array[0];         io_array[0] = l_swap;     }

While still not the most difficult to understand code, it will certainly take more time to parse than the simpler selection sort algorithm. This decrease in readability is only desirable if the performance gained is necessary.

One parting note, which we will discuss in more detail when we talk about NIH (Not-Invented-Here) Syndrome, is to avoid writing your own sorting algorithms in most cases. The sorting algorithms presented here were simply for example purposes; there is usually no reason to rewrite work that is likely to be freely available for almost every language.

Dependency

Optimization must sometimes cross encapsulation boundaries, leading to greater dependency between classes or modules that can reduce the robustness of the code. These dependencies make changes more difficult by causing each change to affect more of the code base than would otherwise be necessary. In some languages, such as C++, these dependencies might also cause other unwanted problems such as longer compile times. As with readability, code with too many dependencies does not indicate optimized code, but can also result from poor coding practices.

Look for two or more units, such as classes or modules, that are overly dependent on each other. Does changing one of the code units require a change to the other code unit? If so, chances are that these units are tightly coupled, possibly for the purpose of optimization. There are several things to consider at this point. First, should these units actually be separate, or can they be incorporated into one cohesive unit? Depending on the design, having them separate might have caused encapsulated data to become exposed. This is one of the least desirable results of such a dependency and should be the first to be remedied as we discuss in the “Cures” section. Second, if upon consideration it makes sense to have them as two separate units, does the dependency serve a useful purpose? This is where a mistake is often made; using an optimization as a justification is not the right decision. Very rarely does the performance increase outweigh the increase in dependency and the resulting brittleness of the code.

To help illustrate such dependencies, let us provide a simple example using pseudo-code. The goal of the code is to call a launch function that will launch a missile at a target when the target is in range. This is a common occurrence in many games, including jet fighter simulations. The launch function takes a unit vector pointing in the direction the missile should launch, which would be defined something like this:

    launch(vector unit_direction);

Now we can create the function that checks the range and calls launch when we are in range. This function requires the origin of the missile launcher, the target destination, and the range of the missile. The range check can be done by taking the dot product of the vector from the origin to the destination with itself, because the dot product of a vector with itself is the vector’s length squared. The range must also be squared to make the comparison valid. If the launch is a go, the vector from the origin to the destination must be normalized, or made into a unit vector, before it is passed to the launch function. This is accomplished by dividing the vector by its length, or the square root of the dot product of the vector with itself. This results in the following function:

    launch_when_in_range(vector origin,                          vector destination,                          real range)     {         if(dot_product(destination – origin,                        destination – origin)            is less than (range ^ 2)) {             launch(normalize(destination – origin));         }     }

Because we have not optimized the function yet, it is easy to see that we can reuse the range check by placing it in a separate function:

    bool is_in_range(vector origin,                      vector destination,                      real range)     {         return whether             dot_product(destination – origin,                         destination – origin)             is less than (range ^ 2);     }     void launch_when_in_range(vector origin,                               vector destination,                               real range)     {         if(is_in_range(origin, destination, range)) {             launch(normalize(destination – origin));         }     }

Perhaps it turns out that this function is called often, so when we get to the optimization phase we decide to improve its performance. A simple observation indicates that we are computing the same values multiple times, so to improve the performance, we place them in temporary variables that can be reused throughout the function:

    void launch_when_in_range(vector origin,                               vector destination,                               real range)     {         vector origin_to_destination =             destination – origin;         real origin_to_destination_length_squared =             dot_product(origin_to_destination,                         origin_to_destination);         if(origin_to_destination_length_squared            is less than (range ^ 2)) {             launch(origin_to_destination / square_root(                 origin_to_destination_length_squared));         }     }

This function has a much tighter coupling between its various parts, making it difficult to separate any part out for reuse. It is also more difficult to read than the two separate functions that we were able to separate the original function into. If this version had been written early in development, it is likely that it would not have been refactored into the separate functions. This would have led to a duplication of the range check elsewhere, leading to code that is more difficult to maintain and modify. This example illustrates dependencies on a functional level, but they can also occur at other levels such as class or module interdependencies. Look carefully at all interdependencies to see if they represent premature optimization.

Constraints

Another indicator of premature optimization is overly constrained data input, leading to an inflexibility that will not allow changes later in development. If a more general solution is evident, then it is likely that the code was written to be optimized rather than robust. As with other signs of optimization, it is possible to choose an algorithm that exhibits too many constraints without a performance gain. However, this is generally less common with this particular symptom. Worse than simply constraining the inputs from being accepted, the usual result is an algorithm that accepts the inputs but performs worse than the more generic algorithm that does not constrain the data.

Let us illustrate this with a simple example involving hash tables. A hash table is an indexed array where the index is determined by applying a hash algorithm to a key from a large range to obtain an index within the range of the array. The key focus of this example will be the hash algorithm, but to see its effect on performance, you must understand the other parts of the algorithm. One of the results of taking the keys from a larger range and converting them to the smaller range of the indices is the possibility of a conflict occurring. There are several methods of resolving these conflicts. For this example, we will assume that the data set is smaller than the array, so we can simply perform a linear search from the initial hash position until we find a valid spot. For inserting a valid spot would be an empty spot, and for searching a valid spot would contain an exact match for the search key. Here are the insert and search functions:

insert(string key, data)     {         base_index = index = hash(key);         while array[index] is not empty         {             index = (index + 1) modulus array_size;             // Assume data fits in array, if the index             // wraps around to the original index the             // array is filled and our assumption has             // been violated.             assert(index is not base_index);         }         array[index] = data;     }     search(string key)     {         base_index = index = hash(key)         while array[index].key does not match key         {             if array[index] is empty                 return item_not_found;             index = (index + 1) modulus array_size;             if index matches base_index                 return item_not_found;         }         return array[index];     }

Notice that the more keys that hash to the same location, the longer insertion and searching will take for those keys. Thus, it is important to ensure that the keys in the input data set hash to different locations as often as possible. Take the following set of keys:

    Ada Hawkins     Brian Hawkins     Chester Hawkins     Debra Hawkins     Kathleen Hawkins     Loreena Hawkins     Robert Hawkins     Trip Hawkins

Given this set of keys, you might be tempted to write a hash algorithm that performs fast with the given input:

    hash(string key)     {         return key[0] modulus array_size;     }

This hash will perform well with a given set of keys, which follow the constraint that most or all of them start with different letters. Now imagine the set of keys changed to the following:

    Hawkins, Ada     Hawkins, Brian     Hawkins, Chester     Hawkins, Debra     Hawkins, Kathleen     Hawkins, Loreena     Hawkins, Robert     Hawkins, Trip

Our hash algorithm will now generate the same index for every key, eliminating the advantage of using a hash table and reverting it to a simple linear search. The worse part about this type of premature optimization is the possibility that external data changed after the application is complete can radically influence the performance of the application. If instead the following hash algorithm [Sedgewick90] were used, the performance would not be optimal for the initial set of keys:

    integer hash(string key, integer array_size)     {         hash = 0;         for each character in key             hash = ((64 * hash) +                 character) modulus array_size;         return hash;     }

However, any change to the keys would result in little change in performance, making the algorithm’s performance more predictable and higher on average than many of the other algorithms. If the performance improvements do become necessary toward the end of development during the optimization stage, be sure to provide a warning or error for data sets than cause considerable performance loss. For example, the insert function might be rewritten like this:

    insert(string key, data)     {         count = 0;         base_index = index = hash(key, array_size);         while array[index] is not empty         {             index = (index + 1) modulus array_size;             count = count + 1;             if count is too large                 warn hash_algorithm_performance_reduced;             // Assume data fits in array, if the index             // wraps around to the original index the             // array is filled and our assumption has             // been violated.             assert(index is not base_index);         }         array[index] = data;     }

Performance Tradeoffs

At this point you might be thinking, “Well, sure these optimizations lead to code that is harder to read and modify, but aren’t the performance improvements worth the tradeoff?” The answer is “not usually.” The fastest way to understand why is to profile the entire application and find out what percentage of time is spent in each of the different algorithms and modules. On average, 80 percent of the time should be spent in 20 percent of the code, and conversely, 80 percent of the code is only called 20 percent of the time. For simplicity’s sake, imagine that every optimization took the same amount of development time, and for every optimization, you lose one feature. You should already notice that you would probably not want to make every optimization, but concentrate on the ones that affect the 20 percent of the code that is running 80 percent of the time. This is even before we consider when the optimization is made. Suppose also that for every month after an optimization is made, but before you ship, you lose one additional feature because of the difficulties caused by poor readability and code fragility. Obviously, you would want to wait until later in the project and make only those optimizations that are necessary to meet your performance goals. While we have simplified our description to make it easier to visualize, this is exactly the type of tradeoff that must be considered on real projects.

What all this really means is that you should consider the real performance benefits of an optimization against all the disadvantages of that same optimization before deciding to go ahead with it. Another of the considerations, and a reason to watch for optimizations that are made too early, is whether the optimization will even be there when the project is finished. The earlier you make an optimization, the more likely it will not be there by the end. A good way to see exactly how many real optimizations you are making is to keep a record of each optimization made. Then, at the end of the project, review the code to see how many of them are still in place. If substantial amounts of these optimizations are no longer there, then you need to adjust your optimization policies for the next project.

Why Later Is Too Late

So, you have found this situation where it would be much more difficult to perform the optimization later, and you are sure that this is not just habit speaking. These situations do occur, but take a very careful look at the reasoning before going ahead with the optimization. Valid scenarios for early optimization do not come along often, so be sure there is plenty of evidence backing this decision or you will regret it later. The more likely reason for difficulty in making optimizations later is poor programming practices. Spaghetti code, or code where every part is dependent on every other part, is one of the most common problems that makes optimizations difficult. However, there are plenty of other reasons.

The other reason to be suspicious of this claim stems from the turbulent nature of software development. Change is common during programming, and even if the assumption that a portion of the code will be difficult to optimize later is correct, that code might be called only occasionally or not at all by the end of the project. If this is the case and the optimization was made anyway, programmer time was lost on a needless task that could have been avoided. It is often better to trade the risk of having to do the optimization later for the guaranteed loss of time and code complexity of making the optimization up front.

False Optimizations: The Optimization that Wasn’t

Programmers are almost always very poor judges of what optimizations will actually make the application run substantially faster. This has caught most programmers repeatedly. However, there is something even worse than wasting time on small optimizations; some supposed optimizations actually reduce the performance of the program. If you profile and the application is running slow during an algorithm that is theoretically fast, then it is time to look at some practical issues that can cause this. A very important warning, however, is to save this process for as near the end of development as possible in order to maintain readability throughout development. In addition, small changes can easily affect these performance times, so early work might be wasted.

Among the more common reasons for false optimizations are either cache misses or page faults. The cache is a small piece of very fast memory located directly on the main processor (Figure 1.1). A cache miss occurs when the algorithm either produces a machine instruction set that cannot be stored in the processor in its entirety, or it accesses data that cannot all be stored in the cache. These cache misses can take longer than the extra processing required for a theoretically slower algorithm, resulting in better real performance from the slower algorithm. It is difficult to predict when this will occur, and small changes can easily result in very different results. This makes it important to save any attempt at this type of optimization until the very end of development.

click to expand
Figure 1.1: Data and instruction caches are located on the main processor, but must access memory when a cache miss occurs.

Page faults result from a similar set of circumstances. With most computers using virtual memory these days, data is often paged to disk when it does not all fit in physical memory at once. What this means is that an algorithm that uses a substantially larger amount of memory for performance improvements could be hit by slow I/O times (Figure 1.2). Again, this results in a theoretically faster algorithm achieving much slower performance in real applications.

click to expand
Figure 1.2: Data directly on the main processor is accessed the fastest, while data accessed from memory because of a cache miss is slightly slower, and data accessed from a hard disk or similar storage medium because of a page fault is substantially slower.

Another programming technology, which continues to gain in popularity, that is the cause of many false optimizations is threading. Threading allows a process to execute multiple tasks at once, but complications arise when these tasks need to share data. This data sharing is one of the main differences between multitasking, running multiple processes simultaneously, and multithreading. Among other reasons, the number of multiple processor systems is continuing to increase, which will continue to increase the use of multithreading. It is therefore very important to understand the impact of this technology on performance and optimization.

Multithreaded applications are particularly sensitive to timing issues and other hard to predict interactions between the various threads, making understanding the behavior of the application exponentially more difficult to understand and predict. Because of this, it is even more important to avoid complex algorithms and methods in places where threads interact. Look for these excessive data sharing and method calls between threads to find these problem areas. In addition, look for common single-threaded optimizations that are made without much thought, as these can often have the opposite result on multithreaded applications. A common example of this is copy-on-write (COW) optimizations for the standard template library string class that is popular with many implementers (Figure 1.3).

click to expand
Figure 1.3: The left side shows a sample sequence using a copy-on-write (COW) string, while the right side shows the same sample sequence using a string that always performs a copy. Because the COW string shares data between threads, a lock must be performed for every copy, including local copies. This results in a longer running time for the overall application using the COW string.

Several more false optimizations occur because of different system architectures and modern optimized compilers. These should only be dealt with toward the end of a project, but it is important to recognize areas where they might occur in order to take preventative measures that will assist these later optimizations. We will go into more detail about how to prepare your code when we talk about prevention.

Why That Has to Be Slow

One of the most common fallacies that programmers routinely entertain is that we know what will be optimal. Repeatedly, the same guesses about what is slow and how it could be fixed are made. Very few of these guesses have been correct. When it comes to the topic of optimization, intuition usually fails. Despite the fact that we are working with machines that are supposed to be deterministic, the best word to describe them is quirky. Computers often behave in ways we don’t expect, and the large software programs and complex operating systems of the modern age interact to form emergent behavior that is often unexpected. It does not help that hardware designers use all types of unusual tricks to improve performance. Even worse are the hardware design decisions that are made to improve benchmark performance, which then become hindrances to more practical applications.

This belief will also leads us to one of the other major illnesses, NIH Syndrome, when the programmer falls into two incorrect assumptions. One of the beliefs is that only you can possible know what is optimal for your current situation, and there is no possibility that code written by someone else could possibly fit your needs. The other belief is that all the code must be optimal. In fact, once you are convinced that the second assumption is false, the first should be of no concern.

One particular project decided to avoid the use of the Standard Template Library (STL) in part due to performance concerns. Although we will go into more detail about the problems that arose from this decision later in Chapter 2, “NIH Syndrome,” it is relevant to mention its adverse effect on performance in this section. By eschewing the use of the STL, no standard containers were available for use in storing collections of objects. An in-house list class was created to fill this need, but it supported only a small subset of the functionality available in the STL. Because of this missing support, most uses of the list class simply iterated through the entire list. In the end, an attempt to avoid performance problems from a well-tested standard library led to even more performance problems along with a host of other disadvantages.

Caching

CD-ROM One common optimization that is often misunderstood and, consequently, overused is caching of values that can be recomputed from already stored data. Do not fall for the temptation of caching on the first pass, but wait for the optimization phase of the project to determine which values would benefit from caching. In order to appreciate why this is important, let us examine an example of value caching and the consequences of this technique. Code for this example is also included in Source/Examples/Chapter1/cache_opt.h on the companion CD-ROM.

Starting with a simple rectangle C++ class:

    class rectangle     {     public:         rectangle(double i_length, double i_width) :             m_length(i_length),             m_width(i_width)         {         }         void set_length(double i_length)         { m_length = i_length; }         void set_width(double i_width)         { m_width = i_width; }         double get_length() { return(m_length); }         double get_width() { return(m_width); }     private:         double m_length, m_width;     };

We then want to add a member function to obtain the area of this rectangle:

    class rectangle     {     public:         // ...         double get_area()         {             return(m_length * m_width);         }         // ...     };

Now, to avoid making the calculation for area every time the area is requested, we can store the area in the class. There are two common ways this can be done, the first of which involves computing the area any time the length or width of the rectangle changes:

    class rectangle     {     public:         rectangle(double i_length, double i_width) :             m_length(i_length),             m_width(i_width)         {             update_area();         }         void set_length(double i_length)         {             m_length = i_length;             update_area();         }         void set_width(double i_width)         {             m_width = i_width;             update_area();         }         // ...         double get_area() { return(m_area); }     private:         void update_area()         {             m_area = m_length * m_width;         }         double m_area;         // ...     };

Another familiar method is lazy evaluation, or waiting for the area to be requested, which can be done with some minor modifications to the previous class:

    class rectangle     {     public:         rectangle(double i_length, double i_width) :             m_length(i_length),             m_width(i_width)         {             area_needs_update();         }         void set_length(double i_length)         {             m_length = i_length;             area_needs_update();         }         void set_width(double i_width)         {             m_width = i_width;             area_needs_update();         }         // ...         double get_area()         {             if(m_area_needs_update) {                 update_area();             }             return(m_area);         }     private:         void area_needs_update()         {             m_area_needs_update = true;         }         void update_area()         {             m_area = m_length * m_width;             m_area_needs_update = false;         }         bool m_area_needs_update;         // ...     };

Lazy evaluation is not very effective in this case for a number of reasons, but it is shown because it is common and can often be useful in many other more complex cases. The primary reason for this ineffectiveness is the simplicity of the calculation versus the overhead in doing lazy evaluation. Trading off computation time in one form for another form that appears to be faster but is not is a common mistake.

The benefits of caching are obvious in this example; the get_area() function only need return a member value with no further calculations. What might be less obvious are the drawbacks to this method, which lead us to the assertion that it should only be used during the optimization stage. The most evident drawback is the added complexity to the rectangle class, making it less readable and therefore harder to modify. This alone should be argument enough for using the simpler form until near the end of development.

If that is not convincing enough, then consider the other consequences of caching data, primary among which is the extra memory requirements. The computed value must be stored, and if lazy evaluation is used, an additional flag must be stored to indicate when the stored value requires updating. This is particularly important on small systems that have memory limitations, but can even have an effect on larger systems due to a variety of possible slowdowns involving processor caches, memory transfers, boundary problems, and other unexpected problems.

While the problems with memory could lead to false optimizations, using caching can guarantee false optimization if the usage pattern of the data or class is not taken into account. To illustrate what this means, let us imagine using the rectangle class in a real-world application that is for editing the layout of rooms in a house. The user can change the size of any room interactively by dragging the sides or corners, and the other rooms adjust by shrinking or expanding. While this is happening, the set_length() and set_width() functions are being called multiple times for each user movement. Once the user is satisfied, he can print out a report with important information such as the area of each room. During this phase, set_area() is called once for each room. In this scenario, caching will cause extra work during the interactive phase and will only slightly speed up the report generation. Overall, this will lead to lower performance and an unsatisfied user. Consider the usage pattern before caching data even during the optimization phase, and always profile to determine the effects.

At this point, you might notice that lazy evaluation could help with this problem, since it would only introduce one extra instruction for each set called. As with all optimizations, this should still be avoided until the optimization stage. After all, those extra instructions are still not necessary in the case given, and the overhead of the extra if branch instruction can be large on many architectures. This does lead to an interesting point about developing general-purpose libraries, which is the one case that it might be useful to perform optimizations that are not tailored to a specific usage. There is no way to know the exact usage of the library in future applications, so optimizations must be carried out here as well. However, this should still be done near the end of the project, which in this case would be library deployment. Later, we will discuss how to approach this type of optimization in a slightly different manner than ordinary applications.

Since caching represents a very common symptom of premature optimization, it is important to always be on the lookout for values that can be computed instead of stored. A similar type of value that can appear in object-oriented language is the data member used as a temporary. This is done to eliminate the need to pass the value between functions within the class or through other classes. This should be avoided when possible, as it leads to many of the same problems as caching, which in essence it is.

Why Bad Habits Are Hard to Break

Back when memory was scarce and CPU speeds were measured in hertz instead of megahertz, it was necessary to ensure that every bit of code was efficient and small. Things have changed, but many of the lessons learned are still ingrained in the programmers and the teachings of the current day. They are like habits that we cannot seem to break, and these habits have become more than obsolete and are detrimental to modern software development. Even when we try to break free of these habits, we are often scared back when some young upstart programmer makes a mess of a project while trying to use fancy new techniques. Unfortunately, there are still a lot of programmers out there blindly following this technique or that technique rather than determining which one applies to the current situation.

Watch out for colleagues and mentors espousing the idea that this method or that method will solve all the problems. Object-oriented programming will not make development simple. Functional programming will not make development simpler. Thinking is what makes development work, and considering all the factors in a situation is necessary to making things easier and faster. However, remember that mistakes are inevitable. The real task is to continue to improve development, and minimize the impact of mistakes and changes.

Speaking of mentors, they are very important to progressing in your abilities as a programmer. A good mentor can help you advance in both your abilities and your career at a rapid pace. You should choose your mentor wisely, looking for a senior programmer who is both a successful practical programmer and communicates well. One other important trait to look for in a mentor is a willingness to continue learning and discuss new ideas. A note to both mentor and student, always listen and learn from the other. Mentoring should be a two-way street, with both sides learning from each other.

Before you can help other programmers break their bad habits, you must first break yours. A useful start occurs just after you choose an algorithm for a programming task. Before you start coding, step back and consider your reasons for choosing a particular algorithm. The algorithm should meet the goals of the task while remaining simple to implement and modify. If you instead chose the algorithm with the highest performance, step back and take another look at the algorithms available. Is there one that is simpler to implement, or one that would make changes easier? Often, these will be the same algorithm, but either way you decide, always prefer the simpler choices to the algorithm with the highest performance until later during the optimization phase. Do not try to guess which algorithms should be optimized from the start; chances are you will be wrong and wasting your time.

Types of Optimization

Once beyond the general symptoms, it is important to understand the different varieties of optimizations. Any of these types can be performed before they are required, and learning to spot each type of optimization is essential to going from a general symptom to a determination that Premature Optimization is at work. Table 1.1 shows a comparison of the different optimization types. Let us look at each of these types in more detail.

Table 1.1: Comparison of Different Optimization Types
Type of Optimization	Description	Common Premature Optimization	Platform Specific	Language Specific
Low-level	Takes into account the assembly and machine code generated.	Yes	Yes	Yes
Algorithm	Choosing a more efficient algorithm, generally based on big O notation.	Yes	No	No
Memory	Deals with memory limitations of the target platform, including size, caching, and page faults.	No	Yes	No
Design	Changes that affect the overall applications, including the end user’s experience, but that improve performance.	No	No	No
Source code	Minimizes the amount of text used to write the source code.	Yes	No	Yes

Low-Level

Low-level optimizations are usually localized and are language and system specific. Another important trait of this type of optimization is its susceptibility to small changes. Such things as cache size and memory alignment, as well as other esoteric hardware and operating system specific factors, influence low-level optimizations. These types of optimizations are the most likely to be unintuitive and fragile and should usually be saved until all other optimizations have been performed.

An example of such a low-level optimization is removal of the loop invariant. Take this small snippet of C/C++ code:

    for(int index = 0; index < total; ++index)     {         volume[index] =             length * width * height[index];     }

Notice that length and width is not dependent on the loop variable. We can therefore remove the length and width multiplication from within the loop to avoid performing the computation on every iteration:

    float area = length * width;     for(int index = 0; index < total; ++index)     {         volume[index] = area * height[index];     }

This type of optimization gives large performance gains only if the code is called a noticeable percentage of the time. In addition, this type of optimization lends itself well to automation. Modern compilers already perform many low-level optimizations automatically, potentially making any work done in the source code unnecessary or even detrimental.

Algorithm

Algorithmic optimizations are the bread and butter of optimizations. They easily give you the most gain for your effort when it comes to performance improvements, and any programmer who is good at optimizations will find himself performing these types of optimizations first. Why waste time doing painful optimizations on some complex set of equations when you could greatly reduce or eliminate the calls to these equations by changing the higher-level algorithm used?

A simple illustration of an algorithmic optimization is a binary search versus a straightforward linear search. First, we look at the simple linear search that might be used initially:

    find(search_criteria)     {         current_item = item_at_head_of_list;         while current_item is in list         {             if current_item matches search_criteria                 return current_item;             current_item = item after current_item;         }         return item_not_found;     }

Now let us look at the higher performance binary search algorithm:

    find(search_criteria)     {         left_index = 0;         right_index = number_of_items;         while left_index is not right_index         {             middle_index = left_index + right_index / 2;             middle_item = item[middle_index];             if middle_item matches search_criteria                 return item[middle_index];             if middle_item is less than search_criteria                 right_index = middle_index;             if middle_item is greater than search_criteria                 left_index = middle_index;         }         return item_not_found;     }

Since many algorithms are language independent, this example is presented in pseudo-code to reinforce the cross-platform nature of this type of optimization. There are several other important points that stem from these examples, related to the symptoms discussed earlier. Let us take them one at a time.

First, the second algorithm is much longer and less readable than the first algorithm. In this case, it is only slightly longer, but with a large increase in the number of conditionals that make the logic harder to follow. Further optimizations to search algorithms can be even less readable, such as hash tables.

Second, the second algorithm adds a number of constraints to the item container that is being searched. The container in the linear search could be unsorted, while the binary search requires that the items in the container be sorted. Additionally, the items in the linear search need only be accessible sequentially, whereas the binary search requires random access through an index. Even if the current context of this algorithm meets these requirements, this limits the types of changes that can be made to the usage of the algorithm in the future.

Finally, we must consider the context of the algorithm’s use. Notice that the binary search is longer, and therefore translates to more machine instructions than the linear search. If the number of items to search is small, this can lead to a false optimization that results in longer search times for the supposedly faster algorithm. For other algorithms, a similar phenomenon can occur with particular input sets.

Why That Nifty New Algorithm

Programmers must constantly solve difficult problems every day, so it is not surprising that the people most likely to take up the profession are the ones who enjoy finding solutions to complex problems. However, this can lead to a problem when faced with situations in which there is only a simple problem or no problem at all to solve. Faced with this, it is often tempting to invent problems that do not exist. This phenomenon is common and problematic enough to deserve its own minor illness, but it is introduced here because one of the common results is a desire to find a more optimal algorithm for code that is called less than 1 percent of the time. This brings along with it all the disadvantages of optimization with no noticeable performance gain.

Exasperating this problem is the desire to try out a new algorithm or technique that the programmer has just heard about. This can have even worse effects than simply looking for the most optimal algorithm, because the programmer will often try to force an algorithm into a situation where it is not appropriate, thus resulting in a false optimization. Always consider carefully if the algorithm fits all the requirements of a given situation, especially future changes. As a programmer, you must always be learning new techniques and algorithms, but be particularly vigilant when you go to use an algorithm you have just learned.

CD-ROM One such nifty algorithm is ternary string search [Bentley98], which provides fast insertion and search times for looking up values based on a string key. With the proper data set, it also has a small memory footprint. Upon encountering this algorithm, one project decided to use it because they expected to need the performance benefits offered. Initially needing only insertion and search, a C++ implementation included in Source/Examples/Chapter1/nifty.h on the companion CD-ROM and similar to the following was used:

template <typename t_DataType> class t_StringMap     {     private:         class t_MapNode         {         public:             t_MapNode() : m_character(0),                 m_left(0), m_middle(0),                 m_right(0) {}             char m_character;             t_MapNode *m_left;             union {                 t_MapNode *m_middle;                 t_DataType *m_data;             };             t_MapNode *m_right;         };     public:         t_StringMap() : m_root(0) { }         bool insert(const char *i_string,                     t_DataType *i_data)         {             t_MapNode *l_node = 0;             t_MapNode **l_nodePointer = &m_root;             // Search through until we find an             // empty spot.  The empty spot will             // be stored in l_nodePointer so it             // can be filled in the subsequent             // section.             while((l_node = *l_nodePointer) != 0)             {                 if(*i_string < l_node->m_character)                 {                     l_nodePointer =                         &(l_node->m_left);                 } else                 if(*i_string > l_node->m_character)                 {                     l_nodePointer =                         &(l_node->m_right);                 } else {                     // If we reach the end of the                     // string, then this string                     // is already in the list.                     if(*i_string++ == 0) {                         l_node->m_data = i_data;                         return(true);                     }                     l_nodePointer =                         &(l_node->m_middle);                 }             }             // Insert the new characters,             // start at the pointer we             // found and countinue down             // the middle.             for(;;) {                 *l_nodePointer = l_node =                     new t_MapNode;                 if(!l_node) {                     return(false);                 }                 l_node->m_character = *i_string;                 // If this is the end of the                 // string, set the data and                 // break out of our loop.                 if(*i_string++ == 0) {                     l_node->m_data = i_data;                     break;                 }                 l_nodePointer = &(l_node->m_middle);             }             return(true);         }         t_DataType *find(const char *i_string) const         {             t_MapNode *l_node = m_root;             // Continue until we run out of             // nodes to search or we find             // what we are looking for.             while(l_node) {                 if(*i_string < l_node->m_character)                 {                     l_node = l_node->m_left;                 } else                 if(*i_string > l_node->m_character)                 {                     l_node = l_node->m_right;                 } else {                     if(*i_string++ == 0) {                         return(l_node->m_data);                     }                     l_node = l_node->m_middle;                 }             }             // Match not found.             return(0);         }     private:         t_MapNode *m_root;     };

Already this is a little complicated to read and understand, but it did perform well and for a long time no changes were needed. However, later during development it became necessary to perform other operations on the key strings that required parsing the entire set. To do this required either the use of recursion or a stack, as this example function shows:

    template <typename t_DataType> class t_StringMap     {         // ...         void print(t_MapNode *i_node = 0,             char *i_string = 0,             unsigned int i_index = 0)         {             char l_string[256];             if(!i_node) {                 if(!m_root) {                     cout <<                     "t_StringMap: ***  Empty  ***"                         << endl;                     return;                 }                 cout <<                     "t_StringMap: --- Strings ---"                     << endl;                 l_string[0] = 0;                 i_node = m_root;                 i_string = l_string;             }             if(i_node->m_left) {                 print(i_node->m_left,                     i_string, i_index);             }             if(i_node->m_character) {                 i_string[i_index] =                     i_node->m_character;                 i_string[++i_index] = 0;                 print(i_node->m_middle,                     i_string, i_index);                 i_string[--i_index] = 0;             } else {                 std::cout << i_string << std::endl;             }             if(i_node->m_right) {                 print(i_node->m_right,                     i_string, i_index);             }         }         // ...     };

In this case, the desire to prematurely optimize made later changes more difficult. This used up precious programmer time that could have been spent on problems that were more important. The proper solution in this case would have been to use the containers in the standard template library, avoiding both Premature Optimization and NIH Syndrome, which we will discuss later.

Memory

On architectures with low amounts of memory, such as organizers and hand-held games, it might also be necessary to optimize the memory footprint of the application. Memory optimizations might also be necessary due to other hardware architecture quirks, such as alignment or transfer size requirements. As with other optimizations, these should not be performed until later in the development cycle if possible. However, there is a greater chance on small memory platforms that the application might exceed the memory requirements before the project is near completion. Just as you would optimize the largest CPU consumers when dealing with performance, address the largest memory users first until there is sufficient room to continue development. Do not overdo the amount of optimizations performed, save the remainder for later in development. Furthermore, if the memory improvements are purely for performance optimizations and not because the application is running out of memory, save them for near the end of development in the optimization phase.

While some memory optimizations can actually improve performance, most tend to involve a tradeoff between memory and performance. For example, take the decision between using bit-fields or booleans in C++ to represent boolean values in a class or structure. A bit-field uses one bit per value, usually with the minimum number of bits that can be used equal to the bit size of an integer. Thus, the number of bits wasted is only the integer bit size minus the total number of bits used. Booleans, on the other hand, usually use a full integer for each value, wasting the number of values multiplied by one minus the bit size of an integer. If there is more than one boolean value to represent, the bit-fields will use less memory. More values mean more memory saved. On most machines, setting and clearing bit-field flags and boolean flags would amount to the same performance, requiring one operation. However, on most architectures, testing a bit-field requires two operations, whereas testing a boolean requires only one operation (Figure 1.4, A and B). The reason is that the bit-field must be extracting from the rest of the bit-fields before it can be tested to see if it is true or false. Since a boolean value’s primary purpose is to be testing for its value, this leads to poorer performance when using bit-fields.

click to expand
Figure 1.4: A) Example of the number of extra bits that go unused if a bit-field is not used. B) Example of the extra computation required when testing a bit-field.

This is a good place to point out that there are other types of optimization outside of performance and memory, such as network transfer speed. Although performance and memory are the most common types, these other types are important in certain cases and might require similar techniques and tradeoffs to those of performance optimization. For example, memory differences in the data that is sent over the network typically have a much greater impact than in other cases. Storing boolean values as bit-fields would make sense if the full set of boolean values was to be transferred over the network on a regular basis. However, as with many exceptions, there is an exception to the exception. If the bit-fields are only sent occasionally, it might be better to interact with them locally as C++ booleans and convert them only when network transfer is needed. This complexity in the decision process about performance further emphasizes the need to wait until the code is solid at the end of development before optimizing, and to always test the optimizations for resulting performance changes.

Design

Changes in the application’s design to improve performance can often be difficult to decide upon, but once the decision is reached it is usual easy to implement them since it often involves cutting or not implementing features. Unlike the rest of the optimization types, this optimization is rarely performed prematurely. Sometimes this optimization is not even performed when it is necessary, leading to missed deadlines, extra expense, and in some cases project cancellation. Therefore, you will rarely see this type as a premature optimization, and instead should look for the opposite case of resistance to change or remove features when approaching the end of development.

Why Past Mistakes

One common occurrence that leads to our assumption that we know what is slow and how to take care of it early is our attempt to learn from past mistakes. Learning from failures is a very important part of the learning process, but this requires that we be able to distinguish the mistakes from events that are unpleasant but also unavoidable. Too often, after several projects where a large amount of optimization work had to be done at the end and required overtime and missed milestones, a programmer will decide that the appropriate way to handle this is to perform the optimizations as early as possible. This is the wrong way to handle this situation.

The problem was not that the optimizations were done at the end of the project, but in the method with which these optimizations were handled. Later in the “Prevention” section, we will talk about how to prepare for later optimizations, but not to make them until necessary. Otherwise, optimizing will actually take more time rather than less. The other important step than should have been made is to schedule time for these optimizations. This is the proper way to avoid falling behind on milestones and consequently the project finish date.

While this might all seem counterintuitive, first-hand experience has shown otherwise. A consultant once joined a project in its early stages. On all his previous projects, he had been brought in during the later stages of the project to assist in optimization and debugging. Convinced that one of the main reasons why optimization was so difficult on these projects, rather than poor coding practices, he decided to lay the groundwork for optimizing early and often in this new project. The result was a body of code that was difficult to modify but easy to break, and in addition an extra programmer was brought in at the end to help with optimization anyway.

Source Code

Source code optimizations do not improve application performance, but in theory save the programmer some typing. This type of optimization should never be done, but unfortunately, it is encountered all too often. There was once a day when this type of optimization was necessary because of the extremely limited memory and poor tools available on early computers. However, unless this book has been sent back through time, you should not be performing these optimizations anymore.

An extreme example of this type of optimization follows:

    void hs(int *a, unsigned int n)     {         typedef unsigned int ui;         if(n<2)return;         for(ui h=n/2-1,r=n;;){         int v=r<n?a[r]:a[h];a[r<n?r:0]=a[0];         ui i=h,j=2*h+1;while(j<r){         if(j<r-1&&a[j]<a[j+1])++j;         if(v<a[j]){a[i]=a[j];i=(j*=2)/2;}         else break;}         a[i]=v;if((h?--h,r:--r)==1)break;}         a[0]^=a[1]^=a[0]^=a[1];     }

While you will never see anything this bad in real applications, it illustrates the difficulty this type of optimization can cause to the readability and ability to modify code. This function performs approximately the same operation as the heap sort function example given earlier in the discussion of readability, only now it is not difficult to read because it is an optimization to performance.

There is one use for this form of optimization: obfuscation contests. Most major languages have some form of this competition, and it can be quite an interesting learning experience to write one of these obfuscated contest entries. To achieve the necessary artistry in your entry, you must explore the details of the language and push it in directions that it would normally not be used. There are even lessons to learn that can be applied to real applications, just do not bring the obfuscated coding style along with them.

< Day Day Up >