8.1 The Two Critical Approaches | System Performance Tuning2002

There are two exceptionally valuable approaches for improving the performance of your applications. The first of these is to write refactored code. Refactoring is a process that originated out of the Smalltalk and academic programming communities, and that emphasizes clear, concise programming, as well as automated testing. When code has been refactored thoroughly, it is easy to understand and much easier to modify for future uses. Unfortunately, it requires a great deal of discipline on the part of the software developer. The definitive reference on this programming technique is Refactoring: Improving the Design of Existing Code by Martin Fowler et al (Addison Wesley).

The second approach is one of algorithmic choice. Although a thorough discussion of algorithmic analysis is well beyond the scope of this text, it's important to realize that there are many different approaches to the same problem, and some are much faster than others. As an example, we'll look at different approaches to searching a string for a specific pattern.

Typically, algorithmic replacement leads to the greatest speedups in application code. Writing clear, refactored code will make the process of finding bottlenecks and implementing algorithmic changes simpler.

You could say these two basic principles boil down to "write good code using fast algorithms." This is true, but my experiences have shown that people often write bad code using slow algorithms. There are a variety of reasons why this occurs:

Writing bad code is much faster than writing good code. Good software engineering practices often trade an increase in initial development time for a decrease in the time required to maintain the software by eliminating bugs in the design phase, or for a decrease in application runtime by integrating performance analysis into the development process. When faced with a deadline and irate management, it's often easier to just get it done, rather than do it well.
Implementing good algorithms is technically challenging. As we'll see in our string searching example, implementing a naive search algorithm is much easier than implementing a more complex system. When faced with management's desire for a product yesterday and the time required to implement such an algorithm (and, indeed, to design one if there is not one readily available), often the good algorithms lose out.
Good software engineers are difficult to find. The traits that make someone a "good" software engineer include but aren't limited to the traits that make someone good at writing source code. Software engineering involves architecture development, soliticing support from other development organizations, formal code review, ongoing documentation, version control, and typically a few well thought-out processes to hold the engineering work together. Concepts like pair-programming are just beginning to move from the academic world into the mainstream, and they aren't heavily practiced. Truly excellent software engineers tend to command very high salaries, and so it's hard to get a lot of them together.
It's quite rare that a group of engineers sit down to develop a piece of software from scratch. More likely, they're adding onto an existing framework, or developing the next version of the current product. The mistakes have already been made, and it's not time- or cost-effective to throw everything away and start over.

There are basically two kinds of performance improvements that come from modifying application source code: evolutionary and revolutionary. An evolutionary change is one that is accomplished without a significant alteration to the algorithms that provide a foundation for the application. For example, changing an multi-threaded application's locking strategies to relieve lock contention on a few key locks is typically an evolutionary change. Revolutionary change comes about from changing the underlying algorithm. The difference is that evolutionary changes tend to give you performance speedups of a few times at most -- it's very dependent on how the original application was written -- but revolutionary changes can improve application performance by orders of magnitude. Revolutionary changes also tend to be much harder to implement.

Clearly, algorithmic change has a huge potential for performance improvement. So, how does refactoring fit into the sketch we've drawn of software development? The basic idea here is whenever you write a line of code, it should be refactored. If you are rewriting pieces of legacy code, those should be refactored as well. Refactoring takes a lot of time and understanding at first, but it rapidly becomes second nature. As the code becomes more highly refactored, it becomes much easier to identify and implement performance improvements. Because the code is clear, compact, and easy to understand, interpreting profiling output becomes straightforward. Implementing algorithmic change becomes vastly easier and much faster. I can't overemphasize the importance of refactoring in application development and tuning. With that philosophy out of the way, then, let's move on to a discussion of string searching as an illustration of the impact of algorithmic selection.

8.1.1 String Searching Algorithms

Let's say that we have a string of text, perhaps "Here is a string of text which we wish to search." We want to find the word "wish" in this string. Henceforth, we'll call the string we are searching S, or the source string, and the string we are searching for T, or the target string. How can we approach this problem? Let's look at a few different approaches.

8.1.1.1 Algorithm 1: naive searching

As a first approach, let's look for T in S by comparing the first character in T to the first character in S , and then following these rules:

We find a mismatch. This means there is no match given the current alignment of the sequences (e.g., where we first starting comparing). Move T one space further along in S , and start comparing again. ^[1]

^[1] This is actually a slightly improved variation of the "truly naive" search algorithm. In the truly naive algorithm, we would continue until we had run out of characters in T , regardless of whether we'd found a mismatch halfway through. This process of aborting as soon as we find a mismatch is called short-circuiting . Actual implementations of the naive algorithm almost always incorporate short-circuiting.
We find a match. This is a potential match, so we need to keep searching; compare the next characters in T or S .
We have used up all the characters in T . We have found a match!
We have used up all the characters in S . No match exists in the string.

Computational Complexity and Big O Notation

One approach to measuring the complexity of an algorithm is to use what is called the asymptotic complexity measure . It means that we do not count the exact number of steps the algorithm requires, but rather consider how that number grows with the size of the inputs to the algorithm. This asymptotic complexity is written using big O notation (the "O" comes from "order").

For example, let's consider this very simple line of code:

 a = 3 b * 5; c = c + 1;

This piece of code takes a constant amount of time, which we write as O(1). In terms of actual instruction time, it may be difficult to say exactly how long it takes, but whatever it is, it should be the same whenever this code is executed.

Let's consider a simple loop structure:

 for (i = 0; i <= n; i++) {      j[i] = j[i]++; }

This loop will run precisely n times, and because the inside of the loop takes a constant amount of time, the total running time is linearly proportional to n . We write it as O( n ). The exact running time might be something like 9 n nanoseconds; perhaps it's even 9 n + 6 nanoseconds, because the loop needs some time to start up. The big O notation allows for both a multiplication factor (like 9) and an additive factor (like 6). As long as the running time is linearly proportional to n , the correct notation is O( n ).

Let's look at one more loop, a more complex one:

 for (i = 0; i <= n; i++) {      for (j = 0; j <= n; j++) {           k[i][j] = k[i][j]++;      } }

The outer loop executes n times, the inner loop executes n times for every time the outer loop executes, and the assignment in the inner part of the loop structure runs in constant time. Since the inner loop actually executes n2 times, the complexity of this loop is O( n2 ). This notation is widely used in algorithmic analysis, so we'll use it in the following discussions.

Repeat until either conditions 2 or 3 are true. For example, let's say we want to find the string "cad" ( T ) in the string "cabcadchad" ( S ). The progress of our algorithm would follow what's shown in Table 8-1.

Table 8-1. Naive search algorithm


 S:   c   abcadchad T:   c   ad 
 S:   ca   bcadchad T:   ca   d 
 S:   cab   cadchad T:   cad   
 S:   c   a   bcadchad T:   c   ad 
 S:   c   ab   cadchad T:   ca   d 
 S:   ca   b   cadchad T:   c   ad 
 S:   cab   c   adchad T:   c   ad 
 S:   cab   ca   dchad T:   ca   d 
 S:   cab   cad   chad T:   cad   


In the worst case, this naive searching algorithm will require a number of comparisons equal to:

length of pattern - number of possible start points = T x (S-T)

For example, using a 48-character string as S and a 4-character string as T , we would require 4x(48-4)=176 comparisons in the worst case. This gives the naive search algorithm a complexity of O( st ), where s is the length of S and t is the length of T . In our efforts to improve performance, we think carefully about this method for searching, and decide we'd like a more efficient one.

8.1.1.2 Algorithm 2: Knuth-Morris-Pratt searching

If the naive algorithm fails to find a match for a character, it advances the starting point forward one character. Unfortunately, this throws away all the information obtained by matching so far. For example, let's say we have "abababcd" as T and "ababababc..." as the search string (it's not important what's after c, as long as it causes a mismatch). When the match fails at the position after c, the naive algorithm would have us advance one character in S and start checking again. Because we know that the first six characters of T were matched, advancing by one cannot work. However, shifting the pattern by two gives us another possible match. We know that the first four characters already match, so we can continue the search from the same place in S . This means we never have to look at characters in the source text more than once. The Knuth-Morris-Pratt (KMP) algorithm has complexity O( s + t ).

In general, KMP is fast exactly where the naive algorithm is slow -- when S and T both contain repeated patterns of characters. It is particularly effective when the alphabet is small (e.g., the genetic code, which uses only four characters: G, C, A, and T, or in bit patterns).

8.1.1.3 Algorithm 3: Boyer-Moore searching

The last algorithm we come to in our search for fast, simple string matching is the Boyer-Moore algorithm, which was first published in 1977. It uses a different approach, together with two techniques to improve the amount by which the pattern is shifted in the event of a failed match, to provide much better performance.

The change of approach is to attempt to match T from right to left, rather than left to right, while still moving T left to right across S . For example, if T is "wish" and S is "dish," we would successfully match "h," "s," and "i" before failing on "d" and "w."

The first technique for improving pattern shifting in the event of a mismatch is to move the pattern along so that a match is made with the character in S that we are inspecting, if possible. For example, let's consider the string of text "Here is a string of text which we wish to search." We want to find the word "wish" in this string:

 S:   here is a piece of text which we wish to search          T:   wish

"H" does not match "e," and no "e" appears in T , so no match can contain the "e." Therefore, we can move the pattern along past the "e" -- up to four places, in fact, the width of T . Now we have:

 S:   here is a piece of text which we wish to search          T:   wish

The space does not match anything in T , so we can advance four characters again:

 S:   here is a piece of text which we wish to search          T:   wish

The match has failed again, but this time against an "i," which occurs in the pattern. We shift T so that the "i" in S and the "i" in T are aligned:

 S:   here is a piece of text which we wish to search          T:   wish

There is no "c" in T , so shift T over by 4. Eventually, by repeating this algorithm, we will find the occurence of T in S . This match occurs at the 34th position in S , but we require only 15 comparisons. Either the naive method or the KMP algorithm would have required at least 37 comparisons -- a huge performance benefit. In general, the longer the pattern, the fewer the comparisons.

The second mechanism used by the Boyer-Moore algorithm is essentially an adaptation of the "fail array" technique used by KMP to decide how far to slide T along S , but adapted to the right-to-left search method. In general, the Boyer-Moore algorithm provides exceedingly fast searching when compared to the naive algorithm or the KMP algorithm, particuarly for long patterns in a text with a large alphabet. Its complexity is O( s / t ).

8.1.2 Caveats in Optimization

There are three pitfalls to watch for in code tuning: diminishing returns, the development-to-use ratio, and functional equivalence. The pitfall of diminishing returns is illustrated by a program that exhibits the characteristics illustrated in Table 8-2.

Table 8-2. Timing characteristics of a sample program

Function	Fraction of time
main	8%
foo	14%
bar	33%
baz	45%

If we improve the performance of baz by 20%, the runtime of the program will decrease by 9%, because baz takes 45% of the total running time. If we improve the performance of foo by the same amount (20%), however, the runtime of the program only improves by 2.8%. Therefore, tuning efforts should be concentrated on areas where the improvements will provide the most effect. This usually means resolving bottlenecks in critical sections of code; fixing bottlenecks in noncritical code sections is not likely to help performance much.

Another thing to watch out for is the development time to runtime ratio. In many fields, a program may be so specialized as to be run only a few times. In these cases, the time spent by developers in profiling, tuning, and testing the code may be so great that it exceeds the improvements gained by optimization.

The last major caveat of code optimization is functional equivalence. Not all optimizations guarantee bitwise-exact results. For example, instead of dividing two numbers , you can multiply one by the inverse (reciprocal) of the other. This can be a much faster operation, but in some cases, the results will not be identical. One instance where this is true is when the divisor is extremely small. ^[2] When the reciprocal of this very small number is computed, it will overflow, and produce an infinity. You must be careful to check the results of your program after each optimization using representative input sets to avoid these sorts of errors.

^[2] Specifically, it must be denormalized ; the number must be so small that the exponent part of the floating-point representation is insufficient to represent all of the exponent, so the leading bits of the fractional part are set to zero to reduce the number's magnitude. For example, the number 6.0 10 ^-324 would have the first fifty-one bits of its fractional part set to zero, which would leave only one significant fractional digit. Denormalization typically occurs once a number is smaller than about 2.3 10 ^-308 .