This book is dedicated to Susan Patricia Caffee Heller, the light of my life. Without her, this book would not be what it is; even more important, I would not be what I am: a happy man.
I'd like to thank all those readers who have provided feedback on the first two editions of this book, especially those who have posted reviews on Amazon.com; their contributions have made this a better book.
I'd also like to thank Jeff Pepper, my editor at Prentice-Hall, for his support and encouragement. Without him, this third edition would never have been published.
Finally, I would like to express my appreciation to John P. Linderman at AT Labs Research for his help with the code in the chapter on sorting immense files.
Introduction to Optimization
What is optimization anyway? Clearly, we have to know this before we can discuss how and why we should optimize programs.
Optimization is the art and science of modifying a working computer program so that it makes more efficient use of one or more scarce resources, primarily memory, disk space, or time. This definition has a sometimes overlooked but very important corollary (The First Law of Optimization): The speed of a nonworking program is irrelevant.
Radix40 Data Representation, Lookup Tables1
Deciding Whether to Optimize
Suppose you have written a program to calculate mortgage payments; the yearly run takes ten minutes. Should you spend two hours to double its speed? Probably not, since it will take twenty-four years to pay back the original investment of time at five minutes per year.2 On the other hand, if you run a program for three hours every working day, even spending thirty hours to double its speed will pay for itself in only twenty working days, or about a month. Obviously the latter is a much better candidate for optimization. Usually, of course, the situation is not nearly so unambiguous: even if your system is overloaded, it may not be immediately apparent which program is responsible.3 My general rule is not to optimize a program that performs satisfactorily. If you (or the intended users) don't become impatient while waiting for it to finish, don't bother. Of course, if you just feel like indulging in some recreational optimization, that's another matter.
Why Optimization Is Necessary
Assuming that our programs are too big, or too slow, why don't we just add more memory or a faster processor? If that isn't possible today, then the next generation of processors should be powerful enough to spare us such concerns.
Let's examine this rather widely held theory. Although the past is not an infallible guide to the future, it is certainly one source of information about what happens when technology changes. A good place to start is to compare the computers of the late 1970's with those of the late 1990's.
The first diskette-based computer I ever owned was a Radio Shack TRS-80 Model IIITM, purchased in 1979.4 It had a 4 MHz Z80TM processor, 48 Kbytes of memory, and BasicTM in ROM. The diskettes held about 140 Kbytes apiece. Among the programs that were available for this machine were word processors, assemblers, debuggers, data bases, and games. While none of these were as advanced as the ones that are available today on 80x86 or 680x0 machines, most of the basic functions were there.
The Pentium IITM machines of today have at least 1000 times as much memory and 20000 times as much disk storage and are probably 1000 times as fast. Therefore, according to this theory, we should no longer need to worry about efficiency.
Recently, however, several of the major microcomputer software companies have had serious performance problems with new software releases of both application programs and system programs, as they attempted to add more and more features to those available in the previous versions.5 This illustrates what I call the Iron Law of Programming: whenever more resources are made available, a more ambitious project is attempted. This means that optimal use of these resources is important no matter how fast or capacious the machine.
Why Optimization Is Often Neglected
In view of this situation, why has optimization of application programs not gained more attention? I suspect that a major reason is the tremendous and continuing change in the relative costs of programming time and computer hardware. To illustrate this, let us examine two situations where the same efficiency improvement is gained, but the second example occurs after twenty years of technological improvement.
In the early 1970's a programmer's starting salary was about $3 per hour, and an hour of timesharing connect time cost about $15. Therefore, if a program originally took one hour to run every day, and the programmer spent a week to reduce this time by 20%, in about eight weeks the increased speed would have paid for the programmer's time.
In the late 1990's, the starting salary is in the vicinity of $15 per hour, and if we assume that the desktop computer costs $2500 and is amortized over three years, the weekly cost of running the unoptimized program is about $2.00. Holding the other assumptions constant, the optimization payback time is about 30 years!6 This long-term trend seems to favor hardware solutions for performance problems over the investment of programming effort. However, the appropriate strategy for performance improvement of a program depends on how much control you have over the hardware on which the program will run. The less control you have over the hardware, the greater the advantage of software solutions to performance problems, as illustrated by the situations below.
Considering a Hardware Solution
While every optimization problem is different, here are some general guidelines that can help you decide whether adding hardware to help you with your performance problems makes sense.
1. If you are creating a system that includes both software and hardware, and you can improve the functioning of the program, ease maintenance, or speed up development at a small additional expense for extra hardware resources, you should almost certainly do so.
A few years ago, I was involved in a classic example of this situation. The project was to create a point-of-sale system, including specialized hardware and programs, that tied a number of terminals into a common database. Since the part of the database that had to be accessed rapidly could be limited to about 1 1/2 megabytes, I suggested that we put enough memory in the database server machine to hold the entire database. That way, the response time would be faster than even a very good indexing system could provide, and the programming effort would be greatly reduced. The additional expense came to about $150 per system, which was less than 3% of the price of the system. Of course, just adding hardware doesn't always mean that no software optimization is needed. In this case, as you will see below, adding the hardware was only the beginning of the optimization effort.
2. If you are writing programs for your own use on one computer, and you can afford to buy a machine powerful enough to perform adequately, then you might very well purchase such a machine rather than optimizing the program.7
3. If you are writing a program that will be running on more than one computer, even if it is only for internal use at your company, the expense and annoyance of requiring the other users to upgrade their computers may outweigh the difficulty of optimization.
4. If your program is to run on many computers at your company, it is almost always a good idea to try to optimize it rather than to require all those users to upgrade their hardware.
5. If your program is to be sold in the open market to run on standard hardware, you must try to optimize it. Otherwise, the users are likely to reject the program.
An excellent example of what happens when you try to make a number of users upgrade their computers is the relatively disappointing sales record (as of this writing) of the Windows NTTM operating system.
In order to get any reasonable performance under version 4.0 of Windows NT (the latest extant as of this writing), you need at least 64 megabytes of memory and a Pentium II/233 processor. Since almost the only people who have such machines are professional software developers, the sales of this program to other users have been much smaller than expected.
Categories of Optimization
There are two broad categories of optimization: using a better algorithm, and improving the implementation of the algorithm we are already using. Generally, we should replace an algorithm by a better one as soon as possible, as this usually does not hinder understanding or later modification of the program. An example is the use of a distribution counting sort (see Chapter mail.htm) in preference to Quicksort. Of course, if we employ these efficient algorithms from the start of our programming effort, then we are not optimizing in the strict sense of our definition (changing a working program to make it more efficient). However, the end result is still a better program than we would have obtained otherwise.8 The second category of optimization is the modification of the implementation of an existing algorithm to take advantage of peculiarities of the environment in which it is running or of the characteristics of the data to which it is applied. This type of modification often has the unfortunate side effect of making the algorithm much harder to understand or modify in the future; it also impairs portability among different hardware and software architectures. Therefore, such an optimization should be postponed until the last possible moment, in order to reduce its negative effects on the development and maintenance of the program. This is an application of the First Law of Optimization: don't try to optimize a program (in the strict sense of modifying an existing program) until it is working correctly.
Finding the Critical Resource
It may seem obvious that, before you can optimize a program, you have to know what is making it inefficient. Of course, if you run out of memory or disk space while executing the program, this determination becomes much simpler.
Depending on which language and machine you are using, there may be "profiling" tools available which allow you to determine where your program is spending most of its time. These, of course, are most useful when the problem is CPU time, but even if that is not the problem, you may still be able to find out that, e.g., your program is spending 95% of its time in the disk reading and/or writing routines. This is a valuable clue to where the problem lies.
However, even if no profiler is available for your system, it isn't hard to gather some useful information yourself. One way to do this is to insert a call to a system timer routine (such as
clock() in ANSI C) at the beginning of the segment to be timed and another call at the end of the segment and subtract the two times. Depending on the resolution of the timer, the length of the routine, and the speed of your processor, you may have to execute the segment more than once to gather any useful information. This is illustrated in the real-life example below.
Determining How Much Optimization Is Needed
Sometimes your task is simply to make the program run as fast (or take as little memory) as possible. In this case, you must use the most effective optimization available, regardless of the effort involved. However, you often have (or can get) a specific target, such as a memory budget of 1.4 megabytes for your data. If you can achieve this goal by relatively simple means, it would be a waste of your time to try to squeeze the last few kilobytes out of the data by a fancier compression algorithm. In fact, it may be worse than a waste of time; the simpler algorithm may also have other desirable characteristics (other than its raw performance).
A good example of such a simple algorithm is the
Radix40 data compression method (see Chapter superm.htm, Figures radix40.00 through radix40.03), which is, on average, considerably less effective at reducing the size of a data file than a number of other data compression algorithms. On the other hand, it is quite fast, requires very little storage, and always produces the same amount of output for the same amount of input, which means that its compression efficiency can be calculated exactly in advance. The (statistically) more effective routines such as those using arithmetic coding take more memory and more time, and they generally produce different amounts of output for a given amount of input (depending on what has gone before), so that their compression efficiency cannot be predicted exactly. (In fact, in rare circumstances, they produce output that is larger than the input.) This context dependence also means that they are more difficult to use in applications where random access to compressed data is needed.
The moral is that you should use the simplest algorithm that will meet your needs. If you can define your needs precisely, you probably won't have to implement as sophisticated an algorithm to solve your problem, which will leave you more time to work on other areas that need improvement.
A Real-Life Example
The point-of-sale database program I mentioned earlier is an excellent example of the fact that optimization rarely follows a straight-line path. The first problem was the speed of access to the database by multiple users. Since the software is supplied with specialized hardware, it was reasonable to solve this problem by adding enough memory to hold the portion of the database that requires rapid access. My employer determined that 15,000 invoice records and 5000 customer records would be sufficient, resulting in a memory requirement of about 1.25 megabytes. The expense of this amount of memory was within reason.
Unfortunately, that part of memory (conventional memory) which allows normal program access is limited to 640 kilobytes on IBM-compatible systems running the MS-DOS operating system, the environment in which this program operated. While our actual hardware allowed for more memory, it could be referenced only as expanded memory, which cannot be allocated as simply as conventional memory.
Luckily, the problem of using expanded memory for data storage has been addressed by libraries of routines which allow any particular 16 Kbyte "page" of expanded memory to be loaded when the data in it are required. Storing the records in expanded memory solved the speed problem by eliminating excessive disk I/O.
However, the amount of expanded memory required was very close to the total available. In the very likely event of adding even a single field to the record definition, there would not be enough room for all the records. Therefore, I had to consider ways to reduce the space taken by these records.
The makeup of the database is important to the solution of this new problem. It consists almost entirely of 15,000 invoice records of approximately 35 bytes each and 5000 customer records of approximately 145 bytes each. Fortunately, the majority of the fields in the customer record contained only uppercase alphabetic characters, numeric digits, a few special characters (".", ",", and "-"), and spaces. This limited character set allows the use of
Radix40 compression, which packs three characters into two bytes. (See Chapter superm.htm for more details on this algorithm).
However, the time required to convert these fields from ASCII to
Radix40 representation seemed excessive. Some testing disclosed that converting 5000 records containing 6 fields of 12 characters each from ASCII to
Radix40 on a 33 MHz i386 took about 40 seconds!9 Although the most common operations in this system do not require wholesale conversion, it is required in such cases as importing an old-style database into this new system, and such inefficiency was unacceptable. So my space problem had become a speed problem again.
I resolved to improve the speed of this conversion (from 1300 microseconds/12 character string) as much as was practical, before proposing further hardware upgrades to the system. The first problem was to determine which operation was consuming the most CPU time. Examination of the code (Figure ascrad1.cpp), disclosed that the
toupper function was being called for every character in the string, every time the character was being examined). This seemed an obvious place to start.
First version of
ascii_to_Radix40 routine (from intro\ascrad1.cpp)
The purpose of writing the loop in this way was to avoid making changes to the input string; after all, it was an input variable. However, a more efficient way to leave the input string unaltered was to make a copy of the input string and convert the copy to uppercase, as indicated in Figure ascrad2.cpp. This reduced the time to 650 microseconds/12 character string, but I suspected that more savings were possible.
Second version of
ascii_to_Radix40 routine (from intro\ascrad2.cpp)
Another possible area of improvement was to reduce the use of dynamic string allocation to get storage for the copy of the string to be converted to uppercase. In my application, most of the strings would be less than 100 characters, so I decided to allocate room for a string of 99 characters (plus the required null at the end) on the stack and to call the dynamic allocation routine only if the string was larger than that. However, this change didn't affect the time significantly, so I removed it.
I couldn't see any obvious way to increase the speed of this routine further, until I noticed that if the data had about the same number of occurrences of each character, the loop to figure out the code for a single character would be executed an average of 20 times per character! Could this be dispensed with?
Yes, by allocating 256 bytes for a table of conversion values.10 Then I could index into the table rather than searching the string of legal values (see Figure ascrad4.cpp). Timing this version revealed an impressive improvement: 93 microseconds/12 character string. This final version is 14 times the speed of the original.11
Fourth version of
ascii_to_Radix40 routine (from intro\ascrad4.cpp)
The use of a profiler would have reduced the effort needed to determine the major causes of the inefficiency. Even without such an aid, attention to which lines were being executed most frequently enabled me to remove the major bottlenecks in the conversion to
Radix40 representation. It is no longer a significant part of the time needed to access a record.
In this chapter, I have given some guidelines and examples of how to determine whether optimization is required and how to apply your optimization effort effectively. In the next chapter we will start to examine the algorithms and other solutions that you can apply once you have determined where your program needs improvement.
- If you don't have the time to read this book in its entirety, you can turn to Figures ioopt-processoropt in Chapter artopt.htm to find the algorithms best suited to your problem.
- Actually, you will never be ahead; five minutes saved 23 years from now is not as valuable as five minutes spent now. This is analogous to the lottery in which you win a million dollars, but the prize is paid as one dollar a year for a million years!
- This is especially true on a multiuser system.
- My previous computer was also a Radio Shack computer, but it had only a cassette recorder/player for "mass storage"!
- Microsoft is the most prominent example at the moment; the resource consumption of Windows NTTM is still a matter of concern to many programmers, even though the machines that these programmers own have increased in power at a tremendous rate.
- This example is actually quite conservative. The program that took one hour to run on a timesharing terminal would probably take much less than that on a current desktop computer; we are also neglecting the time value of the savings, as noted above.
- Of course, if your old machine is more than two or three years old, you might want to replace it anyway, just to get the benefit of the improved technology available today.
- Perhaps this could be referred to as optimizing the design.
- This is worse than it may sound; the actual hardware on which the system runs is much slower than the i386 development machine I was using at the time.
- This table of conversion values can be found in Figure radix40.00.
- I also changed the method of clearing the result array to use
memsetrather than a loop.