25.3. Kinds of Fat and Molasses

< Free Open Study >

In code tuning you find the parts of a program that are as slow as molasses in winter and as big as Godzilla and change them so that they are as fast as greased lightning and so skinny they can hide in the cracks between the other bytes in RAM. You always have to profile the program to know with any confidence which parts are slow and fat, but some operations have a long history of laziness and obesity, and you can start by investigating them.

Common Sources of Inefficiency

Here are several common sources of inefficiency:

Input/output operations One of the most significant sources of inefficiency is unnecessary input/output (I/O). If you have a choice of working with a file in memory vs. on disk, in a database, or across a network, use an in-memory data structure unless space is critical.

Here's a performance comparison between code that accesses random elements in a 100-element in-memory array and code that accesses random elements of the same size in a 100-record disk file: According to this data, in-memory access is on the order of 1000 times faster than accessing data in an external file. Indeed with the C++ compiler I used, the time required for in-memory access wasn't measurable.

Language	External File Time	In-Memory Data Time	Time Savings	Performance Ratio
C++	6.04	0.000	100%	n/a
C#	12.8	0.010	100%	1000:1

The performance comparison for a similar test of sequential access times is similar:

Language	External File Time	In-Memory Data Time	Time Savings	Performance Ratio
C++	3.29	0.021	99%	150:1
C#	2.60	0.030	99%	85:1
Note: The tests for sequential access were run with 13 times the data volume of the tests for random access, so the results are not comparable across the two types of tests.

If the test had used a slower medium for external access for example, a hard disk across a network connection the difference would have been even greater. The performance looks like this when a similar random-access test is performed on a network location instead of on the local machine:

Language	Local File Time	Network File Time	Time Savings
C++	6.04	6.64	-10%
C#	12.8	14.1	-10%

Of course, these results can vary dramatically depending on the speed of your network, network loading, distance of the local machine from the networked disk drive, speed of the networked disk drive compared to the speed of the local drive, current phase of the moon, and other factors.

Overall, the effect of in-memory access is significant enough to make you think twice about having I/O in a speed-critical part of a program.

Paging An operation that causes the operating system to swap pages of memory is much slower than an operation that works on only one page of memory. Sometimes a simple change makes a huge difference. In the next example, one programmer wrote an initialization loop that produced many page faults on a system that used 4K pages.

Java Example of an Initialization Loop That Causes Many Page Faults

for ( column = 0; column < MAX_COLUMNS; column++ ) {    for ( row = 0; row < MAX_ROWS; row++ ) {       table[ row ][ column ] = BlankTableElement();    } }

This is a nicely formatted loop with good variable names, so what's the problem? The problem is that each element of table is about 4000 bytes long. If table has too many rows, every time the program accesses a different row, the operating system will have to switch memory pages. The way the loop is structured, every single array access switches rows, which means that every single array access causes paging to disk.

The programmer restructured the loop this way:

Java Example of an Initialization Loop That Causes Few Page Faults

for ( row = 0; row < MAX_ROWS; row++ ) {    for ( column = 0; column < MAX_COLUMNS; column++ ) {       table[ row ][ column ] = BlankTableElement();    } }

This code still causes a page fault every time it switches rows, but it switches rows only MAX_ROWS times instead of MAX_ROWS * MAX_COLUMNS times.

The specific performance penalty varies significantly. On a machine with limited memory, I measured the second code sample to be about 1000 times faster than the first code sample. On machines with more memory, I've measured the difference to be as small as a factor of 2, and it doesn't show up at all except for very large values of MAX_ROWS and MAX_COLUMNS.

System calls Calls to system routines are often expensive. They often involve a context switch saving the program's state, recovering the kernel's state, and the reverse. System routines include input/output operations to disk, keyboard, screen, printer, or other device; memory-management routines; and certain utility routines. If performance is an issue, find out how expensive your system calls are. If they're expensive, consider these options:

Write your own services. Sometimes you need only a small part of the functionality offered by a system routine and can build your own from lower-level system routines. Writing your own replacement gives you something that's faster, smaller, and better suited to your needs.
Avoid going to the system.
Work with the system vendor to make the call faster. Most vendors want to improve their products and are glad to learn about parts of their systems with weak performance. (They might seem a little grouchy about it at first, but they really are interested.)

In the code-tuning effort I described in "When to Tune" in Section 25.2, the program used an AppTime class that was derived from a commercially available BaseTime class. (These names have been changed to protect the guilty.) The AppTime object was the most common object in this application, and we instantiated tens of thousands of AppTime objects. After several months, we discovered that BaseTime was initializing itself to the system time in its constructor. For our purposes, the system time was irrelevant, which meant we were needlessly generating thousands of system-level calls. Simply overriding BaseTime's constructor and initializing the time field to 0 instead of to the system time gave us about as much performance improvement as all the other changes we made put together.

Interpreted languages Interpreted languages tend to exact significant performance penalties because they must process each programming-language instruction before creating and executing machine code. In the performance benchmarking I performed for this chapter and Chapter 26, I observed the approximate relationships in performance among different languages that are described in Table 25-1.

Table 25-1. Relative Execution Time of Programming Languages
Language	Type of Language	Execution Time Relative to C++
C++	Compiled	1:1
Visual Basic	Compiled	1:1
C#	Compiled	1:1
Java	Byte code	1.5:1
PHP	Interpreted	>100:1
Python	Interpreted	>100:1

As you can see, C++, Visual Basic, and C# are all comparable. Java is close but tends to be slower than the other languages. PHP and Python are interpreted languages, and code in those languages tended to run a factor of 100 or more slower than code in C++, Visual Basic, C#, and Java. The general numbers presented in this table must be viewed cautiously. For any particular piece of code, C++, Visual Basic, C#, or Java might be twice as fast or half as fast as the other languages. (You can see this for yourself in the detailed examples in Chapter 26.)

Errors A final source of performance problems is errors in the code. Errors can include leaving debugging code turned on (such as logging trace information to a file), forgetting to deallocate memory, improperly designing database tables, polling nonexistent devices until they time out, and so on.

A version 1.0 application I worked on had a particular operation that was much slower than other similar operations. A great deal of project mythology grew up to explain the slowness of this operation. We released version 1.0 without ever fully understanding why this particular operation was so slow. While working on the version 1.1 release, however, I discovered that the database table used by the operation wasn't indexed! Simply indexing the table improved performance by a factor of 30 for some operations. Defining an index on a commonly used table is not optimization; it's just good programming practice.

Relative Performance Costs of Common Operations

Although you can't count on some operations being more expensive than others without measuring them, certain operations tend to be more expensive. When you look for the molasses in your program, use Table 25-2 to help make some initial guesses about the sticky parts of your program.

Table 25-2. Costs of Common Operations
		Relative Time Consumed
Operation	Example	C++	Java
Baseline (integer assignment)	i = j	1	1
Routine Calls
Call routine with no parameters	foo()	1	n/a
Call private routine with no parameters	this.foo()	1	0.5
Call private routine with 1 parameter	this.foo( i )	1.5	0.5
Call private routine with 2 parameters	this.foo( i, j )	2	0.5
Object routine call	bar.foo()	2	1
Derived routine call	derivedBar.foo()	2	1
Polymorphic routine call	abstractBar.foo()	2.5	2
Object References
Level 1 object dereference	i = obj.num	1	1
Level 2 object dereference	i = obj1.obj2. num	1	1
Each additional dereference	i = obj1.obj2.obj3…	not measurable	not measurable
Integer Operations
Integer assignment (local)	i = j	1	1
Integer assignment (inherited)	i = j	1	1
Integer addition	i = j + k	1	1
Integer subtraction	i = j - k	1	1
Integer multiplication	i = j * k	1	1
Integer division	i = j \ k	5	1.5
Floating-Point Operations
Floating-point assignment	x = y	1	1
Floating-point addition	x = y + z	1	1
Floating-point subtraction	x = y - z	1	1
Floating-point multiplication	x = y * z	1	1
Floating-point division	x = y / z	4	1
Transcendental Functions
Floating-point square root	x = sqrt( y )	15	4
Floating-point sine	x = sin( y )	25	20
Floating-point logarithm	x = log( y )	25	20
Floating-point e^{^y}	x = exp( y )	50	20
Arrays
Access integer array with constant subscript	i = a[ 5 ]	1	1
Access integer array with variable subscript	i = a[j]	1	1
Access two-dimensional integer array with constant subscripts	x = z[ 3, 5 ]	1	1
Access two-dimensional integer array with variable subscripts	i = a[ j, k ]	1	1
Access floating-point array with constant subscript	x = z[ 5 ]	1	1
Access floating-point array with integer-variable subscript	x = z[ j ]	1	1
Access two-dimensional, floating-point array with constant subscripts	x = z[ 3, 5 ]	1	1
Access two-dimensional, floating-point array with integer-variable subscripts	x = z[ j, k ]	1	1
Note: Measurements in this table are highly sensitive to local machine environment, compiler optimizations, and code generated by specific compilers. Measurements between C++ and Java are not directly comparable.

The relative performance of these operations has changed significantly since the first edition of Code Complete, so if you're approaching code tuning with 10-year-old ideas about performance, you might need to update your thinking.

Most of the common operations are about the same price routine calls, assignments, integer arithmetic, and floating-point arithmetic are all roughly equal. Transcendental math functions are extremely expensive. Polymorphic routine calls are a bit more expensive than other kinds of routine calls.

Table 25-2, or a similar one that you make, is the key that unlocks all the speed improvements described in Chapter 26. In every case, improving speed comes from replacing an expensive operation with a cheaper one. Chapter 26 provides examples of how to do so.

< Free Open Study >

Common Sources of Inefficiency

Java Example of an Initialization Loop That Causes Many Page Faults

Java Example of an Initialization Loop That Causes Few Page Faults

Table 25-1. Relative Execution Time of Programming Languages

Relative Performance Costs of Common Operations

Table 25-2. Costs of Common Operations