General Optimization Techniques | Core Techniques and Algorithms in Game Programming2003

We know our code is slow. We have identified the portion that is dragging performance down. Now, let's try to improve its performance by optimizing it. But first, a clarification is needed. Some years ago, all you needed to optimize game code was a thorough knowledge of assembly language. Compilers were not that good, and thus transforming the offending piece of code into hand-written assembly language magically boosted performance. Today's games are much more complex, and optimization generally comes from careful algorithmic analysis and not as much from down-to-the-metal coding. On the other hand, compilers have improved performance spectacularly, and thus there's less potential in hand coding your critical sections. Many times you will spend long hours figuring out the best assembly implementation, only to end up with more or less the same performance. Thus, let's start by reviewing some general techniques that should help you regardless of the nature of your code. In later sections, we will move on to optimization recommendations for each of the four phases.

Choose Your Enemy

Focus on small portions of code that take lots of resources. It makes no sense to spend time working on a linear piece of code. Evil hides in loops, especially when lots of iterations and complex processing goes on inside. So, begin by carefully selecting your optimization target. Optimizing is a time-consuming, mind-bending process, and it better pay off in the end. Recursion is also a great hiding place for slow code. Take a look at those tree building routines to make sure each recursion takes as little time as possible to execute.

Precompute Everything

Study your code carefully and detect values that are computed in complex arithmetic functions. Does your code use trigonometric functions, random numbers, or any other kind of complex mathematical routine? Try to tabulate those so the only processing that takes place inside the loop is the table lookup. Code such as the following example is going to consume more CPU cycles than you might expect:

 float acc=0; for (i=0;i<1000;i++)    acc=acc+i*sin(x*i);

Substituting the sin call with a table lookup (and precomputing the sine function into that table) produces a five times speed increase on a regular PC. Hopefully, this simple example will show you the importance of avoiding expensive computations inside tight loops.

Simplify Your Math

Not all mathematical operators are equally fast. Traditionally, additions (or subtractions) were faster to compute than multiplies, which in turn were faster than divides. Today, performance varies greatly from one platform to another. On a regular PC, a divide costs one and a half times a multiply, so it is a good idea to check for performance on your target platform. The following two snippets perform the same task, but the second version only has 60 percent of the timing cost of the divide version:

 float acc=1000; for (long i=0;i<100;i++)    acc=acc/2.0; float acc=1000; for (long i=0;i<100;i++)    acc=acc*0.5;

Some operations that should be avoided at all costs are square roots, logarithmic operators, and so on. Try to simplify equations and, if you really need to use complex math routines, tabulate them beforehand.

Also, be especially careful with operations that require type conversion. Adding integers to floats and so on, means the CPU must perform type conversions internally, which takes time to compute. Try to plan your code so type changes are kept to a minimum.

Additionally, rearrange your math operations so you take advantage of the less-costly CPU instructions. Here are two simple examples:

 a*b + a*c  a*(b+c); Gets rid of one multiply b/a + c/a = (1/a)*(b+c); changes one division for one multiply, which is usually cheaper.

The distributive, associative, and commutative laws of arithmetic can help you find the optimal representation for any expression, so the least computations need to be performed.

Store Data Efficiently

Using unneeded precision will significantly increase your memory footprint and will also have a clear impact on performance as well. Try to store data efficiently, using the minimum bits possible. Color interpolation is a good example. Too often, floating-point values are used for color representation, and in the end, colors end up being displayed on a CRT screen that actually does not support that much precision. Using bytes should suffice and decreases memory footprint by a factor of four. It also allows the CPU to use cheaper, byte-based operations or even things like Intel's MultiMedia eXtensions (MMX). As an example, imagine that you need to interpolate two vectors of color values. The original, float-based routine, would be the following, which requires two float multiplies and one float addition per cycle:

 void interpolate_vector_float(float *v1,float *v2,float *v3,int num,float alpha) { int i; float alphainv=1-alpha; for (i=0;i<num;i++)    {    v3[i]=v2[i]*alpha+v1[i]*alphainv;    } }

Note how we have precomputed alphainv before entering the loop to maximize performance. Then, a byte-based approach would be as follows:

 void interpolate_vector_byte(unsigned char *v1,unsigned char *v2,unsigned char *v3,int num,float alpha) { int i; unsigned char balpha=(unsigned char)(alpha*255); unsigned char balphainv=255-balpha; for (i=0;i<num;i++)    {    unsigned int tmp=v2[i]*balpha+v1[i]*balphainv;    tmp=(tmp>>8);    v3[i]=tmp;    } }

Notice how we prepare both alpha and alphainv in bytes, so the interpolation is actually multiplied by 256. Hence, the right-shift we perform keeps data in bytes. As a result, the byte version is twice as fast as the float version, and the data set occupies one fourth of the original version. Sometimes the extra precision will be needed, but getting used to smaller data types improves the performance of your code.

Forget Malloc() and Free()

Memory allocation routines should be avoided at all costs inside game loops, which need to be optimal. Malloc() and free() incur a significant performance hit, which you simply cannot afford inside a tight loop. At load time, place all your memory allocation outside critical sections. Then, use static data or a homegrown memory manager to handle your memory needs during rendering. As a short example, take a look at the following innocent code:

 for (long k=0;k<1000000;k++)    int i=1;

Very simple, right? Now, take a look at this almost identical version:

 for (long k=0;k<1000000;k++)    {    int *i=new int;    int i=1;    delete i;    }

The first version took three milliseconds to run. The dynamic memory version took one and a half seconds. It is understandable because of the large number of calls, but still, keeping malloc and free (or new and delete, if you prefer) outside of critical loops is definitely part of any recipe for success.

Be Linear

Most modern CPUs come with built-in caches that make sequential access to memory extremely efficient. Every time an element is requested, the element is copied to cache memory along with all its memory neighborhood. This means subsequent linear accesses do not need to reach out for main memory, but instead can get the data from the fast cache memory. It takes a bit of coding discipline to traverse structures linearly, but it generally pays off in the long run: Less cache misses mean higher performance.

Watch Out for Pointers

This rule is a corollary of the previous rule. Often, large programs end up with layered data sets, where we can access attributes deep inside the structure by using pointer indirections (as shown in Figure A.3).

Figure A.3. Data structure showing the risk in highly branching structures and linear accesses.

graphics/afig03.gif

In the figure, we have an array of pointers to objects called obj. Each object, in turn, has an attribute that stores a pointer to the object's bounding box. The bounding box has a pointer to the minimal-value point, and in addition, a point has a pointer to a float that stores its X-value.

Now, performing such an access can seem highly sophisticated, but it is not. Each indirection means the CPU must scan a completely different area in memory, so sequential accesses are destroyed. This means each indirection is a cache miss, which can stall the CPU and completely bog down performance.

How do we get rid of pointer indirections? Well, many times we can't. We simply need to jump around like a monkey on a tree. On the other hand, sometimes many indirections can be precomputed outside the loop. For example:

 for( int i = 0; i < numPixels; i++ )    {    rendering_context->back_buffer->surface->bits[i] = some_value;    }

can be rewritten as:

 unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits; for( int i = 0; i < numPixels; i++ )    back_surface_bits[i] = some_value;