Faster Sine and Square Root | Practical Java Game Programming (Charles River Media Game Development)

Sine (with other trig functions) and square root are functions that are ripe for replacement. These functions are relatively costly to execute and often don’t need to be exact. In fact, approximations work great in many game uses. Here we are going to create several new versions of existing math functions using two different techniques. For float sine and cosine, we will create a lookup table-based function. For float square root and inverse square root, we will use faster approximation functions. Both techniques trade accuracy for speed, but if applied in appropriate ways, they will give great performance gains with a negligible cost in accuracy.

Sine and Cosine

This text is not a pure mathematics text, and fully defining and explaining sine and cosine (and any other trigonometric functions) is beyond its scope, but here is a limited review.

Sine and cosine come from trigonometry, the branch of mathematics that deals with the relationships between the sides and the angles of triangles and the calculations based on them.

Sine can be defined in different ways, but in this book, we define it in terms of triangles and circles.

The triangle-based definition states that the sine of an angle in a right triangle equals the opposite side length divided by the hypotenuse length.

The circle-based definition states that the sine of an angle can be defined as the y-coordinate of the point on the circle located at that angle from the origin (see Figure 8.1).

image from book
Figure 8.1: Trigonometry Functions.

Moreover, sine and cosine are used in all kinds of rotational operations from graphics to collisions to AI decision making.

Sine Lookup Tables

Trigonometry functions are much more difficult to compute than typical polynomials. So much time went into computing them in ancient times that tables were made for their values. Even with tables, using trig functions takes time because any use of a trig function involves at least one multiplication or division. We will create straightforward sine and cosine lookup table functions that trade size for accuracy but have a low constant cost for angle values from 0 to 360 degrees.

The technique of using a sine lookup table is quite simple. Build a table of sine values for all angles needed (usually the full circle of 2 PI radians or 360.0 degrees) at some regular interval. For example, a very small (and inaccurate) table would be as follows:

index	angle	sine
0	0	0
1	45	0.70710678...
2	90	1
3	135	0.70710678...
4	180	0
5	225	–0.70710678...
6	270	–1
7	315	–0.70710678...
8	360	–0

To find sin(angle), simply divide the angle by the table entry interval (here, 45 degrees), use the integer part of the number to get the index, and return that value.

For example:

sin(180);

Dividing the angle of 180 passed in by this table’s interval of 45 yields 4.

The table index 4 returns 0.0, which is, indeed, the sine of 180.

It should be obvious that this table and lookup routine are terribly inaccurate. Not only would 0 be returned for 180 but also for 181, 182, up to 225, depending on whether we round up or not for the index value.

Extra Improvements

Several elements of this table must be improved to bring accuracy to a more usable state.

First, more table entries can be added between the existing values. There will be a direct tradeoff between memory and accuracy, where more memory is required for a larger table with smaller intervals between entries, but the larger table will produce less error, or more accuracy, on average.

Second, the range of table values can be improved. Sine and cosine have cycling values. In fact, the values repeat four times in a table going from 0 to 360, with only the signs changing in certain quadrants. We can use this fact to easily shrink the table’s range to one-quarter of the original, ultimately allowing for the same accuracy in less space or, alternatively, increase the accuracy in the same size table.

Last, we can interpolate between table values based on the angle passed in. This action will add some more processing to our function and, depending on the need for speed versus accuracy, may be skipped. We have implemented both functions, one with interpolation and one without, for comparison and testing.

To write the lookup table function, the lookup table must first be built. Because we would like to use static math methods, similar to the existing math class, there should be no constructor to call or instance to hold to call our math methods. But before we can call the first look-up, a table building function must be executed.

To do this, we will use what is known as Java’s static initializer blocks. We will initialize and fill the lookup table in this block, which will execute once when the class is loaded.

The following code builds a lookup table with 0.01 degree accuracy using 9,000 entries. That may be a bit large (9,000 floats ~36k) for some uses, depending on the target platform. A smaller table would be fine; it would just trade size for accuracy.

// number of table entries per degree private static final int sinScale = 100; // float array that will store the sine values private static final float[] sin = new float[(90*sinScale)+1]; // static initializer block // fill the sine look-up table static  {     // since this table is in degrees but the Math.sin() wants      // radians must have a convert coefficient in the loop     // also, this coefficient scales the angle down      // as per the "entries per degree"     // A table could be build to accept radians similarly     double toRadian = Math.PI/(180.0 *sinScale);     for (int i=0;i<sin.length;i++)     {         sin[i] = ((float)Math.sin(((double)i) * toRadian));     } }

Now that the table is built, we need a new sin(angle) method that computes the lookup index and returns the value. Because the table we built is limited to 0–90 degrees, the sin method will have to do a little extra work to find the right index and the correct sign for the sine values. Limited function to 0–360 degrees, the mapping looks like the following code:

0–90	index	+sine
90–180	180(*entries per degree) –index	+sine
180–270	index – 180(*entries per degree)	–sine
270–360	360(*entries per degree) –index	–sine

There is one extra “gotcha” that is likely to happen. The angle value must be limited to 0–360 before doing the table lookup processing. Here we use the modulus operator to divide out values greater than 360 and use only the remainder for our final angle lookup. This action is not free but has to be here because the lookup will fail outside of 0–360. Often we know our angles are limited before the sin() call. In that case a new lookup function could be made without the check. Later we will look at a technique to limit the range without the modulus for even faster lookups.

static final public float sin(float a) {    // Limit range if needed.    if (a > 360)        a %= 360;    // compute the index    int angleIndex = (int)(a*sinScale);    if ( angleIndex < (180 * sinScale)+1 )    {        if ( angleIndex < (90 * sinScale)+1 )        {            return sin[angleIndex];        }        else        {            return sin[(180 * sinScale) - angleIndex];        }    }    else    {        if ( angleIndex < (270 * sinScale)+1 )        {            return -sin[angleIndex - (180 * sinScale)];        }        else        {            return -sin[(360 * sinScale) - angleIndex];        }    } }

It may seem that a lot of extra multiples are going on here, but really they aren’t. All the (angle*sinScale) values should be computed at compile time because sinScale is final. The compiler knows the values cannot change at runtime and places the constant value in the code instead.

The decompiled code for the method sin(float f) for a particular sinScale value looks like the following lines:

public static final float sin(float f) {     if(f > 360F)         f %= 360F;     int i = (int)(f * 100F);     if(i < 18001)         if(i < 9001)             return sin[i];         else             return sin[18000 - i];         if(i < 27001)             return -sin[i - 18000];         else             return -sin[36000 - i]; }

In the end, all the sinScale multiples are gone, replaced with the proper static value. We are left with a small set of operations to get at that sine value.

Array Access Is Not Free

There is a hidden cost in sine lookup table technique in Java over using them in C/C++. Any array access in Java involves a bounds check, as explained in Chapter 6, “Performance and the Java Virtual Machine.” This lookup function is so small and quick that the bounds check is not negligible to its overall execution. However, it happens only once per call and is unavoidable for the VM because we are not accessing in any way that the VM can determine we won’t go out of bounds, even though our own angle-limiting code prevents us from doing so. In addition, if the table is quite large, it may not cache well on the host CPU, further increasing the method’s runtime cost.

Regardless, this final sine method is incredibly cheap. The cosine can be computed similarly using the same table. This is possible because the cosine of an angle is equal to sine of that angle plus 90 degrees. With only a few modifications to the sin(angle) method, we can also get cosine(angle). Unfortunately, not all complex math functions have order cycles that can be made into a table and work like this. Other functions, such as Square Root, don’t have this range-limited, repeating behavior and require different techniques.

Faster Square Root

In C/C++ game programming, a now-classic technique was developed for computing a fast square root approximation. From a primitive data perspective, it is a rather complex series of math operations and bit-twiddling steps that clean up into incredibly tight code. Because the technique manipulates the IEEE data encoding of a native floating-point number and Java uses the same native floating-point numbers, the technique can be ported directly to Java and produce the same results. Unfortunately, depending on the JVM version, it’s not always faster than the standard Java Math.sqrt() method. To see why this is, we must take a closer look at the method.

Implementing a custom square root function is typically done as an iterative approximation function. An iterative approximation function is one in which some part of the solution is processed repeatedly to get the final result where each iteration increases the accuracy of the solution. The nice thing about this kind of solution it that the developer can easily trade speed for accuracy by simply iterating some part of the function more times for better accuracy, fewer for better speed.

To make execution even faster, the function operates directly on the floating-point format. The fast square root function for IEEE-encoded floating-point numbers involves casting the floating-point number to a raw bits integer. This step is done so we can do operations on the exponent part and decimal, or mantissa, of the number separately, and then combine them back for the final floating-point number. This is fairly straightforward to implement in Java, as follows:

//Magic numbers used for computing //best guess starting square static final float A = 0.417319242f; static final float B = 0.590178532f; static final public double fastSqrt(double fp) {      if( fp <= 0 )         return 0;     int expo;     double root;          // split into hi and lo bits     long bitValue = Double.doubleToRawLongBits(fp);     int lo = (int)(bitValue);     int hi = (int)(bitValue >> 32);          // pull out exponent and format     expo = hi >> 20;     expo -= 0x3fe;     // clear exponent and set "normalized" bits     hi &= 0x000fffff;     hi += 0x3fe00000;          // assemble back to double bits     bitValue = ((long)(hi) << 32) + lo;          // turn bitValue back into decimal float for more processing     fp = Double.longBitsToDouble(bitValue);     // find square for decimal using Magic numbers best guess     root = A + B * fp;          // repeat for more accuracy, but slower function         root = 0.5 * (fp/root + root);     root = 0.5 * (fp/root + root);     root = 0.5 * (fp/root + root);     //root = 0.5 * (fp/root + root);     //root = 0.5 * (fp/root + root);     fp = root;          // find square for expo     if ( ( expo & 1 ) != 0 )     {         fp *= factor;         ++expo;     }     if ( expo < 0 )         expo = (short)(expo / 2);     else         expo = (short)((expo + 1)/2);          // format back for floating point     expo += 0x3fe;     // split back to hi and lo bits     bitValue = Double.doubleToLongBits(fp);          lo = (int)(bitValue);     hi = (int)(bitValue >> 32);     // put in exponent     hi &= 0x000fffff;     hi += expo << 20 ;          // assemble back to double bits     bitValue = ((long)(hi) << 32) + lo;          fp = Double.longBitsToDouble(bitValue);     return fp; }

This function isn’t entirely optimized but instead left in this slightly expanded form for readability. Single-precision floating-point number square roots can be computed similarly.

static final public float fastSqrtF(float fp) {     if( fp <= 0 )         return 0;     int expo;     float root;     // convert float to bit representation     int bitValue = Float.floatToRawIntBits(fp);     // pull out exponent and format     expo = (bitValue >> 23);     // subtract exponent bias     expo -= 126;     // clear exponent in bitValue     bitValue &= 0x7fffff;     // sets expo bits to 127     // which is really 1 with the bais of 126     // effective making the number decimal only     bitValue |= 0x3F800000; // ( 127 << 23 )          // turn bitValue back into decimal float for more processing     fp = Float.intBitsToFloat(bitValue);     // find square for decimal using Magic numbers best guess     root = Af + Bf * fp;     // repeat for more accuracy, but slower function         root = 0.5f * (fp/root + root);     root = 0.5f * (fp/root + root);     root = 0.5f * (fp/root + root);     // find square for expo     if ( ( expo & 1 ) == 0 )     {         root *= factor;     }     expo++;     expo >>= 1;     // format back for floating point     expo += 126;           // once again turn back to bits for recombining     bitValue = Float.floatToRawIntBits(root);     // clear exponent in bitValue     bitValue &= 0x7fffff;     // put in exponent into bitValue     bitValue += (expo << 23);        return Float.intBitsToFloat(bitValue); }

One divided by a square root shows up often in games, particularly because of the operation of vector normalization. In fact, in many cases 1/square root is computed more often that the square root alone. Because of this fact, a fast version of this combination, called inverse square root, has also been developed and used in game code. The Java version of this technique follows, computed similarly to the previous methods.

static final public float fastInverseSqrt(float x) {     float xhalf = 0.5f*x;     // convert float to bit representation     int bitValue = Float.floatToRawIntBits(x);     // find square for decimal using Magic number best guess     bitValue = 0x5f3759df - (bitValue >> 1);          // turn bitValue back into decimal float for more processing     x = Float.intBitsToFloat(bitValue);     // repeat for more accuracy, but slower function     x = x*(1.5f - xhalf*x*x);     //x = x*(1.5f - xhalf*x*x);     return x; }

With a working, although approximate square root function, let’s do some performance tests against the standard Java Math class methods.

This task is going to be a micro-benchmark, and micro-benchmarks in Java are particularly error prone because the JVM and JIT compiler are going to be doing processing during the execution. To make sure that the JVM and JIT are affecting our benchmark as little as possible, we will do what is known as a warm-up in our test. A warm-up is a place where the test code is run thousands of times so as to trigger the JIT to compile it and also to give the VM time to compile it. In addition, we will use static data so that data-retrieval time is as minimal and consistent as possible.

This performance test creates two arrays of random floats and doubles and performs our different math functions on it, comparing the time results as absolute values and as a performance ratio.

A quick test on JDK 1.4.2 with the default command-line settings produces the following results:

C:\j2sdk1.4.2_02\bin\java  MathTestsSquareRoot Creating random data... Performing warm-up for square roots test... Square Root Tests Base Time Cost = 2336 Double Math.sqrt(double)              Elapsed time 4256 value =  15.915696943287958 Double MathUtils.fastSqrt(double)     Elapsed time 55064 value =  15.915685478989822 Double MathUtils.fasterSqrt(double)   Elapsed time 39836 value =  15.91568548271219 Perf Ratios: std Math = 1.0  fast 1 Math = 12.93796992481203                              fast 2 Math = 9.359962406015038 Base Time Cost = 1303 Float (float)Math.sqrt(float)         Elapsed time 5214 value =  15.915697 Float MathUtils.fastSqrtF(float)      Elapsed time 44256 value =  15.915696 Float MathUtils.fasterSqrtF(float)    Elapsed time 37781 value =  15.9161 Perf Ratios: std Math = 1.0  fast 1 Math = 8.487917146144994                              fast 2 Math = 7.246068277713848 Inverse Square Root Tests Double  1/Math.sqrt(double)            Elapsed time 7725 value =  0.06283105311462497 Float  1/(float)Math.sqrt(float)       Elapsed time 8696 value =  0.06283105 Float MathUtils.fastInverseSqrt(float) Elapsed time 15203 value =  0.062725544 Perf Ratios: std Math = 1.0  fast 1 Math = 1.125695792880259                              fast 2 Math = 1.9680258899676375

Any number greater than 1.0 in the Perf (performance) Ratios means that the method is slower than the standard Math equivalent. In every case, the alternate methods were more expensive, and more than 10 times as expensive in some cases. What is happening—these new methods were supposed to be faster!

Just about every operation in the function will execute lightning fast, and the total sum should be quicker than the standard full-precision square root functions. Remember, we are trading accuracy for speed. In Java, however, there is oneweakness in this method—the XXXToRawLongBits() and XXXBitsToDouble/Float() methods’ calls. This technique requires converting the floating-point number to a bits-only representation. In system languages such as C, this cast operation is unsafe without a copy. In Java, for security reasons, the conversion requires a method call and a copy. This is the only way to perform this C-like cast operation in Java. On a good VM, this method call can be runtime optimized, but the copy will still happen, which slows things down but can’t quite account for the difference.

In addition, it turns out that Sun has done quite a bit of work improving the notoriously slow Math methods in Java. The 1.4.2 JVM and JIT compiler are aware of the standard Math API now and are making heavy optimizations to get those methods as close to pure C as possible, so much so that a regular Java method that does even fewer computations is significantly slower than java.Math. This is great news for anyone using Java because it means that by simply upgrading your VM version, you will get some of the best Java math possible.

However, if you cannot upgrade or if you are using some exotic command-line options, even in 1.4.2 these improved functions can help. A rerun of the previous test follows, but with the -Xcompile option on 1.4.2. This option forces a JIT at class load time, which seems to prevent the JIT from making the same math optimizations as before.

C:\j2sdk1.4.2_02\bin\java -Xcompile MathTestsSquareRoot Creating random data... Performing warm-up for square roots test... Square Root Tests Base Time Cost = 2072 Double Math.sqrt(double)              Elapsed time 219858 value =  15.915696943287958 Double MathUtils.fastSqrt(double)     Elapsed time 55419 value =  15.915685478989822 Double MathUtils.fasterSqrt(double)   Elapsed time 40011  value = 15.91568548271219 Perf Ratios: std Math = 1.0  fast 1 Math = 0.2520672434025598                              fast 2 Math = 0.18198564528013536 Base Time Cost = 1135 Float (float)Math.sqrt(float)         Elapsed time 223856 value =  15.915697 Float MathUtils.fastSqrtF(float)      Elapsed time 44282 value =  15.915696 Float MathUtils.fasterSqrtF(float)    Elapsed time 37629 value =  15.9161 Perf Ratios: std Math = 1.0  fast 1 Math = 0.19781466657136731                              fast 2 Math = 0.1680946680008577 Inverse Square Root Tests Double  1/Math.sqrt(double)            Elapsed time 224095 value =  0.06283105311462497 Float  1/(float)Math.sqrt(float)       Elapsed time 221134 value =  0.06283105 Float MathUtils.fastInverseSqrt(float) Elapsed time 14766 value =  0.062725544 Perf Ratios: std Math = 1.0  fast 1 Math = 0.9867868537896874                              fast 2 Math = 0.06589169771748589

As was originally expected, the new approximate square root methods are faster—in the floating point cases, executing in less than 1⁄5 the time on this test machine and with pretty good accuracy.

Further test combinations are summarized in the graphs in Figure 8.2 and Figure 8.3.

image from book
Figure 8.2: Square Root Tests, slower than standard math.

image from book
Figure 8.3: Square Root Tests, faster than standard math.

Both graphs sum up the time cost of the double square root, float square root, and inverse square root functions as compared to the standard math API across the JVM 1.4.2 and 1.5 beta with default, -server, and -Xcompile command-line options. The first graph shows the four fastest VMs and settings; the second graph shows the remaining four slowest.

A few interesting things come to light in these tests.

First, in many cases (float)Math.sqrt(float) is actually slightly more expensive than double Math.sqrt(double). This result is unfortunate but makes sense because the function is actually processing doubles and has the added overhead of cast to and from floats.

Second, the alternative square root methods are slower than the standard API methods under the default JVM settings but faster than the standard API methods under -Xcompile or -server, with the exception of combination 1.4.2 -Xcompile -server. In fact, in the second graph, it is shown that those options make the standard math methods many times more expensive, which is why the alternate methods are cheaper. The alternate methods seem to have less variance than the standard math API across different JVMs and options.