14.2 Bytecode Optimization Techniques

Besides getting a faster JVM implementation, it is also possible for one compiler to generate bytecodes that run more efficiently on the same JVM. Redundant operations can be eliminated. Slow operations can be replaced with fast ones.

One reason for implementing bytecode optimization is that you can take advantage of the semantics of the original language. For example, some of the techniques discussed in this section produce incorrect results in a multithreaded environment. If you are writing a compiler for a language that does not support multithreading, then you can perform optimizations not available to languages that must properly support threads.

Similarly, the Java language makes it difficult to perform inlining operations between classes, due to the binary compatibility requirements in The Java Language Specification. A compiler for a language other than Java does not have to respect these binary compatibility requirements.

Many of the techniques discussed in this section are based on assembly language optimization techniques.

14.2.1 Warning

The optimizations described in this section can also be performed automatically by the JVM implementation itself. Generally, the more effort you go through to perform these optimizations yourself, the harder it will be for the JVM's optimizer to perform additional optimizations.

Professor Donald Knuth said, "Premature optimization is the root of all evil." Think carefully before spending a lot of time optimizing your programs at the bytecode level. Your work may actually make performance worse instead of better.

14.2.2 Constant Propagation

Constant propagation involves replacing variables with constants, if the optimizer proves that the variable does not change. For example, consider this snippet of JVM code:

 .method static subtract32([I)V .var 0 is array [I .var 1 is n I .var 2 is i I    bipush 32           ; Push the constant 32    istore_1            ; Store it in n    iconst_0            ; Initialize i to 0    istore_2 loop:    aload_0    aload_0             ; Push the array    iload_2             ; Push i    iaload              ; Push array[i]    iload_1             ; Push n    isub                ; Compute array[i]-n    iload_2             ; Store the result in array[i]    iastore    iinc 2 1            ; Increment i    aload_0             ; Loop if i <= array.length    arraylength    if_icmple loop

This code subtracts the value of n (variable 1) from each element of the array. If you examine the code, you will discover that the value of n never changes; it is always 32. Therefore, you can replace the instruction

 iload_1           ; Push n

with the more efficient

 bipush 32         ; Push 32

The bipush instruction is more efficient that the iload instruction because iload requires fetching a value from memory, which is usually much slower than the CPU itself. Loading a constant can be performed without fetching anything from memory, since the value is part of the instruction itself.

Constant propagation is made difficult in many systems by pointer shadowing. Pointer shadowing occurs when two different pointers refer to the same space in memory. Pointer shadowing makes it hard to predict what parts of the program will cause which memory to change. If the optimizer knows that a piece of memory will not change, it can eliminate code that cannot affect the outcome of the computation.

In the JVM instruction set, there is no way to "shadow" a local variable. That is, there is no way to have multiple pointers to the same local variable. This makes it easier to detect whether or not a particular local variable will change during the course of a loop. In the constant propagation example code, it is obvious that there is no way to change the value of n because only the currently executing method has access to it.

A similar optimization can be performed with fields. This optimization is more difficult than that for local variables because it is possible to have multiple references to the same object. By carefully examining local variables, it is sometimes possible to prove that there is only one reference to an object. For example, if the object is created within the body of the method and there are no instructions that store into the field, then you can treat the field as a constant.

14.2.3 Strength Reduction

In strength reduction, an expensive operation is replaced by an operation that does the same job but is less costly to run. This is a more general form of constant propagation. To do this, you have to know which operations are likely to be more expensive than others. This differs from JVM implementation to JVM implementation.

For this reason, strength reduction is best performed by the JVM implementation itself. Some JIT compilers perform strength reduction when compiling the Java bytecodes into native instructions. Generally, the JIT can do a better job than you can of making these decisions, since the JIT is optimized for the particular computer it runs on. The JIT developer knows exactly how fast any given instruction can run.

One operation that is typically slow is multiplication. For example,

 x * 32

is equivalent to

 x << 5

Generally, performing the left shift operation (<<) is faster than doing a multiply (*) for the same reason people find multiplying by 1,000 easier than multiplying by 317.

Even if the value isn't exactly a power of 2, it may be faster to rephrase multiplication in terms of shifts and addition. This way,

 x * 34

is equivalent to

 x << 5 + x << 1

Sometimes you can compute the results of an operation even before it is executed and replace the code with a simple push of the results. Consider this Java code:

 int seconds_in_a_day = 60*60*24;

A naïve compiler might compile this to

 bipush 60 bipush 60 imul bipush 24 imul

A better compiler would generate

 ldc 86400

This single instruction is likely to execute significantly faster than the five instructions. Although the difference may seem trivial, when it is executed a million times the differences add up.

14.2.4 Loop Unrolling

In some loops, the time it takes to execute the body of the loop may not be all that much larger than the loop tests. Consider, for example, this code to count the number of 1 bits in the int m:

 for(int i = 0; i < 32; i++)    n += m >> i & 1;

A naïve Java compiler might generate this code:

 loop: iload_1                    ; Compare i to 32 bipush 32 if_icmpge break            ; Break when i >= 32 iload_3                    ; Push n iload_2                    ; Push m iload_1                    ; Push i ishr                       ; Compute m >> i iconst_1 iand                       ; Compute m >> i & 1 iadd                       ; Add that value to n iinc 1 1                   ; Increment i goto loop                  ; Loop again

This code will increment i 32 times and test if i is less than 32 the same number of times. It's obvious that the test will fail the first 31 times and succeed the last time. You can eliminate the tests by "unrolling" the loop, like this:

 iload_2      ; Push m iconst_0     ; Push 0 ishr         ; Compute m >> 0 iconst_1     ; Push 1 iand         ; Compute m >> 0 & 1 (the first bit) iload_3      ; Push n iadd         ; Add the result to n iload_2      ; Push m iconst_1     ; Push 1 ishr         ; Compute m >> 1 iconst_1     ; Push 1 iload_3      ; Add n iand         ; Compute m >> 1 & 1 (the second bit) ;; Repeat this pattern 29 more times iload_2      ; Push m bipush 31    ; Push 31 ishr         ; Compute m >> 31 (the leftmost bit) iconst_1     ; Push 1 iload_3      ; Push n iadd         ; Add the last bit to n

Although this greatly expands the size of the resulting code, the total number of instructions executed is reduced. It also allowed us to completely eliminate local variable 1, which had been used as the loop counter. This results in a speedup of 200% to 400%, depending on the virtual machine implementation used.

14.2.5 Peephole Optimization

Compilers often generate code that has redundant instructions. A peephole optimizer looks at the resulting bytecodes to eliminate some of the redundant instructions. This involves looking at the bytecodes a few instructions at a time, rather than doing wholesale reorganization of the code. The name "peephole optimization" comes from the tiny "window" used to view the code.

A peephole optimizer is frequently applied as the last stage of a compiler before code is written to disk. Sometimes it is applied earlier, before other optimizations are done, and then applied again afterwards.

For example, consider this statement:

 if(a == b)              // This is if #1    foo();               //    True case of if #1 else                         //    False case of if #1    if(a == c)           //    This is if #2        bar();           //       True case of if #2    else        baz();           //       False case of if #2

A Java compiler might generate this code:

    iload_1    iload_2    if_icmpeq label1                  ; This is if #1        iload_1                       ;    False case of if #1        iload_3        if_icmpeq label2              ;    This is if #2            invokestatic cls/baz()V   ;    False case of if #2            goto label3               ;    Skip true case of if #2 label2:            invokestatic cls/bar()V   ;    True case of if #2 label3:        goto label4                   ;    Skip true case of if #1 label1:        invokestatic cls/foo()V       ;    True case of if #1 label4:

The line

 goto label3

just goes to another goto, which goes to label4. It would be more efficient to replace this line directly with

 goto label4

Another common set of code produced by Java compilers occurs when assignments are used as statements. Since an assignment is an expression, it should leave a value on the stack; that value is the value assigned. However, when the expression is used as a statement, that value must be removed, since statements leave no values on the stack.

For example, the Java code

 x = y = z;

might compile to

 .var 0 is x I .var 1 is y I .var 2 is z I iload_2                         ; Push z dup                             ; Duplicate the value, so that the                                 ; result is the value assigned istore_1                        ; Store it in y. This leaves the                                 ; result of y=z on the stack. dup                             ; Repeat the pattern: duplicate the value istore_0                        ; Store it in x pop                             ; Remove the value

The first dup is appropriate, since we want to leave the value of z on top of the stack to be assigned to x. However, when the pattern is naively repeated, the second dup is unnecessary. A peephole optimizer might replace it with

 iload_0                         ; Push z dup                             ; Copy it for the second assignment istore_1                        ; Store it in y istore_2                        ; Store it in x

This eliminates both the dup and the pop, but the resulting code has the same effect.

A peephole optimizer yields some of the same results as strength reduction. If a peephole optimizer sees this code

 bipush   60 bipush   60 imul bipush   24 imul

it can recognize that it is equivalent to

 sipush   3600 bipush   24 imul

which can be further reduced to

 ldc 86400

Table 14.1 lists some other rules for a peephole optimizer. The variables x and y stand for any field or local variable. For the operations aload and astore, you may perform the same operations on the corresponding float and int instructions.

Table 14.1. Poophole optimization rules
Replace	With	Reason
`Dup` `swap`	`dup`	The result of a `dup` is two of the same element; swapping produces an identical result.
`swap` `swap`		A double swap accomplishes nothing.
`iinc 1 1` `iinc 1 1`	`iinc 1 2`	Since it takes just as long to increment by 2 as to increment by 1 (on most systems), you might as well do the increment just once.
`nop`		A `nop` instruction has no effect, so you might as well eliminate it.
`getfield` `x` `getfield` `x`	`getfield` `x` `dup`	Two `getfield` instructions one immediately after the other leave no opportunity for the field value to change, so you can replace the second one with a `dup`. Exception: If the field is marked `volatile`, then its value may be changed by some other thread, and this optimization would be in error.
`aload` `x` `aload` `x`	`aload` `x` `dup`	Some systems have special fast operations for duplicating the top element of the stack.
`astore` `x` `aload` `x`	`dup` `astore` `x`	Some systems have special fast operations for duplicating the top element of the stack.
`aload` `x` `astore` `x`		These instructions cancel each other out.
`astore` `x` `astore` `x`	`astore` `x` `pop`	Storing into the same variable twice is unnecessary; only the second store needs to be performed.
`aload` `x` `pop`		These instructions cancel each other out.
`aload` `x` `aload` `y` `areturn`	`aload y` `areturn`	Since the value of `x` is left on the top of the stack at the time of the return, there is no reason to push it. This optimization applies to any instruction except a method invocation, branch instruction, or storing into a field.

It may not be correct to apply peephole optimization techniques when the peephole includes an instruction that is the destination of a goto. For example, in this code

           iinc 1 1                ; Increment variable i loop:     iinc 1 1                ; Increment i again           ;; Use variable i           goto loop

it would be an error to combine the first two lines into

 iinc 1 2                ; Increment i 2

because the destination loop would suddenly disappear, and the resulting code might be meaningless.

Exercise 14.1

Consider this Java code:

 i = i + i

Give the Oolong code that a simple Java compiler would generate. Then optimize the code by hand.

14.2.1 Warning

14.2.2 Constant Propagation

14.2.3 Strength Reduction

14.2.4 Loop Unrolling

14.2.5 Peephole Optimization

Table 14.1. Poophole optimization rules