Besides getting a faster JVM implementation, it is also possible for one compiler to generate bytecodes that run more efficiently on the same JVM. Redundant operations can be eliminated. Slow operations can be replaced with fast ones.
One reason for implementing bytecode optimization is that you can take advantage of the semantics of the original language. For example, some of the techniques discussed in this section produce incorrect results in a multithreaded environment. If you are writing a compiler for a language that does not support multithreading, then you can perform optimizations not available to languages that must properly support threads.
Similarly, the Java language makes it difficult to perform inlining operations between classes, due to the binary compatibility requirements in The Java Language Specification. A compiler for a language other than Java does not have to respect these binary compatibility requirements.
Many of the techniques discussed in this section are based on assembly language optimization techniques.
The optimizations described in this section can also be performed automatically by the JVM implementation itself. Generally, the more effort you go through to perform these optimizations yourself, the harder it will be for the JVM's optimizer to perform additional optimizations.
Professor Donald Knuth said, "Premature optimization is the root of all evil." Think carefully before spending a lot of time optimizing your programs at the bytecode level. Your work may actually make performance worse instead of better.
14.2.2 Constant Propagation
Constant propagation involves replacing variables with constants, if the optimizer proves that the variable does not change. For example, consider this snippet of JVM code:
.method static subtract32([I)V .var 0 is array [I .var 1 is n I .var 2 is i I bipush 32 ; Push the constant 32 istore_1 ; Store it in n iconst_0 ; Initialize i to 0 istore_2 loop: aload_0 aload_0 ; Push the array iload_2 ; Push i iaload ; Push array[i] iload_1 ; Push n isub ; Compute array[i]-n iload_2 ; Store the result in array[i] iastore iinc 2 1 ; Increment i aload_0 ; Loop if i <= array.length arraylength if_icmple loop
This code subtracts the value of n (variable 1) from each element of the array. If you examine the code, you will discover that the value of n never changes; it is always 32. Therefore, you can replace the instruction
iload_1 ; Push n
with the more efficient
bipush 32 ; Push 32
The bipush instruction is more efficient that the iload instruction because iload requires fetching a value from memory, which is usually much slower than the CPU itself. Loading a constant can be performed without fetching anything from memory, since the value is part of the instruction itself.
Constant propagation is made difficult in many systems by pointer shadowing. Pointer shadowing occurs when two different pointers refer to the same space in memory. Pointer shadowing makes it hard to predict what parts of the program will cause which memory to change. If the optimizer knows that a piece of memory will not change, it can eliminate code that cannot affect the outcome of the computation.
In the JVM instruction set, there is no way to "shadow" a local variable. That is, there is no way to have multiple pointers to the same local variable. This makes it easier to detect whether or not a particular local variable will change during the course of a loop. In the constant propagation example code, it is obvious that there is no way to change the value of n because only the currently executing method has access to it.
A similar optimization can be performed with fields. This optimization is more difficult than that for local variables because it is possible to have multiple references to the same object. By carefully examining local variables, it is sometimes possible to prove that there is only one reference to an object. For example, if the object is created within the body of the method and there are no instructions that store into the field, then you can treat the field as a constant.
14.2.3 Strength Reduction
In strength reduction, an expensive operation is replaced by an operation that does the same job but is less costly to run. This is a more general form of constant propagation. To do this, you have to know which operations are likely to be more expensive than others. This differs from JVM implementation to JVM implementation.
For this reason, strength reduction is best performed by the JVM implementation itself. Some JIT compilers perform strength reduction when compiling the Java bytecodes into native instructions. Generally, the JIT can do a better job than you can of making these decisions, since the JIT is optimized for the particular computer it runs on. The JIT developer knows exactly how fast any given instruction can run.
One operation that is typically slow is multiplication. For example,
x * 32
is equivalent to
x << 5
Generally, performing the left shift operation (<<) is faster than doing a multiply (*) for the same reason people find multiplying by 1,000 easier than multiplying by 317.
Even if the value isn't exactly a power of 2, it may be faster to rephrase multiplication in terms of shifts and addition. This way,
x * 34
is equivalent to
x << 5 + x << 1
Sometimes you can compute the results of an operation even before it is executed and replace the code with a simple push of the results. Consider this Java code:
int seconds_in_a_day = 60*60*24;
A naïve compiler might compile this to
bipush 60 bipush 60 imul bipush 24 imul
A better compiler would generate
This single instruction is likely to execute significantly faster than the five instructions. Although the difference may seem trivial, when it is executed a million times the differences add up.
14.2.4 Loop Unrolling
In some loops, the time it takes to execute the body of the loop may not be all that much larger than the loop tests. Consider, for example, this code to count the number of 1 bits in the int m:
for(int i = 0; i < 32; i++) n += m >> i & 1;
A naïve Java compiler might generate this code:
loop: iload_1 ; Compare i to 32 bipush 32 if_icmpge break ; Break when i >= 32 iload_3 ; Push n iload_2 ; Push m iload_1 ; Push i ishr ; Compute m >> i iconst_1 iand ; Compute m >> i & 1 iadd ; Add that value to n iinc 1 1 ; Increment i goto loop ; Loop again
This code will increment i 32 times and test if i is less than 32 the same number of times. It's obvious that the test will fail the first 31 times and succeed the last time. You can eliminate the tests by "unrolling" the loop, like this:
iload_2 ; Push m iconst_0 ; Push 0 ishr ; Compute m >> 0 iconst_1 ; Push 1 iand ; Compute m >> 0 & 1 (the first bit) iload_3 ; Push n iadd ; Add the result to n iload_2 ; Push m iconst_1 ; Push 1 ishr ; Compute m >> 1 iconst_1 ; Push 1 iload_3 ; Add n iand ; Compute m >> 1 & 1 (the second bit) ;; Repeat this pattern 29 more times iload_2 ; Push m bipush 31 ; Push 31 ishr ; Compute m >> 31 (the leftmost bit) iconst_1 ; Push 1 iload_3 ; Push n iadd ; Add the last bit to n
Although this greatly expands the size of the resulting code, the total number of instructions executed is reduced. It also allowed us to completely eliminate local variable 1, which had been used as the loop counter. This results in a speedup of 200% to 400%, depending on the virtual machine implementation used.
14.2.5 Peephole Optimization
Compilers often generate code that has redundant instructions. A peephole optimizer looks at the resulting bytecodes to eliminate some of the redundant instructions. This involves looking at the bytecodes a few instructions at a time, rather than doing wholesale reorganization of the code. The name "peephole optimization" comes from the tiny "window" used to view the code.
A peephole optimizer is frequently applied as the last stage of a compiler before code is written to disk. Sometimes it is applied earlier, before other optimizations are done, and then applied again afterwards.
For example, consider this statement:
if(a == b) // This is if #1 foo(); // True case of if #1 else // False case of if #1 if(a == c) // This is if #2 bar(); // True case of if #2 else baz(); // False case of if #2
A Java compiler might generate this code:
iload_1 iload_2 if_icmpeq label1 ; This is if #1 iload_1 ; False case of if #1 iload_3 if_icmpeq label2 ; This is if #2 invokestatic cls/baz()V ; False case of if #2 goto label3 ; Skip true case of if #2 label2: invokestatic cls/bar()V ; True case of if #2 label3: goto label4 ; Skip true case of if #1 label1: invokestatic cls/foo()V ; True case of if #1 label4:
just goes to another goto, which goes to label4. It would be more efficient to replace this line directly with
Another common set of code produced by Java compilers occurs when assignments are used as statements. Since an assignment is an expression, it should leave a value on the stack; that value is the value assigned. However, when the expression is used as a statement, that value must be removed, since statements leave no values on the stack.
For example, the Java code
x = y = z;
might compile to
.var 0 is x I .var 1 is y I .var 2 is z I iload_2 ; Push z dup ; Duplicate the value, so that the ; result is the value assigned istore_1 ; Store it in y. This leaves the ; result of y=z on the stack. dup ; Repeat the pattern: duplicate the value istore_0 ; Store it in x pop ; Remove the value
The first dup is appropriate, since we want to leave the value of z on top of the stack to be assigned to x. However, when the pattern is naively repeated, the second dup is unnecessary. A peephole optimizer might replace it with
iload_0 ; Push z dup ; Copy it for the second assignment istore_1 ; Store it in y istore_2 ; Store it in x
This eliminates both the dup and the pop, but the resulting code has the same effect.
A peephole optimizer yields some of the same results as strength reduction. If a peephole optimizer sees this code
bipush 60 bipush 60 imul bipush 24 imul
it can recognize that it is equivalent to
sipush 3600 bipush 24 imul
which can be further reduced to
Table 14.1 lists some other rules for a peephole optimizer. The variables x and y stand for any field or local variable. For the operations aload and astore, you may perform the same operations on the corresponding float and int instructions.
It may not be correct to apply peephole optimization techniques when the peephole includes an instruction that is the destination of a goto. For example, in this code
iinc 1 1 ; Increment variable i loop: iinc 1 1 ; Increment i again ;; Use variable i goto loop
it would be an error to combine the first two lines into
iinc 1 2 ; Increment i 2
because the destination loop would suddenly disappear, and the resulting code might be meaningless.