Runtime Execution Optimizations | Practical Java Game Programming (Charles River Media Game Development)

Interpreted languages generally have significant disadvantage when it comes to execution performance. This shortcoming is because the code is executed by a VM instead of directly running on the physical machine. The VM is an emulator that allows code written for a stack machine to run on a specific hardware. This also means that Java bytecodes do not take advantage of the capabilities of the underlying machine. This is, of course, what allows the code to be platform independent. One way to speed up the execution of Java bytecodes is to make actual hardware that understands Java opcodes so that it can execute the bytecodes directly. The disadvantage would be that the code would run fast only on such machines and would still have to be emulated on other hardware.

Another option is to compile the bytecodes to instructions that are native to the underlying machine, so that the bytecodes do not have to be interpreted by the VM. This approach raises the following question: when should the bytecodes be converted to native CPU instructions? Should they be compiled before runtime or during runtime? Compiling Java source code or bytecodes directly to the language of the underlying machine would cause the resulting code to be platform dependent. On the other hand, compiling bytecodes to machine code during runtime allows for code that is platform independent and efficient. This is exactly the job of a just-in-time (JIT) compiler. The JIT compiler is obviously platform dependent because it has to understand the instructions of the underlying machine. This means if a VM is implemented for a specific platform and comes with a JIT compiler, Java bytecodes can be executed as efficiently as code that is specifically written for that platform.

JIT compilers were shipped with JRE 1.2. The first versions of Sun VMs that came with a JIT compiler used to aggressively load and compile the methods of all classes to, say, Intel 586 instructions. Before compiling bytecodes to native instructions, the symbolic references would have to be resolved. As mentioned in a previous section, to resolve symbolic references, classes that are referred to have to be loaded first. Consequently, when the VM was launched, all classes were loaded and compiled. This approach resulted in significant startup time. Besides the additional startup time needed to load all classes, additional memory was also required. Even worse, extra memory was required to store all the additional native CPU instructions that were generated from the compiled bytecodes. In other words, the use of JIT compilers by older VMs caused faster code to be produced by consuming significant amount of memory.

The HotSpot VMs are the latest generation of VMs. They are designed to allow for even better performance while reducing the amount of required memory. The HotSpot VMs are designed based on the idea that most of the execution time of a program is spent in a small segment of the code. The 90-10 rule proposes that 90 percent of the execution time is spent in 10 percent of the code. Instead of trying to load, optimize, and compile the entire code of an application, HotSpot VMs load classes only when needed and try to locate the hot spot of the applications and perform substantial optimizations on that segment. This strategy has proven to be very successful.

In addition to compiling bytecodes to native code, making sure that the produced native code is well optimized can result in significant performance gains. Many C/C++ compilers perform optimizations such as loop unrolling, inlining, and data rearrangement to produce more optimized code. The same holds for Java code. More optimized Java bytecodes can result in better performance when the code is interpreted, as well as better native code when the bytecodes are compiled to native code. Static Java compilers (such as javac.exe) can perform optimizations similar to C/C++ optimizing compilers. Older JDK compilers supported the -O option, which would signal the compiler to perform additional optimizations and generate optimized bytecodes. The optimizations included inlining and many other typical optimizations. The compiler that comes with JDK 1.4 does not perform any special optimizations. In fact, it does less than the older compilers. The few simple optimizations include constant folding and inlining of final fields. Constant folding means to compute constants at compile time, such as the following values:

final int MASK_A = 1<<1;  final int MASK_B = 1<<2;  final int MASK_DEFAULT = MASK_A | MASK_B;

Inlining of final values means that if someone refers to these members, instead of a symbolic reference, the actual value would be embedded in the code.

The more sophisticated optimizations are left for HotSpot, which can perform aggressive optimizations on the hot spots of the code. HotSpot keeps track of how many times each method is invoked. Once the counter breaches a threshold, the method is considered to be hot and is optimized and compiled to native code. This approach has a problem,—it does not promote a method that has a critical loop so that it can be compiled earlier than a typical method. This issue has been resolved by allowing loop iterations to increment the hotness of a method. Therefore, if a method that has a lengthy loop is called, the VM may decide to compile the method to native code, even if the method is called only once.

Do you think that it is okay for the VM to compile a hot method but wait until the method is invoked again before it can replace the method with its compiled version? The first version of HotSpot simply waited for the next invocation of the method so that it could start using the compiled version. Note that if the method has a loop that runs indefinitely, the bytecodes may never get replaced by the compiled code. This issue can cause unexpected results when developers write micro-benchmarks to time-specific operations. Benchmarking is discussed in a later section of this chapter.

Newer versions of HotSpot perform on-stack replacement (OSR). OSR allows the VM to switch from executing bytecodes to native code while it is interpreting the bytecodes. This process is by no means a trivial task, because a custom version of the compiled code must be used so that the execution state of the native code is equivalent to that of the interpreted code. Therefore, certain instructions must be added to, for example, load values already computed by the interpreted code into the registers of the machine.

Because the VM can collect information about the runtime behavior of an application, it has a great advantage over static compilers. First, it can perform traditional optimizations in a manner that typical optimizing static compilers do not. Because the VM can locate the hot spots of the application, it can perform aggressive optimizations that are not worth doing to the entire code of an application. Note that many optimizations can result in lengthier code. For example, inlining and loop unrolling can cause a method to be much longer that its original version. This means that the code will take up more memory during runtime. HotSpot can avoid bloating the entire code by focusing on the more important parts. Second, HotSpot can perform optimizations that traditional static compilers cannot perform. For example, devirtualization is an optimization that cannot always be resolved at compile time. In addition, because at runtime the VM has values of variables available, it can perform additional dead-code eliminations that are not possible at compile time.

Devirtualization is one of the fundamental optimizations for object-oriented languages such as C++ and Java. Optimizations that allow virtual calls to be replaced by direct (or statically bound) calls are known as devirtualization. Let’s take the time to go into details of the process. Virtual methods are the basis of object-oriented languages, the idea being that the functionality of a method is chosen based on the actual type of the object. In Java, all nonstatic and nonfinal methods are virtual. In other words, unlike C++, all methods are by default virtual. Note that even iterators and read methods are typically virtual. The following code segment shows the method print, which has been defined in ClassA and ClassB. Note that ClassB extends ClassA.

class ClassA{     public void print(){         System.out.println("ClassA.print()");     }    }   class ClassB extends ClassA{     public void print(){         System.out.println("ClassB.print()");     }    }

The following code segment should print out ClassA.print() and then ClassB.print(), despite the fact that both object1 and object2 are referred to by references of type ClassA.

class Sample{          public static void main(String args[]){                 ClassA object1 = new ClassA();         ClassA object2 = new ClassB();         object1.print();         object2.print();     } }

Because at compile time the actual type of a reference is not known (that is, whether a reference of type ClassA refers to an instance of ClassA or ClassB), some runtime checks must determine which method should be called. Even though this check its not extremely expensive, the bad news is that if the compiler cannot tell which method needs to be called, it will not be able to inline it. Inlining is one of the most important optimization techniques that cannot be disregarded. Why is inlining so extremely important? The more important reasons are method invocations in general are not cheap, and when many small methods are concatenated into one, much better global optimizations can be performed.

You might have noticed that technically, a compiler can figure out the actual type of the object in the code segment shown earlier. That is correct. A compiler can figure out that object2 points to an instance of ClassB, even though it is declared as a reference of type ClassA. By tracing back to where the object is created, the compiler can determine its real type. Even though this is possible, the segment provided earlier is one of the most straightforward scenarios. If a method, for example, receives a parameter of type Object, the compiler would have quite a challenge trying to trace back to figure out its real type. Besides, even the most sophisticated compilers cannot resolve the following:

ClassA object3; String name = "ClassB";     try{     object3 = (ClassA)Class.forName(name).newInstance(); }catch (Exception e){}

As a side note, if you look at the bytecode that corresponds to invoking a final method, you might be surprised. The bytecode for invoking a virtual and final (nonvirtual) method are identical. Only constructors, private methods, methods invoked with the super keyword, and static methods are statically bound. This is because the Java language specifications explicitly note that removing the final keyword from a method should not break compatibility with existing binaries. That is, if ClassB invokes methodA in ClassA, and methodA is final, if methodA is changed so that it is no longer final, there should not be the need to recompile ClassB. This rule, which is designed to promote binary compatibility, indirectly implies that final methods should not be resolved at compile time. This makes devirtualization during runtime an even more important optimization.

Dead-code elimination is another optimization that static compilers cannot do beyond basic measures. Even though removing dead-code does not sound like an important optimization and is even the requirement for Java compilers as defined in the language specifications, it can have great rewards when performed at runtime. Consider the case where a method is called and the value of a local variable is passed as one of its parameters. As far as that particular invocation is concerned, the passed-in parameters can be considered constant. In other words, if the method is inlined, significant chunks of code as well as checks can be potentially removed. This can make the inlining of nontrivial methods a bit more appealing.

It is important to mention that some runtime optimizations can be performed only based on certain assumptions. For example, devirtualization may assume that no other types are dynamically loaded. Dead-code elimination may assume that the value of certain variables will not change. Such optimistic or speculative assumptions can obviously prove to be wrong at a later point during runtime. If every optimization keeps tracks of its assumptions, when an assumption no longer holds, optimizations can be undone. This process is also known as deoptimization, which requires a form of OSR. Deoptimization is OSR performed in reverse, where native code is replaced with bytecodes.

Some of the other optimizations include array-bounds-check elimination and null-check elimination. Note that the more extensive the optimization performed by a compiler, the more CPU cycles it is likely to consume to generate optimized code. Optimizations that require little CPU time can be performed immediately; others can be performed in the background and replaced when the compilation is complete.

It is also important to note that the Sun VM has two different variations. The Client VM, which is the default, has a faster startup time and requires less memory. This is in part because it does not perform the more complex optimizations. On the other hand, the Server VM assumes that the host machine has a reasonable amount of resources and that the application will be running long enough to make more complex optimizations worth performing.