3.8. Java ProcessesThe following sections should shed some light on what your Java applications are doing. Topics such as profiling and tracing are discussed. 3.8.1. Process Stack on a Java Virtual Machine: pstackYou can use the C++ stack unmangler with Java virtual machine (JVM) targets to show the stacks for Java applications. The c++filt utility is provided with the Sun Workshop compiler tools. $ pstack 27494 |c++filt 27494: /usr/bin/java -client -verbose:gc -Xbatch -Xss256k -XX:+AggressiveHeap ----------------- lwp# 1 / thread# 1 -------------------- ff3409b4 pollsys (0, 0, ffbfe858, 0) ff2dcec8 poll (0, 0, 1d4c0, 10624c00, 0, 0) + 7c fed316d4 int os_sleep(long long,int) (0, 1d4c0, 1, ff3, 372c0, 0) + 148 fed2f6e4 int os::sleep(Thread*,long long,int) (372c0, 0, 1d4c0, 7, 4, ff14f934) + 284 fedc21e0 JVM_Sleep (2, ff14dd24, 0, 1d4c0, ff1470dc, 372c0) + 260 f8c0bc20 * java/lang/Thread.sleep(J)V+0 f8c0bbc4 * java/lang/Thread.sleep(J)V+0 f8c05764 * spec/jbb/JBButil.SecondsToSleep(J)V+11 (line 740) f8c05764 * spec/jbb/Company.displayResultTotals(ZZ)V+235 (line 651) f8c05764 * spec/jbb/JBBmain.DoARun(Lspec/jbb/Company;SSII)V+197 (line 277) f8c05764 * spec/jbb/JBBmain.DOIT(Lspec/jbb/infra/Factory/Container;)V+186 (line 732) f8c05764 * spec/jbb/JBBmain.main([Ljava/lang/String;)V+1220 (line 1019) f8c00218 * StubRoutines (1) fecd9f00 void JavaCalls::call_helper(JavaValue*,methodHandle*,JavaCallArgu- ments*,Thread*) (1, 372c0, ffbff018, ffbfef50, ffbff01c, 0) + 5b8 fedb8e84 jni_CallStaticVoidMethod (ff14dd24, ff1470dc, 3788c, 372c0, 0, 37488) + 514 000123b4 main (ff14a040, 576d1a, fed2a6d0, 2, 2, 1d8) + 1314 00011088 _start (0, 0, 0, 0, 0, 0) + 108 3.8.2. JVM ProfilingWhile the JVM has long included the -Xrunhprof profiling flag, the Java 2 Platform, Standard Edition (J2SE) 5.0 and later use the JVMTI for heap and CPU profiling. Usage information is obtained with the java -Xrunhprof command. This profiling flag includes a variety of options and returns a lot of data. As a result, using a large number of options can significantly impact application performance. To observe locks, use the command in the following example. Note that setting monitor=y specifies that locks should be observed. Setting msa=y turns on Solaris microstate accounting (see Section 3.2.2, and Section 2.10.3 in Solaris™ Internals), and depth=8 sets the depth of the stack displayed. # java -Xrunhprof:cpu=times,monitor=y,msa=y,depth=8,file=path_to_result_file app_name MONITOR DUMP BEGIN\ THREAD 200000, trace 302389, status: CW\ THREAD 200001, trace 300000, status: R\ THREAD 201044, trace 302505, status: R\ ..... MONITOR Ljava/lang/StringBuffer;\ owner: thread 200058, entry count: 1\ waiting to enter:\ waiting to be notified:\ MONITOR DUMP END\ MONITOR TIME BEGIN (total = 2442 ms) Sat Nov 5 11:51:04 2005\ rank self accum count trace monitor\ 1 64.51% 64.51% 364 302089 java.lang.Class (Java)\ 2 20.99% 85.50% 294 302094 java.lang.Class (Java)\ 3 9.94% 95.44% 128 302027 sun.misc.Launcher$AppClassLoader (Java)\ 4 4.17% 99.61% 164 302122 sun.misc.Launcher$AppClassLoader (Java)\ 5 0.30% 99.90% 46 302158 sun.misc.Launcher$AppClassLoader (Java)\ 6 0.05% 99.95% 14 302163 sun.misc.Launcher$AppClassLoader (Java)\ 7 0.03% 99.98% 10 302202 sun.misc.Launcher$AppClassLoader (Java)\ 8 0.02% 100.00% 4 302311 sun.misc.Launcher$AppClassLoader (Java)\ MONITOR TIME END\ This command returns verbose data, including all the call stacks in the Java process. Note two sections at the bottom of the output: the MONITOR DUMP and MONITOR TIME sections. The MONITOR DUMP section is a complete snapshot of all the monitors and threads in the system. MONITOR TIME is a profile of monitor contention obtained by measuring the time spent by a thread waiting to enter a monitor. Entries in this record are ranked by the percentage of total monitor contention time and a brief description of the monitor. In previous versions of the JVM, one option is to dump all the stacks on the running VM by sending a SIGQUIT (signal number 3) to the Java process with the kill command. This dumps the stacks for all VM threads to the standard error as shown below. # kill -3 <pid> Full thread dump Java HotSpot(TM) Client VM (1.4.1_06-b01 mixed mode): "Signal Dispatcher" daemon prio=10 tid=0xba6a8 nid=0x7 waiting on condition [0..0] "Finalizer" daemon prio=8 tid=0xb48b8 nid=0x4 in Object.wait() [f2b7f000..f2b7fc24] at java.lang.Object.wait(Native Method) - waiting on <f2c00490> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:111) - locked <f2c00490> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:127) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=10 tid=0xb2f88 nid=0x3 in Object.wait() [facff000..facffc24] at java.lang.Object.wait(Native Method) - waiting on <f2c00380> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:426) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:113) - locked <f2c00380> (a java.lang.ref.Reference$Lock) "main" prio=5 tid=0x2c240 nid=0x1 runnable [ffbfe000..ffbfe5fc] at testMain.doit2(testMain.java:12) at testMain.main(testMain.java:64) "VM Thread" prio=5 tid=0xb1b30 nid=0x2 runnable "VM Periodic Task Thread" prio=10 tid=0xb9408 nid=0x5 runnable "Suspend Checker Thread" prio=10 tid=0xb9d58 nid=0x6 runnable If the top of the stack for a number of threads terminates in a monitor call, this is the place to drill down and determine what resource is being contended. Sometimes removing a lock that protects a hot structure can require many architectural changes that are not possible. The lock might even be in a third-party library over which you have no control. In such cases, multiple instances of the application are probably the best way to achieve scaling. 3.8.3. Tuning Java Garbage CollectionTuning garbage collection (GC) is one of the most important performance tasks for Java applications. To achieve acceptable response times, you will often have to tune GC. Doing that requires you to know the following:
To obtain this data, add the -verbosegc, -XX:+PrintGCTimeStamps, and -XX:+PrintGCDetails flags to the regular JVM command line. 1953.954: [GC [PSYoungGen: 1413632K->37248K(1776640K)] 2782033K->1440033K(3316736K), 0.3666410 secs] 2018.424: [GC [PSYoungGen: 1477376K->37584K(1760640K)] 2880161K->1473633K(3300736K), 0.3825016 secs] 2018.806: [Full GC [PSYoungGen: 37584K->0K(1760640K)] [ParOldGen: 1436049K- >449978K(1540096K)] 147363 3K->449978K(3300736K) [PSPermGen: 4634K->4631K(16384K)], 5.3205801 secs] 2085.554: [GC [PSYoungGen: 1440128K->39968K(1808384K)] 1890106K->489946K(3348480K), 0.2442195 secs] The preceding example indicates that at 2018 seconds a Young Generation GC cleaned 3.3 Gbytes and took .38 seconds to complete. This was quickly followed by a Full GC that took 5.3 seconds to complete. On systems with many CPUs (or hardware threads), the increased throughput often generates significantly more garbage in the VM, and previous GC tuning may no longer be valid. Sometimes Full GC is generated where previously only Young Generation existed. Dump the GC details to a log file to confirm. Avoid full GC whenever you can because it severely affects response time. Full GC is usually an indication that the Java heap is too small. Increase the heap size by using the -Xmx and -Xms options until Full GCs are no longer triggered. It is best to preallocate the heap by setting -Xmx and -Xms to the same value. For example, to set the Java heap to 3.5 Gbytes, add the -Xmx3550m, -Xms3550m, -Xmn2g, and -Xss128k options. The J2SE 1.5.0_06 release also introduced parallelism into the old GCs. Add the -XX:+UseParallelOldGC option to the standard JVM flags to enable this feature. For Young Generation the number of parallel GC threads is the number of CPUs presented by the Solaris OS. On UltraSPARC T1 processor-based systems this equates to the number of threads. It may be necessary to scale back the number of threads involved in Young Generation GC to achieve response time constraints. To reduce the number of threads, you can set XX:ParallelGCThreads=number_of_threads. A good starting point is to set the GC threads to the number of cores on the system. Putting it all together yields the following flags. -Xmx3550m -Xms3550m -Xmn2g -Xss128k -XX:+UseParallelOldGC -XX:+UseParallelGC -XX:Paral- lelGCThreads=8 -XX:+PrintGCDetails -XX:+PrintGCTimestamps Older versions of the Java virtual machine, such as 1.3, do not have parallel GC. This can be an issue on CMT processors because GC can stall the entire VM. Parallel GC is available from 1.4.2 onward, so this is a good starting point for Java applications on multiprocessor-based systems. 3.8.4. Using DTrace on Java ApplicationsThe J2SE 6 (code-named Mustang) release introduces DTrace support within the Java HotSpot virtual machine. The providers and probes included in the Mustang release make it possible for DTrace to collect performance data for applications written in the Java programming language. The Mustang release contains two built-in DTrace providers: hotspot and hotspot_jni. All probes published by these providers are user-level statically defined tracing (USDT) probes, accessed by the PID of the Java HotSpot virtual machine process. The hotspot provider contains probes related to the following Java HotSpot virtual machine subsystems:
All hotspot probes originate in the VM library (libjvm.so), and as such, are also provided from programs that embed the VM. The hotspot_jni provider contains probes related to the Java Native Interface (JNI), located at the entry and return points of all JNI methods. In addition, the DTrace jstack() action prints mixed-mode stack traces including both Java method and native function names. As an example, the following D script (usestack.d) uses the DTrace jstack() action to print the stack trace. #!/usr/sbin/dtrace -s BEGIN { this->cnt = 0; } syscall::pollsys:entry /pid == $1 && tid == 1/ { this->cnt++; printf("\n\tTID: %d", tid); jstack(50); } syscall:::entry /this->cnt == 1/ { exit(0); } And the stack trace itself appears as follows. # ./usejstack.d 1344 | c++filt CPU ID FUNCTION:NAME 0 316 pollsys:entry TID: 1 libc.so.1`__pollsys+0xa libc.so.1`poll+0x52 libjvm.so`int os_sleep(long long,int)+0xb4 libjvm.so`int os::sleep(Thread*,long long,int)+0x1ce libjvm.so`JVM_Sleep+0x1bc java/lang/Thread.sleep dtest.method3 dtest.method2 dtest.method1 dtest.main StubRoutines (1) libjvm.so`void JavaCalls::call_helper(JavaValue*,methodHandle*,JavaCallArgu- ments*,Thread*)+0x1b5 libjvm.so`void os::os_exception_wrapper(void(*)(JavaValue*,methodHandle*,JavaCallAr- guments*,Thread*),JavaValue*,methodHandle*,Ja vaCallArguments*,Thread*)+0x18 libjvm.so`void JavaCalls::call(JavaValue*,methodHandle,JavaCallArgu- ments*,Thread*)+0x2d libjvm.so`void jni_invoke_static(JNIEnv_*,JavaValue*,_jobject*,JNICallType,_ jmethodID*,JNI_ArgumentPush er*,Thread*)+0x214 libjvm.so`jni_CallStaticVoidMethod+0x244 java`main+0x642 StubRoutines (1) The command line shows that the output from this script was piped to the c++filt utility, which demangles C++ mangled names, making the output easier to read. The DTrace header output shows that the CPU number is 0, the probe number is 316, the thread ID (TID) is 1, and the probe name is pollsys:entry, where pollsys is the name of the system call. The stack trace frames appear from top to bottom in the following order: two system call frames, three VM frames, five Java method frames, and VMframes in the remainder. For further information on using DTrace with Java applications, see Section 10.3. |