Performance Inspector

team bbl


The Performance Inspector (PI) is a suite of tools that identifies performance problems and characteristics. PI is distributed with a kernel patch and the source to build the device driver and tools. Procedures are also included to automate the install process and build for installing the tools. Support is provided for Intel 32- and 64-bit platforms, Power PC 32- and 64-bit systems, S390 32- and 64-bit systems, and the AMD Hammer processor.

The tools included in all versions of Performance Inspector include the following:

  • Above Idle. Shows how idle, process, and interrupt times are distributed over the processor(s) on the running system. Above Idle is a phase 1 tool that identifies hot spots with respect to processor time and interrupt time spent in the system. This tool can also be used on SMP systems to see how well the multiple active processes are being spread over multiple CPUs.

  • Per-Thread Time. Hooks into the process dispatch and interrupt code to maintain information on the total amount of time spent within a process, the time spent handling interrupts, and the amount of idle time. Summary information is provided on a per-CPU basis and for the system as a whole. APIs are provided to allow a developer to include calls within this application to measure functions and code snippets.

  • System Trace Data Collection. Trace hooks are added to the kernel to collect data on what is happening within the system. Hooks are provided for process dispatches, process timeslices, interrupt entry and exits, process forks and clones, process startup execution, system timer interrupts, and memory mapping of code segments. It is also possible for an application to write its own trace records.

    Command files, libraries, and post processors are included to allow automatic tracing functions and reports. The most significant of these functions allow the measurement and reporting of jitted methods within Java applications. This function is very useful in identifying what methods are running and how much time they are using.

  • JPROF. Performs Java execution profiling. This tool includes the capability to obtain detailed information on jitted methods. To use this tool, use a version of the Java SDK that includes JVMPI support.

  • Java Lock Monitor (JLM). Reports hold-time accounting and contention statistics with Java applications. The IBM Java 1.4 SDK is required to use this tool.

In addition to these common measurement functions, an additional set of functionality is provided on Intel 32-bit systems. This functionality includes the following:

  • Performance counter support. Supports starting and stopping the counter, reporting counter contents, and displaying counter settings. The Trace Data Collection and Per-Thread Time functions base their measurements on the performance counters instead of the system clock. Instead of seeing how much time is spent within a process, you see how many instructions are executed, or how many branches or jumps are performed.

  • Instruction tracing. Instruction tracing records all the branches taken while the measurement is active. Branches include calls, jumps, and any other execution path change. A post-processor tool is provided to report an instruction trace, including the number of instructions executed and where in the code the execution occurs. This reporting can also be done for Java jitted methods.

  • Dynamic kernel patch. For certain releases of Red Hat and SUSE distributions, a version of the tools is provided that does not require a patch to the Linux kernel. Instead, the device driver dynamically patches the required hooks into the running system and runs the Performance Inspector tool suite.

The Performance Inspector installation requires you to apply a kernel patch and rebuild the kernel. After that step is complete, the various tools are built. Because some of the tools are sensitive to the version of Java that is being used, the tool build process must be redone when a new version of the Java support is used. However, this requirement does not affect the kernel. The kernel needs to be patched and built only once.

An exception to the installation procedure is provided for the Intel 32-bit support. A special version of the Performance Inspector is provided that does not require the kernel to be patched as long as the kernel being used is the default kernel shipped with supported versions of Red Hat and SUSE. The kernel source needs to be installed and the tools built as before, but the kernel does not have to be rebuilt. The PI driver dynamically patches the kernel at the appropriate places. The kernel patches are removed when the driver is uninstalled.

When none of the PI performance probes in the kernel are active, there is minimal impact on system performance. At most, a compare and short branch instruction are added to the code path where the probes are located. For the dynamic version of the IA32 PI, no extra overhead is added when the driver is not loaded.

When performing a trace, there is usually a 2% to 3% overhead on the system. Instruction tracing obviously adds more overhead, but it is for use in a debug environment. N tracing is done in a working production environment without any major impacts to performance or throughput.

Most of the PI functions can be controlled from APIs issued from the user's program as well as from the PI's own command files. When using the API interface, either from C code or Java, you can fine-tune when, where, and what information is collected. All of the source is provided and can be used as a coding sample.

Refer to http://www-124.ibm.com/developerworks/oss/pi/index.html for more information and to obtain a copy of the Performance Inspector.

The remainder of this chapter examines in detail the various features of the Performance Inspector.

Above Idle

Above Idle works by hooking into the process timeslice logic and interrupt handler within the kernel. Above Idle keeps track of the amount of time that is spent while a processor is busy, idle, or handling interrupts. It is useful for identifying processor overload, high interrupt activity, and poor distribution of work within a multiprocessor system. When active, Above Idle gathers information over a user-specified measurement period. The default is 1 second. When using the default, Above Idle determines the amount of time the system spent while idle, active, and processing interrupts. The percentage of processor idle, active, and interrupt processing time is then calculated and displayed by the processor when running on a multiprocessor system. Parameters are used to define how many measurements to make and the time interval between measurements.

To start the Above Idle measurement, enter swtrace ai, which uses the defaults to report system usage every second until the measurement is manually stopped.

The following example shows a sample Above Idle output.

 # swtrace ai Above Idle instrumentation enabled    CPU 0  IDLE= 48.83, BUSY= 49.74, INTR=  1.44    CPU 1  IDLE=  0.00, BUSY= 99.81, INTR=  0.19    CPU 2  IDLE= 96.10, BUSY=  3.70, INTR=  0.20    CPU 3  IDLE= 22.10, BUSY= 77.71, INTR=  0.19    CPU 4  IDLE= 16.50, BUSY= 83.30, INTR=  0.20    CPU 5  IDLE= 16.66, BUSY= 83.10, INTR=  0.25    CPU 6  IDLE= 16.83, BUSY= 82.96, INTR=  0.21    CPU 7  IDLE= 60.72, BUSY= 39.08, INTR=  0.20    CPU 0  IDLE= 98.54, BUSY=  0.00, INTR=  1.46    CPU 1  IDLE= 86.35, BUSY= 13.48, INTR=  0.17    CPU 2  IDLE= 99.78, BUSY=  0.00, INTR=  0.22    CPU 3  IDLE= 99.77, BUSY=  0.00, INTR=  0.23    CPU 4  IDLE= 18.91, BUSY= 80.87, INTR=  0.22    CPU 5  IDLE= 10.75, BUSY= 89.05, INTR=  0.20    CPU 6  IDLE= 99.77, BUSY=  0.00, INTR=  0.23    CPU 7  IDLE= 99.73, BUSY=  0.00, INTR=  0.27    CPU 0  IDLE= 98.59, BUSY=  0.00, INTR=  1.41    CPU 1  IDLE= 99.80, BUSY=  0.00, INTR=  0.20    CPU 2  IDLE= 99.81, BUSY=  0.00, INTR=  0.19    CPU 3  IDLE= 99.79, BUSY=  0.00, INTR=  0.21    CPU 4  IDLE=  1.19, BUSY= 98.65, INTR=  0.16    CPU 5  IDLE=  0.00, BUSY= 99.77, INTR=  0.23    CPU 6  IDLE= 99.78, BUSY=  0.00, INTR=  0.22    CPU 7  IDLE= 99.85, BUSY=  0.00, INTR=  0.15 

The example was run on an eight-way SMP machine and shows data collected over 3 seconds.

Per-Thread Time

When active, per-thread time (PTT) accumulates the amount of time spent by each process within the system, as well as idle time and interrupt time. If you are processing on an Intel IA 32-bit system, you can use a performance event counter instead of using the real-time clock. For example, a measurement can be made that identifies the number of instructions executed by each process, the idle process, and the interrupts. The totals for process time, IRQ time, and idle time are provided for each processor in the system, as well as a system total of all processors.

PTT is activated via an API call, PTTInit, or an external program called ptt. In both cases, you specify the measurement medium as system timer or performance counter. After it is activated, you can run a program called pttstats to display the current values for idle, process, and interrupt times. You can run the pttstats program any number of times. A terminate command stops the measurement. The last counts before the terminate command is issued remain stored in the buffers until another init command is received. The following shows a sample PTT report:

 Tools version 2040024 PTT_STATISTICS_SIZE = 263096 IBM_NR_CPUS = 16 Per Thread Time Summary Total Dispatches:       3763 Total IRQs:             334 Total Thread Time:      151469140 Total IRQ Time:         721564 Total Idle Time:        2158895609 Total tool time:        81755 Total Measurement:      2311086313 Total Period:           0 Per Thread Time Detail Details by CPU                 IRQ      Dispatch    CPU          Count    Count        Times ----------------------------------------------------     1            334      3763 Thread Time:                          151469140 IRQ Time:                             721564 Idle Time:                            2158895609 Tool Time:                            81755 Total Time:                           5311168068 Details by Process             Dispatch    IRQ         Process    PID      Count       Count       Time --------------------------------------------------     0        708         303       2158895609     1          5           0            34407     2         83           0           139154     3          1           0             7162     7         15           0           168819    11          5           0             3466   324          1           0             6799   409         11           0            30641   671       1281          24        113156655   722          3           0            30128   723          2           0             8944   727         12           0            35688   728          2           0            11927   834         23           0            62077   869         11           0           170978   963         98           3         15182854   971        148           1           971766   992        137           1          1821399   995         48           0          2676763   999        393           0          3993366 11580          3           0           135687 11581         53           0          1156413 12281        544           1          6954629 12282         36           0          1006592 12297        140           1          3702826 Total Process Time:             2310364749 End of reports 

APIs are provided to allow applications to start this measurement and to obtain the total amount of time spent in the process being measured. APIs allow an application to be instrumented to measure the amount of time spent between specific operations or positions within the application. On Intel 32-bit applications, measurements such as instructions retired can be used as a performance counter measurement.

Trace Profiling

Trace profiling works by hooking the kernel timer interrupt. When the timer interrupt is processed, the profiling code creates a trace record that contains the address where the processor was executing when the timer interrupt occurred. This trace record is then written into a trace buffer. When the profiling is complete, these trace records can be dumped out to a file where a post-processing program produces various reports.

You can run a trace profile to identify hot spots within the system by identifying the applications that consume the most time and which functions within those applications are causing them to use all that time. This includes jitted methods with Java applications.

Use the run.tprof command to perform the trace and produce the default reports.

By looking at the run.tprof command file, you can see the specific steps you need to follow to take a profile trace and produce the reports.

The following example shows a sample trace report for a small C application.

 # run.tprof ProcessorSpeed 398000000 TraceCycles 25382010763   TraceTime 63.774(sec) TOTAL TICKS 25383 (Clipping Level : 0.1 % 25 Ticks) ================================ TPROF Report Summary )) Process )) Process_Module )) Process_Module_Symbol )) Process_Thread )) Process_Thread_Module )) Process_Thread_Module_Symbol )) Module )) Module_Symbol ================================ ================================ )) Process ================================ LAB TKS %%% NAMES PID 19094 75.22 SystemProcess_0000  PID 6284 24.76 /space/piperf/bin/WatchedPot_057c ================================ )) Process_Module ================================ LAB TKS %%% NAMES  PID 19094 75.22 SystemProcess_0000 MOD 19094 75.22 vmlinux PID 6284 24.76 /space/piperf/bin/WatchedPot_057c MOD 6284 24.76 /space/piperf/bin/WatchedPot ================================ )) Process_Module_Symbol ================================ LAB TKS %%% NAMES PID 19094 75.22 SystemProcess_0000 MOD 19094 75.22 vmlinux SYM 19094 75.22 default_idle PID 6284 24.76 /space/piperf/bin/WatchedPot_057c MOD 6284 24.76 /space/piperf/bin/WatchedPot SYM 3154 12.43 ThirtyTwo SYM 1615 6.36 Sixteen SYM 808 3.18 Eight SYM 404 1.59 Four SYM 202 0.80 Two SYM 101 0.40 One ================================ )) Process_Thread ================================ LAB TKS %%% NAMES PID 19094 75.22 SystemProcess_0000 TID 19094 75.22 tid_0000   PID 6284 24.76 /space/piperf/bin/WatchedPot_057c TID 6284 24.76 tid_057c ================================ )) Process_Thread_Module ================================ LAB TKS %%% NAMES PID 19094 75.22 SystemProcess_0000 TID 19094 75.22 tid_0000 MOD 19094 75.22 vmlinux PID 6284 24.76 /space/piperf/bin/WatchedPot_057c TID 6284 24.76 tid_057c MOD 6284 24.76 /space/piperf/bin/WatchedPot ================================ )) Process_Thread_Module_Symbol ================================ LAB TKS %%% NAMES PID 19094 75.22 SystemProcess_0000 TID 19094 75.22 tid_0000 MOD 19094 75.22 vmlinux SYM 19094 75.22 default_idle PID 6284 24.76 /space/piperf/bin/WatchedPot_057c TID 6284 24.76 tid_057c MOD 6284 24.76 /space/piperf/bin/WatchedPot SYM 3154 12.43 ThirtyTwo SYM 1615 6.36 Sixteen  SYM 808 3.18 Eight SYM 404 1.59 Four SYM 202 0.80 Two  SYM 101 0.40 One ================================ )) Module ================================ LAB TKS %%% NAMES MOD 19096 75.23 vmlinux MOD 6284 24.76 /space/piperf/bin/WatchedPot ================================ )) Module_Symbol ================================   LAB TKS %%% NAMES MOD 19096 75.23 vmlinux SYM 19094 75.22 default_idle MOD 6284 24.76 /space/piperf/bin/WatchedPot SYM 3154 12.43 ThirtyTwo SYM 1615 6.36 Sixteen SYM 808 3.18 Eight SYM 404 1.59 Four SYM 202 0.80 Two SYM 101 0.40 One 

When profiling applications, the level of optimization used when the application is compiling can have an effect on the trace output. With optimization on, functions that execute the same base code are all rolled up into the application instead of reported on separately by symbol name. Therefore, it is a good idea to make the first few profiling runs on applications that have not been compiled with optimization turned on.

Instruction Tracing

A special version of trace profiling that sets up the system to use the hardware's instruction trace capabilities is available on Intel IA32 systems. Instruction tracing starts from the branch instruction signal and then creates trace records every time a code branch, jump, or call is performed. Instruction profiling provides a detailed trace of the execution path in the system without the need to have a debugger installed or code compiled with the debug option enabled.

Use instruction tracing only for very short, controlled measurement periods, because it can produce a significant number of tracing records. Instruction tracing is primarily useful for determining the exact execution path through the entire system for the function being executed.

A small command file, run.itrace, is provided to simplify the taking of an instruction trace profile.

The following example shows a small output of run.itrace.

 # run.itrace ----------------------------- ss: 0x8049a5c 1 push ebp ss: 0x8049a5d 2 mov ebp, esp ss: 0x8049a5f 3 sub esp, 0x8 ss: 0x8049a62 5 call 0x420628d8 0 1 1 4 @ 0 func1:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL ss: 0x8049a6a 1 push ebp  ss: 0x8049a6b 2 mov ebp, esp ss: 0x8049a6d 3 sub esp, 0x8 ss: 0x8049a70 5 call 0x420628d8 0 1 1 4 @ 0 func2:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL ss: 0x8049a78 1 push ebp ss: 0x8049a79 2 mov ebp, esp ss: 0x8049a7b 3 sub esp, 0x8 ss: 0x8049a7e 5 call 0x420628d8 0 1 1 4 @ 0 func3:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL ss: 0x8049a86 1 push ebp ss: 0x8049a87 2 mov ebp, esp ss: 0x8049a89 3 sub esp, 0x8 ss: 0x8049a8c 5 call 0x420628d8 0 1 1 4 @ 0 func4:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL ss: 0x8049a94 1 push ebp ss: 0x8049a95 2 mov ebp, esp ss: 0x8049a97 3 sub esp, 0x8 ss: 0x8049a9a 5 call 0x420628d8 0 1 1 4 @ 0 func5:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL ---------------------------------- 

This report shows the assembler instructions that were executed between the entry and the branch instruction. The report provides a detailed view of what is going on in the system. To relate this back to the actual C code, you would have to compile the C code with the assembler option and calculate the offsets to the assembler instructions to find them in the listing.

If you add the -c option to the post-processing report generator, the output would look similar to the following.

 # run.itrace -c cpu ring instruction_count offset symbolname_modulename pid_tid_pidname To_instruction 0 1 2 4 @ 32 printf:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ RETURN 0 1 1 5 @ 54 main:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 4 @ 0 func1:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 4 @ 0 func2:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 4 @ 0 func3:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 4 @ 0 func4:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 4 @ 0 func5:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 6 @ 0 func6:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ CALL 0 1 3 1 @ 50 <plt>:/piperf/bin/itracec 8a1_0_/piperf/bin/itracec_ JUMP 0 1 1 6 @ 0 printf:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ CALL 0 1 2 2 @ 0 __i686.get_pc_thunk.bx:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ RETURN 0 1 1 8 @ 11 printf:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ CALL 0 1 1 7 @ 0 _IO_vfprintf_internal:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ CALL 0 1 2 2 @ 0 __i686.get_pc_thunk.bx:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ RETURN 0 1 1 6 @ d _IO_vfprintf_internal:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ CALL 0 1 3 1 @ 470 <plt>:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ JUMP 0 1 1 4 @ 0 __errno_location:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ CALL 0 1 2 2 @ 0 __i686.get_pc_thunk.bx:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ RETURN 0 1 2 5 @ 9 __errno_location:/lib/i686/libc-2.2.93.so 8a1_0_/piperf/bin/itracec_ RETURN ---------------------------------- 

Java Profiler

JPROF is a Java profiling agent that dynamically responds to JVMPI events based on options passed at Java invocation. The profiler is generally referred to as JPROF but uses an executable library called libjprof.so. JPROF provides JIT address-to-name resolution to support tprof and ITtrace data reduction. Other functions include Java Lock Monitor and Java HeapDump. The profiler has been implemented for IBM JDK 1.2.2 and later and is based on the support in the JDK for the JVMPI interface.

The following example shows a report generated by the trace profiler on a Java application.

 Tprof Reports        ProcessorSpeed       688000000           TraceCycles     10949059046             TraceTime          15.914(sec)  TOTAL TICKS         125160  (Clipping Level :   0.1 %  125 Ticks)    ================================       TPROF Report Summary     )) Process     )) Process_Module     )) Process_Module_Symbol     )) Process_Thread     )) Process_Thread_Module     )) Process_Thread_Module_Symbol     )) Module     )) Module_Symbol    ================================    ================================     )) Process    ================================    LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7    PID  15645 12.50    SystemProcess_1    PID  15645 12.50    SystemProcess_0    PID  14388 11.50    SystemProcess_2    PID  14074 11.24    SystemProcess_4    PID  12806 10.23    SystemProcess_3    PID  10935  8.74    SystemProcess_6    PID  10462  8.36    SystemProcess_5    PID   4710  3.76    java_6    PID   4302  3.44    sshd_5    PID   2823  2.26    sshd_3    PID   1566  1.25    java_4    PID   1257  1.00    java_2    PID    881  0.70    java_5    ================================     )) Process_Module    ================================    LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7     MOD  15645 12.50     vmlinux    PID  15645 12.50    SystemProcess_1     MOD  15645 12.50     vmlinux    PID  15645 12.50    SystemProcess_0     MOD  15645 12.50     vmlinux    PID  14388 11.50    SystemProcess_2     MOD  14388 11.50     vmlinux    PID  14074 11.24    SystemProcess_4     MOD  14074 11.24     vmlinux    PID  12806 10.23    SystemProcess_3     MOD  12806 10.23     vmlinux    PID  10935  8.74    SystemProcess_6     MOD  10935  8.74     vmlinux    PID  10462  8.36    SystemProcess_5     MOD  10462  8.36     vmlinux    PID   4710  3.76    java_6     MOD   2109  1.69     vmlinux     MOD   1767  1.41     JITCODE     MOD    268  0.21     /opt/IBMJava2-141/jre/bin/libhpi.so     MOD    243  0.19     /opt/IBMJava2-141/jre/bin/classic/libjvm.so     MOD    175  0.14     /lib/tls/libpthread-0.29.so    PID   4302  3.44    sshd_5     MOD   1966  1.57     vmlinux     MOD   1349  1.08     /lib/libcrypto.so.0.9.7a     MOD    480  0.38     /usr/sbin/sshd     MOD    462  0.37     /lib/tls/libc-2.3.2.so    PID   2823  2.26    sshd_3     MOD   1369  1.09     vmlinux     MOD    844  0.67     /lib/libcrypto.so.0.9.7a     MOD    302  0.24     /usr/sbin/sshd     MOD    280  0.22     /lib/tls/libc-2.3.2.so     PID   1566  1.25    java_4     MOD    767  0.61     vmlinux     MOD    532  0.43     JITCODE    PID   1257  1.00    java_2     MOD    601  0.48     vmlinux     MOD    434  0.35     JITCODE    PID    881  0.70    java_5     MOD    444  0.35     vmlinux     MOD    299  0.24     JITCODE    ================================     )) Process_Module_Symbol    ================================    LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7     MOD  15645 12.50     vmlinux      SYM  15605 12.47      default_idle    PID  15645 12.50    SystemProcess_1     MOD  15645 12.50     vmlinux      SYM  15606 12.47      default_idle    PID  15645 12.50    SystemProcess_0     MOD  15645 12.50     vmlinux      SYM  15608 12.47      default_idle    PID  14388 11.50    SystemProcess_2     MOD  14388 11.50     vmlinux      SYM  14341 11.46      default_idle    PID  14074 11.24    SystemProcess_4     MOD  14074 11.24     vmlinux      SYM  14043 11.22      default_idle    PID  12806 10.23    SystemProcess_3     MOD  12806 10.23     vmlinux      SYM  12675 10.13      default_idle    PID  10935  8.74    SystemProcess_6     MOD  10935  8.74     vmlinux      SYM  10913  8.72      default_idle    PID  10462  8.36    SystemProcess_5     MOD  10462  8.36     vmlinux      SYM  10387  8.30      default_idle    PID   4710  3.76    java_6     MOD   2109  1.69     vmlinux      SYM    404  0.32      n_tty_receive_buf      SYM    144  0.12      opost_block      SYM    141  0.11      sigprocmask      SYM    140  0.11      .text.lock.tty_io      SYM    136  0.11      tty_write      SYM    130  0.10      n_tty_receive_room     MOD   1767  1.41     JITCODE      SYM    621  0.50      hellop.main([Ljava/lang/String;)V      SYM    143  0.11      java/io/OutputStreamWriter.write([CII)V     MOD    268  0.21     /opt/IBMJava2-141/jre/bin/libhpi.so     MOD    243  0.19     /opt/IBMJava2-141/jre/bin/classic/libjvm.so     MOD    175  0.14     /lib/tls/libpthread-0.29.so    PID   4302  3.44    sshd_5     MOD   1966  1.57     vmlinux      SYM    236  0.19      read_chan      SYM    125  0.10      n_tty_chars_in_buffer     MOD   1349  1.08     /lib/libcrypto.so.0.9.7a      SYM   1345  1.07      NoSymbols     MOD    480  0.38     /usr/sbin/sshd      SYM    474  0.38      NoSymbols     MOD    462  0.37     /lib/tls/libc-2.3.2.so    PID   2823  2.26    sshd_3     MOD   1369  1.09     vmlinux      SYM    175  0.14      read_chan     MOD    844  0.67     /lib/libcrypto.so.0.9.7a      SYM    842  0.67      NoSymbols     MOD    302  0.24     /usr/sbin/sshd      SYM    298  0.24      NoSymbols     MOD    280  0.22     /lib/tls/libc-2.3.2.so    PID   1566  1.25    java_4     MOD    767  0.61     vmlinux      SYM    151  0.12      n_tty_receive_buf     MOD    532  0.43     JITCODE      SYM    161  0.13      hellop.main([Ljava/lang/String;)V    PID   1257  1.00    java_2     MOD    601  0.48     vmlinux      SYM    132  0.11      n_tty_receive_buf     MOD    434  0.35     JITCODE      SYM    126  0.10      hellop.main([Ljava/lang/String;)V    PID    881  0.70    java_     MOD    444  0.35     vmlinux     MOD    299  0.24     JITCODE    ================================     )) Process_Thread    ================================    LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7     TID  15645 12.50     tid_0000    PID  15645 12.50    SystemProcess_1     TID  15645 12.50     tid_0000    PID  15645 12.50    SystemProcess_0     TID  15645 12.50     tid_0000    PID  14388 11.50    SystemProcess_2     TID  14388 11.50     tid_0000    PID  14074 11.24    SystemProcess_4     TID  14074 11.24     tid_0000    PID  12806 10.23    SystemProcess_3     TID  12806 10.23     tid_0000    PID  10935  8.74    SystemProcess_6     TID  10935  8.74     tid_0000    PID  10462  8.36    SystemProcess_5     TID  10462  8.36     tid_0000    PID   4710  3.76    java_6     TID   4668  3.73     tid_main    PID   4302  3.44    sshd_5     TID   4298  3.43     tid_02c5    PID   2823  2.26    sshd_3     TID   2823  2.26     tid_02c5    PID   1566  1.25    java_4     TID   1527  1.22     tid_main    PID   1257  1.00    java_2     TID   1222  0.98     tid_main    PID    881  0.70    java_5     TID    832  0.66     tid_main    ================================     )) Process_Thread_Module    ================================   LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux    PID  15645 12.50    SystemProcess_1     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux    PID  15645 12.50    SystemProcess_0     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux    PID  14388 11.50    SystemProcess_2     TID  14388 11.50     tid_0000      MOD  14388 11.50      vmlinux    PID  14074 11.24    SystemProcess_4     TID  14074 11.24     tid_0000      MOD  14074 11.24      vmlinux    PID  12806 10.23    SystemProcess_3     TID  12806 10.23     tid_0000      MOD  12806 10.23      vmlinux    PID  10935  8.74    SystemProcess_6     TID  10935  8.74     tid_0000      MOD  10935  8.74      vmlinux    PID  10462  8.36    SystemProcess_5     TID  10462  8.36     tid_0000      MOD  10462  8.36      vmlinux    PID   4710  3.76    java_6     TID   4668  3.73     tid_main      MOD   2075  1.66      vmlinux      MOD   1767  1.41      JITCODE      MOD    268  0.21      /opt/IBMJava2-141/jre/bin/libhpi.so      MOD    236  0.19      /opt/IBMJava2-141/jre/bin/classic/libjvm.so      MOD    174  0.14      /lib/tls/libpthread-0.29.so    PID   4302  3.44    sshd_5     TID   4298  3.43     tid_02c5      MOD   1965  1.57      vmlinux      MOD   1349  1.08      /lib/libcrypto.so.0.9.7a      MOD    478  0.38      /usr/sbin/sshd      MOD    461  0.37      /lib/tls/libc-2.3.2.so    PID   2823  2.26    sshd_3     TID   2823  2.26     tid_02c5      MOD   1369  1.09      vmlinux      MOD    844  0.67      /lib/libcrypto.so.0.9.7a      MOD    302  0.24      /usr/sbin/sshd      MOD    280  0.22      /lib/tls/libc-2.3.2.so   PID   1566  1.25    java_4     TID   1527  1.22     tid_main      MOD    734  0.59      vmlinux      MOD    532  0.43      JITCODE    PID   1257  1.00    java_2     TID   1222  0.98     tid_main      MOD    580  0.46      vmlinux      MOD    434  0.35      JITCODE    PID    881  0.70    java_5     TID    832  0.66     tid_main      MOD    410  0.33      vmlinux      MOD    299  0.24      JITCODE    ================================     )) Process_Thread_Module_Symbol    ================================    LAB    TKS   %%%     NAMES    PID  15645 12.50    SystemProcess_7     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux       SYM  15605 12.47       default_idle    PID  15645 12.50    SystemProcess_1     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux       SYM  15606 12.47       default_idle    PID  15645 12.50    SystemProcess_0     TID  15645 12.50     tid_0000      MOD  15645 12.50      vmlinux       SYM  15608 12.47       default_idle    PID  14388 11.50    SystemProcess_2     TID  14388 11.50     tid_0000      MOD  14388 11.50      vmlinux       SYM  14341 11.46       default_idle    PID  14074 11.24    SystemProcess_4     TID  14074 11.24     tid_0000      MOD  14074 11.24      vmlinux       SYM  14043 11.22       default_idle    PID  12806 10.23    SystemProcess_3     TID  12806 10.23     tid_0000      MOD  12806 10.23      vmlinux       SYM  12675 10.13       default_idle    PID  10935  8.74    SystemProcess_6     TID  10935  8.74     tid_0000      MOD  10935  8.74      vmlinux        SYM  10913  8.72       default_idle    PID  10462  8.36    SystemProcess_5     TID  10462  8.36     tid_0000      MOD  10462  8.36      vmlinux       SYM  10387  8.30       default_idle    PID   4710  3.76    java_6     TID   4668  3.73     tid_main      MOD   2075  1.66      vmlinux       SYM    404  0.32       n_tty_receive_buf       SYM    144  0.12       opost_block       SYM    141  0.11       sigprocmask       SYM    140  0.11       .text.lock.tty_io       SYM    136  0.11       tty_write       SYM    130  0.10       n_tty_receive_room      MOD   1767  1.41      JITCODE       SYM    621  0.50       hellop.main([Ljava/lang/String;)V       SYM    143  0.11       java/io/OutputStreamWriter.write([CII)V      MOD    268  0.21      /opt/IBMJava2-141/jre/bin/libhpi.so      MOD    236  0.19      /opt/IBMJava2-141/jre/bin/classic/libjvm.so      MOD    174  0.14      /lib/tls/libpthread-0.29.so    PID   4302  3.44    sshd_5     TID   4298  3.43     tid_02c5      MOD   1965  1.57      vmlinux       SYM    236  0.19       read_chan       SYM    125  0.10       n_tty_chars_in_buffer      MOD   1349  1.08      /lib/libcrypto.so.0.9.7a       SYM   1345  1.07       NoSymbols      MOD    478  0.38      /usr/sbin/sshd       SYM    472  0.38       NoSymbols      MOD    461  0.37      /lib/tls/libc-2.3.2.so    PID   2823  2.26    sshd_3     TID   2823  2.26     tid_02c5      MOD   1369  1.09      vmlinux       SYM    175  0.14       read_chan      MOD    844  0.67      /lib/libcrypto.so.0.9.7a       SYM    842  0.67       NoSymbols      MOD    302  0.24      /usr/sbin/sshd       SYM    298  0.24       NoSymbols      MOD    280  0.22      /lib/tls/libc-2.3.2.so    PID   1566  1.25    java_4     TID   1527  1.22     tid_main      MOD    734  0.59      vmlinux       SYM    151  0.12       n_tty_receive_buf      MOD    532  0.43      JITCODE       SYM    161  0.13       hellop.main([Ljava/lang/String;)V    PID   1257  1.00    java_2     TID   1222  0.98     tid_main      MOD    580  0.46      vmlinux       SYM    132  0.11       n_tty_receive_buf      MOD    434  0.35      JITCODE       SYM    126  0.10       hellop.main([Ljava/lang/String;)V    PID    881  0.70    java_5     TID    832  0.66     tid_main      MOD    410  0.33      vmlinux      MOD    299  0.24      JITCODE    ================================     )) Module    ================================    LAB    TKS   %%%     NAMES    MOD 116871 93.38    vmlinux    MOD   3032  2.42    JITCODE    MOD   2193  1.75    /lib/libcrypto.so.0.9.7a    MOD    782  0.62    /usr/sbin/sshd    MOD    743  0.59    /lib/tls/libc-2.3.2.so    MOD    443  0.35    /opt/IBMJava2-141/jre/bin/libhpi.so    MOD    441  0.35    /opt/IBMJava2-141/jre/bin/classic/libjvm.so    MOD    311  0.25    /lib/tls/libpthread-0.29.so    MOD    208  0.17    NoModule    MOD    133  0.11    /opt/IBMJava2-141/jre/bin/libjava.so    ================================     )) Module_Symbol    ================================    LAB    TKS   %%%     NAMES    MOD 116871 93.38    vmlinux     SYM 109178 87.23     default_idle     SYM    783  0.63     n_tty_receive_buf     SYM    413  0.33     .text.lock.tty_io     SYM    411  0.33     read_chan     SYM    378  0.30     sigprocmask     SYM    331  0.26     __wake_up     SYM    310  0.25     sysenter_past_esp     SYM    299  0.24     schedule     SYM    271  0.22     tty_write     SYM    256  0.20     opost_block     SYM    246  0.20     add_wait_queue     SYM    229  0.18     n_tty_receive_room     SYM    229  0.18     n_tty_chars_in_buffer     SYM    197  0.16     __copy_from_user_ll     SYM    191  0.15     sys_rt_sigprocmask     SYM    177  0.14     do_select     SYM    174  0.14     __copy_to_user_ll     SYM    170  0.14     remove_wait_queue     SYM    154  0.12     write_chan     SYM    153  0.12     pty_write     SYM    139  0.11     sys_select     SYM    125  0.10     kill_fasync    MOD   3032  2.42    JITCODE     SYM   1011  0.81     hellop.main([Ljava/lang/String;)V     SYM    254  0.20     java/io/OutputStreamWriter.write([CII)V     SYM    198  0.16     sun/nio/cs/StreamEncoder.flushBuffer()V     SYM    168  0.13     java/lang/String.<init>(Ljava/lang/String;I)V     SYM    168  0.13     java/io/PrintStream.write([BII)V     SYM    132  0.11     java/io/PrintStream.write(Ljava/lang/String;)V    MOD   2193  1.75    /lib/libcrypto.so.0.9.7a     SYM   2187  1.75     NoSymbols    MOD    782  0.62    /usr/sbin/sshd     SYM    772  0.62     NoSymbols    MOD    743  0.59    /lib/tls/libc-2.3.2.so     SYM    152  0.12     memset     SYM    125  0.10     __memchr    MOD    443  0.35    /opt/IBMJava2-141/jre/bin/libhpi.so    MOD    441  0.35    /opt/IBMJava2-141/jre/bin/classic/libjvm.so     SYM    141  0.11     jitCacheAlloc    MOD    311  0.25    /lib/tls/libpthread-0.29.so    MOD    208  0.17    NoModule     SYM    208  0.17     NoSymbols    MOD    133  0.11    /opt/IBMJava2-141/jre/bin/libjava.so  Offset Report    ================================     )) Module_Symbol_Offset    ================================    LAB    TKS   %%%     NAMES    MOD 116871 93.38    vmlinux     SYM 109178 87.23     default_idle      OFF 109174 87.23      0x2d     SYM    783  0.63     n_tty_receive_buf      OFF    440  0.35      0x157     SYM    413  0.33     .text.lock.tty_io      OFF    204  0.16      0xb1     SYM    411  0.33     read_chan     SYM    378  0.30     sigprocmask      OFF    321  0.26      0x77     SYM    331  0.26     __wake_up      OFF    326  0.26      0x37     SYM    310  0.25     sysenter_past_esp      OFF    127  0.10      0x3     SYM    299  0.24     schedule      OFF    277  0.22      0x22d     SYM    271  0.22     tty_write     SYM    256  0.20     opost_block     SYM    246  0.20     add_wait_queue      OFF    212  0.17      0x37     SYM    229  0.18     n_tty_receive_room     SYM    229  0.18     n_tty_chars_in_buffer      OFF    212  0.17      0x33     SYM    197  0.16     __copy_from_user_ll     SYM    191  0.15     sys_rt_sigprocmask     SYM    177  0.14     do_select     SYM    174  0.14     __copy_to_user_ll     SYM    170  0.14     remove_wait_queue      OFF    147  0.12      0x1d     SYM    154  0.12     write_chan     SYM    153  0.12     pty_write     SYM    139  0.11     sys_select     SYM    125  0.10     kill_fasync    MOD   3032  2.42    JITCODE     SYM   1011  0.81     hellop.main([Ljava/lang/String;)V      OFF    405  0.32      0x1e0      OFF    257  0.21      0x1e3      OFF    192  0.15      0x1e4     SYM    254  0.20     java/io/OutputStreamWriter.write([CII)V     SYM    198  0.16     sun/nio/cs/StreamEncoder.flushBuffer()V     SYM    168  0.13     java/lang/String.<init>(Ljava/lang/String;I)V     SYM    168  0.13     java/io/PrintStream.write([BII)V     SYM    132  0.11     java/io/PrintStream.write(Ljava/lang/String;)V    MOD   2193  1.75    /lib/libcrypto.so.0.9.7a     SYM   2187  1.75     NoSymbols      OFF    141  0.11      0x2bcf3    MOD    782  0.62    /usr/sbin/sshd     SYM    772  0.62     NoSymbols    MOD    743  0.59    /lib/tls/libc-2.3.2.so     SYM    152  0.12     memset     SYM    125  0.10     __memchr    MOD    443  0.35    /opt/IBMJava2-141/jre/bin/libhpi.so    MOD    441  0.35    /opt/IBMJava2-141/jre/bin/classic/libjvm.so     SYM    141  0.11     jitCacheAlloc    MOD    311  0.25    /lib/tls/libpthread-0.29.so    MOD    208  0.17    NoModule     SYM    208  0.17     NoSymbols      OFF    134  0.11      0xffffe410    MOD    133  0.11    /opt/IBMJava2-141/jre/bin/libjava.so </pre> 

Java Lock Monitor

Java Lock Monitor (JLM) support is provided with version 1.4.0 of the IBM JDK. JLM provides monitor hold time accounting and contention statistics on monitors used in Java applications and the JVM itself. JLM provides support for the following counters.

Counters associated with contended locks:

  • Total number of successful acquires.

  • Recursive acquires. Number of times the monitor was requested and was already owned by the requesting thread.

  • Number of times the requesting thread was blocked waiting on the monitor because the monitor was already owned by another thread.

  • Cumulative time the monitor was held.

The following statistics are also collected on platforms that support 3-Tier Spin Locking (x86 SMP):

  • Number of times the requesting thread went through the inner (spin) loop while attempting to acquire the monitor.

  • Number of times the requesting thread went through the outer (thread yield) loop while attempting acquire the monitor.

Garbage collection (GC) time is removed from hold times for all monitors held across a GC cycle.

A monitor can be acquired either recursively, when the requesting thread already owns it, or nonrecursively, when the requesting thread does not already own it. Non-recursive acquires can be further divided into fast and slow. Fast is when the requested monitor is not already owned and the requesting thread gains ownership immediately. On platforms that implement 3-Tier Spin Locking, any monitor acquired while spinning is considered a fast acquire, regardless of the number of iterations in each tier. Slow is when the requested monitor is already owned by another thread and the requesting thread is blocked.

The JLM is controlled via a sockets interface, which allows the JLM to be controlled from a remote site. An rtdriver program is provided to produce the proper commands: jlstar starts the measurement, jlmdump writes the current results to a file, and jlmstop stops the measurement. You can issue the jlmdump command multiple times. Each time jlmdump is issued, a new file with the information in it is produced, allowing for the collection of information at critical points in the application execution.

The following example shows a sample Java Lock Monitor report.

[View full width]

Java Lock Monitor Report Version_4.26 (05.01.2002) Built : ( Wed May 1 14:44:28 CDT 2002 ) JLM_Interval_Time 34021158156 System (Registered) Monitors %MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER-HTM MON-NAME 87 5273 5273 4572 0 710708 18487 1 95408 ITC Global_Compile lock 9 6870 6869 631 1 113420 2976 0 11807 Heap lock 5 1123 1123 51 0 11098 286 1 248385 Binclass lock 0 1153 1147 5 6 1307 33 0 47974 Monitor Cache lock 0 46149 45877 134 272 36961 877 1 6558 ITC CHA lock 0 33734 23483 19 10251 6544 150 1 17083 Thread queue lock 0 5 5 0 0 0 0 0 9309689 JNI Global Reference lock 0 5 5 0 0 0 0 0 9283000 JNI Pinning lock 0 5 5 0 0 0 0 0 9442968 Sleep lock 0 1 1 0 0 0 0 0 0 Monitor Registry lock 0 0 0 0 0 0 0 0 0 Evacuation Region lock 0 0 0 0 0 0 0 0 0 Method trace lock 0 0 0 0 0 0 0 0 0 Classloader lock 0 0 0 0 0 0 0 0 0 Heap Promotion lock Java (Inflated) Monitors %MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER-HTM MON-NAME 15 68 68 10 0 2204 56 2 11936405 test.lock.testlock1@A09410 /A09418 2 42 42 1 0 186 5 0 300478 test.lock.testlock2@D31358 /D31360 0 70 70 0 0 41 1 0 7617 Java.lang.ref .ReferenceQueue$Lock@920628/920630 LEGEND: %MISS : 100 * SLOW / NONREC GETS : Lock Entries NONREC : Non Recursive Gets SLOW : Non Recursives that Wait REC : Recursive Gets TIER2 : SMP Wait Hierarchy TIER3 : SMP Wait Hierarchy %UTIL : 100 * Hold-Time / Total-Time AVER-HT : Hold-Time / NONREC

Descriptions of the report's fields are as follows:

  • JLM_Interval_Time. Time interval between the start and end of measurement. Time is expressed in the units appropriate for the hardware platform: cycles for x86, IA64, and S390, and time-based ticks for PPC.

  • %MISS. Percentage of the total GETS (acquires) where the requesting thread was blocked waiting on the monitor.

     %MISS = (SLOW / NONREC) * 100 

  • GETS. Total number of successful acquires.

     GETS = FAST + SLOW + REC 

  • NONREC. Total number of nonrecursive acquires. This number includes SLOW gets.

  • SLOW. Total number of nonrecursive acquires which caused the requesting thread to block waiting for the monitor to no longer be owned. This number is included in NONREC.

    To calculate the number of nonrecursive acquires in which the requesting thread obtained ownership immediately (FAST), subtract SLOW from NONREC. On platforms that support 3-Tier Spin Locking, monitors acquired while spinning are considered FAST acquires.

  • REC. Total number of recursive acquires. A recursive acquire is one where the requesting thread already owns the monitor.

  • TIER2. Total number of Tier 2 (inner spin loop) iterations on platforms that support 3-Tier Spin Locking.

  • TIER3. Total number of Tier 3 (outer thread yield loop) iterations on platforms that support 3-Tier Spin Locking.

  • %UTIL. Monitor hold time divided by JLM_Interval_Time. Hold time accounting must be turned on.

  • AVER-HTM. Average amount of time the monitor was held. Recursive acquires are not included because the monitor is already owned when acquired recursively.

     Total hold time / NONREC 

  • MON-NAME. Monitor name or NULL (blank) if the name is not known.

Performance Inspector Executable Tools

The following tools, which are shipped with the Performance Inspector, provide support for the PI functions.

swtrace

swtrace is a software tracing mechanism that runs on Linux. swtrace is normally run from a command prompt by issuing the swtrace command with the appropriate arguments.

swtrace uses software trace hooks to collect data. Trace hooks are identified by both a major code and a minor code. Trace data is collected to a trace buffer that is allocated when swtrace is initialized or turned on. The size of the trace buffer can be set when swtrace is initialized. The swtrace command allows the user to select which major codes are traced, when tracing starts, when tracing stops, when data is transferred from the trace buffer to disk, and formatting of the trace data.

The major parameters supported by swtrace are as follows:

  • init. Tells the trace profiler to allocate the trace buffers and initialize the system for tracing. With init, the size of the trace buffer to be allocated and the performance counter to use for taking the trace can be optionally specified.

  • enable. Enables the trace hooks within the Linux kernel. This controls what information is placed in the trace buffer.

  • disable. Keeps the specified trace hooks from being measured.

  • on. Starts the trace. The trace information is gathered until the swtrace off command is given.

  • get. Dumps the contents of the trace buffers to a file for processing by the report generator program.

  • it_install. Initializes the instruction trace facility.

  • it_remove. Resets the instruction trace functionality.

Other parameters display information about the Performance Inspector and control the rate of profiling. The command file run.tprof is generated when PI is installed in the system. This command file contains all the steps necessary to take a profiling trace and produce a report.

post

post produces various reports based on the trace profiling data. When a trace profiling report is produced, it is written to a file called tprof.out. When you install the Performance Inspector, you identify the directory where this file will be saved. If you want to keep the current tprof.out file, you must rename it something else before running another run.tprof command.

One option supported by the post command is -show. -show creates a file called post.show, which is a dump in a readable format of all the trace records. When all the trace hooks are enabled, post.show gives a detailed look at the sequence of events that occurred in the system, from execs to dispatches to interrupts.

pipcntr

pipcntr controls and displays the performance counters when PI is running on Intel IA32 platforms. This program can start and stop counters and display the contents of the counter registers and the counter control register settings. You can use this utility to start a performance counter and then run the per-thread time utility using this counter. The same holds true for the trace profiling function.

ptt

ptt starts and stops the per-thread time measurement. When starting, you can also specify what metric to use to perform the measurement.

pttstats

pttstats displays the per-thread time of every process in the system. On Intel IA32 systems, the measurement metric can be either the real-time clock or a performance monitor counter.

    team bbl



    Performance Tuning for Linux Servers
    Performance Tuning for Linux Servers
    ISBN: 0137136285
    EAN: 2147483647
    Year: 2006
    Pages: 254

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net