Section 7.8. Performance Monitoring Tools

7.8. Performance Monitoring Tools

Both application performance and application behavior on the Linux platform need to be checked before the newly ported application is made available. Familiarize yourself with performance monitoring tools available on Linux to find problems or performance bottlenecks. Some of these tools are already available with the standard Linux distributions; others are open-source tools that you can download from the Internet.

We approach application tuning from two viewsinternal and external. As the names imply, the internal view deals with the functions internal to the application; the external view deals with the environment the application runs on. Both have a direct effect on a customer's perception of the application; a fast-performing application will have more content users than a slow-performing application. We discuss both views next.

7.8.1. Internal View

The first view to application tuning is the internal view. The internal view is where we get to look inside the application. The application is best viewed internally with a profiling tool. A profiling tool tells you where your program spent its time and which functions called which functions. This information can show you which functions of your program are slower than you expected. You can then analyze the code for slow functions to determine whether they can be optimized.

7.8.1.1. gprof

Linux provides a profiling tool called gprof.^[14] We use Example 7-7 to run gprof against and show an example output.

^[14] www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html

Example 7-7. Source Code for gprof_1.c

#include <pthread.h> #include <errno.h> #include <stdlib.h> static unsigned int v = 40; static count = 0; pthread_mutex_t count_lck; /*   A Fibonacci sequence is a sequence of numbers {f0, f1, f2 ....}   in which the first two values are equal to 1 and each succeeding   number is the sum of the previous two numbers.   Here are the first few sequence of numbers of the Fibonacci sequence:   1 1 2 3 5 8 13 21 ... */ /* function to return the kth Fibonacci number */ unsigned long fibonacci(unsigned long k) {   unsigned long x, y, z, n;   x = 1;   if (k > 1)   {     y = z = 1;     for (n=2; n<=k; n++)     {       x = y + z;       y = z;       z = x;     }   }   /* return the kth Fibonacci number */   return x; } unsigned long rec_fibonacci(unsigned long k) {   unsigned long x;   x = 1;   if (k > 1)     x = rec_fibonacci(k-1) + rec_fibonacci(k-2);   /* return the kth Fibonacci number */   return x; } void *compute(void *arg) {   pthread_mutex_lock(&count_lck);   if (count == 0)   {     count++;     pthread_mutex_unlock(&count_lck);     printf("recursive = %ld\n", rec_fibonacci(v));   }   else   {     count++;     pthread_mutex_unlock(&count_lck);     printf("non recursive = %ld\n", fibonacci(v));   } } main() {   int rc;   int i;   pthread_t tid[2];   pthread_mutex_init(&count_lck, NULL);   for (i=0; i<2; i++)   {     if( (rc = pthread_create(&tid[i], NULL, compute, NULL)) != 0)     {       printf("pthread_create failed: %d\n", errno);       exit(1);     }   }   for (i=0; i<2; i++)   {     pthread_join(tid[i], NULL);   }   pthread_mutex_destroy(&count_lck); }

Compile with the pg flag to turn on profiling:

$ gcc gprof_1.c -o gprof_1 -lpthread -D_REENTRANT pg

Run the program:

$ ./gprof_1 non recursive = 165580141 recursive = 165580141

The program gprof_1 produces a file named gmon.out. gprof interprets gmon.out^[15] and prints the profile information in readable form.

^[15] Note that if gmon.out does not exist, gprof will complain that it cannot find the file gmon.out.

Run gprof against the application:

$ gprof ./gprof_1 > outfile $ cat outfile --- some information cut ---            Call graph (explanation follows) granularity: each sample hit covers 4 byte(s) no time propagated index % time  self children  called   name         0.00  0.00    1/1      compute [10] [1]   0.0  0.00  0.00    1     fibonacci [1] -----------------------------------------------                331160279       rec_fibonacci [2]         0.00  0.00    1/1      compute [10] [2]   0.0  0.00  0.00    1+331160279 rec_fibonacci [2]                331160279       rec_fibonacci [2] --- some information cut ---

Notice that this output indicates the number of times the function rec_fibonacci() was called but not how much time was spent on each call. However, it does not give you any number in the self column. Apparently, in some operating system kernels (such as Linux), gprof measures only the main thread. Basically, gprof uses the internal ITIMER_PROF timer, which makes the kernel deliver a signal to the application whenever it expires. This timer data needs to be passed on to all spawned threads. appendix E, "gprof helper," includes a module to wrap pthread_create() to pass on the needed timer data. Compile the module to produce a shared library as follows:

$ gcc -shared -fPIC gprof_helper.c -o gprof-helper.so -lpthread ldl

This produces a library gprof_helper.so. Run the example gprof_1 application again but this time with the following command:

$ LD_PRELOAD=./gprof-helper.so ./gprof_1 pthreads: using profiling hooks for gprof non recursive = 165580141 recursive = 165580141

This produces a new gmon.out file on which we can run gprof:

$ gprof ./gprof_1 Flat profile: Each sample counts as 0.01 seconds.  %  cumulative  self       self   total  time  seconds  seconds  calls  s/call  s/call name 100.00  135.95  135.95    1  135.95  135.95 rec_fibonacci  %     the percentage of the total running time of the time    program used by this function. cumulative a running sum of the number of seconds accounted  seconds  for by this function and those listed above it.  self   the number of seconds accounted for by this seconds  function alone. This is the major sort for this       listing. calls   the number of times this function was invoked, if       this function is profiled, else blank.  self   the average number of milliseconds spent in this ms/call  function per call, if this function is profiled,       else blank.  total   the average number of milliseconds spent in this ms/call  function and its descendents per call, if this     function is profiled, else blank. name    the name of the function. This is the minor sort       for this listing. The index shows the location of       the function in the gprof listing. If the index is       in parenthesis it shows where it would appear in       the gprof listing if it were to be printed.            Call graph (explanation follows) granularity: each sample hit covers 4 byte(s) for 0.01% of 135.95 seconds index % time  self children  called   name                331160280       rec_fibonacci [1]        135.95  0.00    1/1      compute [2] [1]  100.0 135.95  0.00    1+331160280 rec_fibonacci [1]                331160280       rec_fibonacci [1] -----------------------------------------------                          <spontaneous> [2]  100.0  0.00 135.95        compute [2]        135.95  0.00    1/1     rec_fibonacci [1] ----------------------------------------------- --- rest of the output cut ---

Notice that the self column has been filled up. Because the time for executing the function fibonacci() was negligible compared to the time spent executing rec_fibonacci(), gprof did not print a line for it. Refer to the GNU gprof manual^[16] for more details about using gprof and interpreting its output.

^[16] www.gnu.org/software/binutils/manual/gprof-2.9.1/html_mono/gprof.html

In our simple example, you just saw how slow a recursive function can be. In real applications, gprof data can show which functions are called and by whom and how fast these functions perform. This data can help you decide whether there is a need to rewrite some of these functions.

7.8.1.2. OProfile

OProfile (http://oprofile.sf.net) is a system-wide profiler capable of profiling all running code, including hardware and software interrupt handlers, kernel modules, kernels, shared libraries, and user-space applications. Unlike gprof, OProfile does not require code to be compiled with special profiling options or even the g option (unless you want samples to be mapped back to the source code). Therefore, you can profile applications running in production environments using OProfile. In addition, OProfile collects samples on all processes and threads running on the system and enables the user to examine data from processes that are still running.

OProfile can be configured to take samples periodically (time-based samples) to identify which parts of code consumed computing resources. On many architectures, OProfile provides access to the performance counter registers, allowing samples to be collected based on other events such as cache misses, TLB misses, memory references, and instructions retired. Users can also request reports that have the source trees annotated with the profile information. The generated profiling report enables developers to determine whether performance problems exist in the code and to modify the code accordingly.

Here we show you how to profile the code in Example 7-7 using OProfile. In this example, we use only the time-based event. As mentioned previously, you do not need to compile your code with any special option:

# gcc o oprof1 gprof_1.c lpthread D_REENTRANT

Execute the following command as root to clear out previous samples:

# opcontrol reset

Next, configure and start OProfile by running the following two commands:

# opcontrol --setup --no-vmlinux --separate=thread \   --event=CYCLES:1000 # opcontrol --start Using 2.6+ OProfile kernel interface. Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running.

The no-vmlinux option indicates that we do not want samples for the kernel. The --separate=thread option gives separation for each thread. The --event option specifies measuring the time-based CYCLES event with a sample recorded for every 1,000 events. You can issue the command opcontrol list-events to list all supported events on your platform. The --start option starts OProfile's data collection.

Now we can start the program:

# ./oprof1 non recursive = 165580141 recursive = 165580141

To ensure that all the current profiling data is flushed to the sample files before each analysis of profiling data, we issue the following command:

# opcontrol dump

Shut down OProfile with the following command:

# opcontrol --shutdown Stopping profiling. Killing daemon.

Analysis can be performed on the collected data as a normal user with opreport and opannotate commands:

# opreport l ./oprof1 CPU: ppc64 POWER5, speed 1650.35 MHz (estimated) Counted CYCLES events (Processor cycles) with a unit mask of 0x00 (No unit mask) count 1000 Processes with a thread ID of 15047 Processes with a thread ID of 15048 samples %    samples %    symbol name 1232110 100.000 0       0 rec_fibonacci 0       0 2    100.000 fibonacci

The output shows a separate report for each thread. As expected, the function rec_fibonacci consumes most of the CPU time (1,232,110 samples, 100 percent) within a thread. Another function, fibonacci, received 2 samples, accounting for 100 percent within another thread. If you want to relate samples back to the source code, you need to recompile with the g option. Following is a portion of the output from the opannotate --source ./oprof1 command:

...              :unsigned long rec_fibonacci(unsigned long k) 0    0  298 27.9288  :{ /* rec_fibonacci total:   0                     0 1067 100.000 */              :  unsigned long x;              : 0    0   1 0.0937  :  x = 1; 0    0  44 4.1237  :  if (k > 1) 0    0  364 34.1143  :  x = rec_fibonacci(k-1) +                   rec_fibonacci(k-2);              :              : /* return the kth Fibonacci number */ 0    0  62 5.8107  :  return x; 0    0  298 27.9288  :} ...

The first nonzero number on each line is the number of samples; the second nonzero number is the relative percentage of total samples. Refer to the OProfile manual^[17] for more detailed information about using OProfile and interpreting its output.

^[17] http://oprofile.sourceforge.net/docs/

Here are some white papers that you may find useful to learn more about OProfile:

Smashing Performance with OProfile(www-128.ibm.com/developerworks/linux/library/l-oprof.html)
Red Hat: OProfile instruction (www.centos.org/docs/4/html/rhel-sag-en-4/s1-oprofile-configuring.html)
OProfile for Linux on POWER (www-128.ibm.com/developerworks/library/l-pow-oprofile/)
Five easy-to-use performance tools for Linux on PowerPC (www-128.ibm.com/developerworks/eserver/library/es-PerformanceInspectoronpLinux.html)
Tuning Programs with OProfile (http://people.redhat.com/wcohen)

Both gprof and OProfile have their own niche in application profiling. Depending on the application's profiling requirements, porting engineers and software developers may choose to favor one tool over another. Table 7-2 compares the characteristics of gprof and OProfile.

Table 7-2. Comparison Between gprof and OProfile
gprof	OProfile^[18]
Does not require root permission	Requires root permission
Cannot measure hardware effects such as cache misses	Can measure hardware effects such as cache misses
Only profiles at the application level	Can capture performance behavior of entire system
Need to recompile source code with `-pg`	No need to recompile source code with extra flags
Can profile interpreted code (such as Java)^[19]	Cannot profile interpreted code^[20]
Can profile application and shared libraries (only if shared libraries were compiled with `pg`)	Can profile application and shared libraries (no need to recompile shared libraries with profiling flags)

^[18] See http://oprofile.sourceforge.net/doc/introduction.html#applications for more information.

^[19] Compile your Java code using GNU compiler for Java (gcj).

^[20] We did not try profiling Java code that was compiled with gcj.

In summary, profiling application code is a way to get an internal view of the application. The internal view gives us an idea about what parts of the code are frequently used and which functions take up the most CPU time. You can use this view to identify hot spots within the application that can be optimized to produce a better-performing application.

7.8.2. External View

The second view to application tuning is the external view. This is where we view the application from a systems point of view. When an application runs, it uses system resources. These system resources are small at startup and gradually increase as more users use the application or more functions of the application are utilized. We call the resources used by the application its resource signature. The resource signature tells us how much memory, CPU, disk space, and other system resources must be available to the application to function at optimal levels. This view is most useful for system architects, who need to plan for resource capacity, and the view is generally obtained by performing an application benchmark.

During application benchmarking, an environment is set up to simulate application use. Usually, this is done through scripts that simulate users utilizing the application to perform different operations. System tools that monitor memory, CPU, disk I/O, and network traffic are then used to observe the overall resource utilization. Information gathered through these tools helps systems architects plan which resources the application will need. We discuss some of these tools next.

7.8.2.1. vmstat

vmstat displays real-time information about processes, memory, paging, block I/O, traps, and CPU activity. The first line provides the average system activity since the last reboot, which can usually be ignored. Additional lines give the information on a sampling period of length delay (a user-specified parameter). The output from vmstat can be used to identify unusual system behaviors such as high page faults or excessive context switches. A sample output of vmstat follows:

procs -------memory---------- ---swap-- -----io---- --system-----cpu---  r b swpd free      buff    cache  si so  bi   bo  in  cs us sy id  wa  0  0  0  14936464  221040  467720  0  0   0    1  1   9  0  0  100  0  0  0  0  14936528  221040  467720  0  0   0    6  5  39  0  0  100  0  0  0  0  14936528  221040  467720  0  0   0    0  3  37  0  0  100  0  0  0  0  14936528  221040  467720  0  0   0    2  4  40  0  0  100  0  0  0  0  14936528  221040  467720  0  0   0    0  3  37  0  0  100  0  0  0  0  14936464  221040  467720  0  0   0    0  3  37  0  0  100  0

The output shows the id column as 100 (100 percent idle), which indicates a fairly quiet system. If this were a busy system, we would see the id column at less than 100 and the us (user), sy (system), and wa (wait) columns with significantly high numbers. The us, sy, and id columns indicate the overall usage of the system. And the wa column indicates number of waiting processes to be run.

Consider Example 7-8. In this example, we create four threads that execute the rec_fibonacci function. Because we already know that one thread executing the rec_fibonacci function uses a significant number of CPU resources, with four threads we can expect the application to put an even higher load on the CPUs. Let's find out whether that is the case.

Example 7-8. Source Code for vmstat_1.c

#include <pthread.h> #include <errno.h> #include <stdlib.h> static unsigned int v = 40; /*   A Fibonacci sequence is a sequence of numbers {f0, f1, f2 ....}   in which the first two values are equal to 1 and each   succeeding number is the sum of the previous two numbers.   Here are the first few sequence of numbers of the Fibonacci   sequence:   1 1 2 3 5 8 13 21 ... */ unsigned long rec_fibonacci(unsigned long k) {   unsigned long x;   x = 1;   if (k > 1)     x = rec_fibonacci(k-1) + rec_fibonacci(k-2);   /* return the kth Fibonacci number */   return x; } void *compute(void *arg) {   printf("recursive = %ld\n", rec_fibonacci(v)); } main() {   int rc;   int i;   pthread_t tid[2];   for (i=0; i<4; i++)   {     if( (rc = pthread_create(&tid[i], NULL, compute, NULL)) != 0)     {       printf("pthread_create failed: %d\n", errno);       exit(1);     }   }   for (i=0; i<4; i++)   {     pthread_join(tid[i], NULL);   } }

Compile the example:

$ gcc vmstat_1.c -o vmstat_1 -lpthread -D_REENTRANT

Run the example:

$ ./vmstat_1

Run vmstat:

$ vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----  r b   swpd   free   buff   cache  si  so    bi    bo    in   cs    us sy id wa  0 0   116    23168  69832  522428  0   0    0     0     0     2    0  0  100 0  0 0   116    23168  69832  522428  0   0    0     4     2    96    0  0  100 0  0 0   116    22848  69832  522428  0   0    0     0     4   115    0  0  100 0  7 0   116    22704  69832  522428  0   0    0    18     5   107   19  0   81 1  6 0   116    22704  69832  522428  0   0    0     0     2   80    100 0    0 0  4 0   116    22704  69832  522428  0   0    0     0     3   80    100 0    0 0  4 0   116    22704  69832  522428  0   0    0    14     2   82    100 0    0 0  4 0   116    22704  69832  522428  0   0    0     0     1   75    100 0    0 0  6 0   116    22704  69832  522428  0   0    0     0     1   82    100 0    0 0  0 0   116    22704  69832  522428  0   0    0     0     3  100     42 0   59 0  0 0   116    22736  69832  522428  0   0    0     0     1   92      0 0  100 0  0 0   116    22736  69832  522428  0   0    0     0     2  102      0 0  100 0  0 0   116    22768  69832  522428  0   0    0     0     2   92      0 0  100 0

The vmstat output shows that the system got really busy when the application was started and went back to an idle state after the application exited. In this case, the user (us) column went up to 100, indicating a busy system in which all the CPU cycles (100 percent) were spent running the application code in the user space. For more information about vmstat, consult the vmstat(8) man pages.

7.8.2.2. iostat

iostat generates a report about the system I/O activities. If no display interval is given, iostat displays I/O information since the last reboot. If a display interval is given, the first set of output represents total activity since boot time, and the subsequent outputs show only the delta activities. The report is divided into two sections: CPU utilization and device utilization. On multiprocessor systems, CPU statistics are calculated system-wide as averages among all processors:

avg-cpu: %user  %nice  %sys %iowait  %idle               0.21   0.00   0.80     2.07   96.92 Device:     tps     Blk_read/s     Blk_wrtn/s     Blk_read             Blk_wrtn sda        2.49            0.05       1443.46         2778              9552392 sdb        4.94            0.10       2871.73         5322            158268008 sdc        4.95            0.10       2860.91         5330            157671720 sdd       30.20         1518.55          0.42     83690898                23288 sde       60.25         2902.76          0.92    159978258                50896 sdf        0.00            0.01          0.00          378                   24 sdg       59.49         2883.87          0.90    158937034                49520

iostat can indicate how busy disk devices can be during I/O operations. Depending on where files used by the application are placed, some disks can have more I/O operations than others. A disk that gets too busy can lead to large I/O waits, resulting in poor application performance. iostat information can help point out where the disk I/O bottlenecks are. This information can then be used by the application architect to redesign the filesystem layout policies.^[21]

^[21] Filesystems can be designed to be striped across disks to increase throughput.

In our example printout, we can see that devices sdd, sde, and sdg have done the bulk of read operations, whereas devices sda, sdb, and sdc have done mostly write operations. This example also shows the tps (transaction per second) column of both the read and write operations. While comparing the tps rate of all disks, one can wonder why the write operations have low tps rates compared to the disks involved in performing read operations. This question can prompt us to investigate how^[22] files in these disks are accessed by the application to give us more clues as to why this is happening. This is just one example of how to use iostat to view the profile of the I/O operations of the application. For more information about iostat, refer to the iostat(1) man pages.

^[22] Perhaps the block transfer rates differ.

7.8.2.3. top

top provides a continuously updated view of a running system. It displays real-time information on CPU utilization, process statistics, and memory utilization, as shown here:

top - 16:23:43 up 27 days,7:07, 2 users, load average: 0.00, 0.00, 0.00 Tasks: 108 total,  1 running, 107 sleeping,  0 stopped,  0 zombie Cpu0:0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu1:0.1% us, 0.1% sy, 0.0% ni, 99.6% id,  0.1% wa, 0.0% hi, 0.0% si Cpu2:0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si Cpu3:0.1% us, 0.0% sy, 0.0% ni, 99.9% id,  0.0% wa, 0.0% hi, 0.0% si Mem:  16190476k total, 2390180k used, 13800296k free,  223624k buffers Swap: 37760724k total,       0k used, 37760724k free,  912316k cached 5775  root      15   0 31764  12m 2296 S  0.2  0.1 148:53.89 X 21463 root      16   0  2472 1520 1108 R  0.1  0.0   0:00.37 top     1 root      16   0   628  284  248 S  0.0  0.0   0:04.34 init     2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0     4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1     5 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1     6 root      RT   0     0    0    0 S  0.0  0.0   0:00.01 migration/2     7 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2

top's output is divided into two parts. The first part contains useful information related to overall system status such as CPU status, process counts, uptime, load average, and utilization statistics for both memory and swap space. The second part shows process-level statistics.

7.8.2.4. ps

The ps command displays information about active processes. Unlike top, the output is a single snapshot in time. You can view the information about how the memory and CPU are used by a particular process by using the ps command:

$ps aux USER   PID   %CPU %MEM VSZ  RSS  TTY  STAT START TIME COMMAND root   10712  99.9 0.7 119196 118528  pts/4 R+ 8:53 2:24  ./ofreqde root   10720  0.0 0.0   2612    1064  pts/0 R+ 8:55 0:00  ps aux root   10721  0.0 0.0   2172     892  pts/0 S+ 8:55 0:00  more

The output of the ps aux command shows not only the total percentage of system memory that each process uses, but also its virtual memory footprint (VSZ) and the amount of physical memory that the process is currently using (RSS).

7.8.2.5. nmon

Although primarily used to monitor system performance when benchmarking an application, you can also use nmon to monitor the application's use of system resources. nmon is a free performance tool for Linux running on different supported platforms. You can use it to monitor or gather performance data, including CPU utilization, memory usage, disk I/O rates, network I/O rates, free space on filesystems, and more. Unlike top, nmon can record data to a file that can be analyzed at a later time. A companion tool called nmon analyzer is also available. The nmon analyzer is designed to take input files produced by nmon and turn them into spreadsheets containing high-quality graphs.

At the time of this writing, the nmon tool runs on the following:

Linux SUSE SLES 9, Red Hat EL 3 and 4, Debian on pSeries p5, and OpenPower
Linux SUSE, Red Hat, and many recent distributions on x86 (Intel and AMD in 32-bit mode)
Linux SUSE and Red Hat on zSeries or mainframe
AIX 4.1.5, 4.2.0, 4.3.2, and 4.3.3 (nmon version 9a: This version is functionally established and will not be developed further.)
AIX 5.1, 5.2, and 5.3 (nmon version 10: This version now supports AIX 5.3 and POWER5 processor-based machines, with SMT and shared CPU micropartitions.)

The nmon tool is available at www-128.ibm.com/developerworks/eserver/articles/analyze_aix/.^[23]

^[23] Or at www.ibm.com/developer (search for nmon)

We just discussed some of the most commonly used tools that will help the porting engineer gather information about the application's resource signature. For more information about performance tuning on Linux, refer to the following:

Performance Tuning for Linux Server, by Sandra K. Johnson et al. (IBM Press, 2005)
Optimizing Linux Performance: A Hands-on Guide to Linux Performance Tools, by Phillip G. Ezolt (Prentice Hall, 2005)
Tuning IBM eServer xSeries Servers for Performance, by David Watts et al. (www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg245287.html?Open)

With more and more companies relying on their software applications for increased productivity, application performance will play a larger and larger role in contributing to that productivity. Both the internal and external views to application tuning will give us the necessary information we need to help us tune the application to perform to its maximum potential on the Linux platform.

7.8.3. Other Tools

Other open-source tools exist that help in profiling and system performance on Linux that we thought are worth mentioning. They are Performance Inspector, System Tap, and Kprobes. Performance Inspector helps in application profiling; the other two tools deal more with the Linux kernel.

7.8.3.1. Performance Inspector^[24]

^[24] http://perfinsp.sourceforge.net/

Performance Inspector (http://sourceforge.net/projects/perfinsp) is a suite of tools that enables you to identify performance problems in your applications and shows you how your application interacts with the Linux kernel. It consists of the following tools:

TProf is a CPU profiling tool. TProf interrupts the system periodically by time or the hardware performance monitor counters, and then determines the address of the interrupted code along with its process ID and thread ID. The sampling information is then processed and used to report hot spots in your code.
PTT collects per-thread statistics, such as the number of CPU cycles, number of interrupts, and number of times the thread was dispatched.
JLM provides statistics on locks based on the Java 2 technology.
JProf is a shared library that interfaces with the Java jvmpi interface.
Hdump is used to analyze the live objects in a Java heap. Hdump provides a live object heap usage summary by object class.
ITrace for PPC64 is a software tracing mechanism that enables you to trace through both application and kernel code. ITrace proves most useful in situations where you have located a specific performance hot spot in your code and would like to optimize that hot spot at the assembly instruction level.

The majority of code within Performance Inspector is released under the GNU General Public License (GPL). Some shared libraries are under the GNU LGPL. At the time of this writing, there are three different packages for the different set of tools and methods for installation, as follows:

PerfInsp. This package uses kernel patches and therefore requires the kernel rebuild. It is for the 2.4 kernel-based distributions.
Dpiperf.dynamic. This package is primarily intended for the 2.6 kernel-based distributions. In this package, TProf and Java tools can be used without rebuilding the kernel. Other tools such as PTT and AI depend on Kprobes. They therefore require a kernel patch until Kprobe is included in the distributions.
PerfInsp.ITrace.PPC64. This package includes the version of ITrace for PPC64 that utilizes the Kprobe kernel feature. The package contains the kernel patch for Kprobe, and therefore requires the kernel rebuild. This package has been tested on SLES9 and RHEL4.

7.8.3.2. SystemTap and Kprobes

SystemTap (http://sources.redhat.com/systemtap) is a dynamic instrumentation system for Linux. SystemTap is built on top of Kprobes, a facility that provides insight into the operation of the Linux kernel without recompiling or rebooting because it is built as a kernel module. Kprobes allows locations in the kernel to be instrumented with code that will be executed when the processor encounters that probe point. After the instrumentation code completes execution, the kernel resumes the normal operation. Kprobes is now included in the mainline 2.6^[25] Linux kernel.

^[25] Linux 2.6.12, to be exact

Here are some other publications that will help you learn more about SystemTap and Kprobe:

Kernel debugging with Kprobes (www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe)
Gaining insight into the Linux kernel with Kprobes (www.redhat.com/magazine/005mar05/features/kprobes/)
Architecture of systemtap: a Linux trace/probe tool (http://sourceware.org/systemtap/archpaper-0505.pdf)
Locating System Problems Using Dynamic Instrumentation, by Vara Prasad et al. Proceeding of Linux Symposium. July 2005, Ottawa, Canada.

SystemTap is safe and lightweight enough to use with live production systems. It is able to instrument both kernel and user space programs even in the absence of source code. SystemTap's probe language will be easy to use, and users will be able to reuse general scripts written by others.