Upon execution of the profiler-instrumented image, the gmon.out file is generated. This file is used in conjunction with the original image for gprof to generate human-readable statistics information. Let s look at a simple example and the data products that result. First, we invoke the image and then generate the gprof summary:
$ ./sort $ gprof sort gmon.out > sort.gprof
The gprof utility writes its human-readable output to standard-out, so the user must redirect this to a file to save it. The first element of the gprof output is what s called the flat profile. This provides the basic timing summary of the executable, as shown in Listing 8.2. Note that this is not the complete output of gprof . We ll look at other data products shortly.
$ gprof sort gmon.out more Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 71.66 3.11 3.11 1 3.11 3.11 bubbleSort 28.11 4.33 1.22 1 1.22 1.22 insertSort 0.23 4.34 0.01 2 0.01 0.01 init_list ...
As we saw in Listing 8.1, our application is made up of three functions. Each function is represented here with a variety of timing data. The first column represents the percentage of time spent in each function in relation to the whole. What s interesting to note from this column is that the bubble sort algorithm requires 2.5 times as much execution time to sort the identical list as the insert sort. The next column, cumulative seconds , is a running sum of the number of seconds, and the next, self seconds , is the number of seconds taken by this function alone. Note that the table itself is sorted by this column in descending order. The column entitled calls represents the total number of times that this function was called (if the function itself was profiled, otherwise the element will be blank. The self s/call represents the average number of seconds spent in this function (including functions that it calls), while the total s/call represents the total number of seconds spent in the function (including functions that it calls). Finally, the name of the function is provided.
The next element provided by gprof is the call graph. This summary (shown in Listing 8.3) shows each function and the calls that are made by it, including the time spent within each function. It illustrates the timing as a hierarchy, which is useful in understanding timing for individual functions down the chain and their effect on higher layers of the call chain.
index % time self children called name <spontaneous> [1] 100.0 0.00 2.11 main [1] 1.44 0.00 1/1 bubbleSort [2] 0.67 0.00 1/1 insertSort [3] 0.00 0.00 2/2 init_list [4] - 1.44 0.00 1/1 main [1] [2] 68.2 1.44 0.00 1 bubbleSort [2] - 0.67 0.00 1/1 main [1] [3] 31.8 0.67 0.00 1 insertSort [3] - 0.00 0.00 2/2 main [1] [4] 0.0 0.00 0.00 2 init_list [4] -
The index column is a unique number given to each element. The column marked % time represents the percentage of the total amount of time that was spent in the given function and its children calls. The self column is the amount of time spent in the function, with children as the amount of time spent in the children s functions. Note that in the first row, children is 2.11 (a sum of the two sort functions 1.44 + 0.67). This illustrates that the children functions took all of the time, and no meaningful time was spent in the main function itself. The called field identifies how many times the function was called by the parent. The first number is the number of times the particular parent called the function, and the second number is the total number of times that the child function was called altogether. Finally, the name column represents the names of the functions. Note that in the first row (after the row headings), the name < spontaneous > is used. This simply means that the parent could not be determined (very likely the C-start initialization).
Now that we have a performance baseline of our application, let s rebuild our application to see how we can improve it. In this example, we ll build using -O2 optimization to see how well it improves this very simple application:
$ gcc -o sort sort.c -pg -O2 $ ./sort $ gprof sort gmon.out more
From Listing 8.4, we see that the performance of our application improved significantly (a five times improvement for the insert-sort ) from Listing 8.2 (pre-optimization). The function bubbleSort saw only a modest four times improvement.
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 76.47 0.78 0.78 1 780.00 780.00 bubbleSort 23.53 1.02 0.24 1 240.00 240.00 insertSort 0.00 1.02 0.00 2 0.00 0.00 init_list
This clearly illustrates the utility of the gprof utility (and the gcc optimizer). We can see how the -O2 optimization level improves the source.
Now that we ve covered the basic uses of gprof , we ll look at the variety of options that are provided. While gprof provides a large number of options, we ll discuss some of the more useful ones here. For a complete list of options, see the gprof help.
The gprof utility can be used to annotate the source with frequency of execution. The image must be built for this purpose. The -g and -a options must be specified along with -pg as:
gcc -o sort sort.c -g -pg
This results in an image with not only the profile information (via the -pg option), but also debugging information (through the -g option). Upon executing the image and checking the resulting output, as
gprof -A -l sort gmon.out
#include <stdio.h> #define MAX_ELEMENTS 10000 void insertSort(int list[], int size) 1 -> { int i, j, temp; for (i = 1 ; i <= size-1 ; i++) { temp = list[i]; for (j = i-1 ; j >= 0 && (list[j] > temp) ; j) { list[j+1] = list[j]; } list[j+1] = temp; } }
The -x option can also be used with gprof to extend execution counts to all source lines.
For private functions that are statically declared, we can ignore the statistics for these through the use of the -a option (or -no-static ). This attributes time spent in the private function to the caller, with the private function never appearing in the flat profile or the call graph.
The gprof utility can recommend a function ordering that can improve cache and translation lookaside buffer (TLB) performance on systems that require functions to be contiguous in memory. For systems that include multiway caches, this may not provide much improvement.
To recommend a function ordering (via ”function-ordering ), the following command sequences can be used:
$ gcc -o sort sort.c -pg -g $ ./sort $ gprof sort gmon.out function-ordering
A list of functions is then suggested with specific ordering (using the source from Listing 8.1). The result is shown in Listing 8.6.
insertSort bubbleSort init_list _init _start __gmon_start__ . . .
Note the grouping of the sorting and init functions, which are all called in proximity of one another. In some cases, reordering functions can be done as simply as coexisting contiguous functions within a single source file. The linker can also provide this capability.
The file ordering option ( ”file-ordering ) provides a similar capability by recommending the order in which objects should be linked to the target image.
The -b option is useful to minimize the amount of superfluous description data that is emitted . Using -b removes the field discussion in the output, as shown in Listing 8.7.
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 71.59 3.10 3.10 1 3.10 3.10 bubbleSort 28.41 4.33 1.23 1 1.23 1.23 insertSort 0.00 4.33 0.00 2 0.00 0.00 init_list Call graph granularity: each sample hit covers 4 byte(s) for 0.23% of 4.33 seconds index % time self children called name <spontaneous> [1] 100.0 0.00 4.33 main [1] 3.10 0.00 1/1 bubbleSort [2] 1.23 0.00 1/1 insertSort [3] 0.00 0.00 2/2 init_list [4] - 3.10 0.00 1/1 main [1] [2] 71.6 3.10 0.00 1 bubbleSort [2] - 1.23 0.00 1/1 main [1] [3] 28.4 1.23 0.00 1 insertSort [3] - 0.00 0.00 2/2 main [1] [4] 0.0 0.00 0.00 2 init_list [4] - Index by function name [2] bubbleSort (sort.c) [4] init_list (sort.c) [3] insertSort (sort.c)
To further minimize the output, the ”no-flat-profile can be used if the flat profile is not needed (the default is ”flat-profile ). The call graph can be disabled using the ”no-graph option (default is ”graph ).
The gprof utility can identify functions that are not called in a given run. The ”display-unused-functions is used in conjunction with the ”static-call-graph option to list those functions, as:
gprof sort gmon.out display-unused-functions static-call-graph
In some cases, a program may differ in timing or may represent such a small timing sample that its accuracy is left in question. To increase the accuracy of applications profiling, the application can be performed numerous times and then averaged. A sample bash script that provides this capability is shown in Listing 8.8 [GNU gprof ]:
Listing 8.8: Analyzing Multiple Invocations of an Application (on the CD-ROM at ./source/ch8/script )#!/bin/bash for i in