8.2 Techniques for Code Analysis | System Performance Tuning2002

There are many techniques for analyzing an applications' performance. Broadly, they can be classed as probe-based or profiler-based . Probe-based analysis relies upon the programmer inserting pieces of code into the application; these code fragments provide timestamping and other meaningful data that can be analyzed later. Profiler-based analysis depends on the application being compiled with profiling support. This causes the program to generate information about its performance as it runs, which can then be analyzed via an external profiling application.

Before we can begin an analysis of what components are responsible for the time consumed, it is good to know exactly how much time is being spent! For this, we have a set of relatively coarse-grained timing tools: time , timex , and (in Solaris) ptime .

8.2.1 Application Timing: time, timex, and ptime

The most simple question we can ask about a program's speed is "How long does it take to run?" Rather than sitting in front of a computer with a stopwatch, some common tools can be used to measure program execution time.

8.2.1.1 time

The first of these tools is the time command, which is built into the csh and ksh shells (it also may exist as a separate command in /bin ). At the end of the program, time presents a breakdown of how the running time of the application was spent:

The elapsed time , which is the amount of "wall clock" time that the program took to run.
The amount of time spent in user mode , which is also called user time (i.e., executing instructions that the compiler generated from your source code, in addition to time spent in library routines).
The amount of time spent in kernel mode , also called system time (i.e., time spent in system calls, issuing I/O requests , etc.).

In addition to these figures, the csh built-in time command provides other useful information. Let's look at an example:

 %  time sieve > /dev/null  6.490u 0.530s 0:16.74 41.9%     0+0k 0+0io 111pf+0w

The first, second, and third fields correspond to the seconds of user time, the seconds of system time, and the elapsed time for the process, respectively. In general, the user time should far exceed the system time. It is probably a sign that the program is doing something inefficient if the system time is disproportionately large: the application might be generating lots of exception conditions or using system calls inefficiently. Note, however, that "I/O wait time" (the time spent rewinding a tape, seeking the disk read/write heads, etc.) is not considered CPU time.

The fourth field refers to the percentage utilization, which is the ratio of elapsed time to CPU time (the sum of the user and system times). There are many reasons why the CPU time is not in the neighborhood of the elapsed time; after all, the system may have a significant number of other processes, the application in question may induce a lot of I/O (either directly or through exhausting physical memory and incurring paging activity), or the system may not be able to retrieve data to and from memory fast enough.

The fifth field represents the average memory utilization statistics. The first of these is the average shared-memory space , which defines how much memory is consumed by the application (this memory may be shared between multiple invocations of the same process, which is why it is called "shared memory"). The second figure represents the average unshared-memory space , which is used for storing data structures. These figures represent actual physical memory usage only. Also, these measurements are not supported on all platforms, so are sometimes reported as 0+0k .

The sixth field represents the amount of block input and output I/O operations, respectively. This field is also not supported on all implementations of the time command, so it is sometimes reported as 0+0io .

The seventh and last field describes the number of page faults and swaps incurred by the application. If the number of page faults is unusually high, it indicates that the process was probably choking on memory requirements. However, every program incurs some page faults (simply by virtue of how the page allocation mechanism works; see Section 4.2.5 in Chapter 4).

Note that there is another version of the time command, one which is not built into the shell:

 #  /bin/time catman -w  real       36.3 user       11.9 sys         3.5

In Linux, the time command is located in /usr/bin , and presents different output:

 %  /usr/bin/time sieve > /dev/null  5.55user 0.25system 0:05.95elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (112major+9778minor)pagefaults 0swaps

This output is comparable to the shell built-in time function.

8.2.1.2 timex

A slightly higher-resolution way of getting a program's runtime is the timex command. It is similar in syntax to the other timing commands we've discussed; it just provides higher accuracy:

 #  timex catman -w  real       12.78 user       11.71 sys         2.67

(Note that the runtime of the catman -w process is significantly decreased compared to the last time we ran it. This is because of filesystem caching; timex itself is not introducing the problem.)

timex has three options worth noting. If process accounting is installed, specifying the -o flag causes timex to report information on the number of blocks read and written. The -s flag reports the total system activity that occured during the process' runtime. The -p flag gives the process accounting record for the command.

8.2.1.3 ptime

Solaris offers a high-resolution method for timing application runtimes invoked via ptime . ptime has the advantage of being simple to run:

 #  ptime catman -w  real       14.455 user        0.005 sys         0.009

This example also nicely illustrates that the sum of system time and user time need not add up to the real (elapsed) time: catman -w , which rebuilds the windex database used by man -f , man -k , and apropos , is fairly disk- intensive . The fact that disk activity is not accounted for in either user time or system time goes a long way towards explaining this huge discrepancy.

On parallel systems, the user time can be larger than the elapsed time because the combined user time across all processors is reported.

8.2.1.4 Mechanisms of timing

It's worth taking a moment to examine how these commands work. The /bin/time and timex commands rely on sampling the state of the processor at the clock interrupt. This gives an accuracy of about 1/100th of a second (the frequency of the clock interrupt, by default).

ptime uses an altogether different mechanism. When the kernel suffers a state change, system call, page fault, or scheduler change, it logs a high-resolution timestamp. This is known as microstate accounting . It significantly improves accuracy but induces some overhead. You can also turn on microstate accounting programmatically.

8.2.2 Timing-Specific Code Sections

There are many techniques for obtaining timing information about specific sections of code. I present two of them here: gethrtime() , a function call that returns the current time in nanoseconds, and directly accessing the TICK register.

8.2.2.1 Timing via gethrtime

It's recommended that you enable microstate accounting before using the gethrtime() system call. There are two ways of doing this: enabling microstate accounting in the code, or by using ptime to measure the elapsed time of the program. I present an example of both in Example 8-1 and Example 8-2.

Example 8-1. gethrtime_timing.c (without enabling microstate accounting)

 /* gethrtime_timing.c */ #include <stdio.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> int main(int argc, char **argv) { hrtime_t timeStart, timeEnd; int i, j = 250000; timeStart = gethrtime(  ); for (i = 0; i < j; i++) { /* The function being measured goes here */ } timeEnd = gethrtime(  ); printf ("Average time: %lld ns\n", (timeEnd - timeStart) / j); return 0; }

Example 8-2. gethrtime_ustate_timing.c (enabling microstate accounting)

 /* gethrtime_ustate_timing.c */ #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/stat.h> #include <sys/time.h> #define _STRUCTURED_PROC 1 #include <sys/procfs.h> /* The init_micro_acct(  ) function is drawn from the file timing.c, which is    included in the Forte Developer 6 release: look in    /opt/SUNWspro/WS6/examples/analyzer/omptest/timing.c */ void init_micro_acct(  ) { /* SunOS 5.5 or higher */ int    ctfld; long   ctl[2]; char   procname[1024]; sprintf(procname, "/proc/%d/ctl", getpid(  )); ctlfd = open(procname, O_WRONLY); if(ctlfd < 0) { fprintf(stderr, "open %s failed, errno = %d\n",                               procname, errno); } ctl[0] = PCSET; ctl[1] = PR_MSACCT; if (write(ctlfd, ctl, 2*sizeof(long)) < 0) { fprintf(stderr, "write failed, errno = %d\n", errno); } close(ctlfd); printf("Enabling microstate accounting.\n"); return; } int main(int argc, char **argv) { hrtime_t timeStart, timeEnd; int i, j = 250000; /* Start microstate accounting */ init_micro_acct(  ); timeStart = gethrtime(  ); for (i = 0; i < j; i++) { /* The function being measured goes here */ } timeEnd = gethrtime(  ); printf ("Average time: %lld ns\n", (timeEnd - timeStart) / j); return 0; }

I built these applications on an Sun Ultra 10 (300 MHz UltraSPARC-II processor) with the Solaris 8 system and the Forte Developer 6 update 1 compiler, specifying the getpid() system call as the function to be timed. I also increased the number of iterations ( j ) to 5 million. Example 8-3 illustrates the differences between running gethrtime_timing alone, running gethrtime_timing with microstate accounting turned on via ptime , and running gethrtime_ustate_timing .

Example 8-3. Differences in execution between three timing programs

 %  /opt/SUNWspro/bin/cc gethrtime_timing.c -o gethrtime_timing  %  /opt/SUNWspro/bin/cc gethrtime_ustate_timing.c -o gethrtime_ustate_timing  %  gethrtime_timing  Average time: 1714 ns %  ptime gethrtime_timing  Average time: 2708 ns real       13.550 user        9.785 sys         3.620 %  gethrtime_ustate_timing  Enabling microstate accounting. Average time: 2648 ns

As you can see, the gethrtime_ustate_timing and the gethrtime_timing figures are comparable, whereas the gethrtime_timing figure is significantly lower (without microstate accounting). This is precisely what we'd expect: it sharply illustrates the overhead incurred during microstate accounting. The advantage of microstate accounting is improved resolution in monitoring.

8.2.2.2 Timing via the TICK register

The SPARC architecture specifies a TICK register, which provides an accurate, low-overhead way of measuring elapsed time in terms of processor cycles. Here, we'll present a way of accessing the TICK register directly from user programs that has much less overhead than the other timing mechanisms we've discussed: gethrtime() has an overhead of about 75 CPU cycles, whereas the TICK register can be accessed in about six cycles. The TICK register can be accessed in user space in Solaris 8, but not in earlier releases. ^[3]

^[3] Accessing the TICK register when the nonprivileged trap (NPT) bit is set in the processor causes a trap, and usually the code will abort. This bit is disabled in Solaris 8, so nonprivileged user code can read the TICK register.

The most convenient way to access the TICK register is through an in-lined assembly template, as illustrated in Example 8-4 and Example 8-5.

Example 8-4. readtickreg.il

 .inline readtickreg,1 rd      %tick, %o1 stx     %o1, [%o0] .end

Example 8-5. gettick_timing.c

 /* gettick_timing.c */ #include <math.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> extern void readtickreg(void *); #pragma no_side_effect(readtickreg); int main(int argc, char **argv) { uint64_t c1=0, c2=0, overhead=0, total=0, cost; int i, j=1, *x; x = &j; for (i = 0; i < 10; i++) { readtickreg ((void *)&c1); readtickreg ((void *)&c2); overhead += c2 - c1; } overhead = floor (overhead * 1.0 / 10); for (i = 0; i < 134217728; i++) { readtickreg ((void *)&c1); /* The function being measured goes here: we will use    *x += i; as an example */ *x += i; readtickreg ((void *)&c2); total += c2 - c1; } cost = floor (total * 1.0 / 134217728 - overhead); printf ("Overhead of readtickreg: %lld\n", overhead); printf ("Cost in processor cycles: %lld\n", cost); return 0; }

134217728 is an arbitrary number.

We can then build and test this code:

 %  /opt/SUNWspro/bin/cc -xarch=v8plusa gettick_timing.c -o gettick_timing readtickreg.il -lm  %  gettick_timing  Overhead of readtickreg: 6 Cost in processor cycles: 7

So the overhead of the call to the TICK register is six cycles, whereas the cost in processor cycles for the assignment is seven cycles.

I replaced the test function with getpid() , and I found that it costs on average 495 cycles. Because this test was run on a 300 MHz Ultra 10, like the previous tests, we can then find out that one cycle is 3.33 nanoseconds (1 second/300,000,000 cycles is 0.000000000333 seconds, or 3.33 nanoseconds). This means the cost of a getpid() call is just about 1,650 nanoseconds, which is pretty close to what we measured using gethrtime() .

8.2.3 Probe-Based Analysis: Solaris TNF

The Solaris trace normal form (TNF) probe functionality allows the detailed monitoring of specific code flows. TNF probing involves inserting specific macro calls into your application's code, and then analyzing the output that they generate. TNF probes are analyzed by means of the prex , tnfdump , and tnfxtract tools, which we originally described for monitoring probes in the Solaris kernel itself. For a quick summary on how to use these tools, see Section 5.5.6 in Chapter 5.

8.2.3.1 Inserting probes

Probes themselves are declared by a series of macros. These probes can be placed anywhere in C and C++ programs, including multi-threaded code, drivers, loadable kernel modules, and shared objects.

The general format for the probe is as follows :

 TNF_PROBE_2 (name, keys, detail,              argument_type_1, argument_name_1, argument_value_1,              argument_type_2, argument_type_2, argument_value_2);

Let's look at this piece by piece.

The macros in general are named TNF_PROBE_ N , where N is a number ranging from zero to five; this represents the number of "data points" (argument type/name/value triplets) recorded by the macro. If zero is specified, no data points are logged. ^[4] The macro name can also be suffixed with _DEBUG , in which case the probe will not be inserted unless the code is compiled with the -DTNF_DEBUG preprocessor flag set.

^[4] However, the probe is still marked by name; it's often useful for measuring simple entry and exit paths through a function, or confirming that a certain code path is being followed.
The first argument is the name of the probe. The use of the name declares it, so no other declaration is necessary; it also is a block-scope declaration, so it does not otherwise affect the namespace of the program.

The convention is to name probes after their functions, and to suffix the first probe in a function with _start and the last with _end . For example, a function named foo might have probes named foo_start and foo_end .
The second argument is a list of keys. This is a list of space-separated keywords that specify the groups that the probe belongs to: semicolons, single quotation marks, and the equals character are not permitted. These keywords are most useful in conjunction with using prex to select a specific set of functions. For example, all the probes in a function bar might have "bar" specified in their keys fields; this makes it very easy to turn on all the probes in that function.
The third argument is a list of details. This string is made up of attribute-value pairs, where the attribute and value are separated by a space, and each pair is separated by a semicolon. The detail field is used for two things. The first is supply an attribute that can be selected on via prex . For example, if one attribute is shape , then prex can select probes on the basis of their shape value. The second utility is to annotate a probe with a string that is written to a trace file only once. Attributes should be named with the vendor stock ticker symbol, or some other fairly unique symbol first, followed by the percent sign (e.g., ORA%coverart ). Some attributes are predefined, including name (the name of the probe), keys (the keys associated with the probe), file (what file the probe is in), line (what line the probe is on), slots (how many slots/data points the probe writes to the trace file), object (the executable or shared object that contains the probe), and debug (where a probe is debug-only or production).

The next three arguments act as a triplet that define a piece of information to be logged.

argument_type_ n defines the type of the n th argument. Some predefined types exist, and are summarized in Table 8-3.

Table 8-3. TNF argument types

Type	Associated C type
`tnf_int`	Int
`tnf_uint`	Unsigned int
`tnf_long`	Long
`tnf_ulong`	Unsigned long
`tnf_longlong`	Long long(only on 64-bit systems)
`tnf_ulonglong`	Unsigned long long(only on 64-bit systems)
`tnf_float`	Float
`tnf_double`	Double
`tnf_string`	Char *
`tnf_opaque`	Void *

There is a documented interface for declaring a new TNF type; please consult the TNF_DECLARE_RECORD manpage .

argument_name_ n defines the name of the n th argument. And quotation marks are not necessary.
argument_value_ n defines the value of the n th argument. This is evaluated to yield a value that is included in the trace file; in a multi-threaded program, it is the user's requirement to place locks around the TNF probe if argument_value_ n contains a variable that should be read-protected.

Any source code that uses the TNF probing macros must include the TNF header file, <tnf/probe.h> . The code must also be compiled against the TNF library ( -ltnfprobe ; note that if libtnfprobe is explicitly linked into the program, it must be before libthread ). You can turn off all the probes by specifying -DNPROBE as a preprocessor option.

8.2.3.2 Caveats

The single biggest caveat when using this trace method is that TNF probes may have a significant impact on application performance. You will need to experiment carefully to see precisely what the effect on your code is. However, the utility is very useful in many cases; don't be scared away just by the concern of inducing some experimental error.

8.2.4 Profiler-Based Analysis: gprof

Profiling provides information about where a program spent its execution time and what functions called what other functions; the data is known as the profile and the call graph , respectively. This technique works particularly well for getting an overview of what a program is doing, particularly a large or complex application for which explicit instrumentation would be unreasonable (e.g., via TNF probes). However, one caveat is that the program must be run in a real-world setting: what the program does affects the information that is produced by the profiling software.

In general, profiling an application takes three steps:

Compile and link your program with profiling turned on.
Execute your application.
Run a profile analyzer, such as GNU's gprof , to produce human-readable results.

Before we work through an example of using profiling, let's see how profiling is implemented.

8.2.4.1 Implementing profiling

Profiling works by modifying, at the time of compilation, how every function is called. This modification causes some information to be stored about where it was called from, ^[5] which allows the profile analyzer to build the call graph. Profiling also periodically finds out which function is currently being executed. Typically, this is done at the rate of about 100 times per second, but it varies from system to system. There are two ways to implement this periodic sampling: via a system call or by telling the kernel to waken and deliver a profiling signal to the process.

^[5] This is usually implemented by examining the stack frame to find both the address of the child and the return address in the parent function; because it is very architecture-dependent , this is typically done in assembler.

Some operating systems implement a profil( ) system call, which creates an array in the kernel's memory space in which each entry accounts for a few bytes (usually between 2 and 8) of the program's address space. Every time the system's clock ticks during the lifetime of the program, the program counter is inspected and the appropriate slot in the kernel memory structure is incremented. This typically imposes low overhead, because the kernel has to wake up to process the clock interrupt anyway.

However, other operating systems don't implement profil( ) or anything similar. In these cases, the kernel (by some mechanism) is made to periodically deliver a signal to the process, which then performs the same examine-and-increment procedure described. Since this method requires the kernel to wake up and transmit a signal into user space, it imposes much higher overhead.

8.2.4.2 Compiling with profiling

For the next examples, I'll use the following piece of code, which is a very simple implementation of the Sieve of Eratosthenes in C: ^[6]

^[6] The exact means of functionality for this (inefficient) piece of code isn't particularly important. However, the idea behind the Sieve of Eratosthenes is to find all the prime numbers up to some upper bound by crossing out multiples of small primes. For example, no multiple of 2 can be prime, since 2 is a prime number, so we cross out all multiples of 2 in a list of integers. A famous theorem in number theory, long since proved, is that to determine if a number a is prime, you need only attempt to divide it by numbers up to the square root of a , which cuts down substantially on the amount of work to be done. There are still much faster methods of finding prime numbers.

 /* sieve.c */ #include <stdio.h> #include <math.h> #define  UPPERBOUND    10000000 int SievedIntegers[UPPERBOUND]; void init(  ) { int i; for (i = 0; i <= UPPERBOUND; i++) { SievedIntegers[i] = 1; } } void printPrimes(  ) { int i; for (i = 2; i <= UPPERBOUND; i++) { if (SievedIntegers[i] == 1) { printf ("%d\n\r", i); } } } void cancel (int p) { int q = 0; for (q = 2 * p; q <= UPPERBOUND; q = q + p) { SievedIntegers[q] = 0; } } int isPrime(int p) { int q; int r = 1; for (q = 2; q <= sqrt(p); q++) { if ((p % q) == 0) { r = 0; break; } } return r; } int main(int argc, char **argv) { int i; init(  ); for (i = 2; i <= sqrt(UPPERBOUND); i++) { if (isPrime(i)) { cancel(i); } } printPrimes(  ); return 0; }

The first step is to compile this code, for which I'll use gcc :

 %  gcc -g -Wall -o sieve sieve.c -pg -lm

Some flags here merit explanation. The -g flag specifies the inclusion of debugging symbols into the binary, which allows gprof to produce a line-by-line list of how many times each line of code was executed, while the -pg flag indicates that I wish to gather profiling data.

8.2.4.3 Execution with profiling

The executable is runnable just as any executable normally would be. When the program exits normally, ^[7] a file named gmon.out will be written to the program's current working directory at exit time. As a result, if your program calls chdir( ) , the gmon.out file will be written in the directory the program last chdir( ) 'd to. This file contains all the profiling data for that invocation of the program.

^[7] That is, by means of main( ) calling return( ) or calling exit( ) .

8.2.4.4 Profile analysis

Now that data has been gathered, we can look at it via gprof executable . Here, I have specified the -b flag to gprof , which causes it to suppress detailed explanations of each field, for the sake of brevity. Let's take a look at the output, piece by piece:

 %  gprof -b sieve  Flat profile: Each sample counts as 0.01 seconds.   %   cumulative   self              self     total             time   seconds   seconds    calls  us/call  us/call  name      72.78      4.01     4.01      446  8991.03  8991.03  cancel  15.06      4.84     0.83        1 830000.00 830000.00  init  12.16      5.51     0.67        1 670000.00 670000.00  printPrimes   0.00      5.51     0.00     3161     0.00     0.00  isPrime ...

This lists all the functions that were sampled. The time field represents the percentage of the total program execution time that was spent in that process; they should sum to 100%. The cumulative seconds field shows the amount of time spent executing a function, plus all the functions above it. The self seconds field is the amount of time associated with this function alone. Functions that have a self seconds value of 0.00 appeared in the call graph, but not in the histogram samples. Therefore, they were executed but never sampled in the random sampling effort.

The number of times the function was called is represented by the calls field. The self us/call value represents the average time, in microseconds, spent in this function per call, whereas the total ms/call field shows the average time, again in microseconds, spent in this function and in all functions above it. The total us/call field expresses the total number of microseconds spent in this function and its descendants per call.

The second part of the gprof output is the call graph:

 ...                         Call graph granularity: each sample hit covers 4 byte(s) for 0.18% of 5.51 seconds index % time    self  children    called     name                                                  <spontaneous> [1]    100.0    0.00    5.51                 main [1]                 4.01    0.00     446/446         cancel [2]                 0.83    0.00       1/1           init [3]                 0.67    0.00       1/1           printPrimes [4]                 0.00    0.00    3161/3161        isPrime [5] -----------------------------------------------                 4.01    0.00     446/446         main [1] [2]     72.8    4.01    0.00     446         cancel [2] -----------------------------------------------                 0.83    0.00       1/1           main [1] [3]     15.1    0.83    0.00       1         init [3] -----------------------------------------------                 0.67    0.00       1/1           main [1] [4]     12.2    0.67    0.00       1         printPrimes [4] -----------------------------------------------                 0.00    0.00    3161/3161        main [1] [5]      0.0    0.00    0.00    3161         isPrime [5] ----------------------------------------------- Index by function name    [2] cancel                  [5] isPrime    [3] init                    [4] printPrimes

The call graph diagrams how much time was spent in each function and its children. Each set of dashed lines divides the table into entries, one for each function. In each entry, the primary line is the one that starts with the index number in square brackets. This line indicates which function the entry describes, the preceding lines indicate the function's callers , and the following lines describe the function's subroutines (also called children in this context).

Each primary line specifies four important fields: % time , which gives the percentage of the total time that was spent in this function and its children; self , which specifies the amount of time spent in this function alone; children , which is the total amount of time spent in the subroutine calls made by this function; and called , which is the number of times the function was called. If the function called itself (exhibited recursion), there will be two numbers specified for the called field, separated by a + character. The first number counts nonrecursive calls, and the second counts recursive calls.

The other lines in an entry are relatively analogous to the primary line, with one exception. The called field for nonprimary lines contain two numbers, which represent the number of times the function was called by this function, followed by the total number of times it was called by all callers.

If the identity of a function's callers cannot be ascertained, a dummy caller line is created, which is named <spontaneous> . This happens sometimes with signal handlers, and is almost always represented as main( ) 's caller.

We have only covered the highlights of using gprof . For more information, consult the manpage.

8.2.4.5 Caveats

There are some definite caveats to using application profiling.

The sampling mechanism takes samples at fixed intervals during the program's runtime. If the program is not running at the time the sampling mechanism runs, no data is gathered. For example, let's say a process generates a working set that is much larger than physical memory. Much of the system's time will be spent handling moving pages from disk to memory and vice versa; this is not considered program runtime and will not be reported. It can also be a benefit: profiling results should be quite similar regardless of the load on the system at the time the profiling data was gathered.

The sampling process itself is subject to statistical inaccuracies. If a function runs for a very short period of time, it is possible that the function will be "missed." If the function can be expected to be seen once, it's possible that it won't show up at all, or that it will be caught running several times. In contrast, the call graph and the counts of the number of times a function is called are derived by explicit counting, so they will be completely accurate.

One general rule of thumb is that a runtime figure is accurate if it is considerably bigger than the sampling period (at least two orders of magnitude).

Some of the call graph information is estimated; there is no direct information concerning them. The assumption is that the time spent in a function is not associated with who called that function. For example, if a function WriteBook( ) used a total of sixty seconds and was called ten times, with three of those calls originating from CallFromEditor( ) , WriteBook( ) contributes eighteen seconds to CallFromEditor( ) 's children time. This is not always a valid assumption. Say that another function, OnTrain( ) , calls WriteBook( ) in all other cases, but in this case WriteBook( ) returns very quickly (it's hard to get much writing done on a train, after all). In this case, almost all the time spent in WriteBook( ) should be attributed to CallFromEditor( ) , but gprof has no way of knowing this, so it blindly -- and quite incorrectly -- associates the forty-two seconds of WriteBook( ) 's runtime to OnTrain( ) .