There are many techniques for analyzing an applications' performance. Broadly, they can be classed as probe-based or profiler-based . Probe-based analysis relies upon the programmer inserting pieces of code into the application; these code fragments provide timestamping and other meaningful data that can be analyzed later. Profiler-based analysis depends on the application being compiled with profiling support. This causes the program to generate information about its performance as it runs, which can then be analyzed via an external profiling application. Before we can begin an analysis of what components are responsible for the time consumed, it is good to know exactly how much time is being spent! For this, we have a set of relatively coarse-grained timing tools: time , timex , and (in Solaris) ptime . 8.2.1 Application Timing: time, timex, and ptimeThe most simple question we can ask about a program's speed is "How long does it take to run?" Rather than sitting in front of a computer with a stopwatch, some common tools can be used to measure program execution time. 8.2.1.1 timeThe first of these tools is the time command, which is built into the csh and ksh shells (it also may exist as a separate command in /bin ). At the end of the program, time presents a breakdown of how the running time of the application was spent:
In addition to these figures, the csh built-in time command provides other useful information. Let's look at an example: % time sieve > /dev/null 6.490u 0.530s 0:16.74 41.9% 0+0k 0+0io 111pf+0w The first, second, and third fields correspond to the seconds of user time, the seconds of system time, and the elapsed time for the process, respectively. In general, the user time should far exceed the system time. It is probably a sign that the program is doing something inefficient if the system time is disproportionately large: the application might be generating lots of exception conditions or using system calls inefficiently. Note, however, that "I/O wait time" (the time spent rewinding a tape, seeking the disk read/write heads, etc.) is not considered CPU time. The fourth field refers to the percentage utilization, which is the ratio of elapsed time to CPU time (the sum of the user and system times). There are many reasons why the CPU time is not in the neighborhood of the elapsed time; after all, the system may have a significant number of other processes, the application in question may induce a lot of I/O (either directly or through exhausting physical memory and incurring paging activity), or the system may not be able to retrieve data to and from memory fast enough. The fifth field represents the average memory utilization statistics. The first of these is the average shared-memory space , which defines how much memory is consumed by the application (this memory may be shared between multiple invocations of the same process, which is why it is called "shared memory"). The second figure represents the average unshared-memory space , which is used for storing data structures. These figures represent actual physical memory usage only. Also, these measurements are not supported on all platforms, so are sometimes reported as 0+0k . The sixth field represents the amount of block input and output I/O operations, respectively. This field is also not supported on all implementations of the time command, so it is sometimes reported as 0+0io . The seventh and last field describes the number of page faults and swaps incurred by the application. If the number of page faults is unusually high, it indicates that the process was probably choking on memory requirements. However, every program incurs some page faults (simply by virtue of how the page allocation mechanism works; see Section 4.2.5 in Chapter 4). Note that there is another version of the time command, one which is not built into the shell: # /bin/time catman -w real 36.3 user 11.9 sys 3.5 In Linux, the time command is located in /usr/bin , and presents different output: % /usr/bin/time sieve > /dev/null 5.55user 0.25system 0:05.95elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (112major+9778minor)pagefaults 0swaps This output is comparable to the shell built-in time function. 8.2.1.2 timexA slightly higher-resolution way of getting a program's runtime is the timex command. It is similar in syntax to the other timing commands we've discussed; it just provides higher accuracy: # timex catman -w real 12.78 user 11.71 sys 2.67 (Note that the runtime of the catman -w process is significantly decreased compared to the last time we ran it. This is because of filesystem caching; timex itself is not introducing the problem.) timex has three options worth noting. If process accounting is installed, specifying the -o flag causes timex to report information on the number of blocks read and written. The -s flag reports the total system activity that occured during the process' runtime. The -p flag gives the process accounting record for the command. 8.2.1.3 ptimeSolaris offers a high-resolution method for timing application runtimes invoked via ptime . ptime has the advantage of being simple to run: # ptime catman -w real 14.455 user 0.005 sys 0.009 This example also nicely illustrates that the sum of system time and user time need not add up to the real (elapsed) time: catman -w , which rebuilds the windex database used by man -f , man -k , and apropos , is fairly disk- intensive . The fact that disk activity is not accounted for in either user time or system time goes a long way towards explaining this huge discrepancy. On parallel systems, the user time can be larger than the elapsed time because the combined user time across all processors is reported. 8.2.1.4 Mechanisms of timingIt's worth taking a moment to examine how these commands work. The /bin/time and timex commands rely on sampling the state of the processor at the clock interrupt. This gives an accuracy of about 1/100th of a second (the frequency of the clock interrupt, by default). ptime uses an altogether different mechanism. When the kernel suffers a state change, system call, page fault, or scheduler change, it logs a high-resolution timestamp. This is known as microstate accounting . It significantly improves accuracy but induces some overhead. You can also turn on microstate accounting programmatically. 8.2.2 Timing-Specific Code SectionsThere are many techniques for obtaining timing information about specific sections of code. I present two of them here: gethrtime() , a function call that returns the current time in nanoseconds, and directly accessing the TICK register. 8.2.2.1 Timing via gethrtimeIt's recommended that you enable microstate accounting before using the gethrtime() system call. There are two ways of doing this: enabling microstate accounting in the code, or by using ptime to measure the elapsed time of the program. I present an example of both in Example 8-1 and Example 8-2. Example 8-1. gethrtime_timing.c (without enabling microstate accounting)/* gethrtime_timing.c */ #include <stdio.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> int main(int argc, char **argv) { hrtime_t timeStart, timeEnd; int i, j = 250000; timeStart = gethrtime( ); for (i = 0; i < j; i++) { /* The function being measured goes here */ } timeEnd = gethrtime( ); printf ("Average time: %lld ns\n", (timeEnd - timeStart) / j); return 0; } Example 8-2. gethrtime_ustate_timing.c (enabling microstate accounting)/* gethrtime_ustate_timing.c */ #include <errno.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/stat.h> #include <sys/time.h> #define _STRUCTURED_PROC 1 #include <sys/procfs.h> /* The init_micro_acct( ) function is drawn from the file timing.c, which is included in the Forte Developer 6 release: look in /opt/SUNWspro/WS6/examples/analyzer/omptest/timing.c */ void init_micro_acct( ) { /* SunOS 5.5 or higher */ int ctfld; long ctl[2]; char procname[1024]; sprintf(procname, "/proc/%d/ctl", getpid( )); ctlfd = open(procname, O_WRONLY); if(ctlfd < 0) { fprintf(stderr, "open %s failed, errno = %d\n", procname, errno); } ctl[0] = PCSET; ctl[1] = PR_MSACCT; if (write(ctlfd, ctl, 2*sizeof(long)) < 0) { fprintf(stderr, "write failed, errno = %d\n", errno); } close(ctlfd); printf("Enabling microstate accounting.\n"); return; } int main(int argc, char **argv) { hrtime_t timeStart, timeEnd; int i, j = 250000; /* Start microstate accounting */ init_micro_acct( ); timeStart = gethrtime( ); for (i = 0; i < j; i++) { /* The function being measured goes here */ } timeEnd = gethrtime( ); printf ("Average time: %lld ns\n", (timeEnd - timeStart) / j); return 0; } I built these applications on an Sun Ultra 10 (300 MHz UltraSPARC-II processor) with the Solaris 8 system and the Forte Developer 6 update 1 compiler, specifying the getpid() system call as the function to be timed. I also increased the number of iterations ( j ) to 5 million. Example 8-3 illustrates the differences between running gethrtime_timing alone, running gethrtime_timing with microstate accounting turned on via ptime , and running gethrtime_ustate_timing . Example 8-3. Differences in execution between three timing programs% /opt/SUNWspro/bin/cc gethrtime_timing.c -o gethrtime_timing % /opt/SUNWspro/bin/cc gethrtime_ustate_timing.c -o gethrtime_ustate_timing % gethrtime_timing Average time: 1714 ns % ptime gethrtime_timing Average time: 2708 ns real 13.550 user 9.785 sys 3.620 % gethrtime_ustate_timing Enabling microstate accounting. Average time: 2648 ns As you can see, the gethrtime_ustate_timing and the gethrtime_timing figures are comparable, whereas the gethrtime_timing figure is significantly lower (without microstate accounting). This is precisely what we'd expect: it sharply illustrates the overhead incurred during microstate accounting. The advantage of microstate accounting is improved resolution in monitoring. 8.2.2.2 Timing via the TICK registerThe SPARC architecture specifies a TICK register, which provides an accurate, low-overhead way of measuring elapsed time in terms of processor cycles. Here, we'll present a way of accessing the TICK register directly from user programs that has much less overhead than the other timing mechanisms we've discussed: gethrtime() has an overhead of about 75 CPU cycles, whereas the TICK register can be accessed in about six cycles. The TICK register can be accessed in user space in Solaris 8, but not in earlier releases. [3]
The most convenient way to access the TICK register is through an in-lined assembly template, as illustrated in Example 8-4 and Example 8-5. Example 8-4. readtickreg.il.inline readtickreg,1 rd %tick, %o1 stx %o1, [%o0] .end Example 8-5. gettick_timing.c/* gettick_timing.c */ #include <math.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> extern void readtickreg(void *); #pragma no_side_effect(readtickreg); int main(int argc, char **argv) { uint64_t c1=0, c2=0, overhead=0, total=0, cost; int i, j=1, *x; x = &j; for (i = 0; i < 10; i++) { readtickreg ((void *)&c1); readtickreg ((void *)&c2); overhead += c2 - c1; } overhead = floor (overhead * 1.0 / 10); for (i = 0; i < 134217728; i++) { readtickreg ((void *)&c1); /* The function being measured goes here: we will use *x += i; as an example */ *x += i; readtickreg ((void *)&c2); total += c2 - c1; } cost = floor (total * 1.0 / 134217728 - overhead); printf ("Overhead of readtickreg: %lld\n", overhead); printf ("Cost in processor cycles: %lld\n", cost); return 0; } 134217728 is an arbitrary number. We can then build and test this code: % /opt/SUNWspro/bin/cc -xarch=v8plusa gettick_timing.c -o gettick_timing readtickreg.il -lm % gettick_timing Overhead of readtickreg: 6 Cost in processor cycles: 7 So the overhead of the call to the TICK register is six cycles, whereas the cost in processor cycles for the assignment is seven cycles. I replaced the test function with getpid() , and I found that it costs on average 495 cycles. Because this test was run on a 300 MHz Ultra 10, like the previous tests, we can then find out that one cycle is 3.33 nanoseconds (1 second/300,000,000 cycles is 0.000000000333 seconds, or 3.33 nanoseconds). This means the cost of a getpid() call is just about 1,650 nanoseconds, which is pretty close to what we measured using gethrtime() . 8.2.3 Probe-Based Analysis: Solaris TNFThe Solaris trace normal form (TNF) probe functionality allows the detailed monitoring of specific code flows. TNF probing involves inserting specific macro calls into your application's code, and then analyzing the output that they generate. TNF probes are analyzed by means of the prex , tnfdump , and tnfxtract tools, which we originally described for monitoring probes in the Solaris kernel itself. For a quick summary on how to use these tools, see Section 5.5.6 in Chapter 5. 8.2.3.1 Inserting probesProbes themselves are declared by a series of macros. These probes can be placed anywhere in C and C++ programs, including multi-threaded code, drivers, loadable kernel modules, and shared objects. The general format for the probe is as follows : TNF_PROBE_2 (name, keys, detail, argument_type_1, argument_name_1, argument_value_1, argument_type_2, argument_type_2, argument_value_2); Let's look at this piece by piece.
The next three arguments act as a triplet that define a piece of information to be logged.
Any source code that uses the TNF probing macros must include the TNF header file, <tnf/probe.h> . The code must also be compiled against the TNF library ( -ltnfprobe ; note that if libtnfprobe is explicitly linked into the program, it must be before libthread ). You can turn off all the probes by specifying -DNPROBE as a preprocessor option. 8.2.3.2 CaveatsThe single biggest caveat when using this trace method is that TNF probes may have a significant impact on application performance. You will need to experiment carefully to see precisely what the effect on your code is. However, the utility is very useful in many cases; don't be scared away just by the concern of inducing some experimental error. 8.2.4 Profiler-Based Analysis: gprofProfiling provides information about where a program spent its execution time and what functions called what other functions; the data is known as the profile and the call graph , respectively. This technique works particularly well for getting an overview of what a program is doing, particularly a large or complex application for which explicit instrumentation would be unreasonable (e.g., via TNF probes). However, one caveat is that the program must be run in a real-world setting: what the program does affects the information that is produced by the profiling software. In general, profiling an application takes three steps:
Before we work through an example of using profiling, let's see how profiling is implemented. 8.2.4.1 Implementing profilingProfiling works by modifying, at the time of compilation, how every function is called. This modification causes some information to be stored about where it was called from, [5] which allows the profile analyzer to build the call graph. Profiling also periodically finds out which function is currently being executed. Typically, this is done at the rate of about 100 times per second, but it varies from system to system. There are two ways to implement this periodic sampling: via a system call or by telling the kernel to waken and deliver a profiling signal to the process.
Some operating systems implement a profil( ) system call, which creates an array in the kernel's memory space in which each entry accounts for a few bytes (usually between 2 and 8) of the program's address space. Every time the system's clock ticks during the lifetime of the program, the program counter is inspected and the appropriate slot in the kernel memory structure is incremented. This typically imposes low overhead, because the kernel has to wake up to process the clock interrupt anyway. However, other operating systems don't implement profil( ) or anything similar. In these cases, the kernel (by some mechanism) is made to periodically deliver a signal to the process, which then performs the same examine-and-increment procedure described. Since this method requires the kernel to wake up and transmit a signal into user space, it imposes much higher overhead. 8.2.4.2 Compiling with profilingFor the next examples, I'll use the following piece of code, which is a very simple implementation of the Sieve of Eratosthenes in C: [6]
/* sieve.c */ #include <stdio.h> #include <math.h> #define UPPERBOUND 10000000 int SievedIntegers[UPPERBOUND]; void init( ) { int i; for (i = 0; i <= UPPERBOUND; i++) { SievedIntegers[i] = 1; } } void printPrimes( ) { int i; for (i = 2; i <= UPPERBOUND; i++) { if (SievedIntegers[i] == 1) { printf ("%d\n\r", i); } } } void cancel (int p) { int q = 0; for (q = 2 * p; q <= UPPERBOUND; q = q + p) { SievedIntegers[q] = 0; } } int isPrime(int p) { int q; int r = 1; for (q = 2; q <= sqrt(p); q++) { if ((p % q) == 0) { r = 0; break; } } return r; } int main(int argc, char **argv) { int i; init( ); for (i = 2; i <= sqrt(UPPERBOUND); i++) { if (isPrime(i)) { cancel(i); } } printPrimes( ); return 0; } The first step is to compile this code, for which I'll use gcc : % gcc -g -Wall -o sieve sieve.c -pg -lm Some flags here merit explanation. The -g flag specifies the inclusion of debugging symbols into the binary, which allows gprof to produce a line-by-line list of how many times each line of code was executed, while the -pg flag indicates that I wish to gather profiling data. 8.2.4.3 Execution with profilingThe executable is runnable just as any executable normally would be. When the program exits normally, [7] a file named gmon.out will be written to the program's current working directory at exit time. As a result, if your program calls chdir( ) , the gmon.out file will be written in the directory the program last chdir( ) 'd to. This file contains all the profiling data for that invocation of the program.
8.2.4.4 Profile analysisNow that data has been gathered, we can look at it via gprof executable . Here, I have specified the -b flag to gprof , which causes it to suppress detailed explanations of each field, for the sake of brevity. Let's take a look at the output, piece by piece: % gprof -b sieve Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls us/call us/call name 72.78 4.01 4.01 446 8991.03 8991.03 cancel 15.06 4.84 0.83 1 830000.00 830000.00 init 12.16 5.51 0.67 1 670000.00 670000.00 printPrimes 0.00 5.51 0.00 3161 0.00 0.00 isPrime ... This lists all the functions that were sampled. The time field represents the percentage of the total program execution time that was spent in that process; they should sum to 100%. The cumulative seconds field shows the amount of time spent executing a function, plus all the functions above it. The self seconds field is the amount of time associated with this function alone. Functions that have a self seconds value of 0.00 appeared in the call graph, but not in the histogram samples. Therefore, they were executed but never sampled in the random sampling effort. The number of times the function was called is represented by the calls field. The self us/call value represents the average time, in microseconds, spent in this function per call, whereas the total ms/call field shows the average time, again in microseconds, spent in this function and in all functions above it. The total us/call field expresses the total number of microseconds spent in this function and its descendants per call. The second part of the gprof output is the call graph: ... Call graph granularity: each sample hit covers 4 byte(s) for 0.18% of 5.51 seconds index % time self children called name <spontaneous> [1] 100.0 0.00 5.51 main [1] 4.01 0.00 446/446 cancel [2] 0.83 0.00 1/1 init [3] 0.67 0.00 1/1 printPrimes [4] 0.00 0.00 3161/3161 isPrime [5] ----------------------------------------------- 4.01 0.00 446/446 main [1] [2] 72.8 4.01 0.00 446 cancel [2] ----------------------------------------------- 0.83 0.00 1/1 main [1] [3] 15.1 0.83 0.00 1 init [3] ----------------------------------------------- 0.67 0.00 1/1 main [1] [4] 12.2 0.67 0.00 1 printPrimes [4] ----------------------------------------------- 0.00 0.00 3161/3161 main [1] [5] 0.0 0.00 0.00 3161 isPrime [5] ----------------------------------------------- Index by function name [2] cancel [5] isPrime [3] init [4] printPrimes The call graph diagrams how much time was spent in each function and its children. Each set of dashed lines divides the table into entries, one for each function. In each entry, the primary line is the one that starts with the index number in square brackets. This line indicates which function the entry describes, the preceding lines indicate the function's callers , and the following lines describe the function's subroutines (also called children in this context). Each primary line specifies four important fields: % time , which gives the percentage of the total time that was spent in this function and its children; self , which specifies the amount of time spent in this function alone; children , which is the total amount of time spent in the subroutine calls made by this function; and called , which is the number of times the function was called. If the function called itself (exhibited recursion), there will be two numbers specified for the called field, separated by a + character. The first number counts nonrecursive calls, and the second counts recursive calls. The other lines in an entry are relatively analogous to the primary line, with one exception. The called field for nonprimary lines contain two numbers, which represent the number of times the function was called by this function, followed by the total number of times it was called by all callers. If the identity of a function's callers cannot be ascertained, a dummy caller line is created, which is named <spontaneous> . This happens sometimes with signal handlers, and is almost always represented as main( ) 's caller. We have only covered the highlights of using gprof . For more information, consult the manpage. 8.2.4.5 CaveatsThere are some definite caveats to using application profiling. The sampling mechanism takes samples at fixed intervals during the program's runtime. If the program is not running at the time the sampling mechanism runs, no data is gathered. For example, let's say a process generates a working set that is much larger than physical memory. Much of the system's time will be spent handling moving pages from disk to memory and vice versa; this is not considered program runtime and will not be reported. It can also be a benefit: profiling results should be quite similar regardless of the load on the system at the time the profiling data was gathered. The sampling process itself is subject to statistical inaccuracies. If a function runs for a very short period of time, it is possible that the function will be "missed." If the function can be expected to be seen once, it's possible that it won't show up at all, or that it will be caught running several times. In contrast, the call graph and the counts of the number of times a function is called are derived by explicit counting, so they will be completely accurate. One general rule of thumb is that a runtime figure is accurate if it is considerably bigger than the sampling period (at least two orders of magnitude). Some of the call graph information is estimated; there is no direct information concerning them. The assumption is that the time spent in a function is not associated with who called that function. For example, if a function WriteBook( ) used a total of sixty seconds and was called ten times, with three of those calls originating from CallFromEditor( ) , WriteBook( ) contributes eighteen seconds to CallFromEditor( ) 's children time. This is not always a valid assumption. Say that another function, OnTrain( ) , calls WriteBook( ) in all other cases, but in this case WriteBook( ) returns very quickly (it's hard to get much writing done on a train, after all). In this case, almost all the time spent in WriteBook( ) should be attributed to CallFromEditor( ) , but gprof has no way of knowing this, so it blindly -- and quite incorrectly -- associates the forty-two seconds of WriteBook( ) 's runtime to OnTrain( ) . |