10.2. The BasicsAs an introduction to DTrace and the D language, let's start with a simple example. The truss(1) utility, a widely used observational tool, provides a powerful means to observe system and library call activity. However, it has many drawbacks: It operates on one process at a time, with no systemwide capability; it is verbose with fixed-output format; and it offers its users a limited choice of questions. Moreover, because of the way it works, TRuss can reduce application performance. Every time a thread in a process makes a system call, TRuss stops the thread through procfs, records the arguments for the system call, and then restarts the thread. When the system call returns, truss again stops the thread, records the return code, and then restarts it. It's not hard to see how this can have quite an impact on performance. DTrace, however, operates completely in the kernel, collecting relevant data at the source. Because the application is no longer controlled through procfs, the impact on the application is greatly minimized. With DTrace we can surpass the power of truss with our first script, which in itself is almost the simplest script that can be written. Here's a D script, truss.d, that lets us observe all global system call activity. #!/usr/sbin/dtrace -s syscall:::entry { } There are a few important things to note from the above example. The first line of the program is as follows: #!/usr/sbin/dtrace -s This specifies that the dtrace(1M) program is to be used as the interpreter, and the -s argument tells dtrace that what follows is a D program that it should execute. Note: The interpreter line for all the examples in this chapter is omitted for the sake of brevity, but it is still very much required. Next follows a description of the events we are interested in looking at. Here we are interested in what happens every time a system call is made. syscall:::entry This is an example of a probe description. In DTrace, a probe is a place in the system where we want to ask a question and record some pertinent data. Such data might include function arguments, stack traces, timestamps, file names, function names, and the like. The braces that follow the probe specification contain the actions that are to be executed when the associated probe is encountered. Actions are generally focused on recording items of data; we'll see examples of these shortly. This example contains no actions, so the default behavior is to just print the name of the probe that has been hit (or fired in tracing parlance) as well as the CPU it executed on and a numerical ID for the probe. Let's run our simple script. sol10# ./truss.d dtrace: script './truss.d' matched 225 probes CPU ID FUNCTION:NAME 0 13 write:entry 0 103 ioctl:entry 0 317 pollsys:entry 0 13 write:entry 0 103 ioctl:entry 0 317 pollsys:entry ^C As you can see from the preceding output, the syscall:::entry probe description enabled 225 different probes in this instance; this is the number of system calls currently available on this system. We don't go into the details now of exactly what this means, but be aware that, when the script is executed, the kernel is instrumented according to our script. When we stop the script, the instrumentation is removed and the system acts in the same way as a system without DTrace installed. The final thing to note here is that the execution of the script was terminated with a Control-C sequence (as shown with the ^C in the above output). A script can itself issue an explicit exit() call to terminate; in the absence of this, the user will have to type Control-C. The preceding script gives a global view of all system call activity. To focus our attention on a single process, we can modify the script to use a predicate. A predicate is associated with a probe description and is a set of conditions placed between forward slashes ("/"). For example: #pragma D option quiet syscall:::entry /pid == 660/ { printf("%-15s: %8x %8x %8x %8x %8x %8x\n", probefunc, arg0, arg1, arg2, arg3, arg4, arg5); } If the expressions within the predicate evaluate to true, then we are interested in recording some data and the associated actions are executed. However, if they evaluate to false, then we choose not to record anything and return. In this case, we want to execute the actions only if the thread making the system call belongs to pid 660. We made a couple of additions to the D script. The #pragma just tells DTrace not to print anything unless it's explicitly asked to do so (the -q option to dtrace(1M) does the same thing). Second, we added some output formatting to printf() to display the name of the system call that was made and its first six arguments, whether the system call has them or not. We look more at output formatting and arguments later. Here is some example output from our script. s10## ./truss.d write : 16 841b548 8 0 831de790 8 read : 16 f942dcc0 20 0 831de790 20 write : 16 841b548 8 0 831de790 8 read : 16 f942dcc0 20 0 831de790 20 pollsys : f942dce0 1 f942dbf0 0 831de790 f942dbf0 write : 5 feab36b1 1 e 81b31250 1 pollsys : 8046ef0 2 8046f88 0 81b31250 8046f88 With a few lines of D we have created the functional equivalent of truss -p. Now that we've seen a simple example, let's look at some of the basic building blocks of DTrace. 10.2.1. D Program StructureD is a block-structured language similar in layout to awk. A program consists of one or more clauses that take the following form: probe / optional predicates / { optional action statements; } Each clause describes one or more probes to enable, an optional predicate, and any actions to associate with the probe specification. When a D program contains several clauses that enable the same probe, the clauses are executed in the order in which they appear in the program. For example: syscall::read:entry { printf("A"); } syscall::read:entry { printf(" B"); exit(1); } The above script contains two clauses; each clause enables the read(2) system call entry probe. When this script is executed, the system is modified dynamically to insert our tracing actions into the read() system call. When any application next makes a read() call, the first clause is executed, causing the character "A" to be displayed. The next clause is executed immediately after the first, and the sequence "B" is also displayed. The exit(1) call terminates the tracing session, an action that in turn causes the enabled probes and their actions to be removed. The system then returns to its default state. Executing the script we see this: sol10# ./read.d A B The preceding explanation is a huge simplification of what actually happens when we execute a D script. The important thing to note here is the dynamic nature of the modifications that are made when a D script is executed. The modifications made to the system (the "instrumentation") exist just for the lifetime of the script. When no DTrace scripts are running, the system acts just as if DTrace were not installed. 10.2.2. Providers and ProbesBy default, DTrace provides tens of thousands of probes that you can enable to gain unparalleled insight into the behavior of a system (use dtrace -l to list them all). Each probe can be referred to by a unique numerical ID or by a more commonly used human-readable one that consists of four colon-separated fields. These are defined as follows: provider:module:function:name
Note two key facts about probe specifications:
Table 10.1 lists examples of valid probe descriptions.
Although it isn't necessary to specify all the fields in a probe, the examples in this book do so in order to remove any ambiguity about which probes are being enabled. Also note that a comma-separated list of probes can be used to associate multiple probes with the same predicate and actions. In previous examples we saw the syscall provider being used to ask questions concerning system call usage. Exactly what is a provider and what is its relationship to a probe? A provider creates the probes that are essentially the individual system points at which we ask questions. There are a number of providers, each able to instrument a different part of the system. The following providers are of special interest to us:
The following providers leverage the sdt provider to grant powerful observability into key Solaris functional areas:
10.2.3. AggregationsThe syscall example used earlier is simple and powerful. However, the output quickly becomes voluminous and overwhelming with thousands of lines generated in seconds. It rapidly becomes difficult to discern patterns of activity in the data, such as might be perceived in a view of all system calls sorted by count. Historically, we would have generated our data and post-processed this by using tools such as awk(1) or perl(1), but that approach is laborious and time wasting. DTrace enables us to succinctly specify how to group vast amounts of data so that we can easily observe such patterns. The mechanism that does this is termed an aggregation. We use aggregations to refine our initial script. syscall:::entry { @sys[probefunc] = count(); } And here is the output now. sol10# ./truss.d dtrace: script './truss.d' matched 225 probes ^C <output elided> fcntl xstat 1113 lwp_park 2767 setcontext 4593 lwp_sigmask 4599 write 7429 setitimer 8234 writev 8444 ioctl 17718 pollsys 135603 read 141379 Instead of seeing every system call as it is made, we are now presented with a table of system calls sorted by count: over 330, 000 system calls presented in several lines! The concept of an aggregation is simple. We want to associate the value of a function with an arbitrary element in an array. In our example, every time a system call probe is fired, the name of the system call is used (using the probefunc built-in variable) to index an associative array. The result of the count() function is then stored in this element of the array (this simply adds 1 to an internal variable for the index in the array and so effectively keeps a running total of the number of times this system call has been entered). In that way, we do not focus on data at individual probe sites but succinctly collate large volumes of data. An aggregation can be split into two basic components: on the left side, a named associative array that is preceded by the @ symbol; on the right side, an aggregating function. @name [ keys ] = function(); An aggregating function has the special property that it produces the same result when applied to a set of data as when applied to subsets of that data and then again to that set of results. A simple example of this is finding the minimum value of the set [5, 12, 4, 7, 18]. Applying the min() function to the whole set gives the result of 4. Equally, computing the minimum value of two subsets [5, 12] and [4, 7, 18] produces 5 and 4. Applying min() again to [5, 4] yields 4. Several aggregating functions in DTrace and their results are listed below.
The following example stores in the appropriate bucket the size of the memory requested in the call to malloc(). pid$1:libc:malloc:entry { @["malloc sizes"] = quantize(arg0); } sol10# ./malloc.d 658 dtrace: script './malloc.d' matched 1 probe ^C malloc sizes value ------------- Distribution ------------- count 2 | 0 4 |@@ 405 8 |@@@@ 886 16 |@@@@@@@ 1673 32 |@ 205 64 |@@@@@@ 1262 128 |@@@@@@@ 1600 256 |@@@@@@@ 1632 512 | 3 1024 | 5 2048 |@@@@ 866 4096 | 10 8192 |@@@ 586 16384 | 0 The example shows that 1673 memory allocations between the size of 16 and 31 bytes were requested. The @ character indicates the relative size of each bucket.
pid$1:libc:malloc:entry { @["malloc sizes"] = lquantize(arg0,4,8,1); } sol10# ./lmalloc.d 658 dtrace: script './lmalloc.d' matched 1 probe ^C malloc sizes value ------------- Distribution ------------- count < 4 | 6 4 |@ 423 5 | 0 6 |@ 400 7 | 0 >= 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 18452 10.2.4. VariablesHaving looked at aggregations, we now come to the two basic data types provided by D: associative arrays and scalar variables. An associative array stores data elements that can be accessed with an arbitrary name, known as a key or an index. This differs from normal, fixed-size arrays in a number of different ways:
Associative arrays in D commonly keep a history of events that have occurred in the past to use in controlling flow in scripts. The following example uses an associative array, arr, to keep track of the largest writes made by applications. syscall::write:entry /arr[execname] < arg2/ { printf("%d byte write: %s\n", arg2, execname); arr[execname] = arg2; } The actions of the clause are executed if the write size, stored in arg2, is larger than that stored in the associative array arr for a given application. If the predicate evaluates to TRue, then this is the largest write seen for this application. The actions record this by first printing the size of the write and then by updating the element in the array with the new maximum write size. D is similar to languages such as C in its implementation of scalar variables, but a few differences need to be highlighted. The first thing to note is that in the D language, variables do not have to be declared in advance of their use, much the same as in awk(1) or perl(1). A variable comes into existence when it first has a value assigned to it; its type is inferred from the assigned value (you are allowed to declare variables in advance but doing so isn't necessary). There is no explicit memory management in D, much as in the Java programming language. The storage for a variable is allocated when the variable is declared, and deallocated when the value of 0 is assigned to the variable. The D language provides three types of variable scope: global, thread-local, and clause-local. Thread-local variables provide separate storage for each thread for a given variable and are referenced with the self-> prefix. fbt:ufs:ufs_write:entry { self->in = timestamp; } In the clause above, every different thread that executes the ufs_write() function has its own copy of a variable named in. Its type is the same as the timestamp built-in variable, and it holds the value that the timestamp built-in variable had when the thread started executing the actions in the clause. This is a nanosecond value since an arbitrary time in the past. A common use of thread-local variables is to highlight a sequence of interest for a given thread and also to associate data with a thread during that sequence. The following example uses the sched provider to record, by application, all the time that a specified user (UID 1003) spent executing. sched:::on-cpu /uid == 1003/ { self->ts = timestamp; } sched:::off-cpu /self->ts/ { @time[execname] = sum(timestamp - self->ts); self->ts = 0; } The above D script contains two clauses. The first one uses the sched:::oncpu probe to enable a probe at every point in the kernel where a thread can be placed onto a processor and run. The predicate attached to this probe specifies that the actions are only to be executed if the uid of the thread is 1003. The action merely stores the current timestamp in nanoseconds by assigning the timestamp built-in variable to a thread-local variable, self->ts. The second clause uses the sched:::off-cpu probe to enable a probe at every location in the kernel where a thread can be taken off the CPU. The self->ts variable in the predicate ensures that only threads owned by uid 1003 that have already been through the sched:::on-cpu probe shall execute the following actions. Why couldn't we just predicate on uid == 1003 as in the first clause? Well, we want to ensure that any thread executing the following actions has already been through the first clause so that its self->ts variable is set. If it hasn't been set, we will end up storing a huge value in the @time aggregation because self->ts will be 0! Using a thread-local variable in predicates like this to control flow in a D script is a common technique that we frequently use in this book. sol10# ./sched.d dtrace: script './sched.d' matched 6 probes ^C cat 1247134 xterm 1830446 ksh 3909045 stdlibfilt 5499630 make 60092218 sh 158049977 sed 162507340 CC 304925644 ir2hf 678855289 sched 2600916929 ube_ipa 2851647754 ccfe 5879225939 ube 7942433397 The preceding example can be enhanced with the profile provider to produce output at a given periodic rate. To produce output every 5 seconds, we can just add the following clause: profile:::tick-5s { printa(@time); trunc(@time); } The profile provider sets up a probe that fires every 5 seconds on a single CPU. The two actions used here are commonly used when periodically displaying aggregation data:
10.2.5. Probe ArgumentsIn DTrace, probe arguments are made available through one or two mechanisms, depending on which provider is responsible for the probe:
For an example of argument usage, let's look at a script based on the fbt provider. The Solaris kernel, like any other program, is made up of many functions that offer well-defined interfaces to perform specific operations. We often want to ask pertinent questions upon entry to a function, such as, What was the value of its third argument? or upon exit from a function, What was the return value? For example: fbt:ufs:ufs_read:entry /uid == 1003/ { self->path = stringof(args[0]->v_path); self->ts = timestamp; } fbt:ufs:ufs_read:return /self->path != NULL/ { @[self->path] = max(timestamp - self->ts); self->path = 0; self->ts = 0; } This example looks at all the reads performed through ufs file systems by a particular user (UID 1003) and, for each file, records the maximum time taken to carry out the read call. A few new things require further explanation. The name of the file being read from is stored in the thread-local variable, self->path, with the following statement: self->path = stringof(args[0]->v_path); The main point to note here is the use of the args[] array to reference the first argument (args[0]) of the ufs_read function. Using MDB, we can inspect the arguments of ufs_read: sol10# mdb -k Loading modules: [ unix krtld genunix dtrace specfs ufs ip sctp usba uhci fctl s1394 nca lofs audiosup nfs random sppp crypto ptm ipc ] > ufs_read::nm -f ctype C Type int (*)(struct vnode *, struct uio *, int, struct cred *, struct caller_context *) The first argument to ufs_read() is a pointer to a vnode structure (struct vnode *). The path name of the file that is represented by that vnode is stored in the v_path member of the vnode structure and can be accessed through args[0]->v_path. Using MDB again, we inspect the type of the v_path member variable. > ::print -t struct vnode v_path char *v_path The v_path member is a character pointer and needs to be converted to DTrace's native string type. In DTrace a string is a built-in data type. The stringof() action is one of many features that allow easy manipulation of strings. It converts the char * representation of v_path into the DTrace string type. If the arg0 built-in variable had been used, a cast would be required and would be written as this: self->path = stringof(((struct vnode *)arg0)->v_path); The predicate associated with the ufs_read:return probe ensures that its actions are only executed for files with a non-NULL path name. The action then uses the path name stored in the self->path variable to index an aggregation, and the max() aggregating function tracks the maximum time taken for reads against this particular file. For example: sol10# ./ufs.d dtrace: script './ufs.d' matched 2 probes ^C /lib/ld.so.1 3840 /usr/share/lib/zoneinfo/GB 5523 <output elided> /usr/share/man/man1/ls.1 2599836 /./usr/bin/more 3941344 /./usr/bin/tbl 3988087 /usr/share/lib/pub/eqnchar 4397573 /usr/share/lib/tmac/an 5054675 /./usr/bin/nroff 7004599 /./usr/bin/neqn 7021088 /./usr/bin/col 9989462 /usr/share/man/windex 13742938 /./usr/bin/man 17179129 Now let's look at a syscall-based example of return probe use. The following script exploits the fact that syscall probes have their return value stored in arg0 and the error code for the call stored in the errno built-in variable. syscall::open*:entry { self->path = arg0; } syscall::open*:return /self->path != NULL && (int)arg0 == -1 && errno == EACCES/ { printf("UID %d permission denied to open %s\n", uid, copyinstr(self->path)); self->path = 0; } The first clause enables probes for the open(2) and open64(2) system calls. It then stores the address of the buffer, which contains the file name to open, in the thread-local variable self->path. The second clause enables the corresponding syscall return probes. The conditions of interest are laid out in the predicate:
If the above conditions are all true, then a message is printed specifying the UID that induced the condition and the file for which permissions were lacking . sol10# ./open.d UID 39079 permission denied to open /etc/shadow Finally, a note regarding the copyinstr() action used in the second clause above: All probes, predicates, and associated actions are executed in the kernel, and therefore any data that originates in userland must be copied into the kernel to be used. The buffer that contains the file name to be opened in our example is a buffer that resides in a userland application. For the contents to be printed, the buffer must be copied to the kernel address space and converted into a DTrace string type; this is what copyinstr() does. 10.2.6. Mixing ProvidersDTrace gives us the freedom to observe interactions across many different subsystems. The following slightly larger script demonstrates how we can follow all the work done in userland and the kernel by a given application function. We can use dtrace -p to attach to and instrument a running process. For example, we can use a script that looks at the function getgr_lookup() in the name services cache daemon. The getgr_lookup() function is called to translate group IDs and group names. Note that here we are interested in the principle of examining a particular function; the actual program and function chosen here are irrelevant. #pragma D option flowindent pid$target:a.out:getgr_lookup:entry { self->in = 1; } pid$target:::entry, pid$target:::return /self->in/ { printf("(pid)\n"); } fbt:::entry, fbt:::return /self->in/ { printf("(fbt)\n"); } pid$target:a.out:getgr_lookup:return /self->in/ { self->in = 0; exit(0); } The #pragma flowindent directive at the start of the script means that indentation will be increased on entry to a function and reduced on the same function's return. Showing function calls in a nested manner like this makes the output much more readable. The pid provider instruments userland applications. The process to be instrumented is specified with the $target macro argument, which always expands to the PID of the process being traced when we attach to the process by using the -p option to dtrace(1M). The second clause enables all the entry and return probes in the nscd process, and the third clause enables every entry and return probe in the kernel. The predicate in both of these clauses specifies that we are only interested in executing the actions if the thread-local self->in variable is set. This variable is set to 1 when nscd's getgr_lookup() function is entered and set to 0 on exit from this function (that is, when the return probe is fired). For example: sol10# dtrace -s ./nscd.d -p 'pgrep nscd` dtrace: script './nscd.d' matched 43924 probes CPU FUNCTION 0 -> getgr_lookup (pid) 0 -> mutex_lock (pid) 0 -> mutex_lock_impl (pid) 0 <- mutex_lock_impl (pid) 0 <- mutex_lock (pid) 0 -> _xstat (pid) 0 -> copyout (fbt) 0 <- kcopy (fbt) 0 -> syscall_mstate (fbt) 0 -> gethrtime_unscaled (fbt) 0 <- gethrtime_unscaled (fbt) 0 <- syscall_mstate (fbt) 0 -> syscall_entry (fbt) <output elided> 0 <- syscall_exit (fbt) 0 -> syscall_mstate (fbt) 0 -> gethrtime_unscaled (fbt) 0 <- gethrtime_unscaled (fbt) 0 <- syscall_mstate (fbt) 0 <- _xstat (pid) 0 -> get_hash (pid) 0 -> abs (pid) 0 -> copyout (fbt) 0 <- kcopy (fbt) 0 <- abs (pid) 0 <- get_hash (pid) 0 -> memcpy (pid) 0 -> copyout (fbt) 0 <- kcopy (fbt) 0 <- memcpy (pid) 0 -> mutex_unlock (pid) 0 <- mutex_unlock (pid) 0 | getgr_lookup:return (pid) 0 <- getgr_lookup 10.2.7. Accessing Global Kernel DataDTrace provides a very useful feature by which we can access symbols defined in the Solaris kernel from within a D script. We can use the backquote character (`) to refer to kernel symbols, and this information can be used to great advantage when we are exploring the behavior of a Solaris kernel. For example, a variable named mpid is declared in the Solaris kernel source to keep track of the last PID that was allocated. It is declared in uts/common/os/pid.c as follows: static pid_t mpid; The following script uses this variable to calculate the rate of process creation on the system and to output a message if it exceeds a given amount (10 processes per second in this case): dtrace:::BEGIN { cnt = 'mpid; } profile:::tick-1s /'mpid < cnt+10/ { cnt = 'mpid; } profile:::tick-1s /'mpid >= cnt+10/ { printf("High process creation rate: %d/sec\n", 'mpid - cnt); cnt = 'mpid; } The first clause uses the BEGIN probe from the dtrace provider to initialize a global variable (cnt) to the current value of the mpid kernel variable. The BEGIN, END, and ERROR probes are special probes that belong to the dtrace provider. These probes are essentially virtual probes in that they aren't associated with any code location or timer source. The BEGIN probe fires before any other probes when we start the tracing session and allows us to perform tasks such as data initialization. The END probe is called when the tracing session is terminated either with a Control-C or an explicit call to the exit() action. Its main function is to print data collected during the execution of the script. The ERROR probe is less commonly used; it is called upon abnormal termination of the script. Both of the next two clauses in the previous example enable the profile:::tick-1s probe. The probe fires every second, and the two clauses are executed in the order specified in the script. The important thing to note is that the predicates in the two clauses contain mutually exclusive logic, which ensures that only one of them will be true at any one timeeither ten processes have been created in the last second or they haven't! The predicate in the first profile:::tick-1s clause specifies that its actions should only be executed if fewer than ten processes have been created (the 'mpid variable is within ten of its value one second ago as stored in the cnt variable). If fewer than ten processes have been created in the last second, the cnt variable is updated with the current value of mpid. The actions in the second clause are executed when more than ten processes have been created. If cnt has already been updated in the first clause, then the predicate will be false and the actions are not executed (a message is then printed with the growth rate, and the cnt variable is updated). For example: sol10# ./scope.d High process creation rate: 30/sec High process creation rate: 31/sec High process creation rate: 35/sec High process creation rate: 35/sec High process creation rate: 44/sec High process creation rate: 44/sec High process creation rate: 20/sec 10.2.8. Assorted Actions of InterestDTrace defines numerous actions, only a small percentage of which are used in this book. Actions that you may see used include normalize(), stack(), and ustack().
The find(1) application is at the top of the list here. The walk() routine is listed multiple times because it is recursively called to walk a file tree. |