Section 10.2. The Basics

10.2. The Basics

As an introduction to DTrace and the D language, let's start with a simple example.

The truss(1) utility, a widely used observational tool, provides a powerful means to observe system and library call activity. However, it has many drawbacks: It operates on one process at a time, with no systemwide capability; it is verbose with fixed-output format; and it offers its users a limited choice of questions. Moreover, because of the way it works, TRuss can reduce application performance. Every time a thread in a process makes a system call, TRuss stops the thread through procfs, records the arguments for the system call, and then restarts the thread. When the system call returns, truss again stops the thread, records the return code, and then restarts it. It's not hard to see how this can have quite an impact on performance. DTrace, however, operates completely in the kernel, collecting relevant data at the source. Because the application is no longer controlled through procfs, the impact on the application is greatly minimized.

With DTrace we can surpass the power of truss with our first script, which in itself is almost the simplest script that can be written. Here's a D script, truss.d, that lets us observe all global system call activity.

#!/usr/sbin/dtrace -s syscall:::entry { }

There are a few important things to note from the above example. The first line of the program is as follows:

#!/usr/sbin/dtrace -s

This specifies that the dtrace(1M) program is to be used as the interpreter, and the -s argument tells dtrace that what follows is a D program that it should execute. Note: The interpreter line for all the examples in this chapter is omitted for the sake of brevity, but it is still very much required.

Next follows a description of the events we are interested in looking at. Here we are interested in what happens every time a system call is made.

syscall:::entry

This is an example of a probe description. In DTrace, a probe is a place in the system where we want to ask a question and record some pertinent data. Such data might include function arguments, stack traces, timestamps, file names, function names, and the like.

The braces that follow the probe specification contain the actions that are to be executed when the associated probe is encountered. Actions are generally focused on recording items of data; we'll see examples of these shortly. This example contains no actions, so the default behavior is to just print the name of the probe that has been hit (or fired in tracing parlance) as well as the CPU it executed on and a numerical ID for the probe.

Let's run our simple script.

sol10# ./truss.d dtrace: script './truss.d' matched 225 probes CPU     ID                    FUNCTION:NAME   0     13                      write:entry   0    103                      ioctl:entry   0    317                    pollsys:entry   0     13                      write:entry   0    103                      ioctl:entry   0    317                    pollsys:entry ^C

As you can see from the preceding output, the syscall:::entry probe description enabled 225 different probes in this instance; this is the number of system calls currently available on this system. We don't go into the details now of exactly what this means, but be aware that, when the script is executed, the kernel is instrumented according to our script. When we stop the script, the instrumentation is removed and the system acts in the same way as a system without DTrace installed.

The final thing to note here is that the execution of the script was terminated with a Control-C sequence (as shown with the ^C in the above output). A script can itself issue an explicit exit() call to terminate; in the absence of this, the user will have to type Control-C.

The preceding script gives a global view of all system call activity. To focus our attention on a single process, we can modify the script to use a predicate. A predicate is associated with a probe description and is a set of conditions placed between forward slashes ("/"). For example:

#pragma D option quiet syscall:::entry /pid == 660/ {         printf("%-15s: %8x %8x %8x %8x %8x %8x\n",             probefunc, arg0, arg1, arg2, arg3, arg4, arg5); }

If the expressions within the predicate evaluate to true, then we are interested in recording some data and the associated actions are executed. However, if they evaluate to false, then we choose not to record anything and return. In this case, we want to execute the actions only if the thread making the system call belongs to pid 660.

We made a couple of additions to the D script. The #pragma just tells DTrace not to print anything unless it's explicitly asked to do so (the -q option to dtrace(1M) does the same thing). Second, we added some output formatting to printf() to display the name of the system call that was made and its first six arguments, whether the system call has them or not. We look more at output formatting and arguments later. Here is some example output from our script.

s10## ./truss.d write          :       16  841b548        8        0 831de790         8 read           :       16 f942dcc0       20        0 831de790        20 write          :       16  841b548        8        0 831de790         8 read           :       16 f942dcc0       20        0 831de790        20 pollsys        : f942dce0        1 f942dbf0        0 831de790  f942dbf0 write          :        5 feab36b1        1        e 81b31250         1 pollsys        :  8046ef0        2  8046f88        0 81b31250   8046f88

With a few lines of D we have created the functional equivalent of truss -p.

Now that we've seen a simple example, let's look at some of the basic building blocks of DTrace.

10.2.1. D Program Structure

D is a block-structured language similar in layout to awk. A program consists of one or more clauses that take the following form:

probe / optional predicates / {         optional action statements; }

Each clause describes one or more probes to enable, an optional predicate, and any actions to associate with the probe specification. When a D program contains several clauses that enable the same probe, the clauses are executed in the order in which they appear in the program. For example:

syscall::read:entry {         printf("A"); } syscall::read:entry {         printf(" B");         exit(1); }

The above script contains two clauses; each clause enables the read(2) system call entry probe. When this script is executed, the system is modified dynamically to insert our tracing actions into the read() system call. When any application next makes a read() call, the first clause is executed, causing the character "A" to be displayed. The next clause is executed immediately after the first, and the sequence "B" is also displayed. The exit(1) call terminates the tracing session, an action that in turn causes the enabled probes and their actions to be removed. The system then returns to its default state. Executing the script we see this:

sol10# ./read.d A B

The preceding explanation is a huge simplification of what actually happens when we execute a D script. The important thing to note here is the dynamic nature of the modifications that are made when a D script is executed. The modifications made to the system (the "instrumentation") exist just for the lifetime of the script. When no DTrace scripts are running, the system acts just as if DTrace were not installed.

10.2.2. Providers and Probes

By default, DTrace provides tens of thousands of probes that you can enable to gain unparalleled insight into the behavior of a system (use dtrace -l to list them all). Each probe can be referred to by a unique numerical ID or by a more commonly used human-readable one that consists of four colon-separated fields. These are defined as follows:

provider:module:function:name

Provider. The name of the DTrace provider that created this probe. A provider is essentially a kernel module that creates groups of probes that are related in some way (for example, kernel functions, an application's functions, system calls, timers).
Module. The name of the module to which this probe belongs if the probe is associated with a program location. For kernel probes, it is the name of the module (for example, ufs); for applications, it is a library name (for example, libc.so).
Function. The name of the function that this probe is associated with if it belongs to a program location. Kernel examples are ufs_write() and clock(); a userland (a program running in user-mode) example is the printf() function of libc.
Name. The name component of the probe. It generally gives an idea of its meaning. Examples include entry or return for kernel function calls, start for an I/O probe, and on-cpu for a scheduling probe.

Note two key facts about probe specifications:

If any field in a probe specification is empty, that field matches any value (that is, it acts like a wildcard).
sh(1)-like pattern matching is supported.

Table 10.1 lists examples of valid probe descriptions.

Table 10.1. Examples of DTrace Probe Descriptions
Probe Description	Meaning
`fbt:ufs:ufs_write:entry`	The `ufs_write()` kernel function's entry point
`fbt:nfs::`	All the probes in the kernel `nfs` module
`syscall::write:entry`	The `write()` system call entry point
`syscall::read:entry`	All the matches of `read`, `readlink`, `readv`, `pread`, and `pread64` system calls
`syscall:::`	All system call entry and return probes
`io:::start`	All the places in the kernel from which a physical I/O can occur
`sched:::off-cpu`	All the places in kernel where a currently executing thread is taken off the CPU

Although it isn't necessary to specify all the fields in a probe, the examples in this book do so in order to remove any ambiguity about which probes are being enabled. Also note that a comma-separated list of probes can be used to associate multiple probes with the same predicate and actions.

In previous examples we saw the syscall provider being used to ask questions concerning system call usage. Exactly what is a provider and what is its relationship to a probe? A provider creates the probes that are essentially the individual system points at which we ask questions. There are a number of providers, each able to instrument a different part of the system.

The following providers are of special interest to us:

fbt. The Function Boundary Tracing provider places probes at the entry and return point of virtually every kernel function. This provider illuminates the operation of the Solaris kernel and is used extensively in this book. Its full power is realized when it is used in conjunction with the Solaris source code.
pid. This provider probes for userland processes at function entry, function return, and even down to the instruction level.
syscall. This provider probes at the entry and return point of every system call.
profile. This provider gives us timer-driven probes. The timers can be specified at any resolution from nanoseconds to days and can interrupt all CPUs or just one.
sdt. The Statically Defined Tracing provider enables programmers to place probes at arbitrary locations in their code and to choose probe names that convey specific meaning. (For example, a probe named transmit-start means more to most observers than the function name in which it sits.)

The following providers leverage the sdt provider to grant powerful observability into key Solaris functional areas:

sched. This provider affords a group of probes for scheduling-related events. Such events include a thread being placed on the CPU, taken off the CPU, put to sleep, or woken up.
io. This provider probes for I/O-related events. Such events include I/O starts, I/O completion, and I/O waits.
proc. The probes of the proc provider examine process creation and life cycle events. Such events include fork, exec, thread creation, and signal send and receive.
vminfo. The vminfo provider is layered on top of the kstat updates to the vm kstat. Every time an update is made to a member of the vm kstat, a probe is fired.
sysinfo. The sysinfo provider is also layered on top of the kstat updates, in this case, to the sys kstat. Every time an update is made to a member of the sys kstat, a probe is fired.

10.2.3. Aggregations

The syscall example used earlier is simple and powerful. However, the output quickly becomes voluminous and overwhelming with thousands of lines generated in seconds. It rapidly becomes difficult to discern patterns of activity in the data, such as might be perceived in a view of all system calls sorted by count. Historically, we would have generated our data and post-processed this by using tools such as awk(1) or perl(1), but that approach is laborious and time wasting. DTrace enables us to succinctly specify how to group vast amounts of data so that we can easily observe such patterns. The mechanism that does this is termed an aggregation. We use aggregations to refine our initial script.

syscall:::entry {         @sys[probefunc] = count(); }

And here is the output now.

sol10# ./truss.d dtrace: script './truss.d' matched 225 probes ^C   <output elided>   fcntl   xstat                                                           1113   lwp_park                                                        2767   setcontext                                                      4593   lwp_sigmask                                                     4599   write                                                           7429   setitimer                                                       8234   writev                                                          8444   ioctl                                                          17718   pollsys                                                       135603   read                                                          141379

Instead of seeing every system call as it is made, we are now presented with a table of system calls sorted by count: over 330, 000 system calls presented in several lines!

The concept of an aggregation is simple. We want to associate the value of a function with an arbitrary element in an array. In our example, every time a system call probe is fired, the name of the system call is used (using the probefunc built-in variable) to index an associative array. The result of the count() function is then stored in this element of the array (this simply adds 1 to an internal variable for the index in the array and so effectively keeps a running total of the number of times this system call has been entered). In that way, we do not focus on data at individual probe sites but succinctly collate large volumes of data.

An aggregation can be split into two basic components: on the left side, a named associative array that is preceded by the @ symbol; on the right side, an aggregating function.

@name [ keys ] = function();

An aggregating function has the special property that it produces the same result when applied to a set of data as when applied to subsets of that data and then again to that set of results. A simple example of this is finding the minimum value of the set [5, 12, 4, 7, 18]. Applying the min() function to the whole set gives the result of 4. Equally, computing the minimum value of two subsets [5, 12] and [4, 7, 18] produces 5 and 4. Applying min() again to [5, 4] yields 4.

Several aggregating functions in DTrace and their results are listed below.

count. Returns the number of times called.

avg. Returns the mean of its arguments. The following example displays the average write size that each process makes. The third argument to the write(2) system call is the size of the write being made. Since arguments are indexed from 0, arg2 is therefore the size of the write.

syscall::write:entry {         @sys[execname] = avg(arg2); } sol10# ./avg.d dtrace: script './avg.d' matched 1 probe ^C   ls                                                               101   egrep                                                            162   gnome-panel                                                      169   gnome-terminal                                                   290   soffice.bin                                                      309   metacity                                                         334   battstat-applet-                                                 366   init                                                             412   mozilla-bin                                                     1612   gconfd-2                                                       27763

sum. Returns the total value of its arguments.
max. Returns the maximum value of its arguments.
min. Returns the minimum value of its arguments.
quantize. Stores the specified argument in the appropriate bucket in a power-of-2 series.

The following example stores in the appropriate bucket the size of the memory requested in the call to malloc().

pid$1:libc:malloc:entry {         @["malloc sizes"] = quantize(arg0); } sol10# ./malloc.d 658 dtrace: script './malloc.d' matched 1 probe ^C   malloc sizes            value  ------------- Distribution ------------- count                2 |                                         0                4 |@@                                       405                8 |@@@@                                     886               16 |@@@@@@@                                  1673               32 |@                                        205               64 |@@@@@@                                   1262              128 |@@@@@@@                                  1600              256 |@@@@@@@                                  1632              512 |                                         3             1024 |                                         5             2048 |@@@@                                     866             4096 |                                         10             8192 |@@@                                      586            16384 |                                         0

The example shows that 1673 memory allocations between the size of 16 and 31 bytes were requested. The @ character indicates the relative size of each bucket.

lquantize. Linear quantizations are frequently used to drill down on buckets of interest when the quantize() function has previously been used. This time we use a linear range of buckets that goes between two sizes with a specified step size. The example below specifies that calls to malloc() between 4 and 7 bytes in size go in their own bucket.

pid$1:libc:malloc:entry {         @["malloc sizes"] = lquantize(arg0,4,8,1); } sol10# ./lmalloc.d 658 dtrace: script './lmalloc.d' matched 1 probe ^C   malloc sizes            value  ------------- Distribution -------------  count              < 4 |                                          6                4 |@                                         423                5 |                                          0                6 |@                                         400                7 |                                          0             >= 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    18452

10.2.4. Variables

Having looked at aggregations, we now come to the two basic data types provided by D: associative arrays and scalar variables. An associative array stores data elements that can be accessed with an arbitrary name, known as a key or an index. This differs from normal, fixed-size arrays in a number of different ways:

There are no predefined limits on the number of elements in the array.
The elements can be indexed with an arbitrary key and not just with integer keys.
The storage for the array is not preallocated or contained in consecutive storage locations.

Associative arrays in D commonly keep a history of events that have occurred in the past to use in controlling flow in scripts. The following example uses an associative array, arr, to keep track of the largest writes made by applications.

syscall::write:entry /arr[execname] < arg2/ {         printf("%d byte write: %s\n", arg2, execname);         arr[execname] = arg2; }

The actions of the clause are executed if the write size, stored in arg2, is larger than that stored in the associative array arr for a given application. If the predicate evaluates to TRue, then this is the largest write seen for this application. The actions record this by first printing the size of the write and then by updating the element in the array with the new maximum write size.

D is similar to languages such as C in its implementation of scalar variables, but a few differences need to be highlighted. The first thing to note is that in the D language, variables do not have to be declared in advance of their use, much the same as in awk(1) or perl(1). A variable comes into existence when it first has a value assigned to it; its type is inferred from the assigned value (you are allowed to declare variables in advance but doing so isn't necessary). There is no explicit memory management in D, much as in the Java programming language. The storage for a variable is allocated when the variable is declared, and deallocated when the value of 0 is assigned to the variable.

The D language provides three types of variable scope: global, thread-local, and clause-local. Thread-local variables provide separate storage for each thread for a given variable and are referenced with the self-> prefix.

fbt:ufs:ufs_write:entry {         self->in = timestamp; }

In the clause above, every different thread that executes the ufs_write() function has its own copy of a variable named in. Its type is the same as the timestamp built-in variable, and it holds the value that the timestamp built-in variable had when the thread started executing the actions in the clause. This is a nanosecond value since an arbitrary time in the past.

A common use of thread-local variables is to highlight a sequence of interest for a given thread and also to associate data with a thread during that sequence. The following example uses the sched provider to record, by application, all the time that a specified user (UID 1003) spent executing.

sched:::on-cpu /uid == 1003/ {         self->ts = timestamp; } sched:::off-cpu /self->ts/ {         @time[execname] = sum(timestamp - self->ts);         self->ts = 0; }

The above D script contains two clauses. The first one uses the sched:::oncpu probe to enable a probe at every point in the kernel where a thread can be placed onto a processor and run. The predicate attached to this probe specifies that the actions are only to be executed if the uid of the thread is 1003. The action merely stores the current timestamp in nanoseconds by assigning the timestamp built-in variable to a thread-local variable, self->ts.

The second clause uses the sched:::off-cpu probe to enable a probe at every location in the kernel where a thread can be taken off the CPU. The self->ts variable in the predicate ensures that only threads owned by uid 1003 that have already been through the sched:::on-cpu probe shall execute the following actions. Why couldn't we just predicate on uid == 1003 as in the first clause? Well, we want to ensure that any thread executing the following actions has already been through the first clause so that its self->ts variable is set. If it hasn't been set, we will end up storing a huge value in the @time aggregation because self->ts will be 0! Using a thread-local variable in predicates like this to control flow in a D script is a common technique that we frequently use in this book.

sol10# ./sched.d dtrace: script './sched.d' matched 6 probes ^C   cat                                                          1247134   xterm                                                        1830446   ksh                                                          3909045   stdlibfilt                                                   5499630   make                                                        60092218   sh                                                         158049977   sed                                                        162507340   CC                                                         304925644   ir2hf                                                      678855289   sched                                                     2600916929   ube_ipa                                                   2851647754   ccfe                                                      5879225939   ube                                                       7942433397

The preceding example can be enhanced with the profile provider to produce output at a given periodic rate. To produce output every 5 seconds, we can just add the following clause:

profile:::tick-5s {               printa(@time);               trunc(@time); }

The profile provider sets up a probe that fires every 5 seconds on a single CPU. The two actions used here are commonly used when periodically displaying aggregation data:

printa(). This function prints aggregation data. This example uses the default formatting, but we can control output by using modifiers in much the same way as with printf(). Note that we refer to the aggregation result (that is, the value returned from the aggregation function) by using the @ formatting character with the appropriate modifier. The above printa() could be rewritten with
```
printa("%-56s %@10d", @time); 
```
trunc(). This function truncates an aggregation or removes its current contents altogether. The trunc() action deletes all the keys and the aggregation results if no second, optional, value is given. Specifying a second argument, n, removes all the keys and the aggregation values in the aggregation apart from the top n values.

10.2.5. Probe Arguments

In DTrace, probe arguments are made available through one or two mechanisms, depending on which provider is responsible for the probe:

args[]. The args[] array presents a typed array of arguments for the current probe. args[0] is the first argument, args[1] the second, and so on. The providers whose probe arguments are presented through the args[] array include fbt, sched, io, and proc.
arg0... arg9. The argn built-in variables are accessible by all probes. They are raw 64-bit integer quantities and, as such, must be cast to the appropriate type.

For an example of argument usage, let's look at a script based on the fbt provider. The Solaris kernel, like any other program, is made up of many functions that offer well-defined interfaces to perform specific operations. We often want to ask pertinent questions upon entry to a function, such as, What was the value of its third argument? or upon exit from a function, What was the return value? For example:

fbt:ufs:ufs_read:entry /uid == 1003/ {         self->path = stringof(args[0]->v_path);         self->ts = timestamp; } fbt:ufs:ufs_read:return /self->path != NULL/ {         @[self->path] = max(timestamp - self->ts);         self->path = 0;         self->ts = 0; }

This example looks at all the reads performed through ufs file systems by a particular user (UID 1003) and, for each file, records the maximum time taken to carry out the read call. A few new things require further explanation.

The name of the file being read from is stored in the thread-local variable, self->path, with the following statement:

self->path = stringof(args[0]->v_path);

The main point to note here is the use of the args[] array to reference the first argument (args[0]) of the ufs_read function. Using MDB, we can inspect the arguments of ufs_read:

sol10# mdb -k Loading modules: [ unix krtld genunix dtrace specfs ufs ip sctp usba uhci fctl s1394 nca lofs audiosup nfs random sppp crypto ptm ipc ] > ufs_read::nm -f ctype C Type int (*)(struct vnode *, struct uio *, int, struct cred *, struct caller_context *)

The first argument to ufs_read() is a pointer to a vnode structure (struct vnode *). The path name of the file that is represented by that vnode is stored in the v_path member of the vnode structure and can be accessed through args[0]->v_path. Using MDB again, we inspect the type of the v_path member variable.

> ::print -t struct vnode v_path char *v_path

The v_path member is a character pointer and needs to be converted to DTrace's native string type. In DTrace a string is a built-in data type. The stringof() action is one of many features that allow easy manipulation of strings. It converts the char * representation of v_path into the DTrace string type.

If the arg0 built-in variable had been used, a cast would be required and would be written as this:

self->path = stringof(((struct vnode *)arg0)->v_path);

The predicate associated with the ufs_read:return probe ensures that its actions are only executed for files with a non-NULL path name. The action then uses the path name stored in the self->path variable to index an aggregation, and the max() aggregating function tracks the maximum time taken for reads against this particular file. For example:

sol10# ./ufs.d dtrace: script './ufs.d' matched 2 probes ^C   /lib/ld.so.1                                                    3840   /usr/share/lib/zoneinfo/GB                                      5523   <output elided>   /usr/share/man/man1/ls.1                                     2599836   /./usr/bin/more                                              3941344   /./usr/bin/tbl                                               3988087   /usr/share/lib/pub/eqnchar                                   4397573   /usr/share/lib/tmac/an                                       5054675   /./usr/bin/nroff                                             7004599   /./usr/bin/neqn                                              7021088   /./usr/bin/col                                               9989462   /usr/share/man/windex                                       13742938   /./usr/bin/man                                              17179129

Now let's look at a syscall-based example of return probe use. The following script exploits the fact that syscall probes have their return value stored in arg0 and the error code for the call stored in the errno built-in variable.

syscall::open*:entry {         self->path = arg0; } syscall::open*:return /self->path != NULL && (int)arg0 == -1 && errno == EACCES/ {         printf("UID %d permission denied to open %s\n",             uid, copyinstr(self->path));         self->path = 0; }

The first clause enables probes for the open(2) and open64(2) system calls. It then stores the address of the buffer, which contains the file name to open, in the thread-local variable self->path.

The second clause enables the corresponding syscall return probes. The conditions of interest are laid out in the predicate:

The stored file name buffer isn't a NULL pointer (self->path != NULL).
The open failed (arg0 == - 1).
The open failed owing to insufficient permissions (errno == EACCES).

If the above conditions are all true, then a message is printed specifying the UID that induced the condition and the file for which permissions were lacking .

sol10# ./open.d UID 39079 permission denied to open /etc/shadow

Finally, a note regarding the copyinstr() action used in the second clause above: All probes, predicates, and associated actions are executed in the kernel, and therefore any data that originates in userland must be copied into the kernel to be used. The buffer that contains the file name to be opened in our example is a buffer that resides in a userland application. For the contents to be printed, the buffer must be copied to the kernel address space and converted into a DTrace string type; this is what copyinstr() does.

10.2.6. Mixing Providers

DTrace gives us the freedom to observe interactions across many different subsystems. The following slightly larger script demonstrates how we can follow all the work done in userland and the kernel by a given application function. We can use dtrace -p to attach to and instrument a running process. For example, we can use a script that looks at the function getgr_lookup() in the name services cache daemon. The getgr_lookup() function is called to translate group IDs and group names. Note that here we are interested in the principle of examining a particular function; the actual program and function chosen here are irrelevant.

#pragma D option flowindent pid$target:a.out:getgr_lookup:entry {         self->in = 1; } pid$target:::entry, pid$target:::return /self->in/ {         printf("(pid)\n"); } fbt:::entry, fbt:::return /self->in/ {         printf("(fbt)\n"); } pid$target:a.out:getgr_lookup:return /self->in/ {         self->in = 0;         exit(0); }

The #pragma flowindent directive at the start of the script means that indentation will be increased on entry to a function and reduced on the same function's return. Showing function calls in a nested manner like this makes the output much more readable.

The pid provider instruments userland applications. The process to be instrumented is specified with the $target macro argument, which always expands to the PID of the process being traced when we attach to the process by using the -p option to dtrace(1M).

The second clause enables all the entry and return probes in the nscd process, and the third clause enables every entry and return probe in the kernel. The predicate in both of these clauses specifies that we are only interested in executing the actions if the thread-local self->in variable is set. This variable is set to 1 when nscd's getgr_lookup() function is entered and set to 0 on exit from this function (that is, when the return probe is fired). For example:

sol10# dtrace -s ./nscd.d -p 'pgrep nscd` dtrace: script './nscd.d' matched 43924 probes CPU FUNCTION   0  -> getgr_lookup                           (pid)   0    -> mutex_lock                           (pid)   0      -> mutex_lock_impl                    (pid)   0      <- mutex_lock_impl                    (pid)   0    <- mutex_lock                           (pid)   0    -> _xstat                               (pid)   0      -> copyout                            (fbt)   0      <- kcopy                              (fbt)   0      -> syscall_mstate                     (fbt)   0        -> gethrtime_unscaled               (fbt)   0        <- gethrtime_unscaled               (fbt)   0      <- syscall_mstate                     (fbt)   0      -> syscall_entry                      (fbt)  <output elided>   0      <- syscall_exit                       (fbt)   0      -> syscall_mstate                     (fbt)   0        -> gethrtime_unscaled               (fbt)   0        <- gethrtime_unscaled               (fbt)   0      <- syscall_mstate                     (fbt)   0    <- _xstat                               (pid)   0    -> get_hash                             (pid)   0      -> abs                                (pid)   0        -> copyout                          (fbt)   0        <- kcopy                            (fbt)   0      <- abs                                (pid)   0    <- get_hash                             (pid)   0    -> memcpy                               (pid)   0      -> copyout                            (fbt)   0      <- kcopy                              (fbt)   0    <- memcpy                               (pid)   0    -> mutex_unlock                         (pid)   0    <- mutex_unlock                         (pid)   0   | getgr_lookup:return                    (pid)   0  <- getgr_lookup

10.2.7. Accessing Global Kernel Data

DTrace provides a very useful feature by which we can access symbols defined in the Solaris kernel from within a D script. We can use the backquote character (`) to refer to kernel symbols, and this information can be used to great advantage when we are exploring the behavior of a Solaris kernel. For example, a variable named mpid is declared in the Solaris kernel source to keep track of the last PID that was allocated. It is declared in uts/common/os/pid.c as follows:

static pid_t mpid;

The following script uses this variable to calculate the rate of process creation on the system and to output a message if it exceeds a given amount (10 processes per second in this case):

dtrace:::BEGIN {         cnt = 'mpid; } profile:::tick-1s /'mpid < cnt+10/ {         cnt = 'mpid; } profile:::tick-1s /'mpid >= cnt+10/ {         printf("High process creation rate: %d/sec\n", 'mpid - cnt);         cnt = 'mpid; }

The first clause uses the BEGIN probe from the dtrace provider to initialize a global variable (cnt) to the current value of the mpid kernel variable.

The BEGIN, END, and ERROR probes are special probes that belong to the dtrace provider. These probes are essentially virtual probes in that they aren't associated with any code location or timer source. The BEGIN probe fires before any other probes when we start the tracing session and allows us to perform tasks such as data initialization. The END probe is called when the tracing session is terminated either with a Control-C or an explicit call to the exit() action. Its main function is to print data collected during the execution of the script. The ERROR probe is less commonly used; it is called upon abnormal termination of the script.

Both of the next two clauses in the previous example enable the profile:::tick-1s probe. The probe fires every second, and the two clauses are executed in the order specified in the script. The important thing to note is that the predicates in the two clauses contain mutually exclusive logic, which ensures that only one of them will be true at any one timeeither ten processes have been created in the last second or they haven't!

The predicate in the first profile:::tick-1s clause specifies that its actions should only be executed if fewer than ten processes have been created (the 'mpid variable is within ten of its value one second ago as stored in the cnt variable). If fewer than ten processes have been created in the last second, the cnt variable is updated with the current value of mpid.

The actions in the second clause are executed when more than ten processes have been created. If cnt has already been updated in the first clause, then the predicate will be false and the actions are not executed (a message is then printed with the growth rate, and the cnt variable is updated). For example:

sol10# ./scope.d High process creation rate: 30/sec High process creation rate: 31/sec High process creation rate: 35/sec High process creation rate: 35/sec High process creation rate: 44/sec High process creation rate: 44/sec High process creation rate: 20/sec

10.2.8. Assorted Actions of Interest

DTrace defines numerous actions, only a small percentage of which are used in this book. Actions that you may see used include normalize(), stack(), and ustack().

normalize(). This action effectively divides the values in the aggregation by a supplied normalization factor. A simple example is the use of a tick-5s probe to display data that you want displayed as a per-second rate:
```
syscall::read:entry {         @reads[probefunc] = count(); } tick-5s {         printa("%s (non normalized) %50@d\n\n", @reads);         normalize(@reads, 5);         printa("%s (  normalized  ) %50@d\n", @reads);         trunc(@reads); } 
```
The above example uses a single aggregation, @reads, to store the number of read system calls made. Every 5 seconds the contents of the aggregation are displayed by printa() and then divided by 5 to give a per-second value with the normalize() action. The normalized aggregation is then printed and its contents are deleted with the trunc() action. For example,
```
sol10# ./norm.d read (non normalized)                                                5012 read (  normalized  )                                                1002 
```

stack(). This action produces the stack trace of the kernel thread at the time of execution. It commonly indexes aggregations to determine the most common callstacks when at a given probe. It can also be an invaluable tool for learning how the code flow in the kernel works because it gives a ready view of the call sequence up to a given point. The following script and output show the most frequently executed call sequence through the ufs file system over a given trace period.

fbt:ufs::entry {         @ufs[stack()] = count(); } END {         trunc(@ufs, 1); } sol10# ./stack.d dtrace: script './stack.d' matched 419 probes ^C CPU     ID                    FUNCTION:NAME   0      2                             :END               genunix`fop_lookup+0x18               genunix`lookuppnvp+0x371               genunix`lookuppnat+0x11a               genunix`lookupnameat+0x8b               genunix`cstatat_getvp+0x16d               genunix`cstatat64_32+0x48               genunix`lstat64_32+0x25               unix`sys_syscall32+0x101             18650

ustack(). This action is the equivalent of the stack function for userland applications. The following script and output display the stack trace of the userland application that is generating most of the work in the ufs code.

#!/usr/sbin/dtrace -s fbt:ufs::entry {         @ufs[ustack()] = count(); } END {         trunc(@ufs, 1); } sol10# ./ustack.d dtrace: script './stack.d' matched 419 probes ^C CPU     ID                    FUNCTION:NAME   0      2                             :END               libc.so.1`lstat64+0x7               libc.so.1`walk+0x44b               libc.so.1`walk+0x44b               libc.so.1`walk+0x44b               libc.so.1`walk+0x44b               libc.so.1`walk+0x44b               libc.so.1`nftw64+0x185               find`main+0x2b6               find`0x80513d2            231489

The find(1) application is at the top of the list here. The walk() routine is listed multiple times because it is recursively called to walk a file tree.