3.6 Processor Performance Tools | System Performance Tuning2002

Concretely detecting whether a system is short of processor time is often more complicated than it first seems. Regular indications of CPU shortage can be attributed to many things: disk configuration problems, peripheral devices being bound to a single processor for interrupt management, processor starvation due to a lack of memory, etc. In a sense, the best way to tell when you're legitimately processor-bound is that you can exclude all other possibilities.

There is also a philosophical issue involved. Some people like seeing a low load average on their systems, or more than 50% of the processor idle. This makes some sense in terms of the system accomodating bursting load activity. Similarly, some people like seeing a load average that approaches the number of processors, meaning the system is running full-tilt. This also makes some sense, because you are using everything that you have paid for in terms of processor performance. My personal philosophy varies between the following: on a computational server, I would aim for full utilization of all processors; on a development system or a database server, I would aim for a degree of overcapacity to accomodate spikes in load.

There are quite a few tools for monitoring processor performance: the load average, vmstat , mpstat , prstat and/or top , and lockstat . There are also two tools in Solaris for managing processors in multiprocessor systems: psrinfo and psrset . We'll cover all of these in some detail.

3.6.1 The Load Average

The system understands a concept of the load average , which is expressed in the uptime command:

 % uptime   4:02pm  up 19 day(s),  9:28,  119 users,  load average: 1.22, 1.21, 1.25

The load average is represented as the average number of runnable jobs (the sum of the run queue length and the number of jobs currently running) over the last one, five, and fifteen minutes. This gives a quick estimate of how heavily loaded the system is, but it is not exclusively a measurement of processor load; a system that has a memory shortage will usually have a high load average.

The load average statistic is only meaningful when looked at in the context of the number of installed processors. As a general rule, if the load average is less than the number of installed processors, the system is not lacking processor power. For example, if the load average is 2.5 on a four-processor system, it is not an indication of a processor shortage.

3.6.2 Process Queues

One very powerful tool in our performance tuning arsenal is vmstat . I discuss it in pieces as we go along. Let's take a look at some sample vmstat output from a Solaris system (recall that vmstat takes as its first argument the time interval in seconds between displaying data, and that you must always discard the first line of output):

 %  vmstat 30   procs  memory            page            disk          faults      cpu  r b w  swap  free  re  mf pi po fr de sr s0 s1 -- --   in   sy   cs us sy id  0 0 0  5928  1920   0 328 45 25 37 1360 2 4  8  0  0  306 1682  299 10  5 85  0 0 0  1202648 17152  0 904  8  0  0 1200 0 7 36  0  0  547 5632  683 18 15 67  0 0 0  1202744 19600  0 1136 0  0  0 1040 0 2 26  0  0  575 4542  752 33 12 55

The first column (highlighted) provides the state of the system's three process queues. In order, they are the run queue ( r ), the runnable-but-blocked queue ( b ) and the runnable-but-swapped queue ( w ). We'll discuss the swap queue's relevance, or lack thereof, in Section 4.5.2.1.

The run queue counts the number of processes that are runnable, but are not currently running. In this example, the system's processor is not overworked. As a general rule, if the run queue is consistently at or more than four times the number of configured processors in the system, you should probably consider increasing the available processors.

The runnable-but-blocked queue counts the number of processes that are otherwise runnable, but cannot be run because they are waiting for a block device transfer ( generally a disk I/O operation).

The Linux vmstat looks slightly different, but displays the same information:

 %  vmstat 30  procs                  memory    swap        io    system         cpu  r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id  0 0 0 20068  9568 26524 50828   0   0    2    7    8    3   3   1   8  0 0 0 20068  9568 26524 50828   0   0    0    0  115   22   8   4  89  0 0 0 20068  9568 26524 50828   0   0    1    0  115   22   7   4  89  0 0 0 20068  9568 26524 50828   0   0    0    0  114   20   7   4  89

3.6.3 Specific Breakdowns

Processor time is organized into four timed modes: system time , user time , I/O wait time , and idle time . ^[23] The idle time consists of what's left over when all other portions have had their fill. A program's normal operating state is user mode, but as it runs, it may generate requests for services that the kernel provides, such as I/O. These requests require the attention of the operating system, so the program switches into system mode, then returns to user mode when the request is complete. The time spent in these two modes is tabulated independently to give the user time and system time values, respectively. These two figures account for the majority of a process's execution time.

^[23] Note that vmstat reports only the user time, system time, and idle time (wait time is summed in with idle time). In order to get separated values for wait time and idle time, use mpstat .

When a process waits for a block device data request to complete, it incurs I/O wait time. This brings up an important fact: when a process is blocked in this fashion, all idle time becomes wait time. If your idle time is zero, as reported by vmstat , the first thing you should check is if your system has I/O throughput problems.

Another good diagnostic event is looking at system time relative to user time. On NFS servers, which we will discuss in Section 7.5, it is normal for the system time to be much higher than the user time because NFS service is contained entirely inside the kernel.

You can obtain this sort of data on a per-process basis by means of the Solaris ptime command. Here is an example:

 %  /usr/proc/bin/ptime tar -cf miscellany.tar miscellany  real        7.402 user        0.044 sys         0.287

There are other commands, /bin/time and timex , which provide varying resolution. We'll discuss these commands in greater detail in Section 8.2.1.

There is a fair bit of system time incurred here because of the volume of I/O requests that must be made through the kernel facilities. There are many ways to get this sort of data on a systemwide level: vmstat and mpstat (which we'll discuss in more detail throughout this book) both provide it. The highlighted portion of the vmstat output is illustrative :

 %  vmstat 30  procs     memory            page            disk          faults  cpu  r b w   swap  free  re  mf pi po fr de sr s0 s1 -- --   in   sy  cs  us sy id  0 0 0   3296  2344   0 365 44 22 33  0  1  4  8  0  0  304  700  318  11  6 84  0 0 0 1221816 99992  0 450  0  0  0  0  0  0  2  0  0  269 1791  465  19  6 76  0 0 0 1222568 100688 0  83  0  0  0  0  0  0  8  0  0  254  591  389  2  1 98

This system exemplifies a system under light load. The idle percentage is fairly high and user time is greater than system time.

3.6.4 Multiprocessor Systems

The tools we've discussed present valid data for multiprocessor systems, but in order to get detailed per-processor statistics under Solaris, we need to turn to a more powerful tool: mpstat .

This is some sample mpstat output from a two-processor system:

 %  mpstat 30  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0  182   0  353   269   54  161   24   13    5    0   940   11   5   5  78   1  182   1  531   134  116  157   23   13    6    0   931   10   6   5  79 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0    0   0 2856   228   28   72    0    6    0    0   454    2   2   4  92   1    4   0    0    39   39   77    0    7    0    0    50    0   0   4  96 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   0   80   0  515   489  288  108    2   11    1    0   275    1   6  37  56   1   61   0   55    46   46  113    2    9    2    0   270    0   1  37  62

Here are some of the more interesting statistics:

intr is the number of interrupts handled by that processor; ithr is the number of interrupts as threads.
migr is the number of thread migrations. A thread migration occurs when a thread switches from running on one processor to running on some other processor.
smtx is the number of spins on mutexes ; (in practical terms, this column represents the number of locks not acquired on the first try). We'll discuss this later in this chapter in Section 3.6.6.
syscl is the number of system calls.
usr , sys , wt , and idl are the user, system, wait, and idle percentages, on a per-processor basis. Note that vmstat provides one systemwide average only.

3.6.5 top and prstat

A popular and common tool for monitoring system-wide process activity is top , which provides a list of the most CPU- intensive processes running on a system, as well as some interesting statistics such as the load average, the number of processes, and the amount of memory and paging space in use. top is a very common and generally easy-to-understand application, so we won't discuss it further here. In the interest of completeness, this is what top output looks like:

 %  top  10:27pm  up 24 days, 21:24,  9 users,  load average: 0.05, 0.06, 0.01 125 processes: 124 sleeping, 1 running, 0 zombie, 0 stopped CPU states:  8.0% user,  9.1% system,  0.0% nice, 82.7% idle Mem:  255596K av, 251056K used,   4540K free,      0K shrd,   3200K buff Swap: 112444K av,      8K used, 112436K free                 99112K cached   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND  5752 gdm       17   0  1296 1296  1052 R       0 13.8  0.5   0:01 top  3829 root       9   0  1252 1252  1072 S       0  1.1  0.4   0:06 sshd  5753 root      10   0  1492 1492  1292 S       0  1.1  0.5   0:00 sendmail     1 root       8   0   520  520   452 S       0  0.0  0.2   0:48 init

Unfortunately, the tradeoff for top 's updating process list and other nice features is that it tends to consume quite a bit of processor time. ^[24] This has been known to substantially skew the behaviour of the system, so Solaris now ships with a similar tool, called prstat , that provides similar output. While it doesn't provide some of the nice information that top does about memory usage, it consumes very little processor time and doesn't skew the system's behaviour nearly as much:

^[24] Rumor has it that this is where top got its name -- it often appears at the top of the CPU consumers list.

 %  prstat  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP          4585 gdm      1280K 1096K cpu0    48    0   0:00.00 0.4% prstat/1   4554 gdm      1424K 1024K sleep   58    0   0:00.00 0.0% csh/1    337 nobody     23M 6856K sleep   59    0   0:00.07 0.0% ns-slapd/15   4552 root     1784K 1256K sleep   38    0   0:00.00 0.0% in.telnetd/1 ... Total: 33 processes, 143 lwps, load averages: 0.00, 0.00, 0.01

The upshot of this is that top is very nice, but you need to be careful about it skewing the system under observation because of its high overhead. On Solaris systems, prstat provides similar output at a much reduced cost in terms of processor time.

3.6.6 Lock Statistics

In addition to using mpstat for processor state information, we can also use it to provide meaningful data on interrupts and mutexes. The key measurement is the smtx column, which represents the number of instances in the unit time that the kernel failed to immediately obtain a mutex when it asked for one. If this number begins to exceed 250 times the number of configured processors, system time will begin to increase noticeably. Higher performance processors can cope with higher smtx values, e.g., UltraSPARC-III/Pentium 3 Xeon.

 %  mpstat 30  CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0  182   0  353   269   54  161   24   13  5  0   940   11   5   5  78   1  182   1  531   134  116  157   23   13  6  0   930   10   6   5  79 CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0  112   0  365   259   55  141    6   12  2  0   486    6   3  10  81   1  157   0  337    73   67  147    7   11  3  0   594    6   4   8  82 CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0   97   0  598   229   26  135    6   11  2  0   389    5   3   5  87   1  107   0   81    67   62  130    5   11  2  0   423    4   3   4  89

The best way to cope with this problem is to check with the vendor for any operating system upgrades that may have solved mutex contention issues, and then use the lockstat command to find out what mutexes are suffering. If you are consistently seeing very high processor utilization and the smtx values are high in your SMP system, replace your current processors with faster ones, rather than adding more!

To find suffering mutexes, lockstat uses a kernel lock monitoring facility. It must be invoked as root. The kernel can dynamically change its locks to collect information, then return to the normal, optimized locking code when the data collection is complete. lockstat can be invoked on a per-process basis or on a system-wide basis by calling it on the sleep command. I went through some grief trying to find a simple example of how to cause a large number of mutex spins to obtain a good example for illustration, and finally did so: ^[25]

^[25] I set this up by running a number of infinitely looping C shell scripts equal to the number of configured processors on the system. I hope this is not actually what your system is doing.

 %  mpstat 10  CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0    1   0    2   104    0   19    0    0   0    26    0   0   0 100   1    1   0    1     8    6   19    0    0   0    37    0   0   0 100 CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0    0   0    0   109    2   40   22    2  956  0 106954   47  53   0   0   1    0   0    0    23    7   43   15    2  1117  0 107760   48  52   0   0 CPU minf mjf xcal  intr ithr  csw icsw migr  smtx  srw syscl  usr sys  wt idl   0    0   0    0   116    0   57   26    2  1289  0 107204   47  53   0   0   1    0   0    0    24    3   45   21    2  1099  0 107199   47  53   0   0

It should be obvious that this system is having serious mutex problems. We can use lockstat to try and examine what the system is doing:

 #  lockstat sleep 10  Adaptive mutex spin: 19132 events Count indv cuml rcnt     spin Lock                   Caller                   -------------------------------------------------------------------------------  2716  14%  14% 1.00        1 flock_lock             cleanlocks+0x1c           2388  12%  27% 1.00        3 0xe0452ab0             dev_get_dev_info+0x3d     1905  10%  37% 1.00        3 0xe0452ab0             cdev_size+0x48            1554   8%  45% 1.00        1 0xe07c9f30             vn_rele+0x22              1307   7%  52% 1.00        5 stable_lock            specvp+0x8f               1050   5%  57% 1.00        2 0xe0452ab0             mod_rele_dev_by_major+0x21   995   5%  62% 1.00        2 0xe0452ab0             ddi_hold_installed_driver+0x36 ... ------------------------------------------------------------------------------- ...

Unfortunately, lockstat is quite verbose by default. However, at first glance it appears that the system is spending most of its time working with the flock_lock lock, which is filesystem- related . From this point it's a matter of using truss to see what system calls in our application are generating these contentions.

If you have a suspicion as to what lock might be the one causing the problem, lockstat has the capability to search for a particular lock. This is accomplished with the -l lockname switch:

 #  lockstat -l flock_lock sleep 10  Adaptive mutex spin: 2594 events Count indv cuml rcnt     spin Lock                   Caller                   -------------------------------------------------------------------------------  2594 100% 100% 1.00        1 flock_lock             cleanlocks+0x1c          ------------------------------------------------------------------------------- Adaptive mutex block: 9 events Count indv cuml rcnt     nsec Lock                   Caller                   -------------------------------------------------------------------------------     9 100% 100% 1.00 64490487 flock_lock             cleanlocks+0x1c          -------------------------------------------------------------------------------

3.6.7 Controlling Processors in Solaris

Solaris 2.6 and later versions support commands to help the system administrator manage processors. They are of somewhat limited use in most performance tuning applications, but they are good to know about.

3.6.7.1 psrinfo

The psrinfo command displays information concerning the configured processors. It provides particularly useful information when the -v switch is applied.

 %  /usr/sbin/psrinfo -v  Status of processor 0 as of: 09/20/99 16:24:34   Processor has been on-line since 09/01/99 06:33:49.   The sparc processor operates at 200 MHz,         and has a sparc floating point processor. Status of processor 1 as of: 09/20/99 16:24:34   Processor has been on-line since 09/01/99 06:33:52.   The sparc processor operates at 200 MHz,         and has a sparc floating point processor.

This utility has proved to be useful when dealing with clients who were not sure what sort of hardware they had, but wanted to have it tuned . Similar information can be obtained by examining the contents of /proc/ cpuinfo in Linux.

3.6.7.2 psradm

You can change the operational status of processors with the psradm command. A processor can have the following states:

An on-line processor does computation ^[26] and may be interrupted by the system's I/O devices.

^[26] That is, an on-line processor processes processes.
An off-line processor does not do any computation and is not usually interruptible. In some cases, it may not be possible to disable interrupts for an off-line processor.
A no-intr processor performs computation, but is not interruptible.

At least one processor must be able to perform computation, and at least one processor must be able to be interrupted.

3.6.7.3 psrset

It is sometimes useful to create processor sets through psrset . A processor set allows you to bind a process to a particular group of processors, rather than just one. Processor sets are sometimes created by the system, in which case they will run both system processes and user processes, or they can be created by psrset , in which case they are only able to run processes that have been bound to that processor. Let's walk through an example of using psrset on a Sun Ultra Enterprise 3500 system, and then binding a process to that processor set.

The system I use for this example has eight 250 MHz UltraSPARC-II processors:

 #  psrinfo  6       on-line   since 11/15/01 16:47:33 7       on-line   since 11/15/01 16:47:40 10      on-line   since 11/15/01 16:47:40 11      on-line   since 11/15/01 16:47:40 14      on-line   since 11/15/01 16:47:40 15      on-line   since 11/15/01 16:47:40 18      on-line   since 11/15/01 16:47:40 19      on-line   since 11/15/01 16:47:40

You can use psrset -i to see what the defined processor sets on a system are (for a listing by processor instead of by set, you can use psrset -p ). By default, there are none -- all processes are in a general execution pool. Let's create a processor set with four processors (those numbered 14, 15, 18, and 19) using psrset -c :

 #  psrset -i  #  psrset -c 14 15 18 19  created processor set 1 processor 14: was not assigned, now 1 processor 15: was not assigned, now 1 processor 18: was not assigned, now 1 processor 19: was not assigned, now 1 #  psrset -i  user processor set 1: processors 14 15 18 19 #  psrset -p  processor 14: 1 processor 15: 1 processor 18: 1 processor 19: 1

If you want to add more processors to this set, you can use psrset -a set-id processor-list ; similarly, you can remove them with psrset -r processor-list .

 #  psrset -i  user processor set 1: processors 14 15 18 19 #  psrset -a 1 10 11  processor 10: was not assigned, now 1 processor 11: was not assigned, now 1 #  psrset -i  user processor set 1: processors 14 15 18 19 10 11 #  psrset -r 18 19  processor 18: was 1, now not assigned processor 19: was 1, now not assigned #  psrset -i  user processor set 1: processors 14 15 10 11

You can completely remove a processor set with psrset -d set-id :

 #  psrset -i  user processor set 1: processors 14 15 10 11 #  psrset -d 1  removed processor set 1 #  psrset -i  #

If you are configuring a mixed-use system -- for example, you might have to support both software development and production computational codes on the same system -- you can use processor sets to segment your workload. For example, on an eight-processor system, you might create one two-processor set dedicated for compiles, one four-processor set for computational jobs, and leave the other two processors unassigned (and therefore available for interactive use). One thing that you might want to do in that case is to disallow the processors that are dedicated to compilation and computation from handling interrupts (e.g., for disk or network I/O). This can be done by running psrset -f set-id ; alternatively, you might want to set aside a few processors just for handling interrupts, in which case you can turn on interrupt handling for a set by using psrset -n set-id . By default, all processors are available to handle interrupts. For example, let's create a four-CPU processor set and disable interrupt handling on it:

 #  psrinfo  6       on-line   since 11/15/01 16:47:33 7       on-line   since 11/15/01 16:47:40 10      on-line   since 11/15/01 16:47:40 11      on-line   since 11/15/01 16:47:40 14      on-line   since 11/15/01 18:26:23 15      on-line   since 11/15/01 18:26:23 18      on-line   since 11/15/01 18:26:23 19      on-line   since 11/15/01 18:26:23 #  psrset -i  #  psrset -c 14 15 18 19  created processor set 1 processor 14: was not assigned, now 1 processor 15: was not assigned, now 1 processor 18: was not assigned, now 1 processor 19: was not assigned, now 1 #  psrset -f 1  #  psrinfo  6       on-line   since 11/15/01 16:47:33 7       on-line   since 11/15/01 16:47:40 10      on-line   since 11/15/01 16:47:40 11      on-line   since 11/15/01 16:47:40 14      no-intr   since 11/15/01 18:27:09 15      no-intr   since 11/15/01 18:27:09 18      no-intr   since 11/15/01 18:27:09 19      no-intr   since 11/15/01 18:27:09

Note that deleting a processor set that had interrupts disabled will not have the side effect of reenabling interrupts on its member processors. You can always reenable interrupts on a per-CPU basis with psradm -n processor-list .

Now that we've created a processor set, how can we run processes on it? There are essentially two ways: we can use psrset -e set-id command to run a command directly on the processor set, or we can use psrset -b pid-list to bind processes to a set. ^[27] For example, let's run a multithreaded application on processor set 1:

^[27] Actually, it binds the underlying lightweight processes to the set (for more information on lightweight processes, see Section 3.3.2.1 earlier in this chapter).

 #  psrset -i  user processor set 1: processors 14 15 18 19 #  psrset -e 1 mtsieve &  [1] 3701 #  mpstat 3  ... CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl   6    0   0  747   401  301    0    0    0    0    0     0    0   0   0 100   7    0   0    0   104  104   45    0    0    0    0    27    0   0   0 100  10    0   0    0   100  100   13    0    0    0    0     9    0   0   0 100  11    0   0    0   100  100   11    0    0    0    0     3    0   0   0 100  14    0   0    0   111  100   11   11    0    0    0     0  100   0   0   0  15    0   0    0   111  100   11   11    0    0    0     0  100   0   0   0  18    0   0    0   111  100   11   11    0    0    0     0  100   0   0   0  19    0   0    0   111  100   11   11    0    0    0     0  100   0   0   0 ...

We can clearly see from the mpstat output that the CPUs in the processor set are completely saturated , but the other processors are totally idle.

3.6.8 Peripheral Interconnect Performance Tools

There are very few tools to monitor the performance of peripheral device interconnects. By far the best is the Solaris tool busstat , which lets you monitor the state of certain performance counters that have been embedded in Sun hardware. Because of the high degree of hardware support that is required, the capabilities of busstat may change over time, and from platform to platform. These examples were gathered on a Sun Ultra Enterprise 3500 running Solaris 8 7/01.

The best way to start using busstat is to see what devices are available for monitoring. This can be done via busstat -l :

 #  busstat -l  Busstat Device(s): sbus0 sbus1 ac0 ac1 ac2 ac3 ac4

There are two sbus devices, which correspond to the two SBuses on this system's single SBus I/O card (see Section 3.5.1.4 earlier in this chapter). The ac devices correspond to the Address Controllers, which are on each installed board in the system. Each device has two performance instrumentation counters (PICs) that can count specific events. In order to get a list of the available events for each device, use busstat -e device :

 #  busstat -e sbus0  pic0 dvma_stream_rd dvma_stream_wr dvma_const_rd dvma_const_wr dvma_tlb_misses dvma_stream_buf_mis dvma_cycles dvma_bytes_xfr interrupts upa_inter_nack pio_reads pio_writes sbus_reruns pio_cycles pic1 ...

Table 3-10 contains a description of the most important of these parameters.

Table 3-10. SBus performance counter events

Event	Description
`dvma_stream_rd`	The number of DVMA streaming reads issued
`dvma_stream_wr`	The number of DVMA streaming writes issued
`dvma_const_rd`	The number of DVMA consistent reads issued
`dvma_const_wr`	The number of DVMA consistent writes issued
`dvma_tlb_misses`	The number of transaction lookaside buffer (TLB) misses incurred during DVMA operations
`dvma_stream_buf_mis`	The number of DVMA streaming buffer read misses
`dvma_cycles`	The number of SBus cycles granted to DVMA operations
`dvma_bytes_xfr`	The number of bytes transferred across the SBus
`interrupts`	The number of interrupts generated by the SBus
`upa_inter_nack`	The number of nonacknowledged (on UPA) interrupts generated by the SBus
`pio_reads`	The number of programmed I/O reads issued
`pio_writes`	The number of programmed I/O writes issued
`sbus_reruns`	The number of SBus reruns
`pio_cycles`	The number of SBus cycles granted to PIO operations

And we can specify the counters that are to be examined by using busstat -w device ,pic0= event ,pic1= event . These counters will remain programmed until they are reset; the counters can only be programmed by the superuser.

These are obviously extremely detailed events, and may not be of great interest to anyone except driver developers or hardware engineers . However, one interesting thing to look at is the number of bytes transferred across the SBus. For example, we can use busstat to watch the activity on an SBus containing a network interface that is attached to a SunRay network appliance: ^[28]

^[28] The SunRay is completely responsible for transmitting screen redrawing instructions on the host system. It functions somewhat like an X terminal.

 #  busstat -w sbus1,pic0=dvma_cycles,pic1=dvma_bytes_xfr 1  time dev    event0               pic0        event1               pic1 1    sbus1  dvma_cycles          1449        dvma_bytes_xfr       1940 2    sbus1  dvma_cycles          1361        dvma_bytes_xfr       1800 ... 12   sbus1  dvma_cycles          1328        dvma_bytes_xfr       1800 13   sbus1  dvma_cycles          3469        dvma_bytes_xfr       6424

Up until this point, the system is simply sitting at the dtlogin screen. The small amount of data transfer is probably due to the "blinking" of the cursor. We then log in by typing a valid username and password:

 14   sbus1  dvma_cycles          7282        dvma_bytes_xfr       19812 15   sbus1  dvma_cycles          7620        dvma_bytes_xfr       21172 16   sbus1  dvma_cycles          1399        dvma_bytes_xfr       2352

At this point we have successfully logged in, and the CDE "Welcome to Solaris" screen appears, followed by the user's desktop environment ( twm and a few xterm processes):

 17   sbus1  dvma_cycles          182874      dvma_bytes_xfr       590212 18   sbus1  dvma_cycles          1361        dvma_bytes_xfr       2004 19   sbus1  dvma_cycles          247957      dvma_bytes_xfr       814420 20   sbus1  dvma_cycles          2691        dvma_bytes_xfr       3688 21   sbus1  dvma_cycles          13542       dvma_bytes_xfr       25392 22   sbus1  dvma_cycles          7931        dvma_bytes_xfr       14248

After a little while, the SBus activity returns to near-zero levels, as there is only minimal graphical update activity occuring on the network:

 31   sbus1  dvma_cycles          225         dvma_bytes_xfr       536 32   sbus1  dvma_cycles          597         dvma_bytes_xfr       760

Although this is a very limited example, it does show some of the powerful analysis made possible by busstat .

3.6.9 Advanced Processor Performance Statistics

The cpustat and cputrack tools in Solaris can be used to obtain extremely detailed statistics about processor performance. cpustat generates system-wide statistics, whereas cputrack lets you gather data about a specific process.

You can obtain the list of available events via cpustat -h . Some events are only available on specific PICs. Table 3-11 summarizes the various events, along with what PICs they are available on.

Table 3-11. CPU performance counter events

Event	Available PICs	Description
`Cycle_cnt`	0, 1	The number of accumulated clock cycles
`Instr_cnt`	0, 1	The number of completed instructions
`Dispatch0_IC_Miss`		The number of instruction cache misses
`Dispatch0_mispred`	1	The number of mispredicted branches (note that branch misprediction typically causes instructions in flight to be killed , so the number of pipeline bubbles is about twice as big as this count)
`Dispatch0_storeBuf`		The number of times that a store instruction was issued, but store buffer was full, and could not hold additional writes to memory
`Dispatch0_FP_use`	1	The number of times that an instruction depended on a floating-point result that was not yet available for reasons other than loading (that is, `Dispatch0_FP_use` and `Load_use` are mutually exclusive)
`Load_use`		The number of times that an instruction waiting to be executed required a load from memory that hadn't yet completed; stalls all instructions in the pipeline
`Load_use_RAW`	1	The number of times that a load-use hazard exists, but there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed until an earlier store can be completed
`IC_ref`		The number of instruction cache references
`IC_hit`	1	The number of instruction cache hits
`DC_rd`		The number of data cache read references
`DC_rd_hit`	1	The number of data cache read hits
`DC_wr`		The number of data cache writes
`DC_wr_hit`	1	The number of data cache write hits
`EC_ref`		The number of E-Cache (L2 cache) references
`EC_hit`	1	The number of E-Cache hits
`EC_write_hit_RD0`		The number of E-Cache write hits that result in ownership being transferred
`EC_wb`	1	The number of E-Cache misses that incur writebacks to main memory
`EC_snoop_inv`		The number of E-Cache invalidations driven by bus snooping
`EC_snoop_cb`	1	The number of E-Cache copybacks driven by bus snooping
`EC_rd_hit`		The number of E-Cache read hits after a data cache miss
`EC_ic_hit`	1	The number of E-Cache read hits after an instruction cache miss

Again, some of these counters refer to very low-level performance structures, and will primarily be of use to people who are intimately familiar with the architecture of their processor. For more information about the UltraSPARC processor family architecture, consult the user guides published by Sun Microelectronics at http://www.sun.com/microelectronics/manuals/index.html.

One thing that is often interesting to compute is the number of cycles per instruction. For this example, I draw upon the simple Sieve of Eratosthenes application source code (see Section 8.2.4.2), with the minor modification that it will not display output (simply compute it). I run an experiment to compare the number of cycles per instruction required as a function of the compiler (Forte 6 Update 1) optimization level. Gather cputrack data for each process like this:

 %  cputrack -c pic0=Cycle_cnt,pic1=Instr_cnt sieve.o0  time lwp      event      pic0      pic1   1.046   1       tick 170075841 128955109   2.056   1       tick 221955530 162910824   3.056   1       tick 240994274  93274214   4.056   1       tick 231824966  52868531   5.056   1       tick 198733670  53220280   6.056   1       tick  92304003  51971543   6.522   1       exit 1192528514 564103611

We then compute, using the last line (the final statistics), the average number of cycles per instruction by dividing the pic0 entry (the number of cycles) by the pic1 entry (the number of instructions) to give a CPI of 2.114. As we increase the optimization, we get the results given here.

Compiler flags	Total instructions	Cycles per instruction	Total runtime (seconds)
-xO0	564,103,381	2.114	6.507
-xO2	168,642,699	5.022	5.480
-xO3	137,109,359	5.929	5.275
-fast -xO5	138,521,627	5.684	5.213

It's dangerous to overgeneralize these performance results, but it's interesting to watch the tremendous CPI jump as you move from no optimization to some optimization. Clearly, further optimization gives decreasing returns.

As we continue to further our understanding, we can use cputrack to try and answer some questions we might have about this application's runtime behavior. For example, we can compute the instruction cache hit rate for the unoptimized application:

 %  ptime cputrack -c pic0=IC_ref,pic1=IC_hit sieve.o0  time lwp      event      pic0      pic1 ...   6.475   1       exit 179350516 179325008

Therefore, we can compute that we have instruction cache hit rate of 99.986%, which is incredibly high. We can also determine the data cache read hit rate and data cache write hit rate:

 %  cputrack -c pic0=DC_rd,pic1=DC_rd_hit sieve.o0  time lwp      event      pic0      pic1 ...   6.469   1       exit 128490011 128460760 real        6.509 user        0.038 sys         0.063 %  cputrack -c pic0=DC_wr,pic1=DC_wr_hit sieve.o0  time lwp      event      pic0      pic1 ...   6.471   1       exit  69216839  33654774

Finally, we end up with a data cache read hit rate of 99.977% and a write hit rate of 48.622%. At this point, things start to get interesting. We've been analyzing the unoptimized application, and our data cache write hit rate seems a little bit low. In fact, we might wonder whether it goes up as the optimization level increases . When we run that experiment, we find out that the write cache hit rate in the highest-optimized (and lowest -runtime) version has a write cache hit rate of only 0.031%. This seems confusing at first, but upon some consideration it makes sense: the application is performing at peak rates when it doesn't write often to the same memory location. This is a good illustration of how confusing examining these low-level counters can be without a great deal of knowledge about the application under analysis.