Concretely detecting whether a system is short of processor time is often more complicated than it first seems. Regular indications of CPU shortage can be attributed to many things: disk configuration problems, peripheral devices being bound to a single processor for interrupt management, processor starvation due to a lack of memory, etc. In a sense, the best way to tell when you're legitimately processor-bound is that you can exclude all other possibilities.
There is also a philosophical issue involved. Some people like seeing a low load average on their systems, or more than 50% of the processor idle. This makes some sense in terms of the system accomodating bursting load activity. Similarly, some people like seeing a load average that approaches the number of processors, meaning the system is running full-tilt. This also makes some sense, because you are using everything that you have paid for in terms of processor performance. My personal philosophy varies between the following: on a computational server, I would aim for full utilization of all processors; on a development system or a database server, I would aim for a degree of overcapacity to accomodate spikes in load.
There are quite a few tools for monitoring processor performance: the load average, vmstat , mpstat , prstat and/or top , and lockstat . There are also two tools in Solaris for managing processors in multiprocessor systems: psrinfo and psrset . We'll cover all of these in some detail.
The system understands a concept of the load average , which is expressed in the uptime command:
% uptime 4:02pm up 19 day(s), 9:28, 119 users, load average: 1.22, 1.21, 1.25
The load average is represented as the average number of runnable jobs (the sum of the run queue length and the number of jobs currently running) over the last one, five, and fifteen minutes. This gives a quick estimate of how heavily loaded the system is, but it is not exclusively a measurement of processor load; a system that has a memory shortage will usually have a high load average.
The load average statistic is only meaningful when looked at in the context of the number of installed processors. As a general rule, if the load average is less than the number of installed processors, the system is not lacking processor power. For example, if the load average is 2.5 on a four-processor system, it is not an indication of a processor shortage.
One very powerful tool in our performance tuning arsenal is vmstat . I discuss it in pieces as we go along. Let's take a look at some sample vmstat output from a Solaris system (recall that vmstat takes as its first argument the time interval in seconds between displaying data, and that you must always discard the first line of output):
% vmstat 30 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id 0 0 0 5928 1920 0 328 45 25 37 1360 2 4 8 0 0 306 1682 299 10 5 85 0 0 0 1202648 17152 0 904 8 0 0 1200 0 7 36 0 0 547 5632 683 18 15 67 0 0 0 1202744 19600 0 1136 0 0 0 1040 0 2 26 0 0 575 4542 752 33 12 55
The first column (highlighted) provides the state of the system's three process queues. In order, they are the run queue ( r ), the runnable-but-blocked queue ( b ) and the runnable-but-swapped queue ( w ). We'll discuss the swap queue's relevance, or lack thereof, in Section 4.5.2.1.
The run queue counts the number of processes that are runnable, but are not currently running. In this example, the system's processor is not overworked. As a general rule, if the run queue is consistently at or more than four times the number of configured processors in the system, you should probably consider increasing the available processors.
The runnable-but-blocked queue counts the number of processes that are otherwise runnable, but cannot be run because they are waiting for a block device transfer ( generally a disk I/O operation).
The Linux vmstat looks slightly different, but displays the same information:
% vmstat 30 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 0 0 20068 9568 26524 50828 0 0 2 7 8 3 3 1 8 0 0 0 20068 9568 26524 50828 0 0 0 0 115 22 8 4 89 0 0 0 20068 9568 26524 50828 0 0 1 0 115 22 7 4 89 0 0 0 20068 9568 26524 50828 0 0 0 0 114 20 7 4 89
Processor time is organized into four timed modes: system time , user time , I/O wait time , and idle time . [23] The idle time consists of what's left over when all other portions have had their fill. A program's normal operating state is user mode, but as it runs, it may generate requests for services that the kernel provides, such as I/O. These requests require the attention of the operating system, so the program switches into system mode, then returns to user mode when the request is complete. The time spent in these two modes is tabulated independently to give the user time and system time values, respectively. These two figures account for the majority of a process's execution time.
[23] Note that vmstat reports only the user time, system time, and idle time (wait time is summed in with idle time). In order to get separated values for wait time and idle time, use mpstat .
When a process waits for a block device data request to complete, it incurs I/O wait time. This brings up an important fact: when a process is blocked in this fashion, all idle time becomes wait time. If your idle time is zero, as reported by vmstat , the first thing you should check is if your system has I/O throughput problems.
Another good diagnostic event is looking at system time relative to user time. On NFS servers, which we will discuss in Section 7.5, it is normal for the system time to be much higher than the user time because NFS service is contained entirely inside the kernel.
You can obtain this sort of data on a per-process basis by means of the Solaris ptime command. Here is an example:
% /usr/proc/bin/ptime tar -cf miscellany.tar miscellany real 7.402 user 0.044 sys 0.287
There are other commands, /bin/time and timex , which provide varying resolution. We'll discuss these commands in greater detail in Section 8.2.1.
There is a fair bit of system time incurred here because of the volume of I/O requests that must be made through the kernel facilities. There are many ways to get this sort of data on a systemwide level: vmstat and mpstat (which we'll discuss in more detail throughout this book) both provide it. The highlighted portion of the vmstat output is illustrative :
% vmstat 30 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 -- -- in sy cs us sy id 0 0 0 3296 2344 0 365 44 22 33 0 1 4 8 0 0 304 700 318 11 6 84 0 0 0 1221816 99992 0 450 0 0 0 0 0 0 2 0 0 269 1791 465 19 6 76 0 0 0 1222568 100688 0 83 0 0 0 0 0 0 8 0 0 254 591 389 2 1 98
This system exemplifies a system under light load. The idle percentage is fairly high and user time is greater than system time.
The tools we've discussed present valid data for multiprocessor systems, but in order to get detailed per-processor statistics under Solaris, we need to turn to a more powerful tool: mpstat .
This is some sample mpstat output from a two-processor system:
% mpstat 30 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 182 0 353 269 54 161 24 13 5 0 940 11 5 5 78 1 182 1 531 134 116 157 23 13 6 0 931 10 6 5 79 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 2856 228 28 72 0 6 0 0 454 2 2 4 92 1 4 0 0 39 39 77 0 7 0 0 50 0 0 4 96 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 80 0 515 489 288 108 2 11 1 0 275 1 6 37 56 1 61 0 55 46 46 113 2 9 2 0 270 0 1 37 62
Here are some of the more interesting statistics:
intr is the number of interrupts handled by that processor; ithr is the number of interrupts as threads.
migr is the number of thread migrations. A thread migration occurs when a thread switches from running on one processor to running on some other processor.
smtx is the number of spins on mutexes ; (in practical terms, this column represents the number of locks not acquired on the first try). We'll discuss this later in this chapter in Section 3.6.6.
syscl is the number of system calls.
usr , sys , wt , and idl are the user, system, wait, and idle percentages, on a per-processor basis. Note that vmstat provides one systemwide average only.
A popular and common tool for monitoring system-wide process activity is top , which provides a list of the most CPU- intensive processes running on a system, as well as some interesting statistics such as the load average, the number of processes, and the amount of memory and paging space in use. top is a very common and generally easy-to-understand application, so we won't discuss it further here. In the interest of completeness, this is what top output looks like:
% top 10:27pm up 24 days, 21:24, 9 users, load average: 0.05, 0.06, 0.01 125 processes: 124 sleeping, 1 running, 0 zombie, 0 stopped CPU states: 8.0% user, 9.1% system, 0.0% nice, 82.7% idle Mem: 255596K av, 251056K used, 4540K free, 0K shrd, 3200K buff Swap: 112444K av, 8K used, 112436K free 99112K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 5752 gdm 17 0 1296 1296 1052 R 0 13.8 0.5 0:01 top 3829 root 9 0 1252 1252 1072 S 0 1.1 0.4 0:06 sshd 5753 root 10 0 1492 1492 1292 S 0 1.1 0.5 0:00 sendmail 1 root 8 0 520 520 452 S 0 0.0 0.2 0:48 init
Unfortunately, the tradeoff for top 's updating process list and other nice features is that it tends to consume quite a bit of processor time. [24] This has been known to substantially skew the behaviour of the system, so Solaris now ships with a similar tool, called prstat , that provides similar output. While it doesn't provide some of the nice information that top does about memory usage, it consumes very little processor time and doesn't skew the system's behaviour nearly as much:
[24] Rumor has it that this is where top got its name -- it often appears at the top of the CPU consumers list.
% prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 4585 gdm 1280K 1096K cpu0 48 0 0:00.00 0.4% prstat/1 4554 gdm 1424K 1024K sleep 58 0 0:00.00 0.0% csh/1 337 nobody 23M 6856K sleep 59 0 0:00.07 0.0% ns-slapd/15 4552 root 1784K 1256K sleep 38 0 0:00.00 0.0% in.telnetd/1 ... Total: 33 processes, 143 lwps, load averages: 0.00, 0.00, 0.01
The upshot of this is that top is very nice, but you need to be careful about it skewing the system under observation because of its high overhead. On Solaris systems, prstat provides similar output at a much reduced cost in terms of processor time.
In addition to using mpstat for processor state information, we can also use it to provide meaningful data on interrupts and mutexes. The key measurement is the smtx column, which represents the number of instances in the unit time that the kernel failed to immediately obtain a mutex when it asked for one. If this number begins to exceed 250 times the number of configured processors, system time will begin to increase noticeably. Higher performance processors can cope with higher smtx values, e.g., UltraSPARC-III/Pentium 3 Xeon.
% mpstat 30 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 182 0 353 269 54 161 24 13 5 0 940 11 5 5 78 1 182 1 531 134 116 157 23 13 6 0 930 10 6 5 79 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 112 0 365 259 55 141 6 12 2 0 486 6 3 10 81 1 157 0 337 73 67 147 7 11 3 0 594 6 4 8 82 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 97 0 598 229 26 135 6 11 2 0 389 5 3 5 87 1 107 0 81 67 62 130 5 11 2 0 423 4 3 4 89
The best way to cope with this problem is to check with the vendor for any operating system upgrades that may have solved mutex contention issues, and then use the lockstat command to find out what mutexes are suffering. If you are consistently seeing very high processor utilization and the smtx values are high in your SMP system, replace your current processors with faster ones, rather than adding more!
To find suffering mutexes, lockstat uses a kernel lock monitoring facility. It must be invoked as root. The kernel can dynamically change its locks to collect information, then return to the normal, optimized locking code when the data collection is complete. lockstat can be invoked on a per-process basis or on a system-wide basis by calling it on the sleep command. I went through some grief trying to find a simple example of how to cause a large number of mutex spins to obtain a good example for illustration, and finally did so: [25]
[25] I set this up by running a number of infinitely looping C shell scripts equal to the number of configured processors on the system. I hope this is not actually what your system is doing.
% mpstat 10 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 1 0 2 104 0 19 0 0 0 26 0 0 0 100 1 1 0 1 8 6 19 0 0 0 37 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 109 2 40 22 2 956 0 106954 47 53 0 0 1 0 0 0 23 7 43 15 2 1117 0 107760 48 52 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 116 0 57 26 2 1289 0 107204 47 53 0 0 1 0 0 0 24 3 45 21 2 1099 0 107199 47 53 0 0
It should be obvious that this system is having serious mutex problems. We can use lockstat to try and examine what the system is doing:
# lockstat sleep 10 Adaptive mutex spin: 19132 events Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 2716 14% 14% 1.00 1 flock_lock cleanlocks+0x1c 2388 12% 27% 1.00 3 0xe0452ab0 dev_get_dev_info+0x3d 1905 10% 37% 1.00 3 0xe0452ab0 cdev_size+0x48 1554 8% 45% 1.00 1 0xe07c9f30 vn_rele+0x22 1307 7% 52% 1.00 5 stable_lock specvp+0x8f 1050 5% 57% 1.00 2 0xe0452ab0 mod_rele_dev_by_major+0x21 995 5% 62% 1.00 2 0xe0452ab0 ddi_hold_installed_driver+0x36 ... ------------------------------------------------------------------------------- ...
Unfortunately, lockstat is quite verbose by default. However, at first glance it appears that the system is spending most of its time working with the flock_lock lock, which is filesystem- related . From this point it's a matter of using truss to see what system calls in our application are generating these contentions.
If you have a suspicion as to what lock might be the one causing the problem, lockstat has the capability to search for a particular lock. This is accomplished with the -l lockname switch:
# lockstat -l flock_lock sleep 10 Adaptive mutex spin: 2594 events Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 2594 100% 100% 1.00 1 flock_lock cleanlocks+0x1c ------------------------------------------------------------------------------- Adaptive mutex block: 9 events Count indv cuml rcnt nsec Lock Caller ------------------------------------------------------------------------------- 9 100% 100% 1.00 64490487 flock_lock cleanlocks+0x1c -------------------------------------------------------------------------------
Solaris 2.6 and later versions support commands to help the system administrator manage processors. They are of somewhat limited use in most performance tuning applications, but they are good to know about.
The psrinfo command displays information concerning the configured processors. It provides particularly useful information when the -v switch is applied.
% /usr/sbin/psrinfo -v Status of processor 0 as of: 09/20/99 16:24:34 Processor has been on-line since 09/01/99 06:33:49. The sparc processor operates at 200 MHz, and has a sparc floating point processor. Status of processor 1 as of: 09/20/99 16:24:34 Processor has been on-line since 09/01/99 06:33:52. The sparc processor operates at 200 MHz, and has a sparc floating point processor.
This utility has proved to be useful when dealing with clients who were not sure what sort of hardware they had, but wanted to have it tuned . Similar information can be obtained by examining the contents of /proc/ cpuinfo in Linux.
You can change the operational status of processors with the psradm command. A processor can have the following states:
An on-line processor does computation [26] and may be interrupted by the system's I/O devices.
[26] That is, an on-line processor processes processes.
An off-line processor does not do any computation and is not usually interruptible. In some cases, it may not be possible to disable interrupts for an off-line processor.
A no-intr processor performs computation, but is not interruptible.
At least one processor must be able to perform computation, and at least one processor must be able to be interrupted.
It is sometimes useful to create processor sets through psrset . A processor set allows you to bind a process to a particular group of processors, rather than just one. Processor sets are sometimes created by the system, in which case they will run both system processes and user processes, or they can be created by psrset , in which case they are only able to run processes that have been bound to that processor. Let's walk through an example of using psrset on a Sun Ultra Enterprise 3500 system, and then binding a process to that processor set.
The system I use for this example has eight 250 MHz UltraSPARC-II processors:
# psrinfo 6 on-line since 11/15/01 16:47:33 7 on-line since 11/15/01 16:47:40 10 on-line since 11/15/01 16:47:40 11 on-line since 11/15/01 16:47:40 14 on-line since 11/15/01 16:47:40 15 on-line since 11/15/01 16:47:40 18 on-line since 11/15/01 16:47:40 19 on-line since 11/15/01 16:47:40
You can use psrset -i to see what the defined processor sets on a system are (for a listing by processor instead of by set, you can use psrset -p ). By default, there are none -- all processes are in a general execution pool. Let's create a processor set with four processors (those numbered 14, 15, 18, and 19) using psrset -c :
# psrset -i # psrset -c 14 15 18 19 created processor set 1 processor 14: was not assigned, now 1 processor 15: was not assigned, now 1 processor 18: was not assigned, now 1 processor 19: was not assigned, now 1 # psrset -i user processor set 1: processors 14 15 18 19 # psrset -p processor 14: 1 processor 15: 1 processor 18: 1 processor 19: 1
If you want to add more processors to this set, you can use psrset -a set-id processor-list ; similarly, you can remove them with psrset -r processor-list .
# psrset -i user processor set 1: processors 14 15 18 19 # psrset -a 1 10 11 processor 10: was not assigned, now 1 processor 11: was not assigned, now 1 # psrset -i user processor set 1: processors 14 15 18 19 10 11 # psrset -r 18 19 processor 18: was 1, now not assigned processor 19: was 1, now not assigned # psrset -i user processor set 1: processors 14 15 10 11
You can completely remove a processor set with psrset -d set-id :
# psrset -i user processor set 1: processors 14 15 10 11 # psrset -d 1 removed processor set 1 # psrset -i #
If you are configuring a mixed-use system -- for example, you might have to support both software development and production computational codes on the same system -- you can use processor sets to segment your workload. For example, on an eight-processor system, you might create one two-processor set dedicated for compiles, one four-processor set for computational jobs, and leave the other two processors unassigned (and therefore available for interactive use). One thing that you might want to do in that case is to disallow the processors that are dedicated to compilation and computation from handling interrupts (e.g., for disk or network I/O). This can be done by running psrset -f set-id ; alternatively, you might want to set aside a few processors just for handling interrupts, in which case you can turn on interrupt handling for a set by using psrset -n set-id . By default, all processors are available to handle interrupts. For example, let's create a four-CPU processor set and disable interrupt handling on it:
# psrinfo 6 on-line since 11/15/01 16:47:33 7 on-line since 11/15/01 16:47:40 10 on-line since 11/15/01 16:47:40 11 on-line since 11/15/01 16:47:40 14 on-line since 11/15/01 18:26:23 15 on-line since 11/15/01 18:26:23 18 on-line since 11/15/01 18:26:23 19 on-line since 11/15/01 18:26:23 # psrset -i # psrset -c 14 15 18 19 created processor set 1 processor 14: was not assigned, now 1 processor 15: was not assigned, now 1 processor 18: was not assigned, now 1 processor 19: was not assigned, now 1 # psrset -f 1 # psrinfo 6 on-line since 11/15/01 16:47:33 7 on-line since 11/15/01 16:47:40 10 on-line since 11/15/01 16:47:40 11 on-line since 11/15/01 16:47:40 14 no-intr since 11/15/01 18:27:09 15 no-intr since 11/15/01 18:27:09 18 no-intr since 11/15/01 18:27:09 19 no-intr since 11/15/01 18:27:09
Note that deleting a processor set that had interrupts disabled will not have the side effect of reenabling interrupts on its member processors. You can always reenable interrupts on a per-CPU basis with psradm -n processor-list .
Now that we've created a processor set, how can we run processes on it? There are essentially two ways: we can use psrset -e set-id command to run a command directly on the processor set, or we can use psrset -b pid-list to bind processes to a set. [27] For example, let's run a multithreaded application on processor set 1:
[27] Actually, it binds the underlying lightweight processes to the set (for more information on lightweight processes, see Section 3.3.2.1 earlier in this chapter).
# psrset -i user processor set 1: processors 14 15 18 19 # psrset -e 1 mtsieve & [1] 3701 # mpstat 3 ... CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 6 0 0 747 401 301 0 0 0 0 0 0 0 0 0 100 7 0 0 0 104 104 45 0 0 0 0 27 0 0 0 100 10 0 0 0 100 100 13 0 0 0 0 9 0 0 0 100 11 0 0 0 100 100 11 0 0 0 0 3 0 0 0 100 14 0 0 0 111 100 11 11 0 0 0 0 100 0 0 0 15 0 0 0 111 100 11 11 0 0 0 0 100 0 0 0 18 0 0 0 111 100 11 11 0 0 0 0 100 0 0 0 19 0 0 0 111 100 11 11 0 0 0 0 100 0 0 0 ...
We can clearly see from the mpstat output that the CPUs in the processor set are completely saturated , but the other processors are totally idle.
There are very few tools to monitor the performance of peripheral device interconnects. By far the best is the Solaris tool busstat , which lets you monitor the state of certain performance counters that have been embedded in Sun hardware. Because of the high degree of hardware support that is required, the capabilities of busstat may change over time, and from platform to platform. These examples were gathered on a Sun Ultra Enterprise 3500 running Solaris 8 7/01.
The best way to start using busstat is to see what devices are available for monitoring. This can be done via busstat -l :
# busstat -l Busstat Device(s): sbus0 sbus1 ac0 ac1 ac2 ac3 ac4
There are two sbus devices, which correspond to the two SBuses on this system's single SBus I/O card (see Section 3.5.1.4 earlier in this chapter). The ac devices correspond to the Address Controllers, which are on each installed board in the system. Each device has two performance instrumentation counters (PICs) that can count specific events. In order to get a list of the available events for each device, use busstat -e device :
# busstat -e sbus0 pic0 dvma_stream_rd dvma_stream_wr dvma_const_rd dvma_const_wr dvma_tlb_misses dvma_stream_buf_mis dvma_cycles dvma_bytes_xfr interrupts upa_inter_nack pio_reads pio_writes sbus_reruns pio_cycles pic1 ...
Table 3-10 contains a description of the most important of these parameters.
Event | Description |
---|---|
dvma_stream_rd | The number of DVMA streaming reads issued |
dvma_stream_wr | The number of DVMA streaming writes issued |
dvma_const_rd | The number of DVMA consistent reads issued |
dvma_const_wr | The number of DVMA consistent writes issued |
dvma_tlb_misses | The number of transaction lookaside buffer (TLB) misses incurred during DVMA operations |
dvma_stream_buf_mis | The number of DVMA streaming buffer read misses |
dvma_cycles | The number of SBus cycles granted to DVMA operations |
dvma_bytes_xfr | The number of bytes transferred across the SBus |
interrupts | The number of interrupts generated by the SBus |
upa_inter_nack | The number of nonacknowledged (on UPA) interrupts generated by the SBus |
pio_reads | The number of programmed I/O reads issued |
pio_writes | The number of programmed I/O writes issued |
sbus_reruns | The number of SBus reruns |
pio_cycles | The number of SBus cycles granted to PIO operations |
And we can specify the counters that are to be examined by using busstat -w device ,pic0= event ,pic1= event . These counters will remain programmed until they are reset; the counters can only be programmed by the superuser.
These are obviously extremely detailed events, and may not be of great interest to anyone except driver developers or hardware engineers . However, one interesting thing to look at is the number of bytes transferred across the SBus. For example, we can use busstat to watch the activity on an SBus containing a network interface that is attached to a SunRay network appliance: [28]
[28] The SunRay is completely responsible for transmitting screen redrawing instructions on the host system. It functions somewhat like an X terminal.
# busstat -w sbus1,pic0=dvma_cycles,pic1=dvma_bytes_xfr 1 time dev event0 pic0 event1 pic1 1 sbus1 dvma_cycles 1449 dvma_bytes_xfr 1940 2 sbus1 dvma_cycles 1361 dvma_bytes_xfr 1800 ... 12 sbus1 dvma_cycles 1328 dvma_bytes_xfr 1800 13 sbus1 dvma_cycles 3469 dvma_bytes_xfr 6424
Up until this point, the system is simply sitting at the dtlogin screen. The small amount of data transfer is probably due to the "blinking" of the cursor. We then log in by typing a valid username and password:
14 sbus1 dvma_cycles 7282 dvma_bytes_xfr 19812 15 sbus1 dvma_cycles 7620 dvma_bytes_xfr 21172 16 sbus1 dvma_cycles 1399 dvma_bytes_xfr 2352
At this point we have successfully logged in, and the CDE "Welcome to Solaris" screen appears, followed by the user's desktop environment ( twm and a few xterm processes):
17 sbus1 dvma_cycles 182874 dvma_bytes_xfr 590212 18 sbus1 dvma_cycles 1361 dvma_bytes_xfr 2004 19 sbus1 dvma_cycles 247957 dvma_bytes_xfr 814420 20 sbus1 dvma_cycles 2691 dvma_bytes_xfr 3688 21 sbus1 dvma_cycles 13542 dvma_bytes_xfr 25392 22 sbus1 dvma_cycles 7931 dvma_bytes_xfr 14248
After a little while, the SBus activity returns to near-zero levels, as there is only minimal graphical update activity occuring on the network:
31 sbus1 dvma_cycles 225 dvma_bytes_xfr 536 32 sbus1 dvma_cycles 597 dvma_bytes_xfr 760
Although this is a very limited example, it does show some of the powerful analysis made possible by busstat .
The cpustat and cputrack tools in Solaris can be used to obtain extremely detailed statistics about processor performance. cpustat generates system-wide statistics, whereas cputrack lets you gather data about a specific process.
You can obtain the list of available events via cpustat -h . Some events are only available on specific PICs. Table 3-11 summarizes the various events, along with what PICs they are available on.
Event | Available PICs | Description |
---|---|---|
Cycle_cnt | 0, 1 | The number of accumulated clock cycles |
Instr_cnt | 0, 1 | The number of completed instructions |
Dispatch0_IC_Miss |
| The number of instruction cache misses |
Dispatch0_mispred | 1 | The number of mispredicted branches (note that branch misprediction typically causes instructions in flight to be killed , so the number of pipeline bubbles is about twice as big as this count) |
Dispatch0_storeBuf |
| The number of times that a store instruction was issued, but store buffer was full, and could not hold additional writes to memory |
Dispatch0_FP_use | 1 | The number of times that an instruction depended on a floating-point result that was not yet available for reasons other than loading (that is, Dispatch0_FP_use and Load_use are mutually exclusive) |
Load_use |
| The number of times that an instruction waiting to be executed required a load from memory that hadn't yet completed; stalls all instructions in the pipeline |
Load_use_RAW | 1 | The number of times that a load-use hazard exists, but there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed until an earlier store can be completed |
IC_ref |
| The number of instruction cache references |
IC_hit | 1 | The number of instruction cache hits |
DC_rd |
| The number of data cache read references |
DC_rd_hit | 1 | The number of data cache read hits |
DC_wr |
| The number of data cache writes |
DC_wr_hit | 1 | The number of data cache write hits |
EC_ref |
| The number of E-Cache (L2 cache) references |
EC_hit | 1 | The number of E-Cache hits |
EC_write_hit_RD0 |
| The number of E-Cache write hits that result in ownership being transferred |
EC_wb | 1 | The number of E-Cache misses that incur writebacks to main memory |
EC_snoop_inv |
| The number of E-Cache invalidations driven by bus snooping |
EC_snoop_cb | 1 | The number of E-Cache copybacks driven by bus snooping |
EC_rd_hit |
| The number of E-Cache read hits after a data cache miss |
EC_ic_hit | 1 | The number of E-Cache read hits after an instruction cache miss |
Again, some of these counters refer to very low-level performance structures, and will primarily be of use to people who are intimately familiar with the architecture of their processor. For more information about the UltraSPARC processor family architecture, consult the user guides published by Sun Microelectronics at http://www.sun.com/microelectronics/manuals/index.html.
One thing that is often interesting to compute is the number of cycles per instruction. For this example, I draw upon the simple Sieve of Eratosthenes application source code (see Section 8.2.4.2), with the minor modification that it will not display output (simply compute it). I run an experiment to compare the number of cycles per instruction required as a function of the compiler (Forte 6 Update 1) optimization level. Gather cputrack data for each process like this:
% cputrack -c pic0=Cycle_cnt,pic1=Instr_cnt sieve.o0 time lwp event pic0 pic1 1.046 1 tick 170075841 128955109 2.056 1 tick 221955530 162910824 3.056 1 tick 240994274 93274214 4.056 1 tick 231824966 52868531 5.056 1 tick 198733670 53220280 6.056 1 tick 92304003 51971543 6.522 1 exit 1192528514 564103611
We then compute, using the last line (the final statistics), the average number of cycles per instruction by dividing the pic0 entry (the number of cycles) by the pic1 entry (the number of instructions) to give a CPI of 2.114. As we increase the optimization, we get the results given here.
Compiler flags | Total instructions | Cycles per instruction | Total runtime (seconds) |
---|---|---|---|
-xO0 | 564,103,381 | 2.114 | 6.507 |
-xO2 | 168,642,699 | 5.022 | 5.480 |
-xO3 | 137,109,359 | 5.929 | 5.275 |
-fast -xO5 | 138,521,627 | 5.684 | 5.213 |
It's dangerous to overgeneralize these performance results, but it's interesting to watch the tremendous CPI jump as you move from no optimization to some optimization. Clearly, further optimization gives decreasing returns.
As we continue to further our understanding, we can use cputrack to try and answer some questions we might have about this application's runtime behavior. For example, we can compute the instruction cache hit rate for the unoptimized application:
% ptime cputrack -c pic0=IC_ref,pic1=IC_hit sieve.o0 time lwp event pic0 pic1 ... 6.475 1 exit 179350516 179325008
Therefore, we can compute that we have instruction cache hit rate of 99.986%, which is incredibly high. We can also determine the data cache read hit rate and data cache write hit rate:
% cputrack -c pic0=DC_rd,pic1=DC_rd_hit sieve.o0 time lwp event pic0 pic1 ... 6.469 1 exit 128490011 128460760 real 6.509 user 0.038 sys 0.063 % cputrack -c pic0=DC_wr,pic1=DC_wr_hit sieve.o0 time lwp event pic0 pic1 ... 6.471 1 exit 69216839 33654774
Finally, we end up with a data cache read hit rate of 99.977% and a write hit rate of 48.622%. At this point, things start to get interesting. We've been analyzing the unoptimized application, and our data cache write hit rate seems a little bit low. In fact, we might wonder whether it goes up as the optimization level increases . When we run that experiment, we find out that the write cache hit rate in the highest-optimized (and lowest -runtime) version has a write cache hit rate of only 0.031%. This seems confusing at first, but upon some consideration it makes sense: the application is performing at peak rates when it doesn't write often to the same memory location. This is a good illustration of how confusing examining these low-level counters can be without a great deal of knowledge about the application under analysis.