Section 2.11. CPU Run Queue Latency | Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris

2.11. CPU Run Queue Latency

The microstate accounting system maintains accurate time counters for threads as well as CPUs. Thread-based microstate accounting tracks several meaningful states per thread in addition to user and system time, which include trap time, lock time, sleep time, and latency time. The process statistics tool, prstat, reports the per-thread microstates for user processes.

$ prstat -mL    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID  25644 rmc       98 1.5 0.0 0.0 0.0 0.0 0.0 0.1   0  36 693   0 filebench/2  25660 rmc       98 1.7 0.1 0.0 0.0 0.0 0.0 0.1   2  44 693   0 filebench/2  25650 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.1   0  45 699   0 filebench/2  25655 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.2   0  46 693   0 filebench/2  25636 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.2   1  50 693   0 filebench/2  25651 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.2   0  54 693   0 filebench/2  25656 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.2   0  60 693   0 filebench/2  25639 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.2   1  61 693   0 filebench/2  25634 rmc       98 1.3 0.1 0.0 0.0 0.0 0.0 0.4   0  63 693   0 filebench/2  25654 rmc       98 1.3 0.1 0.0 0.0 0.0 0.0 0.4   0  67 693   0 filebench/2  25659 rmc       98 1.7 0.1 0.0 0.0 0.0 0.0 0.4   1  68 693   0 filebench/2  25647 rmc       98 1.5 0.1 0.0 0.0 0.0 0.0 0.4   0  73 693   0 filebench/2  25648 rmc       98 1.6 0.1 0.0 0.0 0.0 0.3 0.2   2  48 693   0 filebench/2  25643 rmc       98 1.6 0.1 0.0 0.0 0.0 0.0 0.5   0  75 693   0 filebench/2  25642 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.5   0  80 693   0 filebench/2  25638 rmc       98 1.4 0.1 0.0 0.0 0.0 0.0 0.6   0  76 693   0 filebench/2  25657 rmc       97 1.8 0.1 0.0 0.0 0.0 0.4 0.3   6  64 693   0 filebench/2  25646 rmc       97 1.7 0.1 0.0 0.0 0.0 0.0 0.6   6  83 660   0 filebench/2  25645 rmc       97 1.6 0.1 0.0 0.0 0.0 0.0 0.9   0  55 693   0 filebench/2  25652 rmc       97 1.7 0.2 0.0 0.0 0.0 0.0 0.9   2 106 693   0 filebench/2  25658 rmc       97 1.5 0.1 0.0 0.0 0.0 0.0 1.0   0  72 693   0 filebench/2  25637 rmc       97 1.7 0.1 0.0 0.0 0.0 0.3 0.6   4  95 693   0 filebench/2 Total: 91 processes, 510 lwps, load averages: 28.94, 28.66, 24.39

By specifying the -m (show microstates) and -L (show per-thread) options, we can observe the per-thread microstates. These microstates represent a time-based summary broken into percentages for each thread. The columns USR tHRough LAT sum to 100% of the time spent for each thread during the prstat sample. The important microstates for CPU utilization are USR, SYS, and LAT. The USR and SYS columns are the user and system time that this thread spent running on the CPU. The LAT (latency) column is the amount of time spent waiting for CPU. A non-zero number means there was some queuing for CPU resources. This is an extremely useful metricwe can use it to estimate the potential speedup for a thread if more CPU resources are added, assuming no other bottlenecks obstruct the way. In our example, we can see that on average the filebench tHReads are waiting for CPU about 0.2% of the time, so we can conclude that CPU resources for this system are not constrained.

Another example shows what we would observe when the system is CPU-resource constrained. In this example, we can see that on average each thread is waiting for CPU resource about 80% of the time.

$ prstat -mL    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID  25765 rmc       22 0.3 0.1 0.0 0.0 0.0 0.0  77   0  42 165   0 filebench/2  25833 rmc       22 0.3 0.3 0.0 0.0 0.0 0.0  77   0 208 165   0 filebench/2  25712 rmc       20 0.2 0.1 0.0 0.0 0.0 0.0  80   0  53 132   0 filebench/2  25758 rmc       20 0.3 0.1 0.0 0.0 0.0 0.0  80   0  84 148   0 filebench/2  25715 rmc       20 0.3 0.1 0.0 0.0 0.0 0.0  80   0  56 132   0 filebench/2  25812 rmc       19 0.2 0.1 0.0 0.0 0.0 0.0  81   0  50 132   0 filebench/2  25874 rmc       19 0.2 0.0 0.0 0.0 0.0 0.0  81   0  22 132   0 filebench/2  25842 rmc       19 0.2 0.2 0.0 0.0 0.0 0.0  81   1  92 132   0 filebench/2  25732 rmc       19 0.2 0.1 0.0 0.0 0.0 0.0  81   0  54  99   0 filebench/2  25714 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  81   0  84 165   0 filebench/2  25793 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  81   0  30 132   0 filebench/2  25739 rmc       18 0.3 0.3 0.0 0.0 0.0 0.0  81   0 150 115   0 filebench/2  25849 rmc       18 0.3 0.0 0.0 0.0 0.0 0.0  81   1  19 132   0 filebench/2  25788 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  81   0  77  99   0 filebench/2  25760 rmc       18 0.2 0.0 0.0 0.0 0.0 0.0  82   0  26 132   0 filebench/2  25748 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  82   0  58 132   0 filebench/2  25835 rmc       18 0.3 0.1 0.0 0.0 0.0 0.0  82   0  65 132   0 filebench/2  25851 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   0  87  99   0 filebench/2  25811 rmc       18 0.3 0.2 0.0 0.0 0.0 0.0  82   0 129 132   0 filebench/2  25767 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   1  25 132   0 filebench/2  25740 rmc       18 0.3 0.2 0.0 0.0 0.0 0.0  82   0 118 132   0 filebench/2  25770 rmc       18 0.2 0.1 0.0 0.0 0.0 0.0  82   0  68 132   0 filebench/2 Total: 263 processes, 842 lwps, load averages: 201.45, 192.26, 136.16

We can further investigate which threads are consuming CPU within each process by directing prstat to examine a specific process.

$ prstat -Lm -p 25691    PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID  25691 rmc       17 0.2 0.2 0.0 0.0 0.0 0.0  83   0  74  99   0 filebench/2  25691 rmc      0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 filebench/1

The example shows us that thread number two in the target process is using the most CPU, and spending 83% of its time waiting for CPU. We can further look at information about thread number two with the pstack command.

$ pstack 25691/2 25691:  shadow -a shadow -i 2 -s ffffffff10000000 -m /var/tmp/fbenchJDailY -----------------  lwp# 2 / thread# 2  --------------------  000000010001ae90 flowoplib_hog (30d40, ffffffff6518dc60, 30d40, ffffffff37352c88, 1, 2570d) + 68  00000001000194a4 flowop_start (ffffffff37352c88, 0, 1, 0, 1, 1000888b0) + 408  ffffffff7e7ccea0 _lwp_start (0, 0, 0, 0, 0, 0)

In this example, we've taken a snapshot of the stack of thread number two of our target process. At the time the snapshot was taken, we can see that the function flowop_start was calling flowoplib_hog. It's sometimes worth taking several snapshots to see if a pattern is exhibited. DTrace can analyze this further.