Section 8.2. cpustat Command

8.2. `cpustat` Command

The cpustat command monitors the CPU Performance Counters (CPCs), which provide performance details for the CPU hardware caches. These types of hardware counters are known as Performance Instrumentation Counters, or PICs, which also exist on other devices. The PICs are programmable and record statistics for different events (event is a deliberate term). For example, they can be programmed to track statistics for CPU cache events.

A typical UltraSPARC system might provide two PICs, each of which can be programmed to monitor one event from a list of around twenty. An example of an event is an E-cache hit, the number of which could be counted by a PIC.

Which CPU caches can be measured depends on the type of CPU. Different CPU types not only can have different caches but also can have different available events that the PICs can monitor. It is possible that a CPU could contain a cache with no events associated with itleaving us with no way to measure cache performance.

The following example demonstrates the use of cpustat to measure E-cache (Level 2 cache) events on an UltraSPARC IIi CPU.

# cpustat -c pic0=EC_ref,pic1=EC_hit 1 5    time cpu event      pic0      pic1   1.005   0  tick     66931     52598   2.005   0  tick     67871     52569   3.005   0  tick     65003     50907   4.005   0  tick     64793     50958   5.005   0  tick     64574     50904   5.005   1 total    329172    257936

The cpustat command has a -c eventspec option to configure which events the PICs should monitor. We set pic0 to monitor EC_ref, which is E-cache references; and we set pic1 to monitor EC_hit, which is E-cache hits.

8.2.1. Cache Hit Ratio, Cache Misses

If both the cache references and hits are available, as with the UltraSPARC IIi CPU in the previous example, you can calculate the cache hit ratio. For that calculation you could also use cache misses and hits, which some CPU types provide. The calculations are fairly straightforward:

cache hit ratio = cache hits / cache references

cache hit ratio = cache hits / (cache hits + cache misses)

A higher cache hit ratio improves the performance of applications because the latency incurred when main memory is accessed through memory buses is obviated. The cache hit ratio may also indicate the pattern of activity; a low cache hit ratio may indicate a hot spotwhere frequently accessed memory locations map to the same cache location, causing frequently used data to be flushed.

Since satisfying each cache miss incurs a certain time cost, the volume of cache misses may be of more interest than the cache hit ratio. The number of misses can more directly affect application performance than does changing percent hit ratios since the number of misses is proportional to the total time penalty.

Both cache hit ratios and cache misses can be calculated with a little awk, as the following script, called ecache, demonstrates.^[1]

^[1] This script is based on E-cache from the freeware CacheKit (Brendan Gregg). See the Cache-Kit for scripts that support other CPU types and scripts that measure I- and D-cache activity.

#!/usr/bin/sh # # ecache - print E$ misses and hit ratio for UltraSPARC IIi CPUs. # # USAGE: ecache [interval [count]]      # by default, interval is 1 sec cpustat -c pic0=EC_ref,pic1=EC_hit ${1-1} $2 | awk '         BEGIN { pagesize = 20; lines = pagesize }         lines >= pagesize {            lines = 0            printf("%8s %3s %5s %9s %9s %9s %7s\n",\               "E$  time", "cpu", "event", "total", "hits", "miss", "%hit")         }         $1 !~ /time/ {            total = $4            hits = $5            miss = total - hits            ratio = 100 * hits / total            printf("%8s %3s %5s %9s %9s %9s %7.2f\n",\               $1, $2, $3, total, hits, miss, ratio)            lines++         } '

This script is verbose to illustrate the calculations performed, in particular, using extra named variables.^[2] nawk or perl would also be suitable for postprocessing the output of cpustat, which itself reads the PICs by using the libcpc library, and binding a thread to each CPU.

^[2] A one-liner version to add just the %hit column is as follows:

 # cpustat -nc pic0=EC_ref,pic1=EC_hit 1 5 | awk '{ printf "%s %.2f\n",$0,$5*100/$4 }'

The following example demonstrates the extra columns that ecache prints.

# ecache 1 5 E$  time cpu event     total      hits      miss    %hit    1.013   0  tick     65856     51684     14172   78.48    2.013   0  tick     71511     55793     15718   78.02    3.013   0  tick     69051     54203     14848   78.50    4.013   0  tick     69878     55082     14796   78.83    5.013   0  tick     68665     53873     14792   78.46    5.013   1 total    344961    270635     74326   78.45

This tool measures the volume of cache misses (miss) and the cache hit ratio (%hit) achieved for UltraSPARC II CPUs.

8.2.2. Listing PICs and Events

The -h option to cpustat lists the available events for a CPU type and the PICs that can monitor them.

# cpustat -h Usage:         cpustat [-c events] [-p period] [-nstD] [interval [count]]         -c events specify processor events to be monitored         -n        suppress titles         -p period cycle through event list periodically         -s        run user soaker thread for system-only events         -t        include %tick register         -D        enable debug mode         -h        print extended usage information         Use cputrack(1) to monitor per-process statistics.         CPU performance counter interface:  UltraSPARC I&II         event specification syntax:         [picn=]<eventn>[,attr[n][=<val>]][,[picn=]<eventn>[,attr[n][=<val>]],...]         event0:  Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr                  EC_ref EC_snoop_inv Dispatch0_storeBuf Load_use                  EC_write_hit_RDO EC_rd_hit         event1:  Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb                  Dispatch0_FP_use IC_hit DC_rd_hit DC_wr_hit Load_use_RAW                  EC_hit EC_ic_hit         attributes: nouser sys         See the "UltraSPARC I/II User's Manual" (Part No. 802-7220-02) for         descriptions of these events. Documentation for Sun processors can         be found at: http://www.sun.com/processors/manuals

The -h output lists the events that can be monitored and finishes by referring to the reference manual for this CPU. These invaluable manuals discuss the CPU caches in detail and explain what the events really mean.

In this example of cpustat -h, the event specification syntax shows that you can set picn to measure events from eventn. For example, you can set pic0 to IC_ref and pic1 to IC_hit; but not the other way around. The output also indicates that this CPU type provides only two PICs and so can measure only two events at the same time.

8.2.3. PIC Examples: UltraSPARC IIi

We chose the UltraSPARC IIi CPU for the preceding examples because it provides a small collection of fairly straightforward PICs. Understanding this CPU type is a good starting point before we move on to more difficult CPUs. For a full reference for this CPU type, see Appendix B of the UltraSPARC I/II User's Manual.^[3]

^[3] This manual is available at http://www.sun.com/processors/manuals/805-0087.pdf.

The UltraSPARC IIi provides two 32-bit PICs, which are joined as a 64-bit register. The 32-bit counters could wrap around, especially for longer sample intervals. The 64-bit Performance Control Register (PCR) configures those events (statistics) the two PICs will contain. Only one invocation of cpustat (or cputrack) at a time is possible, since there is only one set of PICs to share.

The available events for measuring CPU cache activity are listed in Table 8.1. This is from the User's Manual, where you can find a listing for all events.

Table 8.1. UltraSPARC IIi CPU Cache Events
Event	PICs	Description
`IC_ref`	PIC0	I-cache references; I-cache references are fetches of up to four instructions from an aligned block of eight instructions. I-cache references are generally prefetches and do not correspond exactly to the instructions executed.
`IC_hit`	PIC1	I-cache hits.
`DC_rd`	PIC0	D-cache read references (including accesses that subsequently trap); non-D-cacheable accesses are not counted. Atomic, block load, "internal" and "external" bad ASIs, quad precision LDD, and MEMBAR instructions also fall into this class.
`DC_rd_hit`	PIC1	D-cache read hits are counted in one of two places: When they access the D-cache tags and do not enter the load buffer (because it is already empty) When they exit the load buffer (because of a D-cache miss or a nonempty load buffer)
`DC_wr`	PIC0	D-cache write references (including accesses that subsequently trap); non-D-cacheable accesses are not counted.
`DC_wr_hit`	PIC1	D-cache write hits.
`EC_ref`	PIC0	Total E-cache references; noncacheable accesses are not counted.
`EC_hit`	PIC1	total E-cache hits.
`EC_write_hit_RDO`	PIC0	E-cache hits that do a read for ownership of a UPA transaction.
`EC_wb`	PIC1	E-cache misses that do writebacks.
`EC_snoop_inv`	PIC0	E-cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQ.
`EC_snoop_cb`	PIC1	E-cache snoop copybacks from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.
`EC_rd_hit`	PIC0	E-cache read hits from D-cache misses.
`EC_ic_hit`	PIC1	E-cache read hits from I-cache misses.

Reading through the descriptions will reveal many subtleties you need to consider to understand these events. For example, some activity is not cacheable and so does not show up in event statistics for that cache. This includes block loads and block stores, which are not sent to the E-cache since it is likely that this data will be touched only once. You should consider such a point if an application experienced memory latency not explained by the E-cache miss statistics alone.

8.2.4. PIC Examples: The UltraSPARC T1 Processor

Each of the 32 UltraSPARC T1 strands has a set of hardware performance counters that can be monitored using the cpustat(1M) command. cpustat can collect two counters in parallel, the second always being the instruction count. For example, to collect iTLB misses and instruction counts for every strand on the chip, type the following:

# /usr/sbin/cpustat -c pic0=ITLB_miss,pic1=Instr_cnt,sys 1 10 time cpu event pic0 pic1 2.019 0 tick 6 186595695 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.089 1 tick 7 192407038 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.039 2 tick 49 192237411 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.049 3 tick 15 190609811 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys ......

Both a pic0 and pic1 register must be specified. ITLB_miss is used in the preceding example, although instruction counts are only of interest in this instance.

The performance counters indicate that each strand is executing about 190 million instructions per second. To determine how many instructions are executing per core, aggregate counts from four strands. Strands zero, one, two, and three are in the first core, strands four, five, six, and seven are in the second core, and so on. The preceding example indicates that the system is executing about 760 million instructions per core per second. If the processor is executing at 1.2 Gigahertz, each core can execute a maximum of 1200 million instructions per second, yielding an efficiency rating of 0.63. To achieve maximum throughput, maximize the number of instructions per second on each core and ultimately on the chip.

Other useful cpustat counters for assessing performance on an UltraSPARC T1 processor-based system are detailed in Table 8.2. All counters are per second, per thread. Rather than deal with raw misses, accumulate the counters and express them as a percentage miss rate of instructions. For example, if the system executes 200 million instructions per second on a strand and IC_miss indicates 14 million instruction cache misses per second, then the instruction cache miss rate is seven percent.

Table 8.2. UltraSPARC-T1 Performance Counters
Events	Description	High Value	Impact	Potential Remedy
`IC_miss`	Number of instruction cache misses	>7%	Small impact as latency can be hidden by strands	Compiler flag options to compact the binary. See compiler section.
`DC_miss`	Number of data cache misses	>11%	Small impact as latency can be hidden by strands	Compact data structures to align on 64-byte boundaries.
`ITLB_miss`	Number of instruction TLB misses	>.001%	Potentially severe impact from TLB thrashing	Make sure text on large pages. See TLB section.
`DTLB_miss`	Number of data TLB misses	>.005%	Potentially severe impact from TLB thrashing	Make sure data segments are on large pages. See TLB section.
`L1_imiss`	Instruction cache misses that also miss L2	>2%	Medium impact potential for all threads to stall	Reduce conflict with data cache misses if possible.
`L1_dmiss_ld`	Data case misses that also miss L2	>2%	Medium impact potential for all threads to stall	Potential alignment issues. Offset data structures.

8.2.5. Event Multiplexing

Since some CPUs have only two PICs, only two events can be measured at the same time. If you are looking at a specific CPU component like the I-cache, this situation may be fine. However, sometimes you want to monitor more events than just the PIC count. In that case, you can use the -c option more than once, and the cpustat command will alternate between them. For example,

# cpustat -c pic0=IC_ref,pic1=IC_hit -c pic0=DC_rd,pic1=DC_rd_hit -c \ pic0=DC_wr,pic1=DC_wr_hit -c pic0=EC_ref,pic1=EC_hit -p 1 0.25 5    time cpu event      pic0      pic1   0.267   0  tick    221423    197095  # pic0=IC_ref,pic1=IC_hit   0.513   0  tick       105        65  # pic0=DC_rd,pic1=DC_rd_hit   0.763   0  tick        37        21  # pic0=DC_wr,pic1=DC_wr_hit   1.013   0  tick       282       148  # pic0=EC_ref,pic1=EC_hit   1.267   0  tick    213558    190520  # pic0=IC_ref,pic1=IC_hit   1.513   0  tick       109        62  # pic0=DC_rd,pic1=DC_rd_hit   1.763   0  tick        37        21  # pic0=DC_wr,pic1=DC_wr_hit   2.013   0  tick       276       149  # pic0=EC_ref,pic1=EC_hit   2.264   0  tick    217713    194040  # pic0=IC_ref,pic1=IC_hit ...

We specified four different PIC configurations (-c eventspec), and cpustat cycled between sampling each of them. We set the interval to 0.25 seconds and set a period (-p) to 1 second so that the final value of 5 is a cycle count, not a sample count. An extra commented field lists the events the columns represent, which helps a postprocessing script such as awk to identify what the values represent.

Some CPU types provide many PICs (more than eight), usually removing the need for event multiplexing as used in the previous example.

8.2.6. Using `cpustat` with Multiple CPUs

Each example output of cpustat has contained a column for the CPU ID (cpu). Each CPU has its own PIC, so when cpustat runs on a multi-CPU system, it must collect PIC values from every CPU. cpustat does this by creating a thread for each CPU and binding it onto that CPU. Each sample then produces a line for each CPU and prints it in the order received. Thus, some slight shuffling of the output lines occurs.

The following example demonstrates cpustat on a server with four Ultra-SPARC IV CPUs, each of which has two cores.

# cpustat -c pic0=DC_rd,pic1=DC_rd_miss 5 1    time cpu event      pic0      pic1   5.008 513  tick    355670     25132   5.008   3  tick   8824184     34366   5.008 512  tick        11         1   5.008   2  tick      1127       123   5.008 514  tick     55337      3908   5.008   0  tick        10         3   5.008   1  tick     19833       854   5.008 515  tick   7360753     36567   5.008   8 total  16616925     100954

The cpu column prints the total CPU count for the last line (total).

8.2.7. Cycles per Instruction

The CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last.

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \ awk '{ printf "%s %.2f cpi\n",$0,$4/$5; }'  10.034   0  tick 3554903403 3279712368  1.08 cpi  10.034   1 total 3554903403 3279712368  1.08 cpi

This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions.

Now the same test is performed while the CPU is busy with heavy random memory access:

# cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \ awk '{ printf "%s %.2f cpi\n",$0,$4/$5; }'  10.036   0  tick 205607856  34023849  6.04 cpi  10.036   1 total 205607856  34023849  6.04 cpi

Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04.

8.2.8. PIC Examples: UltraSPARC IV

The UltraSPARC IV processor provides a greater number of events that can be monitored. The following example is the output from cpustat -h, which lists these events.

# cpustat -h ... Use cputrack(1) to monitor per-process statistics.         CPU performance counter interface:  UltraSPARC III+ & IV         events  pic0=<event0>,pic1=<event1>[,sys][,nouser]         event0: Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr                 EC_ref EC_snoop_inv Dispatch0_br_target Dispatch0_2nd_br                 Rstall_storeQ Rstall_IU_use EC_write_hit_RTO EC_rd_miss                 PC_port0_rd SI_snoop SI_ciq_flow SI_owned SW_count_0                 IU_Stat_Br_miss_taken IU_Stat_Br_count_taken                 Dispatch_rs_mispred FA_pipe_completion MC_reads_0                 MC_reads_1 MC_reads_2 MC_reads_3 MC_stalls_0 MC_stalls_2                 EC_wb_remote EC_miss_local EC_miss_mtag_remote         event1: Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb                 IC_miss_cancelled Re_FPU_bypass Re_DC_miss Re_EC_miss                 IC_miss DC_rd_miss DC_wr_miss Rstall_FP_use EC_misses                 EC_ic_miss Re_PC_miss ITLB_miss DTLB_miss WC_miss                 WC_snoop_cb WC_scrubbed WC_wb_wo_read PC_soft_hit                 PC_snoop_inv PC_hard_hit PC_port1_rd SW_count_1                 IU_Stat_Br_miss_untaken IU_Stat_Br_count_untaken                 PC_MS_misses Re_RAW_miss FM_pipe_completion MC_writes_0                 MC_writes_1 MC_writes_2 MC_writes_3 MC_stalls_1 MC_stalls_3                 Re_DC_missovhd EC_miss_mtag_remote EC_miss_remote         See the "SPARC V9 JPS1 Implementation Supplement: Sun         UltraSPARC-III+"

Some of these are similar to the UltraSPARC IIi CPU, but many are additional. The extra events allow memory controller and pipeline activity to be measured.

8.2. cpustat Command