8.2. cpustat CommandThe cpustat command monitors the CPU Performance Counters (CPCs), which provide performance details for the CPU hardware caches. These types of hardware counters are known as Performance Instrumentation Counters, or PICs, which also exist on other devices. The PICs are programmable and record statistics for different events (event is a deliberate term). For example, they can be programmed to track statistics for CPU cache events. A typical UltraSPARC system might provide two PICs, each of which can be programmed to monitor one event from a list of around twenty. An example of an event is an E-cache hit, the number of which could be counted by a PIC. Which CPU caches can be measured depends on the type of CPU. Different CPU types not only can have different caches but also can have different available events that the PICs can monitor. It is possible that a CPU could contain a cache with no events associated with itleaving us with no way to measure cache performance. The following example demonstrates the use of cpustat to measure E-cache (Level 2 cache) events on an UltraSPARC IIi CPU. # cpustat -c pic0=EC_ref,pic1=EC_hit 1 5 time cpu event pic0 pic1 1.005 0 tick 66931 52598 2.005 0 tick 67871 52569 3.005 0 tick 65003 50907 4.005 0 tick 64793 50958 5.005 0 tick 64574 50904 5.005 1 total 329172 257936 The cpustat command has a -c eventspec option to configure which events the PICs should monitor. We set pic0 to monitor EC_ref, which is E-cache references; and we set pic1 to monitor EC_hit, which is E-cache hits. 8.2.1. Cache Hit Ratio, Cache MissesIf both the cache references and hits are available, as with the UltraSPARC IIi CPU in the previous example, you can calculate the cache hit ratio. For that calculation you could also use cache misses and hits, which some CPU types provide. The calculations are fairly straightforward:
A higher cache hit ratio improves the performance of applications because the latency incurred when main memory is accessed through memory buses is obviated. The cache hit ratio may also indicate the pattern of activity; a low cache hit ratio may indicate a hot spotwhere frequently accessed memory locations map to the same cache location, causing frequently used data to be flushed. Since satisfying each cache miss incurs a certain time cost, the volume of cache misses may be of more interest than the cache hit ratio. The number of misses can more directly affect application performance than does changing percent hit ratios since the number of misses is proportional to the total time penalty. Both cache hit ratios and cache misses can be calculated with a little awk, as the following script, called ecache, demonstrates.[1]
#!/usr/bin/sh # # ecache - print E$ misses and hit ratio for UltraSPARC IIi CPUs. # # USAGE: ecache [interval [count]] # by default, interval is 1 sec cpustat -c pic0=EC_ref,pic1=EC_hit ${1-1} $2 | awk ' BEGIN { pagesize = 20; lines = pagesize } lines >= pagesize { lines = 0 printf("%8s %3s %5s %9s %9s %9s %7s\n",\ "E$ time", "cpu", "event", "total", "hits", "miss", "%hit") } $1 !~ /time/ { total = $4 hits = $5 miss = total - hits ratio = 100 * hits / total printf("%8s %3s %5s %9s %9s %9s %7.2f\n",\ $1, $2, $3, total, hits, miss, ratio) lines++ } ' This script is verbose to illustrate the calculations performed, in particular, using extra named variables.[2] nawk or perl would also be suitable for postprocessing the output of cpustat, which itself reads the PICs by using the libcpc library, and binding a thread to each CPU.
The following example demonstrates the extra columns that ecache prints. # ecache 1 5 E$ time cpu event total hits miss %hit 1.013 0 tick 65856 51684 14172 78.48 2.013 0 tick 71511 55793 15718 78.02 3.013 0 tick 69051 54203 14848 78.50 4.013 0 tick 69878 55082 14796 78.83 5.013 0 tick 68665 53873 14792 78.46 5.013 1 total 344961 270635 74326 78.45 This tool measures the volume of cache misses (miss) and the cache hit ratio (%hit) achieved for UltraSPARC II CPUs. 8.2.2. Listing PICs and EventsThe -h option to cpustat lists the available events for a CPU type and the PICs that can monitor them. # cpustat -h Usage: cpustat [-c events] [-p period] [-nstD] [interval [count]] -c events specify processor events to be monitored -n suppress titles -p period cycle through event list periodically -s run user soaker thread for system-only events -t include %tick register -D enable debug mode -h print extended usage information Use cputrack(1) to monitor per-process statistics. CPU performance counter interface: UltraSPARC I&II event specification syntax: [picn=]<eventn>[,attr[n][=<val>]][,[picn=]<eventn>[,attr[n][=<val>]],...] event0: Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr EC_ref EC_snoop_inv Dispatch0_storeBuf Load_use EC_write_hit_RDO EC_rd_hit event1: Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb Dispatch0_FP_use IC_hit DC_rd_hit DC_wr_hit Load_use_RAW EC_hit EC_ic_hit attributes: nouser sys See the "UltraSPARC I/II User's Manual" (Part No. 802-7220-02) for descriptions of these events. Documentation for Sun processors can be found at: http://www.sun.com/processors/manuals The -h output lists the events that can be monitored and finishes by referring to the reference manual for this CPU. These invaluable manuals discuss the CPU caches in detail and explain what the events really mean. In this example of cpustat -h, the event specification syntax shows that you can set picn to measure events from eventn. For example, you can set pic0 to IC_ref and pic1 to IC_hit; but not the other way around. The output also indicates that this CPU type provides only two PICs and so can measure only two events at the same time. 8.2.3. PIC Examples: UltraSPARC IIiWe chose the UltraSPARC IIi CPU for the preceding examples because it provides a small collection of fairly straightforward PICs. Understanding this CPU type is a good starting point before we move on to more difficult CPUs. For a full reference for this CPU type, see Appendix B of the UltraSPARC I/II User's Manual.[3]
The UltraSPARC IIi provides two 32-bit PICs, which are joined as a 64-bit register. The 32-bit counters could wrap around, especially for longer sample intervals. The 64-bit Performance Control Register (PCR) configures those events (statistics) the two PICs will contain. Only one invocation of cpustat (or cputrack) at a time is possible, since there is only one set of PICs to share. The available events for measuring CPU cache activity are listed in Table 8.1. This is from the User's Manual, where you can find a listing for all events.
Reading through the descriptions will reveal many subtleties you need to consider to understand these events. For example, some activity is not cacheable and so does not show up in event statistics for that cache. This includes block loads and block stores, which are not sent to the E-cache since it is likely that this data will be touched only once. You should consider such a point if an application experienced memory latency not explained by the E-cache miss statistics alone. 8.2.4. PIC Examples: The UltraSPARC T1 ProcessorEach of the 32 UltraSPARC T1 strands has a set of hardware performance counters that can be monitored using the cpustat(1M) command. cpustat can collect two counters in parallel, the second always being the instruction count. For example, to collect iTLB misses and instruction counts for every strand on the chip, type the following: # /usr/sbin/cpustat -c pic0=ITLB_miss,pic1=Instr_cnt,sys 1 10 time cpu event pic0 pic1 2.019 0 tick 6 186595695 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.089 1 tick 7 192407038 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.039 2 tick 49 192237411 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys 2.049 3 tick 15 190609811 # pic0=ITLB_miss,sys,pic1=Instr_cnt,sys ...... Both a pic0 and pic1 register must be specified. ITLB_miss is used in the preceding example, although instruction counts are only of interest in this instance. The performance counters indicate that each strand is executing about 190 million instructions per second. To determine how many instructions are executing per core, aggregate counts from four strands. Strands zero, one, two, and three are in the first core, strands four, five, six, and seven are in the second core, and so on. The preceding example indicates that the system is executing about 760 million instructions per core per second. If the processor is executing at 1.2 Gigahertz, each core can execute a maximum of 1200 million instructions per second, yielding an efficiency rating of 0.63. To achieve maximum throughput, maximize the number of instructions per second on each core and ultimately on the chip. Other useful cpustat counters for assessing performance on an UltraSPARC T1 processor-based system are detailed in Table 8.2. All counters are per second, per thread. Rather than deal with raw misses, accumulate the counters and express them as a percentage miss rate of instructions. For example, if the system executes 200 million instructions per second on a strand and IC_miss indicates 14 million instruction cache misses per second, then the instruction cache miss rate is seven percent.
8.2.5. Event MultiplexingSince some CPUs have only two PICs, only two events can be measured at the same time. If you are looking at a specific CPU component like the I-cache, this situation may be fine. However, sometimes you want to monitor more events than just the PIC count. In that case, you can use the -c option more than once, and the cpustat command will alternate between them. For example, # cpustat -c pic0=IC_ref,pic1=IC_hit -c pic0=DC_rd,pic1=DC_rd_hit -c \ pic0=DC_wr,pic1=DC_wr_hit -c pic0=EC_ref,pic1=EC_hit -p 1 0.25 5 time cpu event pic0 pic1 0.267 0 tick 221423 197095 # pic0=IC_ref,pic1=IC_hit 0.513 0 tick 105 65 # pic0=DC_rd,pic1=DC_rd_hit 0.763 0 tick 37 21 # pic0=DC_wr,pic1=DC_wr_hit 1.013 0 tick 282 148 # pic0=EC_ref,pic1=EC_hit 1.267 0 tick 213558 190520 # pic0=IC_ref,pic1=IC_hit 1.513 0 tick 109 62 # pic0=DC_rd,pic1=DC_rd_hit 1.763 0 tick 37 21 # pic0=DC_wr,pic1=DC_wr_hit 2.013 0 tick 276 149 # pic0=EC_ref,pic1=EC_hit 2.264 0 tick 217713 194040 # pic0=IC_ref,pic1=IC_hit ... We specified four different PIC configurations (-c eventspec), and cpustat cycled between sampling each of them. We set the interval to 0.25 seconds and set a period (-p) to 1 second so that the final value of 5 is a cycle count, not a sample count. An extra commented field lists the events the columns represent, which helps a postprocessing script such as awk to identify what the values represent. Some CPU types provide many PICs (more than eight), usually removing the need for event multiplexing as used in the previous example. 8.2.6. Using cpustat with Multiple CPUsEach example output of cpustat has contained a column for the CPU ID (cpu). Each CPU has its own PIC, so when cpustat runs on a multi-CPU system, it must collect PIC values from every CPU. cpustat does this by creating a thread for each CPU and binding it onto that CPU. Each sample then produces a line for each CPU and prints it in the order received. Thus, some slight shuffling of the output lines occurs. The following example demonstrates cpustat on a server with four Ultra-SPARC IV CPUs, each of which has two cores. # cpustat -c pic0=DC_rd,pic1=DC_rd_miss 5 1 time cpu event pic0 pic1 5.008 513 tick 355670 25132 5.008 3 tick 8824184 34366 5.008 512 tick 11 1 5.008 2 tick 1127 123 5.008 514 tick 55337 3908 5.008 0 tick 10 3 5.008 1 tick 19833 854 5.008 515 tick 7360753 36567 5.008 8 total 16616925 100954 The cpu column prints the total CPU count for the last line (total). 8.2.7. Cycles per InstructionThe CPC events can monitor more than just the CPU caches. The following example demonstrates the use of the cycle count and instruction count on an Ultra-SPARC IIi to calculate the average number of cycles per instruction, printed last. # cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \ awk '{ printf "%s %.2f cpi\n",$0,$4/$5; }' 10.034 0 tick 3554903403 3279712368 1.08 cpi 10.034 1 total 3554903403 3279712368 1.08 cpi This single 10-second sample averaged 1.08 cycles per instruction. During this test, the CPU was busy running an infinite loop program. Since the same simple instructions are run over and over, the instructions and data are found in the Level-1 cache, resulting in fast instructions. Now the same test is performed while the CPU is busy with heavy random memory access: # cpustat -nc pic0=Cycle_cnt,pic1=Instr_cnt 10 1 | \ awk '{ printf "%s %.2f cpi\n",$0,$4/$5; }' 10.036 0 tick 205607856 34023849 6.04 cpi 10.036 1 total 205607856 34023849 6.04 cpi Since accessing main memory is much slower, the cycles per instruction have increased to an average of 6.04. 8.2.8. PIC Examples: UltraSPARC IVThe UltraSPARC IV processor provides a greater number of events that can be monitored. The following example is the output from cpustat -h, which lists these events. # cpustat -h ... Use cputrack(1) to monitor per-process statistics. CPU performance counter interface: UltraSPARC III+ & IV events pic0=<event0>,pic1=<event1>[,sys][,nouser] event0: Cycle_cnt Instr_cnt Dispatch0_IC_miss IC_ref DC_rd DC_wr EC_ref EC_snoop_inv Dispatch0_br_target Dispatch0_2nd_br Rstall_storeQ Rstall_IU_use EC_write_hit_RTO EC_rd_miss PC_port0_rd SI_snoop SI_ciq_flow SI_owned SW_count_0 IU_Stat_Br_miss_taken IU_Stat_Br_count_taken Dispatch_rs_mispred FA_pipe_completion MC_reads_0 MC_reads_1 MC_reads_2 MC_reads_3 MC_stalls_0 MC_stalls_2 EC_wb_remote EC_miss_local EC_miss_mtag_remote event1: Cycle_cnt Instr_cnt Dispatch0_mispred EC_wb EC_snoop_cb IC_miss_cancelled Re_FPU_bypass Re_DC_miss Re_EC_miss IC_miss DC_rd_miss DC_wr_miss Rstall_FP_use EC_misses EC_ic_miss Re_PC_miss ITLB_miss DTLB_miss WC_miss WC_snoop_cb WC_scrubbed WC_wb_wo_read PC_soft_hit PC_snoop_inv PC_hard_hit PC_port1_rd SW_count_1 IU_Stat_Br_miss_untaken IU_Stat_Br_count_untaken PC_MS_misses Re_RAW_miss FM_pipe_completion MC_writes_0 MC_writes_1 MC_writes_2 MC_writes_3 MC_stalls_1 MC_stalls_3 Re_DC_missovhd EC_miss_mtag_remote EC_miss_remote See the "SPARC V9 JPS1 Implementation Supplement: Sun UltraSPARC-III+" Some of these are similar to the UltraSPARC IIi CPU, but many are additional. The extra events allow memory controller and pipeline activity to be measured. |