4.5 Tools for Memory Performance Analysis | System Performance Tuning2002

Tools for memory performance analysis can be classed under three basic issues: how fast is memory, how constrained is memory in a given system, and how much memory does a specific process consume ? In this section, I examine some tools for approaching each of these questions.

4.5.1 Memory Benchmarking

In general, monitoring memory performance is a function of monitoring memory restraints. Tools for providing benchmarks as to how fast the memory subsystem is do exist; they are largely of academic interest, as it is unlikely that much tuning will be able to increase these numbers . The one exception to this rule is that users can carefully tune interleaving as appropriate. Most systems handle interleaving by purely physical means, so you may have to purchase additional memory: consult your system hardware manual for more information. ^[10] Nonetheless, it is often important to be aware of relative memory subsystem performance in order to make meaningful comparisons.

^[10] Keep in mind that having not enough slightly faster memory is probably worse than having enough slightly slower memory. Be careful.

4.5.1.1 STREAM

The STREAM tool is simple; it measures the time required to copy regions of memory. This measures "real-world" sustainable bandwidth, not the theoretical "peak bandwidth" that most computer vendors provide. It was developed by John McCalpin while he was a professor at the University of Delaware.

The benchmark itself is easy to run in single-processor mode (the multiprocessor mode is quite a bit more complex; consult the benchmark documentation for current details). Here's an example from an Ultra 2 Model 2200:

 $  ./stream  ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 1000000, Offset = 0 Total memory required = 22.9 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 40803 microseconds.    (= 40803 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function      Rate (MB/s)   RMS time     Min time     Max time Copy:         226.1804       0.0709       0.0707       0.0716 Scale:        227.6123       0.0704       0.0703       0.0705 Add:          276.5741       0.0869       0.0868       0.0871 Triad:        239.6189       0.1003       0.1002       0.1007

The benchmarks obtained correspond to the summary in Table 4-2.

Table 4-2. STREAM benchmark types

Benchmark	Operation	Bytes per iteration
Copy	`a[i] = b[i]`	16
Scale	`a[i] = q * b[i]`	16
Add	`a[i] = b[i] + c[i]`	24
Triad	`a[i] = b[i] + q * c[i]`	24

It is also interesting to note that there are at least three ways, all in common use, of counting how much data is transferred in a single operation:

Hardware method: Counts how many bytes are physically transferred, as the hardware may move a different number of bytes than the user specified. This is because of cache behavior; when a store operation misses in a cache, many systems perform a write allocate , which means that the cache line containing that memory location is stored in the processor cache. ^[11]

^[11] This is done in order to ensure coherency between main memory and the caches.
Bcopy method: Counts how many bytes get moved from one location in memory to another. If it takes your machine one second to read a certain number of bytes at one location and one more second to write the same number of bytes to another location, the resulting bandwidth is that number of bytes per second.
STREAM method: Counts how many bytes the user asked to be read and how many bytes the user asked to be written. For the simple "copy" test, this is precisely twice the number obtained by the bcopy method. This is done because some of the tests perform arithmetic; it makes sense to count both the data read into and the data written back from the CPU.

One of the nice things about STREAM is that it uses the same method of counting bytes, all the time, so it's safe to make comparisons.

STREAM is available in source code, so that it can be easily compiled. A list of benchmarks are also available. The STREAM home page is located at http://www.streambench.org.

4.5.1.2 lmbench

Another tool for measuring memory performance is lmbench . While lmbench is capable of many sorts of measurements, we'll focus on four specific ones. The first three measure bandwidth: the speed of memory reads, the speed of memory writes , and the speed of memory copies (via the bcopy method described previously). The last one is a measure of memory read latency. Let's briefly discuss what each of these benchmarks measures:

Memory copy benchmark: Allocates a large chunk of memory, fills it with zeroes, and then measures the time required to copy the first half of the chunk of memory into the second half. The results are reported in megabytes moved per second.
Memory read benchmark: Allocates a chunk of memory, fills it with zeroes, and then measures the time required to read that memory as a series of integer loads and adds; each 4-byte integer is loaded and added to an accumulator variable.
Memory write benchmark: Allocates a chunk of memory, fills it with zeroes, and then measures the time needed to write that memory as a series of 4-byte integer stores and increments .
Memory latency benchmark: Measures the time required to read a byte from memory. The results are reported in nanoseconds per load. The entire data memory hierarchy ^[12] is evaluated, including the latency to the primary caches, secondary caches, main memory, and the latency effects of a TLB miss . In order to derive the most information from the lmbench memory latency benchmarks, try plotting the data as a function of the latency versus the array size used for the test.

^[12] Most notably, any separate instruction caches are not measured.

When you compile and run the benchmark, it will ask you a series of questions regarding what tests you would like to run. The suite is fairly well documented. The lmbench home page, which contains source code and more details about how the benchmark works, is located at http://www.bitmover.com/lm/lmbench/.

4.5.2 Examining Memory Usage System-Wide

It is important to understand a system's performance. The only way to get that understanding is through regular monitoring of data.

4.5.2.1 vmstat

vmstat is one of the most ubiquitous performance tools. There is one cardinal rule about vmstat , which you must learn and never forget: vmstat tries to present an average since boot on the first line. The first line of vmstat output is utter garbage and must be discarded.

With that in mind, Example 4-1 shows output under Solaris.

Example 4-1. vmstate output in Solaris

 #  vmstat 5  procs     memory             page            disk         faults       cpu  r b w  swap  free    re  mf pi po fr de sr s0 s1 s2 --  in   sy   cs  us sy id  0 0 0  43248 49592   0   1  5  0  0  0  0  0  1  0  0  116  106   30  1  1  99  0 0 0 275144 56936   0   1  0  0  0  0  0  2  0  0  0  120    5   19  0  1  99  0 0 0 275144 56936   0   0  0  0  0  0  0  0  0  0  0  104    8   19  0  0 100  0 0 0 275144 56936   0   0  0  0  0  0  0  0  0  0  0  103    9   20  0  0 100

The r , b , and w columns represent the number of processes that have been in the run queue, blocked for I/O resources (including paging), and processes that are runnable but swapped, respectively. If you ever see a system with a nonzero number in the w field, all that you can infer is that the system was, at some point in the past, low enough on memory to force swapping to occur. Here's the most important data you can glean from vmstat output:

swap: The amount of available swap space, in KB.
free: The amount of free memory (that is, the size of the free list) in KB. In Solaris 8, this number includes the amount of memory used for the filesystem cache; in prior releases, it does not, and therefore will be very small.
re: The number of pages reclaimed from the free list. A page had been stolen from a process, but then reclaimed by the process before it was recycled and given to another process.
mf: The number of minor page faults. These are fast as long as a page is available from the free list, because they can be resolved without performing a page-in.
pi , po: Page-ins and page-outs, respectively, in KB per second.
de: The short term memory deficit. If the value is nonzero, memory has been vanishing fast lately, and extra free memory will be reclaimed with the expectation that it'll be needed soon.
sr: The scan rate activity of the pagedaemon, in pages per second. This is the most critical memory-shortage indicator. In Solaris systems prior to 8, if this stays about 200 pages per second for a reasonably long period of time, you need more memory. If you are running Solaris 8 and this number is nonzero, you are short of memory.

On Linux, the output is a little bit different:

 %  vmstat 5  procs                  memory    swap        io    system         cpu  r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id  1 0 0 18372  8088 21828 56704   0   0    1    7   13    6   2   1   6  0 0 0 18368  7900 21828 56708   1   0    0    8  119   42   6   3  91  1 0 0 18368  7880 21828 56708   0   0    0   14  122   44   6   3  91  0 0 0 18368  7880 21828 56708   0   0    0    5  113   24   2   2  96  0 0 0 18368  7876 21828 56708   0   0    0    4  110   27   2   2  97

Note that while the w field is calculated, Linux never desperation-swaps. swpd , free , buff , and cache represented the amount of virtual memory used, the amount of idle memory, the amount of memory used as buffers, and the amount used as cache memory, respectively. There is generally very little to be gathered from a Linux vmstat output. The most important thing to watch is probably the si and so columns, which indicate the amount of swap-ins and swap-outs. If these grow large, you probably need to increase the kswapd swap_cluster kernel variable to buy yourself some additional bandwidth to and from swap, or purchase more physical memory (see Section 4.2.5.1 earlier in this chapter).

4.5.2.2 sar

sar (for system activity reporter ) is, like vmstat , an almost ubiquitous performance-monitoring tool. It is particularly useful in that it can be adapted to gather data on its own for later perusal, as well as doing more focused, short-term data collection. In general, its output is comparable with vmstat 's, although differently labeled.

In general, the syntax for invoking sar is sar -flags interval number . This causes a specific number of data points to be gathered every interval seconds. When looking at memory statistics, the most important flags are -g , -p , and -r . Here's an example of the output generated:

 $  sar -gpr 5 100  SunOS islington.london-below.net 5.8 Generic_108528-03 sun4u    02/19/01 11:28:10  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf           atch/s  pgin/s ppgin/s  pflt/s  vflt/s slock/s          freemem freeswap ... 11:28:50     0.00     0.00     0.00     0.00     0.00           251.60    5.00    5.20 1148.20 2634.60    0.00            86319  3304835

The most important output fields are summarized in Table 4-3.

Table 4-3. sar memory statistics fields

Flag	Field	Meaning
`-g`	`pgout/s`	Page-out requests per second
	`ppgout/s`	Pages paged out per second
	`pgfree/s`	Pages placed on the free list per second by the page scanner
	`pgscan/s`	Pages scanned per second by the page scanner
	`%ufs_ipf`	The percentage of cached filesystem pages taken off the free list while they still contained valid data; these pages are flushed and cannot be reclaimed (see Section 5.4.2.7)
`-p`	`atch/s`	Page faults per second that are satisfied by reclaiming a page from the free list (this is sometimes called an attach )
	`pgin/s`	The number of page-in requests per second
	`ppgin/s`	The number of pages paged in per second
	`pflt/s`	The number of page faults caused by protection errors (illegal access to page, copy-on-write faults) per second
`-r`	`freemem`	The average amount of free memory
	freeswap	The number of disk blocks available in paging space

4.5.2.3 memstat

Starting in the Solaris 7 kernel, Sun implemented new memory statistics to help evaluate system behavior. To access these statistics, you need a tool called memstat . While currently unsupported, it will hopefully be rolled into a Solaris release soon. This tool provides some wonderful functionality, and is available at the time of this writing from http://www.sun.com/sun-on-net/performance.html. Example 4-2 shows the sort of information it gives you.

Example 4-2. memstat

  # memstat 5  memory ---------- paging----------executable- -anonymous---filesys  --  ---  cpu --   free  re  mf  pi  po  fr de sr epi epo epf  api apo apf  fpi fpo fpf us sy wt  id  49584   0   1   5   0   0  0  0   0   0   0    0   0   0    5   0   0  1  1  1  98  56944   0   1   0   0   0  0  0   0   0   0    0   0   0    0   0   0  0  0  0 100

Much like vmstat , the first line of output is worthless and should be discarded. memstat breaks page-in, page-out, and page-free activity into three different categories: executable, anonymous, and file operations. Systems with plenty of memory should see very low activity in the epf and apo fields. Consistent activity in these fields indicates a memory shortage.

If you are running Solaris 7 or earlier and do not have priority paging enabled, however, executables and anonymous memory will be paged with even the smallest amount of filesystem I/O, once memory falls to or beneath the lotsfree level.

4.5.3 Examining Memory Usage of Processes

It's nice to know how much memory a specific process is consuming. Each process's address space, which is made up of many segments, can be measured in the following ways:

The total size of the process address space (represented by SZ or SIZE)
The resident size (the size of the part of the address space that's held in memory) of the process address space (RSS)
The total shared address space
The total private address space

There are many good tools for examining memory usage and estimating the amount of memory required for a certain process.

4.5.3.1 Solaris tools

One common tool is /usr/ucb/ps uax . Here's an example of the sort of data it gives:

 %  /usr/ucb/ps uax  USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND root     16755  0.1  1.0 1448 1208 pts/0    O 17:33:35  0:00 /usr/ucb/ps uax root         3  0.1  0.0    0    0 ?        S   May 24  6:19 fsflush root         1  0.1  0.6 2232  680 ?        S   May 24  3:10 /etc/init - root       167  0.1  1.3 3288 1536 ?        S   May 24  1:04 /usr/sbin/syslogd root         0  0.0  0.0    0    0 ?        T   May 24  0:16 sched root         2  0.0  0.0    0    0 ?        S   May 24  0:00 pageout gdm      14485  0.0  0.9 1424 1088 pts/0    S 16:17:57  0:00 -csh

Note that kernel daemons like fsflush , sched , and pageout report SZ and RSS values of zero; they run entirely in kernel space, and don't consume memory that could be used for running other applications. However, ps doesn't tell you anything about the amount of private and shared memory required by the process, which is what you really want to know.

The newer and much more informative way to determine a process's memory use under Solaris is to use /usr/proc/bin/pmap -x process-id . (In Solaris 8, this command moved to /usr/bin .) Continuing from the previous example, here's a csh process:

 %  pmap -x 14485  14485:  -csh Address   Kbytes Resident Shared Private Permissions       Mapped File 00010000     144     144       8     136 read/exec         csh 00042000      16      16       -      16 read/write/exec   csh 00046000     136     112       -     112 read/write/exec    [ heap ] FF200000     648     608     536      72 read/exec         libc.so.1 FF2B0000      40      40       -      40 read/write/exec   libc.so.1 FF300000      16      16      16       - read/exec         libc_psr.so.1 FF320000       8       8       -       8 read/exec         libmapmalloc.so.1 FF330000       8       8       -       8 read/write/exec   libmapmalloc.so.1 FF340000     168     136       -     136 read/exec         libcurses.so.1 FF378000      40      40       -      40 read/write/exec   libcurses.so.1 FF382000       8       -       -       - read/write/exec    [ anon ] FF390000       8       8       8       - read/exec         libdl.so.1 FF3A0000       8       8       -       8 read/write/exec    [ anon ] FF3B0000     120     120     120       - read/exec         ld.so.1 FF3DC000       8       8       -       8 read/write/exec   ld.so.1 FFBE4000      48      48       -      48 read/write         [ stack ] --------  ------  ------  ------  ------ total Kb    1424    1320     688     632

This gives a breakdown of each segment, as well as the process totals.

4.5.4 Linux Tools

The ps command under Linux works essentially the same way it does under Solaris. Here's an example (I've added the header line for ease of identifying the individual columns):

 %  ps uax  grep gdm  USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND gdm      12329  0.0  0.7  1548   984  p3 S   18:18   0:00 -csh  gdm      13406  0.0  0.3   856   512  p3 R   18:37   0:00 ps uax  gdm      13407  0.0  0.2   844   344  p3 S   18:37   0:00 grep gdm

While ps is ubiquitous, it's not very informative. The Linux equivalent of Solaris's pmap are the entries in the /proc filesystem, which tell you very useful things.

One of the nicest features in Linux is the ability to work with the /proc filesystem directly. This access gives you good data on the memory use of a specific process, akin to pmap under Solaris. Every process has a directory under its process ID in the /proc filesystem; inside that directory is a file called status . Here's an example of what it contains:

 %  cat /proc/12329/status  Name:   csh State:  S (sleeping) Pid:    12329 PPid:   12327 Uid:    563     563     563     563 Gid:    538     538     538     538 Groups: 538 100  VmSize:     1548 kB VmLck:         0 kB VmRSS:       984 kB VmData:      296 kB VmStk:        40 kB VmExe:       244 kB VmLib:       744 kB SigPnd: 0000000000000000 SigBlk: 0000000000000002 SigIgn: 0000000000384004 SigCgt: 0000000009812003 CapInh: 00000000fffffeff CapPrm: 0000000000000000 CapEff: 0000000000000000

As you can see, this is a lot of data. Here's a quick Perl script to siphon out the most useful parts :

 #!/usr/bin/perl $pid = $ARGV[0]; @statusArray = split (/\s+/, `grep Vm /proc/$pid/status`); print "Status of process $pid\n\r"; print "Process total size:\t$statusArray[1]\tKB\n\r"; print "            Locked:\t$statusArray[4]\tKB\n\r"; print "      Resident set:\t$statusArray[7]\tKB\n\r"; print "              Data:\t$statusArray[10]\tKB\n\r"; print "             Stack:\t$statusArray[13]\tKB\n\r"; print " Executable (text):\t$statusArray[16]\tKB\n\r"; print "  Shared libraries:\t$statusArray[19]\tKB\n\r";