Tools for memory performance analysis can be classed under three basic issues: how fast is memory, how constrained is memory in a given system, and how much memory does a specific process consume ? In this section, I examine some tools for approaching each of these questions. 4.5.1 Memory BenchmarkingIn general, monitoring memory performance is a function of monitoring memory restraints. Tools for providing benchmarks as to how fast the memory subsystem is do exist; they are largely of academic interest, as it is unlikely that much tuning will be able to increase these numbers . The one exception to this rule is that users can carefully tune interleaving as appropriate. Most systems handle interleaving by purely physical means, so you may have to purchase additional memory: consult your system hardware manual for more information. [10] Nonetheless, it is often important to be aware of relative memory subsystem performance in order to make meaningful comparisons.
4.5.1.1 STREAMThe STREAM tool is simple; it measures the time required to copy regions of memory. This measures "real-world" sustainable bandwidth, not the theoretical "peak bandwidth" that most computer vendors provide. It was developed by John McCalpin while he was a professor at the University of Delaware. The benchmark itself is easy to run in single-processor mode (the multiprocessor mode is quite a bit more complex; consult the benchmark documentation for current details). Here's an example from an Ultra 2 Model 2200: $ ./stream ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 1000000, Offset = 0 Total memory required = 22.9 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 40803 microseconds. (= 40803 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) RMS time Min time Max time Copy: 226.1804 0.0709 0.0707 0.0716 Scale: 227.6123 0.0704 0.0703 0.0705 Add: 276.5741 0.0869 0.0868 0.0871 Triad: 239.6189 0.1003 0.1002 0.1007 The benchmarks obtained correspond to the summary in Table 4-2. Table 4-2. STREAM benchmark types
It is also interesting to note that there are at least three ways, all in common use, of counting how much data is transferred in a single operation:
One of the nice things about STREAM is that it uses the same method of counting bytes, all the time, so it's safe to make comparisons. STREAM is available in source code, so that it can be easily compiled. A list of benchmarks are also available. The STREAM home page is located at http://www.streambench.org. 4.5.1.2 lmbenchAnother tool for measuring memory performance is lmbench . While lmbench is capable of many sorts of measurements, we'll focus on four specific ones. The first three measure bandwidth: the speed of memory reads, the speed of memory writes , and the speed of memory copies (via the bcopy method described previously). The last one is a measure of memory read latency. Let's briefly discuss what each of these benchmarks measures:
When you compile and run the benchmark, it will ask you a series of questions regarding what tests you would like to run. The suite is fairly well documented. The lmbench home page, which contains source code and more details about how the benchmark works, is located at http://www.bitmover.com/lm/lmbench/. 4.5.2 Examining Memory Usage System-WideIt is important to understand a system's performance. The only way to get that understanding is through regular monitoring of data. 4.5.2.1 vmstatvmstat is one of the most ubiquitous performance tools. There is one cardinal rule about vmstat , which you must learn and never forget: vmstat tries to present an average since boot on the first line. The first line of vmstat output is utter garbage and must be discarded. With that in mind, Example 4-1 shows output under Solaris. Example 4-1. vmstate output in Solaris# vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 -- in sy cs us sy id 0 0 0 43248 49592 0 1 5 0 0 0 0 0 1 0 0 116 106 30 1 1 99 0 0 0 275144 56936 0 1 0 0 0 0 0 2 0 0 0 120 5 19 0 1 99 0 0 0 275144 56936 0 0 0 0 0 0 0 0 0 0 0 104 8 19 0 0 100 0 0 0 275144 56936 0 0 0 0 0 0 0 0 0 0 0 103 9 20 0 0 100 The r , b , and w columns represent the number of processes that have been in the run queue, blocked for I/O resources (including paging), and processes that are runnable but swapped, respectively. If you ever see a system with a nonzero number in the w field, all that you can infer is that the system was, at some point in the past, low enough on memory to force swapping to occur. Here's the most important data you can glean from vmstat output:
On Linux, the output is a little bit different: % vmstat 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 18372 8088 21828 56704 0 0 1 7 13 6 2 1 6 0 0 0 18368 7900 21828 56708 1 0 0 8 119 42 6 3 91 1 0 0 18368 7880 21828 56708 0 0 0 14 122 44 6 3 91 0 0 0 18368 7880 21828 56708 0 0 0 5 113 24 2 2 96 0 0 0 18368 7876 21828 56708 0 0 0 4 110 27 2 2 97 Note that while the w field is calculated, Linux never desperation-swaps. swpd , free , buff , and cache represented the amount of virtual memory used, the amount of idle memory, the amount of memory used as buffers, and the amount used as cache memory, respectively. There is generally very little to be gathered from a Linux vmstat output. The most important thing to watch is probably the si and so columns, which indicate the amount of swap-ins and swap-outs. If these grow large, you probably need to increase the kswapd swap_cluster kernel variable to buy yourself some additional bandwidth to and from swap, or purchase more physical memory (see Section 4.2.5.1 earlier in this chapter). 4.5.2.2 sarsar (for system activity reporter ) is, like vmstat , an almost ubiquitous performance-monitoring tool. It is particularly useful in that it can be adapted to gather data on its own for later perusal, as well as doing more focused, short-term data collection. In general, its output is comparable with vmstat 's, although differently labeled. In general, the syntax for invoking sar is sar -flags interval number . This causes a specific number of data points to be gathered every interval seconds. When looking at memory statistics, the most important flags are -g , -p , and -r . Here's an example of the output generated: $ sar -gpr 5 100 SunOS islington.london-below.net 5.8 Generic_108528-03 sun4u 02/19/01 11:28:10 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf atch/s pgin/s ppgin/s pflt/s vflt/s slock/s freemem freeswap ... 11:28:50 0.00 0.00 0.00 0.00 0.00 251.60 5.00 5.20 1148.20 2634.60 0.00 86319 3304835 The most important output fields are summarized in Table 4-3. Table 4-3. sar memory statistics fields
4.5.2.3 memstatStarting in the Solaris 7 kernel, Sun implemented new memory statistics to help evaluate system behavior. To access these statistics, you need a tool called memstat . While currently unsupported, it will hopefully be rolled into a Solaris release soon. This tool provides some wonderful functionality, and is available at the time of this writing from http://www.sun.com/sun-on-net/performance.html. Example 4-2 shows the sort of information it gives you. Example 4-2. memstat# memstat 5 memory ---------- paging----------executable- -anonymous---filesys -- --- cpu -- free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id 49584 0 1 5 0 0 0 0 0 0 0 0 0 0 5 0 0 1 1 1 98 56944 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 Much like vmstat , the first line of output is worthless and should be discarded. memstat breaks page-in, page-out, and page-free activity into three different categories: executable, anonymous, and file operations. Systems with plenty of memory should see very low activity in the epf and apo fields. Consistent activity in these fields indicates a memory shortage. If you are running Solaris 7 or earlier and do not have priority paging enabled, however, executables and anonymous memory will be paged with even the smallest amount of filesystem I/O, once memory falls to or beneath the lotsfree level. 4.5.3 Examining Memory Usage of ProcessesIt's nice to know how much memory a specific process is consuming. Each process's address space, which is made up of many segments, can be measured in the following ways:
There are many good tools for examining memory usage and estimating the amount of memory required for a certain process. 4.5.3.1 Solaris toolsOne common tool is /usr/ucb/ps uax . Here's an example of the sort of data it gives: % /usr/ucb/ps uax USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND root 16755 0.1 1.0 1448 1208 pts/0 O 17:33:35 0:00 /usr/ucb/ps uax root 3 0.1 0.0 0 0 ? S May 24 6:19 fsflush root 1 0.1 0.6 2232 680 ? S May 24 3:10 /etc/init - root 167 0.1 1.3 3288 1536 ? S May 24 1:04 /usr/sbin/syslogd root 0 0.0 0.0 0 0 ? T May 24 0:16 sched root 2 0.0 0.0 0 0 ? S May 24 0:00 pageout gdm 14485 0.0 0.9 1424 1088 pts/0 S 16:17:57 0:00 -csh Note that kernel daemons like fsflush , sched , and pageout report SZ and RSS values of zero; they run entirely in kernel space, and don't consume memory that could be used for running other applications. However, ps doesn't tell you anything about the amount of private and shared memory required by the process, which is what you really want to know. The newer and much more informative way to determine a process's memory use under Solaris is to use /usr/proc/bin/pmap -x process-id . (In Solaris 8, this command moved to /usr/bin .) Continuing from the previous example, here's a csh process: % pmap -x 14485 14485: -csh Address Kbytes Resident Shared Private Permissions Mapped File 00010000 144 144 8 136 read/exec csh 00042000 16 16 - 16 read/write/exec csh 00046000 136 112 - 112 read/write/exec [ heap ] FF200000 648 608 536 72 read/exec libc.so.1 FF2B0000 40 40 - 40 read/write/exec libc.so.1 FF300000 16 16 16 - read/exec libc_psr.so.1 FF320000 8 8 - 8 read/exec libmapmalloc.so.1 FF330000 8 8 - 8 read/write/exec libmapmalloc.so.1 FF340000 168 136 - 136 read/exec libcurses.so.1 FF378000 40 40 - 40 read/write/exec libcurses.so.1 FF382000 8 - - - read/write/exec [ anon ] FF390000 8 8 8 - read/exec libdl.so.1 FF3A0000 8 8 - 8 read/write/exec [ anon ] FF3B0000 120 120 120 - read/exec ld.so.1 FF3DC000 8 8 - 8 read/write/exec ld.so.1 FFBE4000 48 48 - 48 read/write [ stack ] -------- ------ ------ ------ ------ total Kb 1424 1320 688 632 This gives a breakdown of each segment, as well as the process totals. 4.5.4 Linux ToolsThe ps command under Linux works essentially the same way it does under Solaris. Here's an example (I've added the header line for ease of identifying the individual columns): % ps uax grep gdm USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND gdm 12329 0.0 0.7 1548 984 p3 S 18:18 0:00 -csh gdm 13406 0.0 0.3 856 512 p3 R 18:37 0:00 ps uax gdm 13407 0.0 0.2 844 344 p3 S 18:37 0:00 grep gdm While ps is ubiquitous, it's not very informative. The Linux equivalent of Solaris's pmap are the entries in the /proc filesystem, which tell you very useful things. One of the nicest features in Linux is the ability to work with the /proc filesystem directly. This access gives you good data on the memory use of a specific process, akin to pmap under Solaris. Every process has a directory under its process ID in the /proc filesystem; inside that directory is a file called status . Here's an example of what it contains: % cat /proc/12329/status Name: csh State: S (sleeping) Pid: 12329 PPid: 12327 Uid: 563 563 563 563 Gid: 538 538 538 538 Groups: 538 100 VmSize: 1548 kB VmLck: 0 kB VmRSS: 984 kB VmData: 296 kB VmStk: 40 kB VmExe: 244 kB VmLib: 744 kB SigPnd: 0000000000000000 SigBlk: 0000000000000002 SigIgn: 0000000000384004 SigCgt: 0000000009812003 CapInh: 00000000fffffeff CapPrm: 0000000000000000 CapEff: 0000000000000000 As you can see, this is a lot of data. Here's a quick Perl script to siphon out the most useful parts : #!/usr/bin/perl $pid = $ARGV[0]; @statusArray = split (/\s+/, `grep Vm /proc/$pid/status`); print "Status of process $pid\n\r"; print "Process total size:\t$statusArray[1]\tKB\n\r"; print " Locked:\t$statusArray[4]\tKB\n\r"; print " Resident set:\t$statusArray[7]\tKB\n\r"; print " Data:\t$statusArray[10]\tKB\n\r"; print " Stack:\t$statusArray[13]\tKB\n\r"; print " Executable (text):\t$statusArray[16]\tKB\n\r"; print " Shared libraries:\t$statusArray[19]\tKB\n\r"; |