4.10 Monitoring System Performance

Most Linux systems behave well under a distribution's default settings, and you can spend hours and days trying to tune your machine's performance without attaining any meaningful results. Sometimes performance can be improved, though, and this section concentrates primarily on memory and processor performance, and looks at how you can find out if a hardware upgrade might be worthwhile.

The two most important things that you should know about performance on your system are the load average and the system's swap/page fault behavior. The load average is the average number of processes currently ready to run. The uptime command tells you three load averages in addition to how long the kernel has been running:

 ... up 91 days, ... load average:  0.08, 0.03, 0.01

The three numbers here are the load averages for the past minute, 5 minutes, and 15 minutes. As you can see, this system isn't very busy, because the processor has been running at 1 percent capacity for the past quarter hour (because the last number is 0.01). Most systems exhibit a load average of 0 when you're doing anything except compiling a program or playing a game. A load average of 0 is usually a good sign, because it means that your processor isn't even being challenged, and you're saving power by not running a silly screensaver.

If the load average is around 1, it's not necessarily a bad thing; it simply means that the processor has something to do all of the time ” you can find the process currently using most of the CPU time with the top command. If the load average goes up near 2 or above, multiple processes are probably starting to interfere with each other, and the kernel is trying to divide the CPU resources evenly. (Unless, of course, you have two processors, in which case, a load average of 2 means that both processors have just enough to do all of the time.)

A high load average does not necessarily mean that your system is having trouble. A system with enough memory and I/O resources can easily handle many running processes. Don't panic if your load average is high and your system still responds well. The system is just running a lot of processes. The processes have to compete with each other for the CPU, and therefore, they will take longer to do their computation than they would if there were one CPU for each process, but there's nothing to worry about.

However, if you sense that the system is slow and the load average is high, you're probably running into memory problems. A high load average can result from the kernel thrashing , or rapidly swapping processes to and from swap space on the disk. Check the free command or /proc/meminfo to see how much of the real memory is being used for buffers. If there isn't much buffer memory (and the rest of the real memory is taken), then you need more memory.

A situation where the kernel does not have a piece of memory ready when a program wants to use it is called a page fault (the kernel breaks memory into small chunks called pages ). When you get a lot of page faults, it bogs the system down because the kernel has to work to provide the pages, robbing normal processes of their chance to run.

The vmstat command tells you how much the kernel is swapping pages in and out, how busy the CPU is, and a number of other things. It is probably the single most powerful memory-performance monitoring tool out there, but you have to know what the output means. Here is some output from vmstat 2 , which reports statistics every two seconds:

 procs           memory           swap        io      system        cpu r b w  swpd   free  buff cache  si   so    bi   bo   in   cs   us   sy   id 2 0 0 145680  3336  5812 89496   1    1     2    2    2    2    2    0    0 0 0 0 145680  3328  5812 89496   0    0     0    0  102   38    0    0   99 0 0 0 145680  3328  5812 89496   0    0     0   42  111   44    1    0   99 0 0 0 145680  3328  5812 89496   0    0     0    0  101   35    0    1   99  ...

The output falls into categories: procs for processes, memory for memory usage, swap for the pages pulled in and out of swap, io for disk usage, system for the number of times the kernel switches into kernel code, and cpu for the time used by different parts of the system.

The preceding output is typical for a system that isn't doing much. You need to ignore the first line of numbers because it's incomplete and doesn't make sense. Here, the system has 145680KB of memory swapped out to the disk ( swpd ) and 3328KB of real memory free , but it doesn't seem to matter, because the si and so columns report that the kernel is not swapping anything in or out from the disk. (In this example, Mozilla is taking up extensive residence on the swap partition. Mozilla has a habit of loading a lot of stuff into memory when it is started, never bothering to actually use it.) The buff column indicates the amount of memory that the kernel is using for disk buffers.

On the far right, you see the distribution of CPU time, under us , sy , and id . These are the percentages of time that the CPU is spending on user tasks, system (kernel) tasks , and idle time. In the preceding example, there aren't too many user processes running (using a maximum of 1 percent of the CPU), the kernel doing practically nothing, while the CPU is sitting around doing nothing 99 percent of the time.

Now, watch what happens when a big, nasty program starts up sometime later (the first two lines are right before the program runs):

 procs                  memory    swap           io     system         cpu r b w swpd    free  buff cache  si   so    bi    bo    in   cs  us  sy  id 1 0 0 140988  4668  7312 58980   0    0     0     0   118  148   0   0 100 0 0 0 140988  4668  7312 58980   0    0     0     0   101   31   0   0  99 1 0 0 140988  3056  6780 58496   6   14  1506    14   174 2159  23   5  72 1 1 0 140988  3304  5900 58364  32  268   100   268   195 2215  41   5  53 0 0 0 140988  3496  5648 58160  88    0   250     0   186  573   4   6  89 1 0 0 140988  3300  5648 58248   0    0    38     0   188 1792  11   6  83 2 3 0 140988  3056  4208 59588  42   14  2062    14   249 1395  20   6  74 2 1 0 140464  3100  2608 65416  16   96  2398   176   437  713   3  11  85 4 0 0 140392  3180  2416 69780   0   14  4490    14   481  704   1   5  94 2 2 0 140392  3056  2428 73076 106    0  7184    68   549 1173   0   9  90 2 1 0 141176  4072  2112 81544  28  220  7314   252   514 1748  20  19  61 1 2 0 141636  3056  1892 87012   0 1960  3532  1974   504 1054   2   7  91 3 0 0 145960  3056  1876 89864   0 3044  1458  3056   490  876   1   6  92  ...

The CPU starts to see some usage for an extended period, especially from user processes. The amount of buffer space starts to deplete as the new program requires new pages, and the kernel starts to kick pages out into swap space (the so column) to make space. Unfortunately, as the kernel does this, it has to wait for the disk to do its work, and the CPU sits idle while processes block (the b column), because they also have to wait for the memory pages that they need. You can see the effects of swapping all over the output here; more memory would have fixed this particular problem.

This section hasn't explained all of the vmstat output columns. You can check on them in the vmstat(8) manual page, but you may have to learn more about operating systems first from a class or a book like Operating System Concepts [Silberschatz]. However, try to avoid getting really obsessed with performance. If you're trying to squeak 3 percent more speed out of something and you're not working on a huge computation, you're probably wasting your time.