CPU Utilization | Performance Tuning for Linux Servers

One of the primary indicators of system performance is the responsiveness of the system to its users. Although most user reports of performance tend to be very subjective, there are several more quantitative measures than can be applied to end-user perceptions. For instance, humans tend to generally perceive any response within 50 milliseconds as virtually instantaneous. Interactive applications, especially those that involve text entry, appear slow when that 50-millisecond threshold is exceeded. Users of complex systems may have been acclimated to system response times of several seconds or, for occasional highly complex activities, possibly even minutes. Each workload has its own response times, often established with the user through days, weeks, months, or even years of interaction with the system. When the user's expectations are no longer satisfied, the user tends to call the support desk to indicate that response times have degraded, possibly catastrophically, to the point where the user can no longer complete his task in a reasonable amount of time.

These calls to the support desk serve as an unplanned opportunity to employ performance analysis tools of various types. Unfortunately, reactive applications of these tools without known baselines can present quite a challenge to the uninitiated. Having an opportunity to monitor the application before a crisis, as well as creating and archiving various performance statistics, typically helps expedite a resolution to any complaints to the support desk. Also, some general experience with how to recognize certain signs of system performance problems is invaluable in understanding performance problems with a workload. Systematic and methodic application of the tools described here, typically coupled with general knowledge of the workload, should make it possible to narrow down any performance bottlenecks. And with a knowledge of the bottleneck from a system perspective, a system administrator or performance analyst should be able to employ system tuning, system reconfiguration, workload reconfiguration, or detailed analysis of specific workload components to further identify and ultimately resolve any performance bottleneck.

In a methodical analysis of the system, often one of the first and most basic tools is a simple measure of the system's CPU utilization. In particular, Linux and most UNIX-based operating systems provide a command that displays the system's load average:

 $ uptime  17:37:30 up 3 days, 17:06,  7 users,  load average: 1.13, 1.23, 1.15

In particular, the load average represents the average number of tasks that could be run over a period of 1, 5, and 15 minutes. Runnable tasks are those that either are currently running or those that can run but are waiting for a processor to be available. In this case, the system had only one CPU, which can be determined by looking at the contents of /proc/cpu:

 $ cat /proc/cpu processor     : 0 vendor_id     : GenuineIntel cpu family    : 6 model         : 8 model name    : Pentium III (Coppermine) stepping      : 6 cpu MHz       : 647.195 cache size       : 256 KB fdiv_bug         : no hlt_bug          : no f00f_bug         : no coma_bug         : no fpu              : yes fpu_exception    : yes cpuid level      : 2 wp               : yes flags            : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca    cmov pat pse36 mmx fxsr sse bogomips         : 1291.05

In this case, there is only an entry for a single processor, so on average there was slightly more work for the processor to do than it could keep up with. At a high level, this implies that the machine has more work to do than it can keep up with. Note: If uptime showed a load average of something less than 2.00 on a two-CPU machine, that would indicate that the processors still have additional free cycles. The same would be true on a four-CPU machine with a load average less than 4.00, and so on. However, the load average alone does not tell the entire story.

In this case, the machine in question was a laptop. The workload included an email client, word processors, web browsers, some custom applications, chat clients, and so on, all running simultaneously. However, the response time to the various workloads was deemed to be acceptable by the user.

This implies that the operating system scheduler was appropriately prioritizing tasks that needed to be run based on those that were most interactive. No significant server aspects were running on the machine, so there was no impact from the high load average on other clients or machines.

So, although the tool indicates that the CPU was fully utilized, it does not indicate what the system was doing or why it was so busy. And, if the response time for the users of the system was acceptable, there may not be any reason to probe more deeply into the system's operation.

However, this chapter would be uninteresting if we stopped there. Quite often, simple tools like uptime are a user's quick way to attempt to account for any perceived slowness in an application's response time. If the load average for the system indicates that the response time may be caused by an overloaded processor (or processors), a variety of other tools can be used to narrow down the cause of the load.

To drill down a bit further into how the processors are being used, we'll look at three tools that provide us with a variety of different insights about the CPU utilization: vmstat, iostat, and top. Each of these commands has a different focus on system monitoring, but each can be used to get a different perspective on how the processor is being utilized. In particular, the next step would be to understand if the processor is primarily spending time in the operating system (often referred to as kernel space) or in the application (often referred to as user space) or if the processor is idle. Further, if the processor is idle, understanding why it is idle is key to any further performance analysis. A processor can be idle for a number of reasons. For instance, the most obvious reason is that a process cannot run. This may sound too obvious, but if a component of the workload, such as a particular process or task, is not running, performance may be impacted. In some cases, caching components or fallback mechanisms allow some applications to continue to run, although with degraded throughput. As an example, the Internet domain name service is often configured to consult a daemon called named or to consult a service off-host. If one name service provider, such as the first one listed with a nameserver line in /etc/resolv.conf, is not running, there may be a timeout period before another information provider is consulted. To the user, this may appear as sporadic delays in the application. To someone monitoring the system with uptime, the load average may not appear very high. However, in this case, the output from vmstat may help narrow the problem. In particular, many of the tools that we will look at in this chapter and the next two will help you understand what the machine's CPU is doing, from which a performance analyst can create hypotheses about the system performance and then attempt to validate those hypotheses.

vmstat

vmstat is a real-time performance monitoring tool. The vmstat command provides data that can be used to help find unusual system activity, such as high page faults or excessive context switches, which can degrade system performance. The frequency of data display is a user-specified parameter. A sample of the vmstat output is as follows:

 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r   b   swpd  free   buff  cache   si   so    bi    bo   in    cs  us sy id wa 18  8    0 5626196   3008 122788    0    0 330403   454 2575  4090 91  8  1  0 18 15    0 5625132   3008 122828    0    0 328767   322 2544  4264 91  8  0  0 17 12    0 5622004   3008 122828    0    0 327956   130 2406  3998 92  8  0  0 22  2    0 5621644   3008 122828    0    0 327892   689 2445  4077 92  8  0  0 23  5    0 5621616   3008 122868    0    0 323171   407 2339  4037 92  8  1  0 21 14    0 5621868   3008 122868    0    0 323663    23 2418  4160 91  9  0  0 22 10    0 5625216   3008 122868    0    0 328828   153 2934  4518 90  9  1  0

vmstat produces the following information:

The procs section reports the number of running processes (r) and blocked processes (b) at the time of reporting. You can use the information in this section to check whether the number of running and blocked processes is similar to what you expected. If it isn't, you can check a number of things: the parameters of the applications and kernel, system scheduler and I/O scheduler, the distribution of processes among available processors, and so on.
The memory section reports the amount of memory being swapped out (swpd), free memory (free), buffer cache for I/O data structures (buff), and cached memory for files read from the disk (cache) in kilobytes. The value of swpd reflects kswapd activities.
The swap section returns the amount of memory swapped in (si) from disk and swapped out (so) to disk, in kilobytes per second. Note that so reflects the kswapd activity as data is swapped out to the swap area. However, si reflects page fault activities as pages are swapped back to the physical memory.
The io section reports the number of blocks read in (bi) from the devices and blocks written out (bo) to the devices in kilobytes per second. Pay particular attention to these two fields when running I/O-intensive applications.
The system section reports the number of interrupts (in) and context switches (cs) per second.
The cpu section reports the percentage of total CPU time in terms of user (us), system (sy), true idleness (id), and waiting for I/O completion (wa). Perhaps the most common metric is CPU utilization. If wa is large, examine the I/O subsystemyou may conclude, for example, that more I/O controllers and disks are needed to reduce the I/O wait time.

For I/O-intensive workloads, you can monitor bi and bo for the transfer rate and in for the interrupt rate. You can monitor swpd, si, and so to see whether the system is swapping. If so, you can check on the swapping rate. Perhaps the most common metric is CPU utilization and the monitoring of us, sy, id, and wa. If wa is large, you need to examine the I/O subsystem. You might come to the conclusion that more I/O controllers and disks are needed to reduce the I/O wait time.

Like uptime, when vmstat is run without options, it reports a single snapshot of the system. If you run uptime, followed by vmstat, you can get a quick snapshot of how busy the system is and some indication of what the processor is doing with the percentage breakdown of user, system, idle, and waiting on I/O times. In addition, vmstat provides an instantaneous count of the number of runnable processes. Note that uptime provides another view of the number of runnable processes across three time periods: 1 minute, 5 minutes, and 15 minutes. So, if the load average from uptime remains above 1 for any period of time, the number of runnable tasks reported by vmstat should also be near 1.

Because vmstat can also provide information at regular, repeated intervals, you can get a more dynamic view of the system with the following command:

 $ vmstat 5 10

This command outputs vmstat information every 5 seconds for 10 iterations. Again, the output should generally show about one runnable task in each line of output if the load average has been 1 for the past 1/5/10 minutes per the output of uptime. Do not be surprised to see peaks of 5, 7, or even 20 in the output of vmstat! Remember that the load average is a calculated average as opposed to an instantaneous snapshot. Both views have their benefits to system performance analysis.

vmstat also provides a rough view of how much time the processor is spending in the operating system, how much time is spent in user application space (including libraries provided as part of the application or operating system), how much time is idle, and how much time the operating system is blocked waiting on I/O. Various workloads have different utilization patterns, so there are no definite rules as to what values are good or bad. In these cases, it is good to have developed a baseline for your workload to help you recognize how the workload has changed as the response time or system load increases. vmstat helps identify which tools can help drill down into any workload degradations or provide clues as to where tuning or reconfiguration can improve workload performance.

The remainder of this section explores a few examples. Imagine a scenario where users are reporting poor response times with a workload. Examining the load average via uptime shows the load average to be very low, possibly even well below the baseline times. Further, vmstat shows that the number of runnable jobs was very low and that the system was relatively idle from the percentage of CPU idle time. An analyst might be interpret these results as a sign that some key process had exited or was blocking waiting on some event that wasn't completing. For instance, some applications use a form of semaphore technique to dispatch work and wait for completion. Perhaps the work was dispatched to a back-end server or other application and that application has for some reason stopped processing all activity. As a result, the application closest to the user is blocked, not runnable, waiting on some semaphore to notify it of completion before it can return information to the user. This might cause the administrator to focus attention on the server application to see why it is unable to complete the requests being queued for it.

In another scenario, suppose that the load average is showing a load in excess of 1, possibly even a full point higher on average than established baselines. In addition, vmstat shows one or two processes always runnable, but the percentage of user time is nearly 100% over an extended period of time. Another tool might be necessary to find out what process or processes are using up 100% of the CPU timefor instance, ps(1) or top(1). ps(1) provides a listing of all processes that currently exist, or some selected subset of processes based on its options. top(1) (or gtop(1)) provides a constantly updating view of the most active processes, where most active can be defined as those processes that are using the most processor time. This data might help identify a runaway process that is doing nothing useful on the system. If vmstat(1) had reported that the processes were mostly running in user space, the administrator might want to connect a debugger such as gdb(1) to the process and use breakpoints, tracing, or other debugging means to understand what the application was doing. If vmstat had reported that most of the time was being consumed as "system" time, other tools such as strace(1) might be used to find out what system calls were being made. If vmstat(1) had reported that a large percentage of time was being spent waiting for I/O completion, tools such sar(1) could be used to see what devices were being used and also provide some possible insights into which applications or file systems were in use, whether the system was swapping or paging, and so on.

vmstat(1) provides some simple insights into the current state of the system. In this section, we looked at the areas that primarily interact with the CPU utilization. However, vmstat(1) also provides some insight into memory utilization, basic swapping activity, and I/O activity. Later sections in this chapter look at more of these areas in detail.

top and gtop

top and gtop are very useful tools for understanding the tasks and processes that contribute to the high-level information provided by vmstat or uptime. They can show which processes are active and which ones over time are consuming the most processing time or memory.

The top command provides an updating overview of all running processes and the system load. top provides information on CPU load, memory usage, and usage per process, as detailed in the snapshot that follows. Note that it also provides load average snapshots much like uptime(1) does; however, top also provides a breakout of the number of processes that have been created but that are currently sleeping, and the number of processes that are running. "Sleeping" tasks are those that are blocked waiting on some activity, such as a key press from a user at a keyboard, data from a pipe or socket, requests from another host (such as a web server waiting for someone to request content), and so on. top(1) also shows load average for each processor independently, which can help identify any imbalances in scheduling tasks. By default, the output of top is refreshed frequently, and tasks are sorted by percentage of CPU consumption. Other sorting options are possible, such as cumulative CPU consumption or percentage of memory consumption.

  4:52pm  up  5:08,  3 users,  load average: 2.77, 5.81, 3.15 37 processes: 36 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states:  5.1% user, 53.1% system,  0.0% nice, 41.1% idle CPU1 states:  5.0% user, 52.4% system,  0.0% nice, 41.4% idle Mem:   511480K av,   43036K used,  468444K free,       0K shrd,    2196K Swap:  263992K av,       0K used,  263992K free                 21432K   PID USER   PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND  1026 root    2   0   488  488   372  S    1.1  0.0   0:00 chat_s  7490 root   11   0  1012 1012   816  R    0.5  0.1   0:00 top     3 root   19  19     0    0     0  SWN  0.3  0.0   0:00 ksoftirqd_C     4 root   19  19     0    0     0  SWN  0.1  0.0   0:00 ksoftirqd_C     1 root    9   0   536  536   468  S    0.0  0.1   0:04 init     2 root    9   0     0    0     0  SW   0.0  0.0   0:00 keventd     5 root    9   0     0    0     0  SW   0.0  0.0   0:00 kswapd     6 root    9   0     0    0     0  SW   0.0  0.0   0:00 bdflush     7 root    9   0     0    0     0  SW   0.0  0.0   0:00 kupdated     9 root    9   0     0    0     0  SW   0.0  0.0   0:00 scsi_eh_0    10 root    9   0     0    0     0  SW   0.0  0.0   0:00 khubd   331 root    9   0   844  844   712  S    0.0  0.1   0:00 syslogd   341 root    9   0  1236 1236   464  S    0.0  0.2   0:00 klogd   356 rpc     9   0   628  628   536  S    0.0  0.1   0:00 portmap

top includes the following information in its output:

Line 1 shows the system uptime, including the current time, how long the system has been up since the last reboot, the current number of users, and three load average numbers. The load average numbers represent the average number of processors ready to run during the previous 1, 5, and 15 minutes.
Line 2 includes the process statistics, including the total number of processes running at the time of the last top screen update. This line also includes the number of sleeping, running, zombie, and stopped processes.
Lines 3 and 4 display CPU statistics for individual CPUs, including the percentage of CPU time used by the user, system, niced, and idle processes.
Line 5 provides memory statistics, including total memory available, used memory, free memory, shared memory by different processes, and memory used for buffers.
Line 6 shows virtual memory or swap statistics, including total available swap space, used swap space, free swap space, and cached swap space.

The remaining lines show statistics on the individual processes. Some of the more useful top parameters are as follows:

`d`	Delay between updates to the data.
`p`	Display only the processes specified. Up to 20 processes can be specified.
`S`	Display a summary of time being spent by the process and its children. Also displays dead time.
`I`	Do not report idle processes.
`H`	Show all threads by process.
`N`	Number of times to produce the reports.

top also has a dynamic mode to change report information. Activate dynamic mode by pressing the f key. By further pressing the j key, you can add a new column to show the CPU last used by an executing process. This additional information is particularly useful for understanding the process behavior in an SMP system.

This section barely scratches the surface of the types of information that top or gtop provide. For more information on top, see the corresponding man page; for more information on gtop, see the built-in help command.

sar

sar is part of the sysstat package. sar collects and reports a wide range of system activity in the operating system, including CPU utilization, the rate of context switches and interrupts, the rate of paging in and paging out, the shared memory usage, the buffer usage, and network usage. sar(1) is useful because it constantly collects and logs system activity information in a set of log files, which makes it possible to evaluate performance problems both prior to the reporting of a performance regression event as well as after the event. sar can often be used to pinpoint the time of the event and can also be used to identify specific changes in the system's behavior. sar can also output information with a shorter interval or a fixed number of intervals, much like vmstat. Based on the values in the count and interval parameters, the sar tool writes information the specified number of times spaced at the specified intervals in seconds. In addition, sar can provide averages for a number of data points that it collects. The following example provides statistics on a four-way SMP system by collecting data every 5 seconds:

 11:09:13   CPU  %user  %nice  %system %iowait  %idle 11:09:18   all   0.00   0.00     4.70   52.45  42.85 11:09:18     0   0.00   0.00     5.80   57.00  37.20 11:09:18     1   0.00   0.00     4.80   49.40  45.80 11:09:18     2   0.00   0.00     6.00   62.20  31.80 11:09:18     3   0.00   0.00     2.40   41.12  56.49 11:09:23   all   0.00   0.00     3.75   47.30  48.95 11:09:23     0   0.00   0.00     5.39   37.33  57.29 11:09:23     1   0.00   0.00     2.80   41.80  55.40 11:09:23     2   0.00   0.00     5.40   41.60  53.00 11:09:23     3   0.00   0.00     1.40   68.60  30.00 . . . Average:   all   0.00   0.00     4.22   16.40  79.38 Average:     0   0.00   0.00     8.32   24.33  67.35 Average:     1   0.00   0.00     2.12   14.35  83.53 Average:     2   0.01   0.00     4.16   12.07  83.76 Average:     3   0.00   0.00     2.29   14.85  82.86

One component of CPU consumption by the system is the networking and disk servicing routines. As the operating system generates I/O, the corresponding device subsystems respond by signaling the completion of those I/O requests with hardware interrupts. The operating system counts each of these interrupts; the output can help you visualize the rate of networking and disk I/O activity. sar(1) provides this input. With baselines, it is possible to track the rate of system interrupts, which can be another source of system overhead or an indicator of possible changes to system performance. The I SUM option can generate the following information, including the total number of interrupts per second. The I ALL option can provide similar information for each interrupt source (not shown).

Interrupt Rate

 10:53:53         INTR    intr/s 10:53:58          sum   4477.60 10:54:03          sum   6422.80 10:54:08          sum   6407.20 10:54:13          sum   6111.40 10:54:18          sum   6095.40 10:54:23          sum   6104.81 10:54:28          sum   6149.80 . . . Average:          sum   4416.53

A per-CPU view of the interrupt distribution on an SMP machine is available through the sar A command (the following example is an excerpt from the full output). Note that the IRQ values of the system are 0, 1, 2, 9, 12, 14, 17, 18, 21, 23, 24, and 25. Due to the limited width of the page, interrupts for 9, 12, 14, and 17 have been clipped away.

Interrupt Distribution

 10:53:53  CPU  i000/s i001/s i002/s ... i018/s  i021/s i023/s i024/s  i025/s 10:53:58    0 1000.20   0.00   0.00 ...  0.40    0.00    0.00   3.00    0.00 10:53:58    1    0.00   0.00   0.00 ...  0.00    0.00    0.00   0.00 2320.00 10:53:58    2    0.00   0.00   0.00 ...  0.00 1156.00    0.00   0.00    0.00 10:53:58    3    0.00   0.00   0.00 ...  0.00    0.00    0.00   0.00    0.00 Average:    0  999.94   0.00   0.00 ...  1.20  590.99    0.00   3.73    0.00 Average:    1    0.00   0.00   0.00 ...  0.00    0.00    0.00   0.00  926.61 Average:    2    0.00   0.00   0.00 ...  0.00  466.51    0.00   0.00 1427.48 Average:    3    0.00   0.00   0.00 ...  0.00    0.00    0.00   0.00    0.00

The study of interrupt distribution might reveal an imbalance in interrupt processing. The next step should be an examination of the scheduler. One way to tackle the problem is to bind IRQ processing to a specific processor or a number of processors by setting up an affinity for a particular device's interrupt (or IRQ) to a particular CPU or set of CPUs. For example, if 0x0001 is echoed to /proc/irq/ID, where ID corresponds to a device, only CPU 0 will process IRQ for this device. If 0x000f is echoed to /proc/irq/ID, CPU 0 through CPU 3 will be used to process IRQ for this device. For some workloads, this technique can reduce contention on certain heavily used processors. This technique allows I/O interrupts to be processed more efficiently; the I/O performance should increase accordingly.