|
One of the primary indicators of system performance is the responsiveness of the system to its users. Although most user reports of performance tend to be very subjective, there are several more quantitative measures than can be applied to end-user perceptions. For instance, humans tend to generally perceive any response within 50 milliseconds as virtually instantaneous. Interactive applications, especially those that involve text entry, appear slow when that 50-millisecond threshold is exceeded. Users of complex systems may have been acclimated to system response times of several seconds or, for occasional highly complex activities, possibly even minutes. Each workload has its own response times, often established with the user through days, weeks, months, or even years of interaction with the system. When the user's expectations are no longer satisfied, the user tends to call the support desk to indicate that response times have degraded, possibly catastrophically, to the point where the user can no longer complete his task in a reasonable amount of time. These calls to the support desk serve as an unplanned opportunity to employ performance analysis tools of various types. Unfortunately, reactive applications of these tools without known baselines can present quite a challenge to the uninitiated. Having an opportunity to monitor the application before a crisis, as well as creating and archiving various performance statistics, typically helps expedite a resolution to any complaints to the support desk. Also, some general experience with how to recognize certain signs of system performance problems is invaluable in understanding performance problems with a workload. Systematic and methodic application of the tools described here, typically coupled with general knowledge of the workload, should make it possible to narrow down any performance bottlenecks. And with a knowledge of the bottleneck from a system perspective, a system administrator or performance analyst should be able to employ system tuning, system reconfiguration, workload reconfiguration, or detailed analysis of specific workload components to further identify and ultimately resolve any performance bottleneck. In a methodical analysis of the system, often one of the first and most basic tools is a simple measure of the system's CPU utilization. In particular, Linux and most UNIX-based operating systems provide a command that displays the system's load average: $ uptime 17:37:30 up 3 days, 17:06, 7 users, load average: 1.13, 1.23, 1.15 In particular, the load average represents the average number of tasks that could be run over a period of 1, 5, and 15 minutes. Runnable tasks are those that either are currently running or those that can run but are waiting for a processor to be available. In this case, the system had only one CPU, which can be determined by looking at the contents of /proc/cpu: $ cat /proc/cpu processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 6 cpu MHz : 647.195 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1291.05 In this case, there is only an entry for a single processor, so on average there was slightly more work for the processor to do than it could keep up with. At a high level, this implies that the machine has more work to do than it can keep up with. Note: If uptime showed a load average of something less than 2.00 on a two-CPU machine, that would indicate that the processors still have additional free cycles. The same would be true on a four-CPU machine with a load average less than 4.00, and so on. However, the load average alone does not tell the entire story. In this case, the machine in question was a laptop. The workload included an email client, word processors, web browsers, some custom applications, chat clients, and so on, all running simultaneously. However, the response time to the various workloads was deemed to be acceptable by the user. This implies that the operating system scheduler was appropriately prioritizing tasks that needed to be run based on those that were most interactive. No significant server aspects were running on the machine, so there was no impact from the high load average on other clients or machines. So, although the tool indicates that the CPU was fully utilized, it does not indicate what the system was doing or why it was so busy. And, if the response time for the users of the system was acceptable, there may not be any reason to probe more deeply into the system's operation. However, this chapter would be uninteresting if we stopped there. Quite often, simple tools like uptime are a user's quick way to attempt to account for any perceived slowness in an application's response time. If the load average for the system indicates that the response time may be caused by an overloaded processor (or processors), a variety of other tools can be used to narrow down the cause of the load. To drill down a bit further into how the processors are being used, we'll look at three tools that provide us with a variety of different insights about the CPU utilization: vmstat, iostat, and top. Each of these commands has a different focus on system monitoring, but each can be used to get a different perspective on how the processor is being utilized. In particular, the next step would be to understand if the processor is primarily spending time in the operating system (often referred to as kernel space) or in the application (often referred to as user space) or if the processor is idle. Further, if the processor is idle, understanding why it is idle is key to any further performance analysis. A processor can be idle for a number of reasons. For instance, the most obvious reason is that a process cannot run. This may sound too obvious, but if a component of the workload, such as a particular process or task, is not running, performance may be impacted. In some cases, caching components or fallback mechanisms allow some applications to continue to run, although with degraded throughput. As an example, the Internet domain name service is often configured to consult a daemon called named or to consult a service off-host. If one name service provider, such as the first one listed with a nameserver line in /etc/resolv.conf, is not running, there may be a timeout period before another information provider is consulted. To the user, this may appear as sporadic delays in the application. To someone monitoring the system with uptime, the load average may not appear very high. However, in this case, the output from vmstat may help narrow the problem. In particular, many of the tools that we will look at in this chapter and the next two will help you understand what the machine's CPU is doing, from which a performance analyst can create hypotheses about the system performance and then attempt to validate those hypotheses. vmstatvmstat is a real-time performance monitoring tool. The vmstat command provides data that can be used to help find unusual system activity, such as high page faults or excessive context switches, which can degrade system performance. The frequency of data display is a user-specified parameter. A sample of the vmstat output is as follows: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 18 8 0 5626196 3008 122788 0 0 330403 454 2575 4090 91 8 1 0 18 15 0 5625132 3008 122828 0 0 328767 322 2544 4264 91 8 0 0 17 12 0 5622004 3008 122828 0 0 327956 130 2406 3998 92 8 0 0 22 2 0 5621644 3008 122828 0 0 327892 689 2445 4077 92 8 0 0 23 5 0 5621616 3008 122868 0 0 323171 407 2339 4037 92 8 1 0 21 14 0 5621868 3008 122868 0 0 323663 23 2418 4160 91 9 0 0 22 10 0 5625216 3008 122868 0 0 328828 153 2934 4518 90 9 1 0 vmstat produces the following information:
For I/O-intensive workloads, you can monitor bi and bo for the transfer rate and in for the interrupt rate. You can monitor swpd, si, and so to see whether the system is swapping. If so, you can check on the swapping rate. Perhaps the most common metric is CPU utilization and the monitoring of us, sy, id, and wa. If wa is large, you need to examine the I/O subsystem. You might come to the conclusion that more I/O controllers and disks are needed to reduce the I/O wait time. Like uptime, when vmstat is run without options, it reports a single snapshot of the system. If you run uptime, followed by vmstat, you can get a quick snapshot of how busy the system is and some indication of what the processor is doing with the percentage breakdown of user, system, idle, and waiting on I/O times. In addition, vmstat provides an instantaneous count of the number of runnable processes. Note that uptime provides another view of the number of runnable processes across three time periods: 1 minute, 5 minutes, and 15 minutes. So, if the load average from uptime remains above 1 for any period of time, the number of runnable tasks reported by vmstat should also be near 1. Because vmstat can also provide information at regular, repeated intervals, you can get a more dynamic view of the system with the following command: $ vmstat 5 10 This command outputs vmstat information every 5 seconds for 10 iterations. Again, the output should generally show about one runnable task in each line of output if the load average has been 1 for the past 1/5/10 minutes per the output of uptime. Do not be surprised to see peaks of 5, 7, or even 20 in the output of vmstat! Remember that the load average is a calculated average as opposed to an instantaneous snapshot. Both views have their benefits to system performance analysis. vmstat also provides a rough view of how much time the processor is spending in the operating system, how much time is spent in user application space (including libraries provided as part of the application or operating system), how much time is idle, and how much time the operating system is blocked waiting on I/O. Various workloads have different utilization patterns, so there are no definite rules as to what values are good or bad. In these cases, it is good to have developed a baseline for your workload to help you recognize how the workload has changed as the response time or system load increases. vmstat helps identify which tools can help drill down into any workload degradations or provide clues as to where tuning or reconfiguration can improve workload performance. The remainder of this section explores a few examples. Imagine a scenario where users are reporting poor response times with a workload. Examining the load average via uptime shows the load average to be very low, possibly even well below the baseline times. Further, vmstat shows that the number of runnable jobs was very low and that the system was relatively idle from the percentage of CPU idle time. An analyst might be interpret these results as a sign that some key process had exited or was blocking waiting on some event that wasn't completing. For instance, some applications use a form of semaphore technique to dispatch work and wait for completion. Perhaps the work was dispatched to a back-end server or other application and that application has for some reason stopped processing all activity. As a result, the application closest to the user is blocked, not runnable, waiting on some semaphore to notify it of completion before it can return information to the user. This might cause the administrator to focus attention on the server application to see why it is unable to complete the requests being queued for it. In another scenario, suppose that the load average is showing a load in excess of 1, possibly even a full point higher on average than established baselines. In addition, vmstat shows one or two processes always runnable, but the percentage of user time is nearly 100% over an extended period of time. Another tool might be necessary to find out what process or processes are using up 100% of the CPU timefor instance, ps(1) or top(1). ps(1) provides a listing of all processes that currently exist, or some selected subset of processes based on its options. top(1) (or gtop(1)) provides a constantly updating view of the most active processes, where most active can be defined as those processes that are using the most processor time. This data might help identify a runaway process that is doing nothing useful on the system. If vmstat(1) had reported that the processes were mostly running in user space, the administrator might want to connect a debugger such as gdb(1) to the process and use breakpoints, tracing, or other debugging means to understand what the application was doing. If vmstat had reported that most of the time was being consumed as "system" time, other tools such as strace(1) might be used to find out what system calls were being made. If vmstat(1) had reported that a large percentage of time was being spent waiting for I/O completion, tools such sar(1) could be used to see what devices were being used and also provide some possible insights into which applications or file systems were in use, whether the system was swapping or paging, and so on. vmstat(1) provides some simple insights into the current state of the system. In this section, we looked at the areas that primarily interact with the CPU utilization. However, vmstat(1) also provides some insight into memory utilization, basic swapping activity, and I/O activity. Later sections in this chapter look at more of these areas in detail. top and gtoptop and gtop are very useful tools for understanding the tasks and processes that contribute to the high-level information provided by vmstat or uptime. They can show which processes are active and which ones over time are consuming the most processing time or memory. The top command provides an updating overview of all running processes and the system load. top provides information on CPU load, memory usage, and usage per process, as detailed in the snapshot that follows. Note that it also provides load average snapshots much like uptime(1) does; however, top also provides a breakout of the number of processes that have been created but that are currently sleeping, and the number of processes that are running. "Sleeping" tasks are those that are blocked waiting on some activity, such as a key press from a user at a keyboard, data from a pipe or socket, requests from another host (such as a web server waiting for someone to request content), and so on. top(1) also shows load average for each processor independently, which can help identify any imbalances in scheduling tasks. By default, the output of top is refreshed frequently, and tasks are sorted by percentage of CPU consumption. Other sorting options are possible, such as cumulative CPU consumption or percentage of memory consumption. 4:52pm up 5:08, 3 users, load average: 2.77, 5.81, 3.15 37 processes: 36 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 5.1% user, 53.1% system, 0.0% nice, 41.1% idle CPU1 states: 5.0% user, 52.4% system, 0.0% nice, 41.4% idle Mem: 511480K av, 43036K used, 468444K free, 0K shrd, 2196K Swap: 263992K av, 0K used, 263992K free 21432K PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 1026 root 2 0 488 488 372 S 1.1 0.0 0:00 chat_s 7490 root 11 0 1012 1012 816 R 0.5 0.1 0:00 top 3 root 19 19 0 0 0 SWN 0.3 0.0 0:00 ksoftirqd_C 4 root 19 19 0 0 0 SWN 0.1 0.0 0:00 ksoftirqd_C 1 root 9 0 536 536 468 S 0.0 0.1 0:04 init 2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd 5 root 9 0 0 0 0 SW 0.0 0.0 0:00 kswapd 6 root 9 0 0 0 0 SW 0.0 0.0 0:00 bdflush 7 root 9 0 0 0 0 SW 0.0 0.0 0:00 kupdated 9 root 9 0 0 0 0 SW 0.0 0.0 0:00 scsi_eh_0 10 root 9 0 0 0 0 SW 0.0 0.0 0:00 khubd 331 root 9 0 844 844 712 S 0.0 0.1 0:00 syslogd 341 root 9 0 1236 1236 464 S 0.0 0.2 0:00 klogd 356 rpc 9 0 628 628 536 S 0.0 0.1 0:00 portmap top includes the following information in its output:
top also has a dynamic mode to change report information. Activate dynamic mode by pressing the f key. By further pressing the j key, you can add a new column to show the CPU last used by an executing process. This additional information is particularly useful for understanding the process behavior in an SMP system. This section barely scratches the surface of the types of information that top or gtop provide. For more information on top, see the corresponding man page; for more information on gtop, see the built-in help command. sarsar is part of the sysstat package. sar collects and reports a wide range of system activity in the operating system, including CPU utilization, the rate of context switches and interrupts, the rate of paging in and paging out, the shared memory usage, the buffer usage, and network usage. sar(1) is useful because it constantly collects and logs system activity information in a set of log files, which makes it possible to evaluate performance problems both prior to the reporting of a performance regression event as well as after the event. sar can often be used to pinpoint the time of the event and can also be used to identify specific changes in the system's behavior. sar can also output information with a shorter interval or a fixed number of intervals, much like vmstat. Based on the values in the count and interval parameters, the sar tool writes information the specified number of times spaced at the specified intervals in seconds. In addition, sar can provide averages for a number of data points that it collects. The following example provides statistics on a four-way SMP system by collecting data every 5 seconds: 11:09:13 CPU %user %nice %system %iowait %idle 11:09:18 all 0.00 0.00 4.70 52.45 42.85 11:09:18 0 0.00 0.00 5.80 57.00 37.20 11:09:18 1 0.00 0.00 4.80 49.40 45.80 11:09:18 2 0.00 0.00 6.00 62.20 31.80 11:09:18 3 0.00 0.00 2.40 41.12 56.49 11:09:23 all 0.00 0.00 3.75 47.30 48.95 11:09:23 0 0.00 0.00 5.39 37.33 57.29 11:09:23 1 0.00 0.00 2.80 41.80 55.40 11:09:23 2 0.00 0.00 5.40 41.60 53.00 11:09:23 3 0.00 0.00 1.40 68.60 30.00 . . . Average: all 0.00 0.00 4.22 16.40 79.38 Average: 0 0.00 0.00 8.32 24.33 67.35 Average: 1 0.00 0.00 2.12 14.35 83.53 Average: 2 0.01 0.00 4.16 12.07 83.76 Average: 3 0.00 0.00 2.29 14.85 82.86 One component of CPU consumption by the system is the networking and disk servicing routines. As the operating system generates I/O, the corresponding device subsystems respond by signaling the completion of those I/O requests with hardware interrupts. The operating system counts each of these interrupts; the output can help you visualize the rate of networking and disk I/O activity. sar(1) provides this input. With baselines, it is possible to track the rate of system interrupts, which can be another source of system overhead or an indicator of possible changes to system performance. The I SUM option can generate the following information, including the total number of interrupts per second. The I ALL option can provide similar information for each interrupt source (not shown). Interrupt Rate10:53:53 INTR intr/s 10:53:58 sum 4477.60 10:54:03 sum 6422.80 10:54:08 sum 6407.20 10:54:13 sum 6111.40 10:54:18 sum 6095.40 10:54:23 sum 6104.81 10:54:28 sum 6149.80 . . . Average: sum 4416.53 A per-CPU view of the interrupt distribution on an SMP machine is available through the sar A command (the following example is an excerpt from the full output). Note that the IRQ values of the system are 0, 1, 2, 9, 12, 14, 17, 18, 21, 23, 24, and 25. Due to the limited width of the page, interrupts for 9, 12, 14, and 17 have been clipped away. Interrupt Distribution10:53:53 CPU i000/s i001/s i002/s ... i018/s i021/s i023/s i024/s i025/s 10:53:58 0 1000.20 0.00 0.00 ... 0.40 0.00 0.00 3.00 0.00 10:53:58 1 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 2320.00 10:53:58 2 0.00 0.00 0.00 ... 0.00 1156.00 0.00 0.00 0.00 10:53:58 3 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 Average: 0 999.94 0.00 0.00 ... 1.20 590.99 0.00 3.73 0.00 Average: 1 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 926.61 Average: 2 0.00 0.00 0.00 ... 0.00 466.51 0.00 0.00 1427.48 Average: 3 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 The study of interrupt distribution might reveal an imbalance in interrupt processing. The next step should be an examination of the scheduler. One way to tackle the problem is to bind IRQ processing to a specific processor or a number of processors by setting up an affinity for a particular device's interrupt (or IRQ) to a particular CPU or set of CPUs. For example, if 0x0001 is echoed to /proc/irq/ID, where ID corresponds to a device, only CPU 0 will process IRQ for this device. If 0x000f is echoed to /proc/irq/ID, CPU 0 through CPU 3 will be used to process IRQ for this device. For some workloads, this technique can reduce contention on certain heavily used processors. This technique allows I/O interrupts to be processed more efficiently; the I/O performance should increase accordingly. |
|