Performance Monitoring

The system manager is responsible not only for the provision of an IT service to customers, but also for the maintenance of an acceptable level of performance. To this end, the system manager must ensure that the systems under his control are not suffering from poor performance or substandard response times.

Performance monitoring is normally carried out by the system administrator, but the system manager needs to have a good appreciation of what is involved and the tools that are available to obtain the required information. This is especially important because the system manager will be compiling the periodic reports that undoubtedly will form part of an ongoing SLA.

Regular monitoring of the systems builds a crucial library of information, highlighting trends of both usage and performance. For example, the data that is gathered could provide an early warning indicator of increased usage or load on the systems, perhaps necessitating a hardware upgrade or even a replacement system. The important point is that, armed with this information, the system manager should not face any surprises ; any additional processing power or other resource that might be needed can be planned for and implemented before it becomes a performance issue.

The other side of performance monitoring is to investigate problems as they occur, to try to identify bottlenecks that could be affecting the overall performance of the system, or to gather evidence that the loading is not evenly balancedthis is especially visible with disks. Frequently, a problem may manifest itself in some way but actually be caused by something else. This is one reason why the activity of monitoring is so useful:The process of elimination helps to narrow down the true cause of the problem.

The next subsections describe the kind of problems likely to be encountered , as well as the tools provided with the standard Solaris release. These sections also discuss some utilities and tools that are available both commercially and in the public domain.

What Are You Looking For?

Monitoring the performance of the system generally involves looking for areas of weakness or contention . Bottlenecks can slow down the rest of the system; for example, a badly balanced disk setup that is overloaded with requests and is causing the processor to regularly wait for data. The monitoring process also can provide evidence that the current resources are starting to struggle with the load being placed upon them. The system manager can use this information as evidence that an upgrade is required (possibly due to increased usage by the customer) and also to justify why the consumer department should pay for it to be able to maintain the required level of service.

Solaris Utilities

Solaris provides a comprehensive set of tools to allow monitoring of the system's performance. These are discussed in the following paragraphs to show the type of information that can be obtained and to tell what to actually look for to ascertain whether there is a performance problem. The final utility in this section, top, is available in the public domain and is now included with the standard release of the Solaris operating environment (as of Solaris 8). It is extremely useful for seeing the processes that are utilizing most of the system resources.

`vmstat`

The vmstat utility is listed as a tool for reporting virtual memory statistics, but it does much more than that. It also gives a good overall picture of how the system is performing. Listing 9.3 shows the output from the vmstat command.

Listing 9.3 Output from the `vmstat` Command Using an Interval Period of 3 Seconds

 #vmstat 3  procs     memory            page            disk          faults      cpu   r b w   swap  free  re  mf pi po fr de sr s0 s6 s7 s8   in   sy   cs us sy id  0 0 0   1184 96992   0   6 11  9 12  0  0  2  0  0  2  260 1253  429 64  8 29   6 0 0 2873928 33128  0   0  5 18 18  0  0 23  0  0 66 1754 4880 1806 81 19  0   3 0 0 2873928 33136  0   0  0  2  2  0  0 26  0  0 33 1281 5937 1523 83 17  0   3 0 0 2873928 33128  0   0  2  2  2  0  0 15  0  0 15  988 6416 1523 83 16  0   1 0 0 2873928 33128  1   0  8 18 18  0  0 16  0  0 23 1229 5428 1755 77 22  0   2 0 0 2873928 33128  0   0  2  2  2  0  0 28  0  0  6  966 6366 1342 84 16  0   2 1 0 2873928 33136  0   0  5  8  8  0  0 29  0  0 30 1405 6175 1925 81 18  1   2 0 0 2873928 33128 15   0 525 312 312 0 0 31 0  0 35 1452 3331 2005 76 20  4   1 2 0 2873928 32744 15   1  8 594 594 0 0  9  0  0 77 1838 3356 2047 74 23  3   2 0 0 2873928 33128  4   0  8 45 45  0  0 24  0  0 19 1178 3077 1616 80 20  1   1 0 0 2873928 33120 14   1  8 128 128 0 0 28  0  0 14  975 3099 1439 70 23  6   1 1 0 2873928 33120  5   0 16 66 66  0  0 27  0  0 32 1386 3764 1953 74 22  4   2 1 0 2873928 33128  3   0  2 90 90  0  0 34  0  0 81 1984 2961 1820 78 18  3   2 0 0 2873928 33128  2   0  2 24 24  0  0  2  0  0 26 1255 3518 1944 78 21  0   2 1 0 2873928 33128  3   0  5 58 58  0  0 30  0  0 74 1804 2810 1656 75 23  2   2 0 0 2873928 33184 10   0 592 77 77 0  0  8  0  0 32 1129 2972 1490 80 20  1   3 0 0 2873928 33176  1   0 202 10 10 0  0 31  0  0 26 1305 3315 1825 71 26  3   1 0 0 2873928 33128  1   0  2 13 13  0  0 27  0  0 14  952 3063 1366 68 24  8   2 0 0 2873928 33128  0   0  2  8  8  0  0 12  0  0 40 1529 3514 2060 73 24  4   2 2 0 2873928 33120  5   0 13 261 261 0 0 30  0  0 83 2260 3466 2180 71 18 11

Ignore the first entry here because it is a summary of all activity since the system was last rebooted. The information provided in this entry is meaningless.

This system is heavily loaded and could benefit from additional processor power, as shown by the combination of the number of runnable processes in the "r" column, coupled with the fact that the CPU is running almost constantly at near 100% busy, with very little idle time. Additionally, disk balancing is not ideal, with s0 and s8 bearing most of the work while s7 has no activity at all. There are also some blocked processes, identified by the "b" column. This indicates that the process had to wait for I/O, probably from disk. The output in this listing also shows that there is no shortage of physical memory otherwise , the scan rate (sr) column would be showing high numbers (more than 200300).

The page-in and page-out columns (pi and po) would be high (maybe several thousand) if the system were being used as an NFS server. These show both file system I/O activity and virtual memory activity.

The vmstat command shows at a glance the overall status of the various components . For fuller details of processor usage on systems containing more than one processor, use mpstat . Detailed disk information can be found using the iostat command; further statistics about the swap space can be displayed with the swap command. These commands are discussed next.

`mpstat`

The mpstat command produces a tabular report for each processor in a multiprocessor environment, providing useful information on how the CPUs are performing, how busy they are, whether the load is evenly balanced among them, and also how much time is spent waiting for other resources. Listing 9.4 displays a sample output from running the mpstat command.

Listing 9.4 Output from the `mpstat` Command Using an Interval Period of 5 Seconds

 #mpstat 5  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    2   0    8    39    3  159   55    6    2    0   450   67   7   1  26    1    2   0   14   275   52  170   56    6    2    0   522   58   8   2  32  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0   23   0    3   306   49 1264  553   29    7    0  2204   86  13   0   1    1   51   0   16   543   26 1208  581   29    7    0  2036   82  18   0   1  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    2    88   32  637  274   13    4    0  1791   83  17   0   0    1    0   0    6   279   17  193   91   13    5    0   388   92   8   0   0  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    77    1  269  105   32    5    0   506   89   4   0   7    1    0   0    5   297   58  573  226   33    4    0  1123   67   5   0  28  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    85    2  213   95   24    2    0   778   73  19   0   8    1    0   0    7   267   36  907  404   23    3    0  2003   89   9   1   2  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    4    74    3  622  258   26    3    0  1253   94   4   0   2   1    0   0    6   291   27  314  112   27    2    0   611   53  15   0  32   CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    16    2   23   14    3    0    0    27   99   1   0   0    1    0   0    7   217   16   28    7    2    0    0   611   34  15   0  52  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    5    16    6   68    9    7    0    0    81   60   2   1  37    1    0   0    7   233   30   53    8    5    0    0   385   53   8   1  39  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    53    2  158   51    5    0    0   161   79   2   0  19    1    0   0    8   262   33  170   37    6    0    0   517   41  12   0  47  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    2    16    1   29   15    4    0    0    25   91   2   0   8    1    0   0    6   223   22   25    8    3    0    0   680   43  17   0  40  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    2   0    6    38    3  213   71    7    0    0   307   61   4   0  35    1    0   0    7   253   18  237   84    5    0    0   366   49   6   0  46  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    18    1   27    6    2    0    0    23   99   1   0   0    1    0   0    5   212   11   21    8    4    0    0   979   52  27   0  21  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    6    16    6   48   10    4    0    0    59   83   0   1  15    1    0   0    7   238   36   76    2    5    0    0    96   11   3   1  85  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0    0   0    1    75    1  142   73    6    1    0   177   97   3   0   0    1    0   0    7   271   18  138   64    5    1    0   875   52  17   0  31

As with vmstat , the first entry is a summary of all activity since the system was last rebooted. The information in this entry is meaningless and should be ignored.

The columns of real interest are described here:

intr The number of interrupts on each of the processors. An unbalanced number of interrupts is caused by placement of the SBUS cards on the system board. If the differences are high (by more than, say, 2000), consider redistributing the SBUS cards.
usr The amount of processor time being given to the user process. A normally functioning system would expect this to be around 85% of the usage. The example output in Listing 9.4 shows that the processors are functioning correctly.
wt How much time the processor(s) had to wait while I/O was carried out, normally a read from disk. If this figure is high, maybe 4050, then the system is definitely I/O-bound. The example output shows only an occasional wait for I/O, which poses no threat to performance.

`iostat`

The iostat command reports statistics on terminal, disk, and tape I/O activity (as well as CPU utilization), although its main use is to monitor disk performance. It can be used to identify a badly balanced disk configuration, and it provides sufficient options to examine the performance of a single disk partition. Listing 9.5 displays the iostat command with extended statistics for each disk volume.

Listing 9.5 Sample Output from the `iostat` Command Using an Interval Period of 5 Seconds and the Flags `-xn`

 #iostat -xn 5  extended device statistics    r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device    0.0  4.0    0.0   30.7  0.0  0.0    0.0    9.7   0   4 c0t0d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t6d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t8d0    1.8 49.7   16.0  397.6  0.0  0.4    0.0    8.5   0  34 c0t9d0    0.0 74.5    0.0  595.7  0.0  0.5    0.0    6.3   0  44 c0t10d0    2.0 66.3   22.4  530.2  0.0  0.7    0.0   11.0   0  56 c0t11d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rmt/0  extended device statistics    r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device    0.0 14.2    0.0   67.4  0.0  0.1    0.0    6.8   0  10 c0t0d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t6d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t8d0    0.6 32.0    4.8  256.0  0.0  0.2    0.0    6.7   0  20 c0t9d0    0.0 73.8    0.0  590.4  0.0  0.4    0.0    5.9   0  42 c0t10d0    0.6 68.2    4.8  545.6  0.0  0.9    0.0   13.2   0  64 c0t11d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rmt/0  extended device statistics    r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device    0.0 23.6    0.0  111.8  0.0  0.2    0.0    7.0   0  16 c0t0d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t6d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t8d0    1.8 52.2   14.4  417.5  0.0  0.4    0.0    6.9   0  34 c0t9d0    0.0 34.6    0.0  276.7  0.0  0.2    0.0    5.9   0  20 c0t10d0    1.6 89.2   22.4  713.4  0.0  1.5    0.0   16.1   0  85 c0t11d0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rmt/0

As with vmstat and mpstat , the first entry is a summary of all activity since the system was last rebooted. The information in this entry is meaningless and should be ignored.

The information provided by iostat clearly shows disk volumes that are being used excessively and those that are not being used at all. The columns of interest are described here:

wait The number of requests waiting in the O/S to get to the disk. If this figure is greater than 0, it should be treated as a warning that there might be a performance issue, particularly if the disk is in a disk array.
actv The number of requests pending in the disk volume itself. There is likely a performance problem if this figure is high (1520) and the disk is already very busy.
%b How busy the disk volume is, displayed as a percentage. Normally this figure should be less than 65%; disk device c0t11d0 has a value of 85%, which constitutes excessive usage, especially when compared to the other disk volumes.
svc_t The time taken to service a request, in milliseconds . The value shown in this column includes any time that a request might spend waiting in the queue. When this value is high (greater than 100), it can point to a performance problem if the disk is already busy.

Listing 9.6 shows a more detailed picture for disk volumes, displaying the usage information for each partition of each disk.

Listing 9.6 Sample Output from the `iostat` Command Using an Interval Period of 5 Seconds and the Flags `-xnp` to Show Greater Detail for Each Physical Disk Volume

 #iostat -xnp 5  extended device statistics    r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device    3.5 14.4   29.1   92.1  0.0  0.2    0.0   12.9   0  10 c0t0d0    0.0  0.1    0.2    0.3  0.0  0.0    0.0   42.2   0   0 c0t0d0s0    0.1  1.9    0.4   10.3  0.0  0.0    0.1   17.8   0   2 c0t0d0s1    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t0d0s2    0.0  0.0    0.0    0.1  0.0  0.0    0.0    9.0   0   0 c0t0d0s3    0.3  8.9    2.5   46.1  0.0  0.1    0.0   14.3   0   6 c0t0d0s5    0.4  0.0    2.5    0.1  0.0  0.0    0.0    5.4   0   0 c0t0d0s6    2.6  3.5   23.5   35.1  0.0  0.1    0.0    9.3   0   4 c0t0d0s7    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t6d0    1.6  2.2   17.0   18.4  0.0  0.0    0.0    5.8   0   2 c0t8d0    0.5  0.3    6.9    2.3  0.0  0.0    0.0    5.2   0   0 c0t8d0s0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t8d0s2    0.6  1.9    6.4   15.8  0.0  0.0    0.0    6.1   0   1 c0t8d0s5    0.0  0.0    0.0    0.0  0.0  0.0    0.0    5.2   0   0 c0t8d0s6    0.5  0.0    3.7    0.3  0.0  0.0    0.0    5.2   0   0 c0t8d0s7    0.0  0.1    0.9    0.9  0.0  0.0    0.0    5.5   0   0 c0t9d0    0.0  0.1    0.5    0.5  0.0  0.0    0.0    5.5   0   0 c0t9d0s0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t9d0s1    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t9d0s2    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t9d0s6    0.3  1.9   14.7   14.8  0.0  0.0    0.0    5.5   0   1 c0t9d0s7    0.0  0.1    0.9    0.9  0.0  0.0    0.0    5.5   0   0 c0t10d0    0.0  0.1    0.5    0.5  0.0  0.0    0.0    5.5   0   0 c0t10d0s0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t10d0s1    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t10d0s2    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t10d0s6    0.3  2.0   15.7   15.9  0.0  0.0    0.0    5.5   0   1 c0t10d0s7    0.0  0.1    0.9    0.9  0.0  0.0    0.0    5.5   0   0 c0t11d0    0.0  0.1    0.5    0.5  0.0  0.0    0.0    5.5   0   0 c0t11d0s0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t11d0s1    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t11d0s2    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t11d0s6    0.3  2.0   16.1   16.2  0.0  0.0    0.0    5.5   0   1 c0t11d0s7    0.0  0.1    0.9    0.9  0.0  0.0    0.0    5.5   0   0 c0t12d0    0.0  0.1    0.5    0.5  0.0  0.0    0.0    5.5   0   0 c0t12d0s0    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t12d0s1    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t12d0s2    0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t12d0s6    0.3  2.1   16.4   16.6  0.0  0.0    0.0    5.5   0   1 c0t12d0s7    0.0  0.6    0.1   37.4  0.0  0.0    0.0   35.1   0   2 rmt/0

The initial information displayed in Listing 9.5 could identify a disk volume that might be the cause of a performance problem, but the information in Listing 9.6 can isolate the problem to a particular disk partition or file system being used to excess. The result is usually that the disk is too slow and needs to be replaced with a faster one, or that the file systems resident on the disk need to be split across multiple disks to achieve a performance gain. This problem would not normally point to a disk controller being at fault unless all disk volumes and partitions being serviced by a particular controller displayed the same symptoms.

sar

sar, which stands for "System Activity Report," is used to collect cumulative activity information and optionally to gather the information into files for subsequent analysis.

By default, automatic collection of this information is disabled. To enable it, carry out the following steps:

Uncomment the crontab entries for user sys. This is the cron user to use for performance collection activities.
Uncomment the lines in the file /etc/init.d/perf.
Run the script /etc/init.d/perf manually, or wait until the next system reboot to activate the performance collection-gathering process.

The data is collected and stored in daily files in the directory /var/adm/sa. The data is stored in files named saxx, where xx is a number representing the day of the month.

An example of sar output monitoring CPU activity is shown in Listing 9.7.

Listing 9.7 Sample Output from the `sar` Command Using the Flag `-u` to Show CPU Activity with a Time Interval of 5 Seconds

 SunOS systemA 5.7 Generic_106541-12 sun4u    11/20/00  15:29:02    %usr    %sys    %wio   %idle  15:29:07      80      20       0       0  15:29:12      78      22       0       0  15:29:17      79      21       0       0  15:29:22      80      20       0       0  15:29:27      83      17       0       0  Average       80      20       0       0

The sar command provides information on many system resources and, because it is stored in daily files, can be used to good effect for historical analysis. Consult the sar manual page for a full description of the facilities available with this command.

perfmeter

The performance meter, or perfmeter, is a graphical display of system performance. It allows monitoring of performance on remote hosts as well as the local system, but it provides less detail than the commands already discussed in this section. The data can be displayed either as a strip chart or as multiple dials. Figures 9.1 and 9.2 show an example of each type of display, with all the available options selected.

Figure 9.1. The strip chart displays cumulative results and is extremely useful in identifying peaks of activity over a given period of time.

$graphics\09fig01.gif$

Figure 9.2. The hour hand of the dials represents the average figure over a 20-second period, while the minute hand shows the average over a 2-second period.

$graphics\09fig02.gif$

top

This utility is bundled with versions of Solaris from version 8 onward and is also freely available in the public domain. It displays information about the processes currently utilizing the most of the system's resources. The display is updated regularly at an interval that is configurable by the user.

The top command can be downloaded from a number of sites on the World Wide Web, including http://www.sunfreeware.com. It is easily installed and comes either as a precompiled delivery or with source code, along with supporting documentation. It is most useful for analyzing performance problems as they are in progress so that it can be seen in a real-time display exactly which processes are being executed and how much of the CPU is currently allocated to a particular process. The top command also shows summary information, such as the current system load, the amount of swap space and physical memory in use (and how much is free), how many processes are on the system and what their respective states are, and also how CPU utilization is being divided.

Listing 9.8 A Snapshot from the `top` Command Showing the Processes Using the Most of the Resources

 load averages:  0.63,  0.61,  0.64                                     18:58:46  103 processes: 95 sleeping, 1 running, 4 zombie, 3 on cpu  CPU states: 43.4% idle, 25.0% user, 27.2% kernel,  4.4% iowait,  0.0% swap  Memory: 2048M real, 317M free, 517M swap in use, 3198M swap free  PID USERNAME THR PRI NICE  SIZE   RES STATE   TIME    CPU COMMAND  15132 nobody     1  58    0   49M   40M sleep   0:09  7.34% oracle  6811 oracle     1   1    0   48M   37M sleep   0:21  1.88% oracle  29672 john     1  11    0   19M   12M cpu1    0:03  1.56% sqlplus  29673 john     1  11    0   48M   37M cpu1    0:02  1.47% oracle  15128 oracle     1  58    0   20M   12M sleep   0:01  0.84% svrmgrl  6658 nobody     7  12    0   24M   21M sleep   0:15  0.80% jre  15879 john       1   0    0 2208K 1592K cpu0    0:00  0.72% top  1 root    1     38    0  752K  304K sleep   5:36  0.42% init  15970 root       1   0    0 1080K  840K sleep   0:00  0.13% mkbb.sh  2949 oracle     1  58    0   60M   52M sleep   3:57  0.11% oracle  16104 root       1   0    0 1080K  664K sleep   0:00  0.11% mkbb.sh  1395 oracle    60  58    0   50M   36M sleep   0:48  0.09% oracle  28718 nobody     1  58    0   49M   40M sleep   0:10  0.08% oracle

Top

What Are You Looking For?

Solaris Utilities

vmstat

Listing 9.3 Output from the vmstat Command Using an Interval Period of 3 Seconds

mpstat

Listing 9.4 Output from the mpstat Command Using an Interval Period of 5 Seconds

iostat

Listing 9.5 Sample Output from the iostat Command Using an Interval Period of 5 Seconds and the Flags -xn

Listing 9.6 Sample Output from the iostat Command Using an Interval Period of 5 Seconds and the Flags -xnp to Show Greater Detail for Each Physical Disk Volume

sar

Listing 9.7 Sample Output from the sar Command Using the Flag -u to Show CPU Activity with a Time Interval of 5 Seconds