2.1 Workflow Characterization | System Performance Tuning2002

Performance experts are, in some sense, the gurus atop mountains of the computer world. ^[2] Every time some humble supplicant comes to them with a problem, the guru pokes around at things for a while, then sends the supplicant off to come up with more information. Characterization is the process of trying to gather as much information about the system as possible, so that trends and patterns can be determined. These patterns will prove to be vitally important the first time that performance falls through the floor; we will be able to piece together what the perturbations in the patterns are, and from that figure out what caused them -- rather like studying the wake of a passing ship to see what kind of vessel it was and where it's headed.

^[2] To extend this analogy, the database performance people sometimes seem to be in outer space. I have three main points of evidence: it is absurdly expensive to talk to them, either you or they usually have to travel vast distances for an audience, and they speak in strange languages that I don't understand.

An analogy has been drawn between workload management and financial transactions. The first time I read of this was in Adrian Cockcroft's excellent book Sun Performance and Tuning (Prentice Hall). The essential idea is that workload management on computer systems is analogous to a department with a capital budget, staff, etc., performing a task. There are basically three possible outcomes :

If there is no plan and no effective controls on the staff, then the staff will run wild, grabbing as much budget as they can to ensure that their own projects will be well- funded . Some staff will end up with no funding whatsoever, while other staff take up "fact-finding missions" to Maui. The project as a whole ends up a complete mess. (This is known as the "startup model.")
Management is overzealous, and creates a huge bureaucratic staff to plan, assess the plans, and replan. The entire budget is consumed by this bureaucratic middle layer, which micromanages those responsible for actually doing the work, often by demanding daily status reports . The administrative overheads involved make it very difficult to spend any money. The work ends up a mess, because by the time it was supposed to be finished, it's barely been started. (This is known as the "government model.")
In the ideal case, management balances and controls funding so that staff are constrained, but everyone has enough to get their work done. The bureaucratic overhead is kept to a minimum, and status reports are infrequent. The work is done on-time and within its budget. (This is known as the "dreamland model.")

The analogy to performance management is fairly straightforward. Instead of a capital budget, we have the currency of computing, if you will: processor cycles, disk I/O rates, network capacity, etc. We have to have some management over these resources, or we end up following the startup model, where actually getting any resources turns into a grabbing contest. However, if we manage things too much, we'll end up like the governmental model, where nobody can get any work done because we all spend too much time filling out paperwork. So, ideally , we should aim for a balanced approach: the dreamland model. Unfortunately, the dreamland model is aptly named. It's very hard to achieve, much less maintain, such a model in practice.

The first step in performance management has to be developing some processes to guide our decisions. This will involve setting concrete goals, identifying reproducibly measurable performance criteria, and determining the current status. This information then lets us go to management and negotiate for resources to improve the user 's experience.

As a discipline, performance management started decades ago in the mainframe world, where everything is stunningly expensive. The payback from making tuning decisions based on real understanding was direct; the actual cost savings in the computing budget was often quite large. Now that the costs of computing have plummeted, it's getting harder to justify the time and effort spent in understanding and tuning. You have to focus more on the indirect advantages to analysis; haphazardly applying system upgrades might not accomplish anything for several months, whereas a careful study of the system and a reasoned upgrade might improve performance significantly. The value of the increased throughput over those several months may be much greater than the cost of the analysis.

There are three important tools we can use to improve our understanding of system behavior: the simple performance measurement commands that we are all fairly familiar with, process accounting, and sar 's automated data collection facilities. In this chapter, I'll briefly touch on network pattern analysis.

2.1.1 Simple Commands

We have all used simple commands like iostat , vmstat , and mpstat . These tools are fundamental to performance analysis, and can provide a great deal of valuable information about what is going on in the system. One quick way to gather data about the runtime of a system is to set the interval on these commands to some fairly large amount of time (say, a few minutes), and then redirect the output of the command to a file. One problem with this is that it becomes very hard to keep track of the time that a specific data point was collected. It's straightforward to write some Perl to take care of this for you:

 #!/usr/bin/perl while (<>) { print localtime(  ) . ":  $_"; }

Example 2-1 shows the script in action on a Linux system.

Example 2-1. vmstat on a Linux system

  # vmstat 5  chrononome.pl  Sat Jun 30 00:37:28 2001:  procs               memory      swap      io     system    cpu  Sat Jun 30 00:37:28 2001: r  b  w  swpd  free  buff  cache  si  so   bi   bo  in  cs us sy id Sat Jun 30 00:37:28 2001: 1  0  0  5472 26680  8420 177908   0  0    10   37  63  37  5  6 14 Sat Jun 30 00:37:33 2001: 0  0  0  5472 26576  8420 177908   0  0     0   20 163  37  1  0 99 Sat Jun 30 00:37:38 2001: 0  1  0  5472 26540  8420 177916   0  0     0   12 148  41  1  1 98 Sat Jun 30 00:37:43 2001: 0  0  0  5472 26904  8420 177920   0  0     2    5 141  43  1  2 98 Sat Jun 30 00:37:48 2001: 0  0  0  5472 26904  8420 177920   0  0     0    1 129  39  2  1 98 Sat Jun 30 00:37:53 2001: 0  0  0  5472 26904  8420 177920   0  0     0    0 124  36  0  1 99 Sat Jun 30 00:37:58 2001: 0  0  0  5472 26904  8420 177920   0  0     0   10 143  35  1  0 99

Some people run these commands with a very short interval via a crontab entry. This is a valid approach, but I am not personally fond of it. It depends on what you want to measure: if I run vmstat with a 1800-second interval, I expect each line to be the average virtual memory activity over the last half hour . However, if I run vmstat with a 2-second interval every half hour out of cron , I get a two-second snapshot of activity right at the half-hour mark. Both can be useful, but I would rather have longer-interval data, at least to start with. If it turns out that I am missing spikes in data, I can always start up another copy of the monitoring application and set the interval to something sufficient to catch the peaks and valleys. Long- term performance trends are not really captured well by increasing the collection interval; mostly you'll just have to do a lot of smoothing on a lot of data.

2.1.2 Process Accounting

Process accounting is a means by which the system gathers information on every process as it runs. This information consists of CPU utilization, disk I/O activity, memory consumption, and other useful tidbits. Perhaps the most useful part of process accounting is that it comes with mechanisms for determining patterns in usage, by means of the audit reporting system.

Some system administrators are scared off of process accounting because of fears of high overhead. Collecting the accounting data has essentially no impact on the system: the kernel always collects the accounting data, so the only extra overhead is the writing of a 40-byte record to the accounting logs. The log summary scripts, however, can take a significant amount of time to run, so they are best scheduled outside of peak hours.

2.1.2.1 Enabling process accounting

Starting system accounting is very simple. In Solaris, be sure that you have installed the optional packages that contain the process accounting functionality: they are SUNWaccu and SUNWaccr . As of Solaris 8, these packages are located on the second Solaris installation CD.

The first step is to link the startup and shutdown scripts:

 #  ln /etc/init.d/acct /etc/rc0.d/K22acct  #  ln /etc/init.d/acct /etc/rc2.d/S22acct

You can then reboot the system, or start accounting immediately by running /etc/init.d/acct start . The second step is to add cron entries to the adm user for the summary reporting commands. Example 2-2 shows what to add.

Example 2-2. Adding cron entries

 # min   hour    day     month   wkday   command 0       *       *       *       *       /usr/lib/acct/ckpacct 30      2       *       *       *       /usr/lib/acct/runacct 2> /var/adm/acct/nite/fd2log 30      9       *       *       5       /usr/lib/acct/monacct

The ckpacct program is used to check the size of the accounting file /usr/adm/pacct . The runacct command generates accounting information from the data, and the monacct generates "fiscal" reports for each user. These reports are stored in /var/adm/acct .

2.1.2.2 Reviewing accounting records

The most useful tool for reviewing accounting data is acctcom , which shows the immediate accounting data. It can be run in several modes: specifying -a will give you average statistics on each process that has been run, -t will provide the system/user time breakdowns for each process, -u user will show all the processes executed by a given user, and -C time will show all the processes that consumed more than time seconds of processor time. There are many other useful options to acctcom that may vary from system to system, so consult your manual pages for more information.

The system accounting programs also generate some files that can be reviewed at your leisure. These are done daily and at the end of every accounting period (defined as when monacct runs).

2.1.3 Automating sar

We will periodically talk about sar as a means of gathering performance data. sar can also be used to automatically collect data and store it for later review.

2.1.3.1 Enabling sar

Enabling automated data collection with sar is quite straightforward, and entails two steps: uncomment the relevant lines (namely, the last 13) in /etc/init.d/perf , and set up the system crontab file to support automated data recording.

Changing the system cron entries (located in /var/spool/crontabs/sys ) entails making some decisions about when you'd like data to be recorded (via the sa1 command), and when you'd like data to be reported (by the sa2 command). By default, these are the entries:

 # min   hour    day     month   wkday   command # 0 * * * 0-6 /usr/lib/sa/sa1 # 20,40 8-17 * * 1-5 /usr/lib/sa/sa1 # 5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A

The first entry will write a sar record every hour on the hour, seven days a week. The second entry will write a sar record twice an hour, at 20 minutes and 40 minutes past the hour, during peak working hours of 8 A.M. to 5 P.M., Monday through Friday.

The third record is more complex. At five minutes past six o'clock, Monday through Friday, it will report on the data gathered between 8 A.M. ( -s time ) and one minute past six o'clock ( -e time ), at an interval of 1200 seconds, or 20 minutes ( -i seconds ), and it will report all data ( -A option to sar ).

2.1.3.2 Retrieving data

Retrieving data from the sar records is remarkably simple. You simply run sar , specifying the data you'd like to see by means of the normal option switches, and leave off the interval for the present day's data. You can use the -s starting-time and -e ending-time flags to control the time range of interest, and you can specify the day of interest by using -f /var/adm/sa/sa dd (where dd is the day of the month).

2.1.4 Virtual Adrian

Adrian Cockcroft is a distinguished engineer for Sun in the field of performance analysis. He, in coordination with Rich Pettit, developed a software suite for data collection and analysis on Solaris systems. It's described in great detail in their book, Sun Performance and Tuning (Prentice Hall). The book in general is very much worth reading, although it is slightly out of date, and the toolkit is also excellent. This package is called the SE Toolkit, and is located at http://www.setoolkit.com.

2.1.5 Network Pattern Analysis

Historically, ^[3] networks have not generally been a real-world performance limitation. Consequently, a great deal of effort has been placed on understanding and improving other aspects of the system. Because of the explosion of internetworked applications, the network layer is increasingly becoming a limiting factor. As a result, the tools in network performance analysis tend to not be quite as refined as those in other areas. How, then, can we approach the problem of understanding the network layer in our environment?

^[3] That is, in the pre-web world.

A word of warning before we proceed: this section assumes some familiarity with the networking concepts discussed in Chapter 7.

I open this section by discussing three common traffic patterns observed in environments where applications communicate over TCP/IP; then, I briefly discuss some parameters of networked traffic, such as "How large is a packet, on average?" Finally, we'll discuss some of the tools that you can use to determine the patterns occuring on your own network.

A Brief Note on Terminology

I use a few terms in our discussion of network traffic that bear definition:

In a TCP/IP connection, the client is the end that opened the connection (the initiating endpoint). The server is the end that accepted the connection (the accepting endpoint). Note that this says absolutely nothing about what roles the machines are playing in anything other than the connection under examination!
Inbound traffic is moving from the client towards the server; outbound traffic is the reverse direction.
An payload-free packet is generally about 60 bytes on the wire (recall that there are various headers -- typically, Ethernet, IP, and TCP). These packets are a necessary part of how some functions (for example, TCP) work.
A small packet is one that is less than 400 bytes on the wire (including all headers).
A medium packet is one that is between 400 and 900 bytes.
A large packet is between 900 and 1,500 bytes (the maximum for Ethernet).

2.1.5.1 Pattern 1: request-response

The first pattern is that of request-response. The classic real-world examples of this are HTTP, email retrieval protocols such as POP3, and outbound SMTP transactions (delivered to a remote host). Figure 2-1 shows Pattern 1 traffic.

Figure 2-1. Pattern 1 traffic

The request-response pattern is characterized by generally high connection rates. However, fairly low connection rates are sometimes observed, particularly in modern web environments where HTTP session keepalives are being used extensively. The inbound packets are generally small; the outbound packets are generally medium-to-large, but generally there are only a few (less than twenty) of them.

2.1.5.2 Pattern 1B: inverse request-response

A variation on the first pattern is inverse request-response; it is typically seen in inbound SMTP transactions. The interesting part is that the inbound and outbound roles are reversed : the client initiates the majority of the data transfer. Figure 2-2 illustrates the pattern of 1B traffic.

Figure 2-2. Pattern 1B traffic

Connection rates are often moderate, and sometimes exhibit bursty behavior. Inbound packets are generally small in the initial part of the connection, but then become large. The outbound packets are small or payload-free.

2.1.5.3 Pattern 2: data transfer

The second pattern is typical of large amounts of data being transferred. This is most commonly seen in ftp traffic and file transfers during networked backups (see Figure 2-3).

Figure 2-3. Pattern 2 traffic

In general, the connection rate is very low. The inbound packet stream is essentially all payload-free packets, but the outbound packet stream consists almost entirely of full packets.

2.1.5.4 Pattern 3: message passing

The third and final pattern is that of message passing. This is most commonly seen in character-driven applications such as telnet , rlogin , and ssh , but it is also seen sometimes in database transaction schemes such as SQLnet. Some parallelized high-performance computing codes also exhibit this sort of pattern. Figure 2-4 illustrates Pattern 3.

Figure 2-4. Pattern 3 traffic

The connection rate is variable. It is generally low, but can be quite high -- it is very application-dependent. The characteristic to look for is that there are large numbers of small packets being pushed around. This sort of traffic tends to be very hard on a system, because the amount of work required to process a small packet is essentially the same as the amount of work required to process a large packet: you are doing the same amount of work for less payload.

Tuning systems for Pattern 3 workloads is very difficult. If you are faced with a Pattern 3 workload, you are sure the network is the limiting factor in performance, and you need a performance improvement, you have two real choices: either induce algorithmic change in the application so that it stops behaving like Pattern 3 and starts behaving more like Pattern 1, or invest in very low-latency network hardware. ^[4]

^[4] Myrinet has a reputation for fast, low-latency network hardware. It is not widely deployed outside of specialty markets.

2.1.5.5 Packet size distributions

Gathering traces (via snoop or tcpdump , for example) of network activity can give you some very interesting information on what sort of patterns are occuring on your network. One of the more simple questions to ask is "If we draw the graph of packet size versus packet count, what does it look like?" In some of the work I've done on "real-world" systems on the Internet, I found that, remarkably, this graph tends to be trimodal (it has three distinct peaks). These systems are typically web and email servers.

The first peak is largely due to inbound traffic, and consists almost entirely of packets about 60 bytes in size. These are TCP acknowledgments flowing " backwards " from browsers towards the web server, acknowledging the receipt of chunks of data.
The second peak is almost entirely due to outbound traffic, and occurs at 1540 bytes on the wire. This represents full packets of data flowing away from the server towards the browsers.
The third peak took me a little while to figure out: it occurs at about 540 bytes, and is strongly biased towards outbound traffic. It turns out that this is due to certain Windows TCP/IP implementations , which set the "maximum segment size" for a TCP connection to 536 bytes. As a result, the other end cannot transmit any packet larger than 536 bytes back to the client.

If you're in a position where you can strongly influence the traffic on your network, this sort of information can drive great change. Properly tuning application algorithms and TCP stacks can eliminate things like the "middle peak" at 540 bytes, with a corresponding increase in efficiency.