Performance experts are, in some sense, the gurus atop mountains of the computer world. [2] Every time some humble supplicant comes to them with a problem, the guru pokes around at things for a while, then sends the supplicant off to come up with more information. Characterization is the process of trying to gather as much information about the system as possible, so that trends and patterns can be determined. These patterns will prove to be vitally important the first time that performance falls through the floor; we will be able to piece together what the perturbations in the patterns are, and from that figure out what caused them -- rather like studying the wake of a passing ship to see what kind of vessel it was and where it's headed.
[2] To extend this analogy, the database performance people sometimes seem to be in outer space. I have three main points of evidence: it is absurdly expensive to talk to them, either you or they usually have to travel vast distances for an audience, and they speak in strange languages that I don't understand.
An analogy has been drawn between workload management and financial transactions. The first time I read of this was in Adrian Cockcroft's excellent book Sun Performance and Tuning (Prentice Hall). The essential idea is that workload management on computer systems is analogous to a department with a capital budget, staff, etc., performing a task. There are basically three possible outcomes :
If there is no plan and no effective controls on the staff, then the staff will run wild, grabbing as much budget as they can to ensure that their own projects will be well- funded . Some staff will end up with no funding whatsoever, while other staff take up "fact-finding missions" to Maui. The project as a whole ends up a complete mess. (This is known as the "startup model.")
Management is overzealous, and creates a huge bureaucratic staff to plan, assess the plans, and replan. The entire budget is consumed by this bureaucratic middle layer, which micromanages those responsible for actually doing the work, often by demanding daily status reports . The administrative overheads involved make it very difficult to spend any money. The work ends up a mess, because by the time it was supposed to be finished, it's barely been started. (This is known as the "government model.")
In the ideal case, management balances and controls funding so that staff are constrained, but everyone has enough to get their work done. The bureaucratic overhead is kept to a minimum, and status reports are infrequent. The work is done on-time and within its budget. (This is known as the "dreamland model.")
The analogy to performance management is fairly straightforward. Instead of a capital budget, we have the currency of computing, if you will: processor cycles, disk I/O rates, network capacity, etc. We have to have some management over these resources, or we end up following the startup model, where actually getting any resources turns into a grabbing contest. However, if we manage things too much, we'll end up like the governmental model, where nobody can get any work done because we all spend too much time filling out paperwork. So, ideally , we should aim for a balanced approach: the dreamland model. Unfortunately, the dreamland model is aptly named. It's very hard to achieve, much less maintain, such a model in practice.
The first step in performance management has to be developing some processes to guide our decisions. This will involve setting concrete goals, identifying reproducibly measurable performance criteria, and determining the current status. This information then lets us go to management and negotiate for resources to improve the user 's experience.
As a discipline, performance management started decades ago in the mainframe world, where everything is stunningly expensive. The payback from making tuning decisions based on real understanding was direct; the actual cost savings in the computing budget was often quite large. Now that the costs of computing have plummeted, it's getting harder to justify the time and effort spent in understanding and tuning. You have to focus more on the indirect advantages to analysis; haphazardly applying system upgrades might not accomplish anything for several months, whereas a careful study of the system and a reasoned upgrade might improve performance significantly. The value of the increased throughput over those several months may be much greater than the cost of the analysis.
There are three important tools we can use to improve our understanding of system behavior: the simple performance measurement commands that we are all fairly familiar with, process accounting, and sar 's automated data collection facilities. In this chapter, I'll briefly touch on network pattern analysis.
We have all used simple commands like iostat , vmstat , and mpstat . These tools are fundamental to performance analysis, and can provide a great deal of valuable information about what is going on in the system. One quick way to gather data about the runtime of a system is to set the interval on these commands to some fairly large amount of time (say, a few minutes), and then redirect the output of the command to a file. One problem with this is that it becomes very hard to keep track of the time that a specific data point was collected. It's straightforward to write some Perl to take care of this for you:
#!/usr/bin/perl while (<>) { print localtime( ) . ": $_"; }
Example 2-1 shows the script in action on a Linux system.
# vmstat 5 chrononome.pl Sat Jun 30 00:37:28 2001: procs memory swap io system cpu Sat Jun 30 00:37:28 2001: r b w swpd free buff cache si so bi bo in cs us sy id Sat Jun 30 00:37:28 2001: 1 0 0 5472 26680 8420 177908 0 0 10 37 63 37 5 6 14 Sat Jun 30 00:37:33 2001: 0 0 0 5472 26576 8420 177908 0 0 0 20 163 37 1 0 99 Sat Jun 30 00:37:38 2001: 0 1 0 5472 26540 8420 177916 0 0 0 12 148 41 1 1 98 Sat Jun 30 00:37:43 2001: 0 0 0 5472 26904 8420 177920 0 0 2 5 141 43 1 2 98 Sat Jun 30 00:37:48 2001: 0 0 0 5472 26904 8420 177920 0 0 0 1 129 39 2 1 98 Sat Jun 30 00:37:53 2001: 0 0 0 5472 26904 8420 177920 0 0 0 0 124 36 0 1 99 Sat Jun 30 00:37:58 2001: 0 0 0 5472 26904 8420 177920 0 0 0 10 143 35 1 0 99
Some people run these commands with a very short interval via a crontab entry. This is a valid approach, but I am not personally fond of it. It depends on what you want to measure: if I run vmstat with a 1800-second interval, I expect each line to be the average virtual memory activity over the last half hour . However, if I run vmstat with a 2-second interval every half hour out of cron , I get a two-second snapshot of activity right at the half-hour mark. Both can be useful, but I would rather have longer-interval data, at least to start with. If it turns out that I am missing spikes in data, I can always start up another copy of the monitoring application and set the interval to something sufficient to catch the peaks and valleys. Long- term performance trends are not really captured well by increasing the collection interval; mostly you'll just have to do a lot of smoothing on a lot of data.
Process accounting is a means by which the system gathers information on every process as it runs. This information consists of CPU utilization, disk I/O activity, memory consumption, and other useful tidbits. Perhaps the most useful part of process accounting is that it comes with mechanisms for determining patterns in usage, by means of the audit reporting system.
Some system administrators are scared off of process accounting because of fears of high overhead. Collecting the accounting data has essentially no impact on the system: the kernel always collects the accounting data, so the only extra overhead is the writing of a 40-byte record to the accounting logs. The log summary scripts, however, can take a significant amount of time to run, so they are best scheduled outside of peak hours.
Starting system accounting is very simple. In Solaris, be sure that you have installed the optional packages that contain the process accounting functionality: they are SUNWaccu and SUNWaccr . As of Solaris 8, these packages are located on the second Solaris installation CD.
The first step is to link the startup and shutdown scripts:
# ln /etc/init.d/acct /etc/rc0.d/K22acct # ln /etc/init.d/acct /etc/rc2.d/S22acct
You can then reboot the system, or start accounting immediately by running /etc/init.d/acct start . The second step is to add cron entries to the adm user for the summary reporting commands. Example 2-2 shows what to add.
# min hour day month wkday command 0 * * * * /usr/lib/acct/ckpacct 30 2 * * * /usr/lib/acct/runacct 2> /var/adm/acct/nite/fd2log 30 9 * * 5 /usr/lib/acct/monacct
The ckpacct program is used to check the size of the accounting file /usr/adm/pacct . The runacct command generates accounting information from the data, and the monacct generates "fiscal" reports for each user. These reports are stored in /var/adm/acct .
The most useful tool for reviewing accounting data is acctcom , which shows the immediate accounting data. It can be run in several modes: specifying -a will give you average statistics on each process that has been run, -t will provide the system/user time breakdowns for each process, -u user will show all the processes executed by a given user, and -C time will show all the processes that consumed more than time seconds of processor time. There are many other useful options to acctcom that may vary from system to system, so consult your manual pages for more information.
The system accounting programs also generate some files that can be reviewed at your leisure. These are done daily and at the end of every accounting period (defined as when monacct runs).
We will periodically talk about sar as a means of gathering performance data. sar can also be used to automatically collect data and store it for later review.
Enabling automated data collection with sar is quite straightforward, and entails two steps: uncomment the relevant lines (namely, the last 13) in /etc/init.d/perf , and set up the system crontab file to support automated data recording.
Changing the system cron entries (located in /var/spool/crontabs/sys ) entails making some decisions about when you'd like data to be recorded (via the sa1 command), and when you'd like data to be reported (by the sa2 command). By default, these are the entries:
# min hour day month wkday command # 0 * * * 0-6 /usr/lib/sa/sa1 # 20,40 8-17 * * 1-5 /usr/lib/sa/sa1 # 5 18 * * 1-5 /usr/lib/sa/sa2 -s 8:00 -e 18:01 -i 1200 -A
The first entry will write a sar record every hour on the hour, seven days a week. The second entry will write a sar record twice an hour, at 20 minutes and 40 minutes past the hour, during peak working hours of 8 A.M. to 5 P.M., Monday through Friday.
The third record is more complex. At five minutes past six o'clock, Monday through Friday, it will report on the data gathered between 8 A.M. ( -s time ) and one minute past six o'clock ( -e time ), at an interval of 1200 seconds, or 20 minutes ( -i seconds ), and it will report all data ( -A option to sar ).
Retrieving data from the sar records is remarkably simple. You simply run sar , specifying the data you'd like to see by means of the normal option switches, and leave off the interval for the present day's data. You can use the -s starting-time and -e ending-time flags to control the time range of interest, and you can specify the day of interest by using -f /var/adm/sa/sa dd (where dd is the day of the month).
Adrian Cockcroft is a distinguished engineer for Sun in the field of performance analysis. He, in coordination with Rich Pettit, developed a software suite for data collection and analysis on Solaris systems. It's described in great detail in their book, Sun Performance and Tuning (Prentice Hall). The book in general is very much worth reading, although it is slightly out of date, and the toolkit is also excellent. This package is called the SE Toolkit, and is located at http://www.setoolkit.com.
Historically, [3] networks have not generally been a real-world performance limitation. Consequently, a great deal of effort has been placed on understanding and improving other aspects of the system. Because of the explosion of internetworked applications, the network layer is increasingly becoming a limiting factor. As a result, the tools in network performance analysis tend to not be quite as refined as those in other areas. How, then, can we approach the problem of understanding the network layer in our environment?
[3] That is, in the pre-web world.
A word of warning before we proceed: this section assumes some familiarity with the networking concepts discussed in Chapter 7.
I open this section by discussing three common traffic patterns observed in environments where applications communicate over TCP/IP; then, I briefly discuss some parameters of networked traffic, such as "How large is a packet, on average?" Finally, we'll discuss some of the tools that you can use to determine the patterns occuring on your own network.
A Brief Note on TerminologyI use a few terms in our discussion of network traffic that bear definition:
|
The first pattern is that of request-response. The classic real-world examples of this are HTTP, email retrieval protocols such as POP3, and outbound SMTP transactions (delivered to a remote host). Figure 2-1 shows Pattern 1 traffic.
The request-response pattern is characterized by generally high connection rates. However, fairly low connection rates are sometimes observed, particularly in modern web environments where HTTP session keepalives are being used extensively. The inbound packets are generally small; the outbound packets are generally medium-to-large, but generally there are only a few (less than twenty) of them.
A variation on the first pattern is inverse request-response; it is typically seen in inbound SMTP transactions. The interesting part is that the inbound and outbound roles are reversed : the client initiates the majority of the data transfer. Figure 2-2 illustrates the pattern of 1B traffic.
Connection rates are often moderate, and sometimes exhibit bursty behavior. Inbound packets are generally small in the initial part of the connection, but then become large. The outbound packets are small or payload-free.
The second pattern is typical of large amounts of data being transferred. This is most commonly seen in ftp traffic and file transfers during networked backups (see Figure 2-3).
In general, the connection rate is very low. The inbound packet stream is essentially all payload-free packets, but the outbound packet stream consists almost entirely of full packets.
The third and final pattern is that of message passing. This is most commonly seen in character-driven applications such as telnet , rlogin , and ssh , but it is also seen sometimes in database transaction schemes such as SQLnet. Some parallelized high-performance computing codes also exhibit this sort of pattern. Figure 2-4 illustrates Pattern 3.
The connection rate is variable. It is generally low, but can be quite high -- it is very application-dependent. The characteristic to look for is that there are large numbers of small packets being pushed around. This sort of traffic tends to be very hard on a system, because the amount of work required to process a small packet is essentially the same as the amount of work required to process a large packet: you are doing the same amount of work for less payload.
Tuning systems for Pattern 3 workloads is very difficult. If you are faced with a Pattern 3 workload, you are sure the network is the limiting factor in performance, and you need a performance improvement, you have two real choices: either induce algorithmic change in the application so that it stops behaving like Pattern 3 and starts behaving more like Pattern 1, or invest in very low-latency network hardware. [4]
[4] Myrinet has a reputation for fast, low-latency network hardware. It is not widely deployed outside of specialty markets.
Gathering traces (via snoop or tcpdump , for example) of network activity can give you some very interesting information on what sort of patterns are occuring on your network. One of the more simple questions to ask is "If we draw the graph of packet size versus packet count, what does it look like?" In some of the work I've done on "real-world" systems on the Internet, I found that, remarkably, this graph tends to be trimodal (it has three distinct peaks). These systems are typically web and email servers.
The first peak is largely due to inbound traffic, and consists almost entirely of packets about 60 bytes in size. These are TCP acknowledgments flowing " backwards " from browsers towards the web server, acknowledging the receipt of chunks of data.
The second peak is almost entirely due to outbound traffic, and occurs at 1540 bytes on the wire. This represents full packets of data flowing away from the server towards the browsers.
The third peak took me a little while to figure out: it occurs at about 540 bytes, and is strongly biased towards outbound traffic. It turns out that this is due to certain Windows TCP/IP implementations , which set the "maximum segment size" for a TCP connection to 536 bytes. As a result, the other end cannot transmit any packet larger than 536 bytes back to the client.
If you're in a position where you can strongly influence the traffic on your network, this sort of information can drive great change. Properly tuning application algorithms and TCP stacks can eliminate things like the "middle peak" at 540 bytes, with a corresponding increase in efficiency.