15.1 Thinking About System Performance

Why is the system so slow? is probably second on any system administrator's things-I-least-want-to-hear list (right after Why did the system crash again?!). Like system reliability, system performance is a topic that comes up only when there is a problem. Unfortunately, no one is likely to compliment or thank you for getting the most out of the system's resources.

System performance-related complaints can take on a variety of forms, ranging from sluggish interactive response time, to a job that takes too long to complete or is unable to run at all because of insufficient resources.

In general, system performance depends on how efficiently a system's resources are applied to the current demand for them by various jobs in the system. The most important system resources from a performance perspective are CPU, memory, and disk and network I/O, although sometimes other device I/O can also be relevant. How well a system performs at any given moment is the result of both the total demand for the various system resources and how well the competition among processes^[1] for them is being managed. Accordingly, performance problems can arise from a number of causes, including both a lack of needed resources and ineffective control over them. Addressing a performance problem involves identifying what these resources are and figuring out how to manage them more effectively.

^[1] On many modern systems, processes have been replaced by threads as the fundamental execution entity. However, in uniprocessor environments at least, threads and processes are conceptually similar at a system administration level, so I will continue to speak of "processes" throughout this chapter.

NOTE

figs/armadillo_tip.gif

As with most of life, performance tuning is much harder when you have to guess what normal is. If you don't know what the various system performance metrics usually show when performance is acceptable, it will be very hard to figure out what is wrong when performance degrades. Accordingly, it is essential to do routine system monitoring and to maintain records of performance-related statistics over time.

When the lack of a critical resource is the source of a performance problem, there are a limited number of approaches to improving the situation. Put simply, when you don't have enough of something, there are only a few options: get more, use less, eliminate inefficiency and waste to make the most of what you have, or ration what you have. In the case of a system resource, this can mean obtaining more of it (if that is possible), reducing job or system requirements to desire less of it, having its various consumers share the amount that is available by dividing it between them, having them take turns using it, or otherwise changing the way it is allocated or controlled.

For example, if your system is short of CPU resources, your options for improving things may include some or all of the following:

Adding more CPU capacity by upgrading the processor.
Adding additional processors to allow different parts of the work load to proceed in parallel.
Taking advantage of currently unused CPU capacity by scheduling some jobs to run during times when the CPU is lightly loaded or even idle.
Reducing demands for CPU cycles by eliminating some of the jobs that are contending for them (or moving them to another computer).
Using process priorities to allocate CPU time explicitly among processes that want it, favoring some over the others.
Employing a batch system to ensure that only a reasonable number of jobs run at the same time, making others wait.
Changing the behavior of the operating system's job scheduler to affect how the CPU is divided among multiple jobs.

Naturally, not all potential solutions will necessarily be possible on any given computer system or within any given operating system.

It is often necessary to distinguish between raw system resources like CPU and memory and the control mechanisms by which they are accessed and allocated. For example, in the case of the system's CPU, you don't have the ability to allocate or control this resource as such (unless you count taking the system down). Rather, you must use features like nice numbers and scheduler parameters to control usage.

Table 15-1 lists the most important control mechanisms associated with CPU, memory, and disk and network I/O performance.

Table 15-1. system resource control mechanisms
Resource	Control mechanisms
CPU	Nice numbers Process priorities Batch queues Scheduler parameters
Memory	Process resource limits Memory management-related parameters Paging (swap) space
Disk I/O	Filesystem organization across physical disks and controllers File placement on disk I/O-related parameters
Network I/O	Network memory buffers Network-related parameters Network infrastructure

15.1.1 The Tuning Process

The following process offers the most effective approach to addressing system performance issues.

15.1.1.1 Define the problem in as much detail as you can.

The more specific you can be about what is wrong (or less than optimal) with the way things are currently, the more likely it is you can find ways to improve them. Ideally, you'd like to move from an initial problem description like this one:

System response time is slow.

to one like this:

Interactive users running X experience significant delays opening new windows and switching between windows.

A good description of the current performance issues will also implicitly state your performance goals. For example, in this case, the performance goal is clearly to improve interactive response time for users running under X. It is important to understand such goals clearly, even if it is not always possible to reach them (in which case, they are really wishes more than goals).

15.1.1.2 Determine what's causing the problem.

To do so, you'll need to answer questions like these:

What is running on the system (or, when the performance of a single job or process is the issue, what else is running)? You may also need to consider the sources of the other processes (for example, local users, remote users, the cron subsystem, and so on).
When or under what conditions does the problem occur? For example, does it only occur at certain, predictable times of the day or when remote NFS mounts of local disks have reached a certain level? Are all users affected or only some or even one of them?
Has anything about the system changed that could have introduced or exacerbated the problem?
What is the critical resource that is adversely affecting performance? Answering this question will involve finding the performance bottleneck for the job(s) in which you are interested (or for this type of system workload). Later sections of this chapter will discuss tools and utilities that enable you to determine this.

For example, if we examined the system with the X windows performance problems, we might find that the response-time problems occurred only when more than one simulation job and/or large compilation job is running. By watching what happens when a user tries to switch windows under those conditions, we could also figure out that the critical resource is system memory and that the system is paging (we'll have more to say about this later in this chapter).

15.1.1.3 Formulate explicit performance improvement goals.

This step involves transforming the implicit goals (wishes) that were part of the problem description into concrete, measurable goals. Again, being as precise and detailed as possible will make your job easier.

In many cases, tuning goals will need to be developed in conjunction with the users affected by the performance problems, and possibly with other users and management personnel as well. System performance is almost always a matter of compromises and tradeoffs, because it inevitably involves deciding how to apply and apportion the finite available resources. Tuning is easiest and most successful where there is a clear agreement about the relative priority and importance of the various competing activities on the system.

To continue with our example, setting achievable tuning goals will be difficult unless it is decided whose performance is more important. In other words, it is probably necessary to choose between snappy interactive response time for X users and fast completeion times for simulation and compilation jobs (remember that the status quo has already been demonstrated not to work). Decided one way, the tuning goal could become something like this:

Improve interactive response time for X users as much as possible without making simulation jobs take any longer to complete. Compilations can be delayed somewhat in order to keep the system from paging.

Not all performance goals that can be formulated can be met. You often must choose between the alternatives that are actually possible. Thus, in the preceding example, you will not be able to meet all three CPU requirements simultaneously on the current system.

15.1.1.4 Design and implement modifications to the system and applications to achieve those goals.

Figuring out what to do is, of course, the trickiest part of tuning a system. We'll look at what the options are for various types of problems in the upcoming sections of this chapter.

It is important to tune the system as a whole. Focusing only on part of the system workload will give you a distorted picture of the problem, because system performance is ultimately the result of the interactions among everything on the system.

15.1.1.5 Monitor the system to determine how well the changes worked.

The purpose here is to evaluate the system status after the change is made and determine whether or not the change has improved things as expected or desired. The most successful tuning method introduces small changes to the system, one at a time, allowing you to thoroughly test each one and judge its effectiveness and to back it out again if it makes things worse instead of better.

15.1.1.6 Return to the first step and begin again.

System performance tuning is inevitably an iterative process, because even a successful change will often reveal new interactions to understand and new problems to address. Similarly, once the bottleneck caused by one system resource is relieved, a new one centered around a different resource may very well arise. In fact, the initial performance problem can often be just a secondary symptom of the real, more serious underlying problem (e.g., a CPU shortage can be a symptom of serious memory shortfalls).

NOTE

figs/armadillo_tip.gif

Not all problems in life can be solved with money, but many performance issues can. If you have definitively identified the resource that is in short supply and you can afford to buy more of it (or upgrade it), do so. This approach is often the best and fastest way to address a performance problem. On the other hand, buying hardware in the hope that will alleviate a performance problem is likely to be both wasteful and frustrating.

Most operating systems provide specialized tools for performance tuning. These are the primary tuning tools and procedures for each of the various operating systemswe are considering:

AIX	`schedtune`, `vmtune`, `no`
FreeBSD	`sysctl`, `/etc/sysctl.conf`
HP-UX	`ndd`, `kmtune`
Linux	files under /proc/sys
Solaris	`dispadmin`, `ndd`, /etc/system
Tru64	`sysconfig`, /etc/sysconfigtab, `dxkerneltuner`

We'll discuss using these tools at the appropriate points within this chapter.

Some systems also provide additional performance monitoring and tuning tools as add-on packages.

15.1.2 Some Tuning Caveats

I'll close this section with two important notes about system performance tuning.

First, be aware of the experimenter effect. The term refers to the realization that merely watching something happen can change the thing that is happening in significant ways. In anthropology, this means that the a researcher observing the customs and behaviors of another culture inevitably has an effect on what is observed; people behave differently when they know they are being watched, especially by outsiders. For performance monitoring, running the monitoring tools can also have an effect on the system, and this fact needs to be taken into account when interpreting the data they collect. Ideally, performance data collection should be decoupled from data analysis (and the latter can take place on a different system).

Second, consider this advice from IBM's AIX Versions 3.2 and 4.1 Performance Tuning Guide:

The analyst must resist the temptation to tune what is measurable rather than what is important.

Its overly formal language aside, this maxim reminds us that the tools Unix provides for observing system behavior offer one way of looking at the system, but not the only way. What is actually important to watch and tune on your system may or may not be trivially accessible to either monitoring or modification.

At the same time, it is also necessary to keep this important corollary in mind:

Resist the temptation to tune something just because it is tunable.

This is, of course, really just another way of saying:

If it ain't broke, don't fix it.

Table 15-1. system resource control mechanisms