CPU usage is usually the first factor that I consider when I am tracking down a performance problem or just trying to assess the current system state in general.
15.3.1 Nice Numbers and Process Priorities
Most Unix systems use a priority-based round-robin scheduling algorithm to distribute CPU resources among multiple competing processes. All processes have an execution priority assigned to them, an integer value that is dynamically computed and updated on the basis of several factors. Whenever the CPU is free, the scheduler selects the most favored process to begin or resume executing; this usually corresponds to the process with the lowest priority number, because lower numbers are defined as more favored than higher ones in typical implementations.
Although there may be a multitude of processes simultaneously present on the system, only one process actually uses the CPU processor at any given time (assuming the system has only a single CPU). Once a process begins running, it continues to execute until it needs to wait for an I/O operation to complete, receives an interrupt from the kernel, or otherwise gives up control of the CPU, or until it exhausts the maximum execution time slice (or quantum) defined on that system (10 milliseconds is a common value). Once the current process stops executing, the scheduler again selects the most favored process on the system and starts or resumes it. The process of changing the current running process is called a context switch.
Multiple runnable processes at the same priority level are placed into the run queue for that priority level. Whenever the CPU is free, the scheduler starts the processes at the head of the lowest-numbered, nonempty run queue. When the process at the top of a run queue stops executing, it goes to the end of the line, and the next process moves to the front.
A Unix process has two priority numbers associated with it:
Under BSD, nice numbers range from -20 and 20, with -20 the most favored priority (thedefault priority is 0); under System V, nice numbers range from 0 to 39 (the default is 20), with lower numbers again indicating higher priority and more rapid execution. For Unix, less is truly more. Interactive shells usually run at the default level (0 for BSD and 20 for System V). Only the superuser can specify nice numbers lower than the default.
Many systems provide a special nice number that can be assigned to processes that you want to run only when nothing else wants the CPU. On Solaris systems, this number is 19, and on Tru64, HP-UX, and Linux systems, it is 20.
On AIX systems, a similar effect can be accomplished by setting a process to the fixed priority level 121 (using the setpri system call see Section 14.4 for a sample program calling this function). Because varying priorities always remain at or below 120, a job in the priority range of 121 to 126 will run only when no lower-priority process wants the CPU. Note that once you have assigned a process a fixed priority, it cannot return to having a varying priority.
Any user can be nice and decrease the priority of a process he owns by increasing its nice number. Only the superuser can decrease the nice number of a process. This prevents users from increasing their own priorities and thereby using more than their share of the system's resources.
There are several ways to specify a job's execution priority. First, there are two commands to initiate a process at lowered priority: the nice command, built into some shells, and the general Unix command nice, usually stored in /bin or /usr/bin. These commands both work the same way, but have slightly different syntaxes:
% nice [+|- n ] command $ /bin/nice - [[-] n ] command
In the built-in C shell version of nice, if an explicitly signed number is given as its first argument, it specifies the amount the command's priority will differ from the default nice number; if no number is specified, the default offset is +4. With /bin/nice, the offset from the default nice number is specified as its argument and so is preceded by a hyphen; the default offset is +10, and positive numbers need not include a plus sign. Thus, the following commands are equivalent, despite looking very different:
% nice +6 bigjob $ /bin/nice -6 bigjob
Both commands result in bigjob having a nice number of 6 under BSD and 26 under System V. Similarly, the following commands both raise bigjob's priority five steps above the default level (to -5 under BSD and 15 under System V):
# nice -5 bigjob # /bin/nice --5 bigjob
Thus, BSD and System V nice numbers always differ by 20, but identical commands have equivalent effects on the two systems.
The -l option to ps (either format the output varies only slightly) may be used to display a process's nice number and current execution priority. Here is some example output from a Linux system:
% ps l F UID PID PPID PRI NI VSZ RSS WCHAN COMMAND 8201 371 8390 8219 1 0 3233 672 wait4 ... rlogin iago 8201 371 8391 8219 3 4 3487 1196 do_sig ... big_cmd 8201 0 8394 1 15 -5 2134 1400 - ... imp_cmd
The column headed NI displays each process's nice number. The column to its immediate left, labeled PRI, shows the process's current actual execution priority.
Processes inherit the priority of their parent when they are created. However, changing the priority of the parent process does not change the priorities of its children. Therefore, increasing a process's priority number may have no effect if this process has created one or more subprocesses. Accordingly, if the parent process spends most of its time waiting for its children, changing the parent's priority will have little or no effect on the system's performance.
15.3.2 Monitoring CPU Usage
There are many ways of obtaining a quick snapshot of current overall CPU activity. For example, the vmstat command includes CPU activity among the many system statistics that it displays. Its most useful mode uses this syntax:
$ vmstat interval [count ]
where interval is the number of seconds between reports, and count is the total number of reports to generate. If count is omitted, vmstat runs until you terminate it.
Here is an example of the output from vmstat:
$ vmstat 5 4 procs memory page disk faults cpu r b w avm fre re at pi po fr de sr d0 d1 d2 d3 in sy cs us sy id 1 0 0 61312 9280 0 0 24 1 2 0 0 4 1 1 12 35 66 16 63 11 26 3 2 0 71936 3616 3 0 96 0 0 0 2 18 0 0 0 23 89 34 72 28 0 5 1 0 76320 3424 0 0 0 0 0 0 0 26 0 0 0 24 92 39 63 37 0 4 1 0 63616 3008 1 1 0 0 0 0 0 21 0 0 0 23 80 33 78 22 0
The first line of every vmstat report displays average values for each statistic since boot time; it should be ignored. If you forget this, you can be misled by vmstat's output. At the moment, we are interested in these columns of the report:
During the period covered by the vmstat report, this system's CPU was used to capacity: there was no idle time at all. You'll need to use ps in conjunction with vmstat to determine the specific jobs that are consuming the system's CPU resources. Under AIX, the tprof command can also be used for this purpose.
188.8.131.52 Recognizing a CPU shortage
High levels of CPU usage are not a bad thing in themselves (quite the contrary, in fact: they might mean that the system is accomplishing a lot of useful work). However, if you are tracking down a system performance problem, and you see such levels of CPU use consistently over a significant period of time, a shortage of CPU cycles is one factor that might be contributing to that problem (it may not be the total problem, however, as we'll see).
In general, one or more of the following symptoms may suggest a shortage of CPU resources when they appear regularly and/or persist for a significant period of time:
We'll look at each of these options in turn in the remainder of this section.
15.3.3 Changing a Process's Nice Number
When the system's load is high, you may wish to force CPU-intensive processes to run at a lower priority. This reduces the CPU contention for interactive jobs such as editing, and generally keeps users happy. Alternatively, you may wish to devote most of the system's time to a few critical processes, letting others finish when they will. The renice command may be used to change the priority of running processes. Introduced in BSD, renice is now also supported on most System V systems. Only root may use renice to increase the priority of a process (i.e., lower its nice number).
renice's traditional syntax is:
# renice new-nice-number pid
new-nice-number is a valid nice number, and pid is a process identification number. For example, the following command sets the nice number of process 8201 to 5, lowering its priority by five steps.
# renice 5 8201
184.108.40.206 renice under AIX, HP-UX, and Tru64
AIX and HP-UX use a modified form of the renice command. This form requires that the -n option precede the new nice number, as in this example:
$ renice -n 12 8201
Tru64 supports both forms of the renice command.
Note that AIX uses the System V-style priority system, running from 0 (high) to 40 (low). For renice under AIX, the new nice number is still specified on a scale from -20 to 20; it is translated internally into the 0-40 scheme actually used. This can make for some slightly strange output at times:
# renice -n 10 3769 3769: old priority 0, new priority 10 # ps -l -p 3769 F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIMECMD 200801 S 371 3769 8570 0 70 30 2aca 84 1d79098 pts/1 0:00 c12
The renice command reports its action in terms of BSD nice numbers, but the ps display shows the real nice number.
220.127.116.11 Changing process priorities under Solaris
System V.4 changed the standard System V priority scheme as part of its support for real-time processes. By default, V.4, and hence Solaris, internally use time-sharing priority numbers ranging from -20 to 20, with 20 as the highest priority (the default is 0). V.4 also supports the BSD renice command, mapping BSD nice numbers to the corresponding time-sharing priority number; similarly, the ps command continues to display nice numbers in the V.3 format. Solaris has incorporated this scheme as part of its V.4 base.
Solaris also uses another command to modify process priorities (again, primarily intended for real-time processes): priocntl. The priocntl form to change the priority for a single process is:
# priocntl -s -p new-pri -i pid proc-id
where new-pri is the new priority for the process and proc-id is the process ID of the desired process. For example, the following command sets the priority level for process 8733 to -5:
# priocntl -s -p -5 -i pid 8733
The following form may be used to set the priority (nice number) for every process created by a given parent process:
# priocntl -s -p -5 -i ppid 8720
This command sets the priority of process 8720 and all of its children.
The priocntl command has many other capabilities and uses, as we'll see in the course of this chapter (you may also want to consult its manual page).
18.104.22.168 Setting a user's default nice numbers under Tru64
Tru64 allows you to specify the default nice number for a user's login shell (which will be inherited by all processes that she subsequently creates), via the u_priority field in the user's protected password database entry. This field takes a numeric value and defaults to 0 (the usual default nice number). For example, the following form would set the user's nice value to 5:
A systemwide default nice value may also be set in /etc/auth/system/default.
15.3.4 Configuring the System Scheduler
AIX and Solaris provide substantial facilities for configuring the functioning of the system scheduler. Tru64 also offers a few relevant kernel parameters. We'll consider these facilities in this section. The other operating systems offer little of practical use for CPU performance tuning.
22.214.171.124 The AIX scheduler
On AIX systems, dynamic process priorities range from 0 to 127, with lower numbers more favorable. The current value for each process appears in the column labeled PRI in ps -l displays. Normally, process execution priorities change over time (in contrast to nice numbers), according to the following formula:
min is the minimum process priority level, normally 40; nice is the process' nice number; recent is a number indicating how much CPU time the process has received recently (it is displayed in the column labeled C in ps -l output). By default, the parameter frac is 0.5; it specifies how much of the recent CPU usage is taken into account (how large the penalty for recent CPU cycles is).
For a new process, recent starts out at 0; it can reach a maximum value of 120. By default, at the end of each 10-millisecond time slice (equivalent to one clock tick), the scheduler increases recent by one for the process currently in control of the CPU. In addition, once a second, the scheduler reduces recent for all processes, multiplying it by a reduction factor that defaults to 0.5 (i.e., recent is divided by 2 by default). The effect of this procedure is to penalize processes that have received CPU resources most recently by increasing their execution priority value, and gradually lowering the execution priority value for processes that have had to wait, to the minimum level arising from their nice number.
The result of this scheduling policy is that CPU resources are more or less evenly divided among (compute-bound) jobs at the same nice level. When there are jobs ready to run at both normal and raised nice levels, the normal-priority jobs will get more time than the others, but even the niced jobs will get some CPU time. For long-running processes, the distinction between normal priority and niced processes eventually becomes quite blurred because normal priority processes that have gotten a significant amount of CPU time can easily rise in priority above that of waiting niced processes.
The schedtune utility is used to modify scheduler and other operating system parameters. The schedtune executable is provided in /usr/samples/kernel.
For normal processes, you can alter two scheduler parameters with this utility: the fraction of the short-term CPU usage value used in computing the current execution priority (-r) and how much the short-term CPU usage number is reduced at the end of each one second interval (-d). Each value is divided by 32 to compute the actual multiplier that is used (e.g., frac in the preceding equation is equal to -r/32). Both values default to 16, resulting in factors of one half in both cases.
For example, the following command makes slight alterations to these two parameters:
# schedtune -r 15 -d 15
The -r option determines how quickly recent CPU usage raises a process's execution priority (lowering its likelihood of resumed execution). For example, giving -r a value of 10 causes the respective priorities of normal and niced processes to equalize more slowly than under the default conditions, allocating a large fraction of the total CPU capacity to the more favored jobs.
Decreasing the value even more intensifies this effect; if the option is set to 4, for example, only one eighth of the recent CPU usage number will be used in calculating the execution priority (instead of one half). This means that this component will never contribute more than 15 to the execution priority (120 * 4/32), so a process that has a nice number greater than 15 will never interfere with the running of a normal process.
Setting -r to 0 makes nice numbers the sole determinant of process execution priorities by removing recent CPU usage from the calculation (literally, multiplying it by 0). Under these conditions, process execution priorities will remain static over time for all processes (unless they are explicitly reniced by hand).
Setting the -d option to a value other than 16 changes what constitutes recent CPU usage. A smaller value means that CPU usage affects the execution priority less than under the default conditions, effectively making the definition of "recent" shorter. On the other hand, a larger value causes CPU usage to affect execution priorities for a longer period of time. In the extreme case, -d 32 means that CPU usage simply accumulates (the divisor every second is 1), so long-running processes will always be less favored than ones that have used less CPU time because every process's recent CPU usage number will eventually rise to the maximum value of 120 and stay there (provided they run long enough). Newer processes will always be favored over those that have already received at least 120 time slices. Their relative nice numbers will determine the execution order for all processes over this threshold, and one at the same nice level will take turns via the usual run queue mechanism.
schedtune's -t option may be used to change the length of the maximum time slice allotted to a process. This option takes the number of 10-millisecond clock ticks by which to increase the length of the default time slice as its argument. For example, this command doubles the length of the time slice, setting it to 20 milliseconds:
# schedtune -t 1
Note that this change applies only to fixed-priority processes (the priority must be set with the setpri system call). Such processes' priority do not change over time (as described above), but rather remain fixed for their entire lifetimes.
schedtune's modifications to the scheduling parameters remain in effect only until the system is rebooted; you'll need to place the appropriate command in one of the system initialization scripts or in /etc/inittab if you decide that a permanent change is desirable. schedtune -D may be used to restore the default values for all parameters managed by this utility at any point (including ones unrelated to the system scheduler). Executing the command without any options will display the current values of all tunable parameters, and the -? option will display a manual page for the command (use a backslash before the question mark in the C shell).
126.96.36.199 The Solaris scheduler
System V.4 also introduced administrator-configurable process scheduling, which is now part of Solaris. One purpose of this facility is to support real-time processes: processes designed to work in application areas where nearly immediate responses to events are required (say, processing raw radar data in a vehicle in motion, controlling a manufacturing process making extensive use of robotics, or starting up the backup cooling system on a nuclear reactor). Operating systems handle such needs by defining a class of processes as real-time processes, giving them virtually complete access to all system resources when they are running. Under such instances, normal time-sharing processes will receive little or no CPU time. Solaris allows a system to be configured to allow both normal time-sharing and real-time processes (although actual real-time systems using other operating systems have seldom actually done this). Alternatively, a system may be configured without real-time processes.
This section serves as an introductory overview to this facility. Obviously, the process scheduler facility is something to play with on a test system first, not something to try on your main production system three days before an important deadline.
Solaris defines various process classes: real-time, time-sharing, interactive, system and interrupts. The latter class is used for kernel processes (such as the paging daemon). For scheduling table definition purposes, each process class has its own set of priority numbers. For example, real-time process priorities run from 0 to 59 (higher is better). Time-sharing processes use priority numbers from to 59 by default. However, these priority number sets are all mapped to a single set of internal priority numbers running from 0 to 169, as defined in Table 15-4.
As the table indicates, a real-time process will always run before either a system or time-sharing process, because real-time process global priorities which are actually used by the process scheduler are all greater than system and time-sharing global priorities. The definitions of each real-time and time-sharing global priority level are stored in the kernel and, if they have been customized, are usually located by one of the system initialization scripts at boot time. The current definitions may be retrieved with the dispadmin -g command. Here is an example:
$ dispadmin -g -c TS # Time Sharing Dispatcher Configuration RES=1000 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL 1000 0 10 5 10 # 0 1000 0 11 5 11 # 1 1000 1 12 5 12 # 2 1000 1 13 5 13 # 3 ... 100 47 58 5 58 # 57 100 48 59 5 59 # 58 100 49 59 5 59 # 59
Each line of the table defines the characteristics of a different priority level, numbered consecutively from 0. The RES= line defines the time units used in the table. It says how many parts each second is divided into; each defined fraction of a second becomes one unit. Thus, in this file, the time units are milliseconds.
The fields have the following meanings:
All text after number signs is ignored. Thus, the PRIORITY LEVEL columns are really comments designed to make the table easier to read.
From the preceding example, it is evident how process priorities would change under various circumstances. For example, consider a level 57 process (2 steps short of the most favored priority). If a process at this level runs for its full 100 milliseconds, it will then drop down to priority level 47, giving up the CPU to any higher priority processes. If, on the other hand, it waits for 5 milliseconds after being ready to run, its priority level is raised to 58, making it more likely to be executed sooner.
Here is a rather different time sharing scheduling table:
# Time Sharing Dispatcher Configuration RES=1000 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL 200 0 59 0 50 # 0 200 0 59 0 50 # 1 200 0 59 0 50 # 2 200 0 59 0 50 # 3 ... 160 0 59 0 51 # 10 160 1 59 0 51 # 11 ... 120 10 59 0 52 # 20 120 11 59 0 52 # 21 ... 80 20 59 0 53 # 30 80 21 59 0 53 # 31 ... 40 30 59 0 55 # 40 ... 40 47 59 0 59 # 57 40 48 59 0 59 # 58 40 49 59 0 59 # 59
This table has the effect of conflating the large number of processes down to a few distinct values when processes have to wait to gain access to the CPU. Because ts_maxwait is always and ts_lwait ranges only between 50 and 59, any runnable process that has to wait gets its priority changed to a value in this range. In addition, when a process returns from a sleep, its priority is set to 59, the highest available value. Note also that processes with high priorities get short time slices compared to the previous table (as little as 40 milliseconds).
You can dynamically install a new scheduler table with the dispadmin command's -s option. For example, this command installs the table contained in the file /etc/ts_sched.new into memory:
# dispadmin -c TS -s /etc/ts_sched.new
The table format in the specified file must be the same as that displayed by dispadmin-g, and it must contain the same number of priority levels as the one currently in use. Permanent changes may be made by running such a command at boot time or by creating a loadable module with a new scheduler table (see the ts_dptbl manual page for the latter procedure).
The priocntl command allows a priority level ceiling to be imposed upon a time-sharing process, which specifies the maximum priority level it can attain. This prevents a low priority process from becoming runnable and eventually marching up to the top priority level (as would happen under the first scheduler table we looked at) when you really want that process to run only when nothing else is around. Setting a limit can keep it below the range of normal processes. For example, the following command sets the maximum priority for process 27163 to -5:
# priocntl -s -m -5 27163
Note that the command uses external priority numbers (not the scheduler table values).
Tru64 provides many kernel parameters that control various aspects of the kernel's functioning. On Tru64 systems, kernel parameters may be altered using the sysconfig and dxkerneltuner utilities (text-based and GUI, respectively), although most values are alterable only at boot time.
sysconfig can also be used to display the current and configured values of kernel variables. For example, the following commands display information about the autonice_penalty parameter:
# sysconfig -m proc Is the proc subsystem static or dynamic? proc: static # sysconfig -q proc autonice_penalty Display current value. proc: autonice_penalty = 4 # sysconfig -Q proc autonice_penalty Display parameter attributes. proc: autonice_penalty - type=INT op=CQ min_val=0 max_val=20
The command takes a subsystem name and (optionally) a parameter name as its arguments.
The following command form will modify a current value:
# sysconfig -r proc autonice_penalty=6
Another useful sysconfig argument is -d; it displays the values set in the kernel initialization file, /etc/sysconfigtab, which are set at boot time. The majority of this file specifies device configuration; local modifications to standard kernel parameter values come at the end.
Here are some sample entries from this file:
generic: General settings. memberid = 0 new_vers_high = 1445655480385976064 new_vers_low = 51480 ipc: shm_max = 67108864 Max. shared memory (default: 4 MB). shm_mni = 1024 Max. shared regions (default: 128). shm_seg = 256 Max. regions/process (default: 32). proc: CPU-related settings. max_per_proc_stack_size = 41943040 autonice = 1 autonice_penalty = 10
Each stanza is introduced by the subsystem name. In this example, we configure the generic (general), ipc shared memory and proc (CPU/process) subsystems.
The proc subsystem is the most relevant to CPU performance. The following parameters may be useful in some circumstances:
15.3.5 Unix Batch-Processing Facilities
Manually monitoring and altering processes' execution priorities is a crude way to handle CPU time allocation, but unfortunately it's the only method that standard Unix offers. It is adequate for the conditions under which Unix was developed: systems with lots of small interactive jobs. But if a system runs some large jobs as well, it quickly breaks down.
Another way of dividing the available CPU resources on a busy system among multiple competing processes is to run jobs at different times, including some at times when the system would otherwise be idle. Standard Unix has a limited facility for doing so via the at and batch commands. Under the default configuration, at allows a command to be executed at a specified time, and batch provides a queue from which jobs may be run sequentially in a batch-like mode. For example, if all large jobs are run via batch from its default queue, it can ensure that only one is ever running at a time (provided users cooperate, of course).
In most implementations, system administrators may define additional queues in the queuedefs file, found in various locations on different systems:
This file defines queues whose names consist of a single letter (either case is valid). Conventionally, queue a is used for at, queue b is used for batch, and on many newer systems, queue c is used by cron. Tru64 and AIX define queues e and f for at jobs using the Korn shell and C shell, respectively (submitted using the at command's -k and -c options).
Queues are defined by lines in this format:
q . x jy nz w
q is a letter, x indicates the number of simultaneous jobs that may run from that queue, y specifies the nice value for processes started from that queue, and z says how long to wait before trying to start a new job when the maximum number for that queue or the facility are already running. The default values are 100 jobs, a nice value of 2 (where 0 is the default nice number), and 60 seconds.
The first two of the following queuedefs entries show typical definitions for the at and batch queues. The third entry defines the h queue, which can run one or two simultaneous jobs, niced to level 10, and waits for five minutes between job initiation attempts after starting one has failed:
a.4j1n b.2j2n90w h.2j10n300w
The desired queue is selected with the -q option to the at command. Jobs waiting in the facility's queues may be listed and removed from a queue using the -l and -r options, respectively.
If simple batch-processing facilities like these are sufficient for your system's needs, at and batch may be of some use, but if any sort of queue priority features are required, these commands will probably prove insufficient. The manual page for at found on many Linux systems is the most honest about its deficiencies:
A true batch system supports multiple queues; queues that receive jobs from and send jobs to a configurable set of network hosts, including the ability to select hosts based on load-leveling criteria and to allow the administrator to set in-queue priorities (for ordering pending jobs within a queue); queue execution priorities and resource limits (the priority and limits automatically assigned to jobs started from that queue); queue permissions (which users can submit jobs to each queue); and other parameters on a queue-by-queue basis. AIX has adapted its print-spooling subsystem to provide a very simple batch system (see Section 13.3), allowing for different job priorities within a queue and multiple batch queues, but it is still missing most important features of a modern batch system. Some vendors offer batch-processing features as an optional feature at additional cost.
There are also a variety of open source queueing systems, including: