15.3 Managing CPU Resources

CPU usage is usually the first factor that I consider when I am tracking down a performance problem or just trying to assess the current system state in general.^[9]

^[9] Some people recommend checking memory use first, because CPU shortages are occasionally secondary effects of memory shortages.

15.3.1 Nice Numbers and Process Priorities

Most Unix systems use a priority-based round-robin scheduling algorithm to distribute CPU resources among multiple competing processes. All processes have an execution priority assigned to them, an integer value that is dynamically computed and updated on the basis of several factors. Whenever the CPU is free, the scheduler selects the most favored process to begin or resume executing; this usually corresponds to the process with the lowest priority number, because lower numbers are defined as more favored than higher ones in typical implementations.

Although there may be a multitude of processes simultaneously present on the system, only one process actually uses the CPU processor at any given time (assuming the system has only a single CPU). Once a process begins running, it continues to execute until it needs to wait for an I/O operation to complete, receives an interrupt from the kernel, or otherwise gives up control of the CPU, or until it exhausts the maximum execution time slice (or quantum) defined on that system (10 milliseconds is a common value). Once the current process stops executing, the scheduler again selects the most favored process on the system and starts or resumes it. The process of changing the current running process is called a context switch.

Multiple runnable processes at the same priority level are placed into the run queue for that priority level. Whenever the CPU is free, the scheduler starts the processes at the head of the lowest-numbered, nonempty run queue. When the process at the top of a run queue stops executing, it goes to the end of the line, and the next process moves to the front.

A Unix process has two priority numbers associated with it:

Its nice number, which is its requested execution priority with respect to other processes. This value is settable by the process' owner and by root. The nice number appears in the NI column in ps -l listings.
Its current (actual) execution priority, which is computed and dynamically updated by the operating system (in a system-dependent way), often taking into account factors such as the process's nice number, how much CPU time it has had recently, what other processes are runnable and their priorities, and other factors. This value appears in the PRI column in ps -l listings.^[10]

^[10] See Section 15.3.4.1 later in this chapter for a concrete example of how process priorities are calculated.

Under BSD, nice numbers range from -20 and 20, with -20 the most favored priority (thedefault priority is 0); under System V, nice numbers range from 0 to 39 (the default is 20), with lower numbers again indicating higher priority and more rapid execution. For Unix, less is truly more. Interactive shells usually run at the default level (0 for BSD and 20 for System V). Only the superuser can specify nice numbers lower than the default.

Many systems provide a special nice number that can be assigned to processes that you want to run only when nothing else wants the CPU. On Solaris systems, this number is 19, and on Tru64, HP-UX, and Linux systems, it is 20.

On AIX systems, a similar effect can be accomplished by setting a process to the fixed priority level 121 (using the setpri system call see Section 14.4 for a sample program calling this function). Because varying priorities always remain at or below 120, a job in the priority range of 121 to 126 will run only when no lower-priority process wants the CPU. Note that once you have assigned a process a fixed priority, it cannot return to having a varying priority.

Any user can be nice and decrease the priority of a process he owns by increasing its nice number. Only the superuser can decrease the nice number of a process. This prevents users from increasing their own priorities and thereby using more than their share of the system's resources.

There are several ways to specify a job's execution priority. First, there are two commands to initiate a process at lowered priority: the nice command, built into some shells, and the general Unix command nice, usually stored in /bin or /usr/bin. These commands both work the same way, but have slightly different syntaxes:

% nice [+|- n ] command   $ /bin/nice - [[-] n ] command

In the built-in C shell version of nice, if an explicitly signed number is given as its first argument, it specifies the amount the command's priority will differ from the default nice number; if no number is specified, the default offset is +4. With /bin/nice, the offset from the default nice number is specified as its argument and so is preceded by a hyphen; the default offset is +10, and positive numbers need not include a plus sign. Thus, the following commands are equivalent, despite looking very different:

% nice +6 bigjob  $ /bin/nice -6 bigjob

Both commands result in bigjob having a nice number of 6 under BSD and 26 under System V. Similarly, the following commands both raise bigjob's priority five steps above the default level (to -5 under BSD and 15 under System V):

# nice -5 bigjob # /bin/nice --5 bigjob

Thus, BSD and System V nice numbers always differ by 20, but identical commands have equivalent effects on the two systems.

The -l option to ps (either format the output varies only slightly) may be used to display a process's nice number and current execution priority. Here is some example output from a Linux system:

% ps l    F  UID   PID PPID  PRI  NI   VSZ    RSS WCHAN       COMMAND  8201  371  8390 8219    1   0  3233    672 wait4  ...  rlogin iago  8201  371  8391 8219    3   4  3487   1196 do_sig ...  big_cmd  8201    0  8394    1   15  -5  2134   1400 -      ...  imp_cmd

The column headed NI displays each process's nice number. The column to its immediate left, labeled PRI, shows the process's current actual execution priority.

Some Unix implementations automatically reduce the priority of processes that consume more than 10 minutes of user CPU time. Because the ps command reports total CPU time (user time plus system time), its display often indicates a total CPU time of more than 10 minutes at the moment this occurs.

Processes inherit the priority of their parent when they are created. However, changing the priority of the parent process does not change the priorities of its children. Therefore, increasing a process's priority number may have no effect if this process has created one or more subprocesses. Accordingly, if the parent process spends most of its time waiting for its children, changing the parent's priority will have little or no effect on the system's performance.

15.3.2 Monitoring CPU Usage

There are many ways of obtaining a quick snapshot of current overall CPU activity. For example, the vmstat command includes CPU activity among the many system statistics that it displays. Its most useful mode uses this syntax:

$ vmstat interval [count ]

where interval is the number of seconds between reports, and count is the total number of reports to generate. If count is omitted, vmstat runs until you terminate it.

Here is an example of the output^[11] from vmstat:

^[11] vmstat's output varies somewhat from system to system.

$ vmstat 5 4  procs     memory            page                disk         faults       cpu r b w   avm  fre  re  at  pi  po  fr  de  sr  d0 d1 d2 d3  in  sy  cs  us  sy  id 1 0 0  61312 9280  0   0  24   1   2   0   0   4  1  1 12  35  66  16  63  11  26  3 2 0  71936 3616  3   0  96   0   0   0   2  18  0  0  0  23  89  34  72  28   0 5 1 0  76320 3424  0   0   0   0   0   0   0  26  0  0  0  24  92  39  63  37   0 4 1 0  63616 3008  1   1   0   0   0   0   0  21  0  0  0  23  80  33  78  22   0

The first line of every vmstat report displays average values for each statistic since boot time; it should be ignored. If you forget this, you can be misled by vmstat's output. At the moment, we are interested in these columns of the report:

r: Number of runnable processes that are waiting for the CPU
cs: Number of context switches
us: Percentage of CPU cycles spent as user time (i.e., running the heart of user applications)
sy: Percentage of CPU cycles spent as system time, both as part of the overhead involved in running user programs (e.g. handling I/O requests) and in providing general operating system services
id: Idle time: percentage of CPU cycles that went unused during the interval

During the period covered by the vmstat report, this system's CPU was used to capacity: there was no idle time at all.^[12] You'll need to use ps in conjunction with vmstat to determine the specific jobs that are consuming the system's CPU resources. Under AIX, the tprof command can also be used for this purpose.

^[12] The method for determining whether a single job is CPU-limited or not is somewhat different. When there is a significant difference between the CPU time and the elapsed time taken for a job to complete on an otherwise idle system, some factor(s) other than a lack of CPU cycles are degrading its performance.

15.3.2.1 Recognizing a CPU shortage

High levels of CPU usage are not a bad thing in themselves (quite the contrary, in fact: they might mean that the system is accomplishing a lot of useful work). However, if you are tracking down a system performance problem, and you see such levels of CPU use consistently over a significant period of time, a shortage of CPU cycles is one factor that might be contributing to that problem (it may not be the total problem, however, as we'll see).

Short-term CPU usage spikes are normal.

In general, one or more of the following symptoms may suggest a shortage of CPU resources when they appear regularly and/or persist for a significant period of time:

Higher than normal load averages.
Total processor usage (us+sy) that is higher than normal. You might start thinking about future CPU requirements when the total load increases over time and exceeds 80%-90%.
A large number of waiting runnable processes (r). This indicates that these processes are ready to run but can't get any CPU cycles. I start looking into things when this value gets above about 3-6 (per CPU).
Ideally, most of the CPU usage should be spent in user time performing actual work and not in system time. Sustained abnormally high levels of system time, especially in conjunction with a large number of context switches, can indicate too many processes contending for the CPU,^[13] even when the total CPU usage is not an issue. I like the system time to be a fraction of the user time, about a third or less (applies only when the total time used is nontrivial).

^[13] A high system time percentage can also indicate a memory shortage, as we'll see.
When an overcommitment of CPU resources is the source of a performance bottleneck, there are several options for addressing the situation:
- If you want to favor some jobs over others, you can explicitly divide up the CPU resources via process priorities.
- If there is simply more demand for the CPU resources than can be met, you'll need to reduce consumption in some way: move some of the load to a different (presumably less heavily loaded) system, execute some jobs at a later time (during off-hours via a batch system, for example), and the like.^[14]
  
  ^[14] It is also possible to reduce CPU consumption by making the application programs themselves more efficient. Such techniques are beyond the scope of this book; consult High Performance Computing by Kevin Dowd (O'Reilly & Associates) for detailed information about the code tuning process.
- If the operating system supports it, you can change its scheduling procedure to allocate CPU resources to those jobs and in the manner that you deem appropriate.

We'll look at each of these options in turn in the remainder of this section.

15.3.3 Changing a Process's Nice Number

When the system's load is high, you may wish to force CPU-intensive processes to run at a lower priority. This reduces the CPU contention for interactive jobs such as editing, and generally keeps users happy. Alternatively, you may wish to devote most of the system's time to a few critical processes, letting others finish when they will. The renice command may be used to change the priority of running processes. Introduced in BSD, renice is now also supported on most System V systems. Only root may use renice to increase the priority of a process (i.e., lower its nice number).

renice's traditional syntax is:

# renice new-nice-number pid

new-nice-number is a valid nice number, and pid is a process identification number. For example, the following command sets the nice number of process 8201 to 5, lowering its priority by five steps.

# renice 5 8201

Giving a process an extremely high priority may interfere with the operating system's own operation. Let common sense reign.

15.3.3.1 renice under AIX, HP-UX, and Tru64

AIX and HP-UX use a modified form of the renice command. This form requires that the -n option precede the new nice number, as in this example:

$ renice -n 12 8201

Tru64 supports both forms of the renice command.

Note that AIX uses the System V-style priority system, running from 0 (high) to 40 (low). For renice under AIX, the new nice number is still specified on a scale from -20 to 20; it is translated internally into the 0-40 scheme actually used. This can make for some slightly strange output at times:

# renice -n 10 3769  3769: old priority 0, new priority 10 # ps -l -p 3769  F      S UID  PID PPID C PRI NI ADDR SZ   WCHAN   TTY  TIMECMD  200801 S 371 3769 8570 0  70 30 2aca 84 1d79098 pts/1  0:00 c12

The renice command reports its action in terms of BSD nice numbers, but the ps display shows the real nice number.

15.3.3.2 Changing process priorities under Solaris

System V.4 changed the standard System V priority scheme as part of its support for real-time processes. By default, V.4, and hence Solaris, internally use time-sharing priority numbers ranging from -20 to 20, with 20 as the highest priority (the default is 0). V.4 also supports the BSD renice command, mapping BSD nice numbers to the corresponding time-sharing priority number; similarly, the ps command continues to display nice numbers in the V.3 format. Solaris has incorporated this scheme as part of its V.4 base.

Solaris also uses another command to modify process priorities (again, primarily intended for real-time processes): priocntl. The priocntl form to change the priority for a single process is:

# priocntl -s -p new-pri -i pid proc-id

where new-pri is the new priority for the process and proc-id is the process ID of the desired process. For example, the following command sets the priority level for process 8733 to -5:

# priocntl -s -p -5 -i pid 8733

The following form may be used to set the priority (nice number) for every process created by a given parent process:

# priocntl -s -p -5 -i ppid 8720

This command sets the priority of process 8720 and all of its children.

The priocntl command has many other capabilities and uses, as we'll see in the course of this chapter (you may also want to consult its manual page).

15.3.3.3 Setting a user's default nice numbers under Tru64

Tru64 allows you to specify the default nice number for a user's login shell (which will be inherited by all processes that she subsequently creates), via the u_priority field in the user's protected password database entry. This field takes a numeric value and defaults to 0 (the usual default nice number). For example, the following form would set the user's nice value to 5:

u_priority#5

A systemwide default nice value may also be set in /etc/auth/system/default.

15.3.4 Configuring the System Scheduler

AIX and Solaris provide substantial facilities for configuring the functioning of the system scheduler. Tru64 also offers a few relevant kernel parameters. We'll consider these facilities in this section. The other operating systems offer little of practical use for CPU performance tuning.

These operations require care and thought and should initially be tried on nonproduction systems.

15.3.4.1 The AIX scheduler

On AIX systems, dynamic process priorities range from 0 to 127, with lower numbers more favorable. The current value for each process appears in the column labeled PRI in ps -l displays. Normally, process execution priorities change over time (in contrast to nice numbers), according to the following formula:

new_priority = min + nice + (frac * recent)

min is the minimum process priority level, normally 40; nice is the process' nice number; recent is a number indicating how much CPU time the process has received recently (it is displayed in the column labeled C in ps -l output). By default, the parameter frac is 0.5; it specifies how much of the recent CPU usage is taken into account (how large the penalty for recent CPU cycles is).

For a new process, recent starts out at 0; it can reach a maximum value of 120. By default, at the end of each 10-millisecond time slice (equivalent to one clock tick), the scheduler increases recent by one for the process currently in control of the CPU. In addition, once a second, the scheduler reduces recent for all processes, multiplying it by a reduction factor that defaults to 0.5 (i.e., recent is divided by 2 by default). The effect of this procedure is to penalize processes that have received CPU resources most recently by increasing their execution priority value, and gradually lowering the execution priority value for processes that have had to wait, to the minimum level arising from their nice number.

The result of this scheduling policy is that CPU resources are more or less evenly divided among (compute-bound) jobs at the same nice level. When there are jobs ready to run at both normal and raised nice levels, the normal-priority jobs will get more time than the others, but even the niced jobs will get some CPU time. For long-running processes, the distinction between normal priority and niced processes eventually becomes quite blurred because normal priority processes that have gotten a significant amount of CPU time can easily rise in priority above that of waiting niced processes.

The schedtune utility is used to modify scheduler and other operating system parameters. The schedtune executable is provided in /usr/samples/kernel.

For normal processes, you can alter two scheduler parameters with this utility: the fraction of the short-term CPU usage value used in computing the current execution priority (-r) and how much the short-term CPU usage number is reduced at the end of each one second interval (-d). Each value is divided by 32 to compute the actual multiplier that is used (e.g., frac in the preceding equation is equal to -r/32). Both values default to 16, resulting in factors of one half in both cases.

For example, the following command makes slight alterations to these two parameters:

# schedtune -r 15 -d 15

The -r option determines how quickly recent CPU usage raises a process's execution priority (lowering its likelihood of resumed execution). For example, giving -r a value of 10 causes the respective priorities of normal and niced processes to equalize more slowly than under the default conditions, allocating a large fraction of the total CPU capacity to the more favored jobs.

Decreasing the value even more intensifies this effect; if the option is set to 4, for example, only one eighth of the recent CPU usage number will be used in calculating the execution priority (instead of one half). This means that this component will never contribute more than 15 to the execution priority (120 * 4/32), so a process that has a nice number greater than 15 will never interfere with the running of a normal process.

Setting -r to 0 makes nice numbers the sole determinant of process execution priorities by removing recent CPU usage from the calculation (literally, multiplying it by 0). Under these conditions, process execution priorities will remain static over time for all processes (unless they are explicitly reniced by hand).

Setting the -d option to a value other than 16 changes what constitutes recent CPU usage. A smaller value means that CPU usage affects the execution priority less than under the default conditions, effectively making the definition of "recent" shorter. On the other hand, a larger value causes CPU usage to affect execution priorities for a longer period of time. In the extreme case, -d 32 means that CPU usage simply accumulates (the divisor every second is 1), so long-running processes will always be less favored than ones that have used less CPU time because every process's recent CPU usage number will eventually rise to the maximum value of 120 and stay there (provided they run long enough). Newer processes will always be favored over those that have already received at least 120 time slices. Their relative nice numbers will determine the execution order for all processes over this threshold, and one at the same nice level will take turns via the usual run queue mechanism.

schedtune's -t option may be used to change the length of the maximum time slice allotted to a process. This option takes the number of 10-millisecond clock ticks by which to increase the length of the default time slice as its argument. For example, this command doubles the length of the time slice, setting it to 20 milliseconds:

# schedtune -t 1

Note that this change applies only to fixed-priority processes (the priority must be set with the setpri system call). Such processes' priority do not change over time (as described above), but rather remain fixed for their entire lifetimes.

schedtune's modifications to the scheduling parameters remain in effect only until the system is rebooted; you'll need to place the appropriate command in one of the system initialization scripts or in /etc/inittab if you decide that a permanent change is desirable. schedtune -D may be used to restore the default values for all parameters managed by this utility at any point (including ones unrelated to the system scheduler). Executing the command without any options will display the current values of all tunable parameters, and the -? option will display a manual page for the command (use a backslash before the question mark in the C shell).

15.3.4.2 The Solaris scheduler

System V.4 also introduced administrator-configurable process scheduling, which is now part of Solaris. One purpose of this facility is to support real-time processes: processes designed to work in application areas where nearly immediate responses to events are required (say, processing raw radar data in a vehicle in motion, controlling a manufacturing process making extensive use of robotics, or starting up the backup cooling system on a nuclear reactor). Operating systems handle such needs by defining a class of processes as real-time processes, giving them virtually complete access to all system resources when they are running. Under such instances, normal time-sharing processes will receive little or no CPU time. Solaris allows a system to be configured to allow both normal time-sharing and real-time processes (although actual real-time systems using other operating systems have seldom actually done this). Alternatively, a system may be configured without real-time processes.

This section serves as an introductory overview to this facility. Obviously, the process scheduler facility is something to play with on a test system first, not something to try on your main production system three days before an important deadline.

Solaris defines various process classes: real-time, time-sharing, interactive, system and interrupts. The latter class is used for kernel processes (such as the paging daemon). For scheduling table definition purposes, each process class has its own set of priority numbers. For example, real-time process priorities run from 0 to 59 (higher is better). Time-sharing processes use priority numbers from to 59 by default. However, these priority number sets are all mapped to a single set of internal priority numbers running from 0 to 169, as defined in Table 15-4.

Table 15-4. Solaris priority classes
Class	Relative priorities	Absolute priorities
Time-sharing/interactive	0-59	0-59
Kernel	0-39	60-99
Real-time	0-59	100-159
Interrupt	0-9	160-169^[15]

^[15] The interrupt class uses 100-109 if the real time class is not in use.

As the table indicates, a real-time process will always run before either a system or time-sharing process, because real-time process global priorities which are actually used by the process scheduler are all greater than system and time-sharing global priorities. The definitions of each real-time and time-sharing global priority level are stored in the kernel and, if they have been customized, are usually located by one of the system initialization scripts at boot time. The current definitions may be retrieved with the dispadmin -g command. Here is an example:

$ dispadmin -g -c TS # Time Sharing Dispatcher Configuration  RES=1000 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL       1000        0       10          5         10      #     0       1000        0       11          5         11      #     1       1000        1       12          5         12      #     2       1000        1       13          5         13      #     3       ...         100       47       58          5         58      #     57        100       48       59          5         59      #     58        100       49       59          5         59      #     59

Each line of the table defines the characteristics of a different priority level, numbered consecutively from 0. The RES= line defines the time units used in the table. It says how many parts each second is divided into; each defined fraction of a second becomes one unit. Thus, in this file, the time units are milliseconds.

The fields have the following meanings:

ts_quantum: The maximum amount of time that a process at this priority level can run without interruption.
ts_tqexp: New priority given to a process running at this priority level that gets the entire maximum run interval. In the preceding example, this has the effect of lowering its priority.
ts_slpret: New priority given to a process at this priority level when it returns from a sleep.
ts_maxwait: Maximum amount of time a process at this level can remain runnable without actually executing before having its priority changed to the value in the ts_lwait column. This setting affects processes that are ready to run but aren't getting any CPU time. After this interval, their priority will be increased with the preceding scheduler table.
ts_lwait: New priority given to a process that is runnable and whose maximum wait time has expired. In the preceding example, this usually increases its priority somewhat.

All text after number signs is ignored. Thus, the PRIORITY LEVEL columns are really comments designed to make the table easier to read.

From the preceding example, it is evident how process priorities would change under various circumstances. For example, consider a level 57 process (2 steps short of the most favored priority). If a process at this level runs for its full 100 milliseconds, it will then drop down to priority level 47, giving up the CPU to any higher priority processes. If, on the other hand, it waits for 5 milliseconds after being ready to run, its priority level is raised to 58, making it more likely to be executed sooner.

Here is a rather different time sharing scheduling table:

# Time Sharing Dispatcher Configuration  RES=1000 # ts_quantum ts_tqexp ts_slpret ts_maxwait ts_lwait PRIORITY LEVEL        200        0       59          0         50      #     0        200        0       59          0         50      #     1        200        0       59          0         50      #     2        200        0       59          0         50      #     3        ...         160        0       59          0         51      #     10        160        1       59          0         51      #     11        ...        120       10       59          0         52      #     20        120       11       59          0         52      #     21        ...         80       20       59          0         53      #     30         80       21       59          0         53      #     31        ...         40       30       59          0         55      #     40        ...         40       47       59          0         59      #     57         40       48       59          0         59      #     58         40       49       59          0         59      #     59

This table has the effect of conflating the large number of processes down to a few distinct values when processes have to wait to gain access to the CPU. Because ts_maxwait is always and ts_lwait ranges only between 50 and 59, any runnable process that has to wait gets its priority changed to a value in this range. In addition, when a process returns from a sleep, its priority is set to 59, the highest available value. Note also that processes with high priorities get short time slices compared to the previous table (as little as 40 milliseconds).

You can dynamically install a new scheduler table with the dispadmin command's -s option. For example, this command installs the table contained in the file /etc/ts_sched.new into memory:

# dispadmin -c TS -s /etc/ts_sched.new

The table format in the specified file must be the same as that displayed by dispadmin-g, and it must contain the same number of priority levels as the one currently in use. Permanent changes may be made by running such a command at boot time or by creating a loadable module with a new scheduler table (see the ts_dptbl manual page for the latter procedure).

The priocntl command allows a priority level ceiling to be imposed upon a time-sharing process, which specifies the maximum priority level it can attain. This prevents a low priority process from becoming runnable and eventually marching up to the top priority level (as would happen under the first scheduler table we looked at) when you really want that process to run only when nothing else is around. Setting a limit can keep it below the range of normal processes. For example, the following command sets the maximum priority for process 27163 to -5:

# priocntl -s -m -5 27163

Note that the command uses external priority numbers (not the scheduler table values).

15.3.4.3 Tru64

Tru64 provides many kernel parameters that control various aspects of the kernel's functioning. On Tru64 systems, kernel parameters may be altered using the sysconfig and dxkerneltuner utilities (text-based and GUI, respectively), although most values are alterable only at boot time.

sysconfig can also be used to display the current and configured values of kernel variables. For example, the following commands display information about the autonice_penalty parameter:

# sysconfig -m proc                         Is the proc subsystem static or dynamic? proc: static # sysconfig -q proc autonice_penalty        Display current value. proc: autonice_penalty = 4 # sysconfig -Q proc autonice_penalty        Display parameter attributes. proc: autonice_penalty -      type=INT op=CQ min_val=0 max_val=20

The command takes a subsystem name and (optionally) a parameter name as its arguments.

The following command form will modify a current value:

# sysconfig -r proc autonice_penalty=6

Another useful sysconfig argument is -d; it displays the values set in the kernel initialization file, /etc/sysconfigtab, which are set at boot time. The majority of this file specifies device configuration; local modifications to standard kernel parameter values come at the end.

Here are some sample entries from this file:

generic:                                    General settings.         memberid = 0         new_vers_high = 1445655480385976064         new_vers_low = 51480 ipc:         shm_max = 67108864                  Max. shared memory (default: 4 MB).         shm_mni = 1024                      Max. shared regions (default: 128).         shm_seg = 256                       Max. regions/process (default: 32). proc:                                       CPU-related settings.         max_per_proc_stack_size = 41943040         autonice = 1         autonice_penalty = 10

Each stanza is introduced by the subsystem name. In this example, we configure the generic (general), ipc shared memory^[16] and proc (CPU/process) subsystems.

^[16] These example settings are useful for running large jobs on multiprocessor systems.

The proc subsystem is the most relevant to CPU performance. The following parameters may be useful in some circumstances:

max_per_proc_address_space and max_per_process_data_size may need to be increased from their defaults of 4 GB and 1 GB (respectively) to accommodate very large jobs.
By default, the Tru64 scheduler gives a priority boost to jobs returning from a block I/O wait (in an effort to expedite interactive response). You can disable this by setting give_boost to 0.
The scheduler can be configured to automatically nice processes that have used more than 600 seconds of CPU time (this is disabled by default). Setting autonice to 1 enables it, and you can specify the amount to nice by with the autonice_penalty parameter (the default is 4).
The round_robin_switch_rate can be used to modify the time slice. It does so in an indirect manner. Its default value is 0, which is also equivalent to its maximum value of 100. This setting specifies how many time-slice expiration context switches occur in a second, and the time slice is computed by dividing the CPU clock rate by this value. Thus, setting it to 50 has the effect of doubling the time slice length (because the divisor changes from 100 to 50). Such a modification should be considered only for systems designed for running long jobs, with little or no interactive activity (or where you've decided to favor computation over interactive activity).

15.3.5 Unix Batch-Processing Facilities

Manually monitoring and altering processes' execution priorities is a crude way to handle CPU time allocation, but unfortunately it's the only method that standard Unix offers. It is adequate for the conditions under which Unix was developed: systems with lots of small interactive jobs. But if a system runs some large jobs as well, it quickly breaks down.

Another way of dividing the available CPU resources on a busy system among multiple competing processes is to run jobs at different times, including some at times when the system would otherwise be idle. Standard Unix has a limited facility for doing so via the at and batch commands. Under the default configuration, at allows a command to be executed at a specified time, and batch provides a queue from which jobs may be run sequentially in a batch-like mode. For example, if all large jobs are run via batch from its default queue, it can ensure that only one is ever running at a time (provided users cooperate, of course).

In most implementations, system administrators may define additional queues in the queuedefs file, found in various locations on different systems:

AIX	/var/adm/cron
FreeBSD	not used
HP-UX	/var/adm/cron
Linux	not used
Solaris	/etc/cron.d
Tru64	/var/adm/cron

This file defines queues whose names consist of a single letter (either case is valid). Conventionally, queue a is used for at, queue b is used for batch, and on many newer systems, queue c is used by cron. Tru64 and AIX define queues e and f for at jobs using the Korn shell and C shell, respectively (submitted using the at command's -k and -c options).

Queues are defined by lines in this format:

q . x jy nz w

q is a letter, x indicates the number of simultaneous jobs that may run from that queue, y specifies the nice value for processes started from that queue, and z says how long to wait before trying to start a new job when the maximum number for that queue or the facility are already running. The default values are 100 jobs, a nice value of 2 (where 0 is the default nice number), and 60 seconds.

The first two of the following queuedefs entries show typical definitions for the at and batch queues. The third entry defines the h queue, which can run one or two simultaneous jobs, niced to level 10, and waits for five minutes between job initiation attempts after starting one has failed:

a.4j1n  b.2j2n90w  h.2j10n300w

The desired queue is selected with the -q option to the at command. Jobs waiting in the facility's queues may be listed and removed from a queue using the -l and -r options, respectively.^[17]

^[17] The BSD form of the at facility provided the atq and atrm commands for these functions, but they are obsolete forms. Also, only the implementations found on FreeBSD and Linux systems continue to require that the atrun command be executed periodically from within cron to enable the at facility (every 10 minutes was a typical interval).

If simple batch-processing facilities like these are sufficient for your system's needs, at and batch may be of some use, but if any sort of queue priority features are required, these commands will probably prove insufficient. The manual page for at found on many Linux systems is the most honest about its deficiencies:

at and batch as presently implemented are not suitable when users are competing for resources.

A true batch system supports multiple queues; queues that receive jobs from and send jobs to a configurable set of network hosts, including the ability to select hosts based on load-leveling criteria and to allow the administrator to set in-queue priorities (for ordering pending jobs within a queue); queue execution priorities and resource limits (the priority and limits automatically assigned to jobs started from that queue); queue permissions (which users can submit jobs to each queue); and other parameters on a queue-by-queue basis. AIX has adapted its print-spooling subsystem to provide a very simple batch system (see Section 13.3), allowing for different job priorities within a queue and multiple batch queues, but it is still missing most important features of a modern batch system. Some vendors offer batch-processing features as an optional feature at additional cost.

There are also a variety of open source queueing systems, including:

Distributed Queueing System (DQS): http://www.scri.fsu.edu/~pasko/dqs.html
Portable Batch System: http://pbs.mrj.com