Section 7.5. Resource Controls

7.5. Resource Controls

This section describes the extensible resource control framework, first introduced in Solaris 9. It leverages the project database introduced in Section 7.1 allowing the implementation of a network-wide static resource policy. Several new interfaces are introduced to enumerate, set, and get controls on processes, tasks, and projects; these interfaces could potentially be applied to other system entities. Additional interfaces simplify interaction with active projects.

We discuss the facility for resource controls, including the ability to control the number of LWPs in a task. The policy source for the various controls of a given process, task, or project shall reside in the project database introduced by Projects and Tasks, although a new API allows subsequent modification of those controls on the current OS instance. (The primordial policy source, in the absence of attribute definitions in the project database, is the set of operating system defaults.)

7.5.1. Introduction to Resource Controls

The resource controls framework permits applications to set the values of resource controls advertised by various kernel subsystems. Invalid control settings are ignored, with an error code returned. The controls frameworkcalled rctlis designed to be flexible. It is applied here to the existing process rlimit. It is able to encapsulate the apparently defunct POSIX draft standard for process and session limits if that draft is reactivated. (As the task is a Solaris-specific extension, no existing task limit standards are known or anticipated.)

Initially, a small set of resource controls are implemented:

CPU seconds per task
number of LWPs per task

The existing process limits are made available through the rctl interface:

CPU seconds per process
Maximum file size
Data size for this process
Stack size for this process
Core file size for this process
Number of file descriptors per process
Maximum mapped memory per process

Many other Solaris facilities leverage the resource control infrastructure, including these:

The fair-share scheduler (FSS), which also uses a set of project-based resource controls associating some number of CPU shares with a project
The System V message parameters
Event ports
The Cryptographic framework

7.5.2. What Is an `rctl`?

The resource control framework generalizes the rlimit-process relationship to a resource control-entity relationship. In this sense, a resource control is some amount of information associated with the entity pertinent to resource management operations. The kernel subsystem publishing the resource control can associate default actions to be taken at various thresholds of the resource usage; these actions can be modified or new actions (on new thresholds) can be introduced by the user process or by the administrator.

Resource controls are an extension of the basic rlimit concept. Infinite values are supported by a separate flag bit and do not contaminate the numeric values available for a resource. The hard/soft limit approach is also extended, to distinguish an administratively established maximum from absolute constraint from the operating system. And, as noted earlier, resource controls could potentially be applied to abstractions other than the processin this document, resource limits are attached to tasks and projects.

Figure 7.2. Process Collectives and Resource Control Sets

Resource controls are flexible, in that controls for which the subsystem associates no actions act as subsystem-specific attributes. The subsystem has then associated one or more named integers with the entity and offers no operations other than set or get operations on the value of the integer. An example resource control used primarily as an attribute would be the number of CPU shares associated with a project, or a priority associated with a task (for arbitration between two tasks contesting over a finite resource). Because of implementation restrictions, certain resource controls may not support all possible actions offered by the framework. In particular, the legacy rlimits may have restrictions on available actions. These limitations are usually expressed through the global flag values on a resource control.

Multiple resource controls can exist on a resource, one at each container level in the process model. For instance, CPU time limits are enforced on both the process and the task. The general statement of enforcement, in the case that resource controls on the same resource would be active with a resource allocation, is that the smallest container's control is enforced first. Thus, process.max-cpu-time's action would be taken before task.max-cpu-time's action if they were simultaneously encountered. (In the case of signals, this behavior may not be realized at the recipient process if the signal's implementation reorders the delivery of the resultant signals.)

7.5.3. Numeric Values of Resource Controls

Although resource controls represent an opportunity to associate arbitrary data with one or more kernel entities, we strongly restrict that to 64-bit unsigned integer data (in the form of an rctl_qty_t). In particular, we exclude floating-point data out of hand and discourage use of percentages as values within the resource management framework. Infinite values (those not being enforced) are marked by a flag bit, keeping the infinity concept separate from the range of valid values.

7.5.4. Resource Control Definitions

The resource controls facility provides a reasonably rich interface to describe controlled quantities. Implementation restrictions prevent all controls from having identical properties and, as a result, we require descriptive constants to distinguish the various restrictions and properties on controls and control values. Table 7.3 summarizes the key constant values; more complete specifications are given in the manual pages and in relevant parts of the technical discussion. A summary of defined constants for resource control facility is shown in Table 7.4 and Table 7.6.

Table 7.3. Privilege Levels of Resource Control Values
Privilege Name	Description
`RCPRIV_BASIC`	Can be modified by the owner of the calling process.
`RCPRIV_PRIVILEGED`	Value requires privilege to modify action; value can be lowered if global flag is set `RCPRIV_SYSTEM`.
`RCPRIV_SYSTEM`	Value cannot be modified.

Table 7.4. Operational Flags for `setrctl(2)` and `geTRctl(2)`
Operational Flag	Description
`RCTL_DELETE`	Passed to `setrctl(2)` to remove given resource control value.
`RCTL_FIRST`	Passed to `getrctl(2)` to retrieve first defined value on given resource control.
`RCTL_INSERT`	Passed to `setrctl(2)` to insert given resource control value in value sequence.
`RCTL_NEXT`	Passed to `geTRctl(2)` to retrieve next defined value in value sequence following the given resource control value.
`RCTL_REPLACE`	Passed to `setrctl(2)` to replace the first given resource value with the second given resource control value.
`RCTL_USAGE`	Passed to `geTRctl(2)` to get current usage of the specific resource control by the calling process.

Table 7.5. Global Resource Control Properties and Actions
Global Resource Control	Description
`RCTL_GLOBAL_DENY_ALWAYS`	The action taken when a control value is exceeded on this control will always include denial of the resource.
`RCTL_GLOBAL_DENY_NEVER`	The action taken when a control value is exceeded on this control will always exclude denial of the resource; the resource will always be granted, although other actions may also be taken.
`RCTL_GLOBAL_CPU_TIME`	The valid signals available as local actions include the `SIGXCPU` signal.
`RCTL_GLOBAL_FILE_SIZE`	The valid signals available as local actions include the `SIGXFSZ` signal.
`RCTL_GLOBAL_INFINITE`	This resource control supports the concept of an unlimited value; generally true only of accumulation-oriented resources such as CPU time.
`RCTL_GLOBAL_LOWERABLE`	Nonprivileged callers are able to lower the value of privileged resource control values on this control.
`RCTL_GLOBAL_NOACTION`	No global action will be taken when a resource control value is exceeded on this control.
`RCTL_GLOBAL_NOBASIC`	No values with the `RCPRIV_BASIC` privilege are permitted on this control.
`RCTL_GLOBAL_NOLOCALACTION`	No local actions (deny or signals, presently) are permitted on this control.
`RCTL_GLOBAL_SYSLOG`	The defined message will be logged by the `syslog` facility when any resource control value on a sequence associated with this control is exceeded.
`RCTL_GLOBAL_UNOBSERVABLE`	The resource control (generally on a task- or project-related control) does not support observational control values: as `RCPRIV_BASIC` privileged control value placed by a process on the task or process will only generate an action if the value is exceeded by that process.

Table 7.6. Local (Value-Specific) Resource Control Properties and Actions
Local Resource Control	Description
`RCTL_LOCAL_DENY`	When this resource control value is encountered, the request for the resource will be denied. Set on all values if `RCTL_GLOBAL_DENY_ALWAYS` is set for this control; cleared on all values if `RCTL_GLOBAL_DENY_NEVER` is set for this control.
`RCTL_LOCAL_MAXIMAL`	This resource control value represents a request for the maximal amount of resource for this control; in the case that `RCTL_GLOBAL_INFINITE` is set for this resource control, then `RCTL_LOCAL_MAXIMAL` indicates an unlimited resource control value-one that will never be exceeded.
`RCTL_LOCAL_NOACTION`	No local action will be taken when this resource control value is exceeded.
`RCTL_LOCAL_SIGNAL`	The specified signal, set with `rctlblk_set_local_action(3C)`, will be sent to the process that placed this resource control value in the value sequence.

7.5.5. Policy

The resource controls facility enables one aspect of simple resource management policy: static controls on process model abstractions. Furthermore, it is a natural mechanism for placing importance or priority attributes on these abstractions these attributes can then be used to implement more dynamic resource management policies (which can usually be viewed as scheduling algorithms). These forms of policies can be contrasted with those involving system-level abstractions, such as processor set sizes or interrupt binding assignments (which can also be made statically or dynamically). Aspects of this facility help provide an interface between the process model abstractions and those of the system.

The policy simplification that this framework affords is that administrative concentration of some resource management constraints can now be seated in a networkwide name service. Furthermore, the rctl facility enables both systemwide and entity-specific monitoring of events related to increasing resource usage, which is meant to assist the administrator in estimating capacity requirements and workload sizes.

As with the CPU second variance noted in Section 7.5.3, machine-specific attributes probably should not be resource controls nor should they appear in a name-service database akin to the project, since these databases provide replies indiscriminate of the machine making the request. There is a further need for a name service capable of distinct replies to distinguishable machines; the grammar for the project attributes does not preclude a node-specific enhancement if a standard name service of this kind does not emerge.

7.5.6. Consequences of Exceeding an `rctl`

The use of a given resource by a given entity can be evaluated at the time of a request for additional units of that resource (a synchronous test) or at an arbitrary time (an asynchronous test). In the case of a synchronous test, the resource request can be denied or permitted; in both cases, a signal can be sent to the violating process (or a process that is a member of the violating entity) or a system message can be logged.

That is, there is a set of actions that resource limit violation can result in:

A signal sent to triggered or monitoring process, set with the setrctl(2) system call
A message through syslogd(1M), activated or deactivated with the rctladm(1M) command.

A self-monitoring process can use the delivered signal to trigger garbage collection, or some other release of resources, and then reattempt the potentially failed operation as necessary. External daemons can then monitor syslog output or express explicit interest in individual tasks.

A control with the RCTL_LOCAL_DENY flag set, in addition to any action taken, will refuse the resource to the requesting process when the implementation permits it. (For instance, in the current implementation, a CPU second cannot be denied, in that its use is evaluated after the period of its grant.) Thus, a resource control value with RCTL_LOCAL_DENY set will be activated each time a request for the controlled resource is made.

Although local and global actions are currently simply signals or syslog messages, respectively, as the system event interfaces mature, we expect supporting standard system events on both a local and global basis for exceeded rctls. System events, with their well-defined structure, forthcoming subscription model, and support for multiple channels, will provide a much richer event vocabulary to management applications using resource controls for workload monitoring.

7.5.7. Signal and `siginfo` Semantics for Exceeded Controls

Many of the signals in the UNIX environment have precise semantics; we cannot offer a framework that allows any signal to be sent on the occasion of a resource control being exceeded. Certain signals are reserved very specifically to one behavior; the behavior associated with others is more open to interpretation. We simplify by allowing only the signals listed below.

SIGABRT
SIGHUP
SIGSTOP
SIGTERM
SIGKILL

The new signal SIGXRES is used specifically within the resource control framework, with the clear semantic that "a resource control value has been exceeded." Additional information will be available from the siginfo code and accompanying structure, as discussed below.

Two global flags are defined to enable certain controls to use the historical resource limit signals SIGXFSZ and SIGXCPU. In particular, task.max-cpu-time and process.max-cpu-time will have RCTL_GLOBAL_CPU_TIME set, while process.max-file-size will have the global flag RCTL_GLOBAL_FILE_SIZE set.

The second issue is that we cannot easily distinguish between two resource control values sending identical signals. We provide the rctlblk_get_firing_time(3C) function, which allows the ultimate discrimination of all values across the resource controls associated with process and those of the task and project of which the process is a member. However, we can simplify this determination considerably if we can identify whether the triggering value was associated with the process or with its containers. For this purpose, a new siginfo code, SI_RCTL and a new union within the siginfo make this identification directly. Two members for this union, si entity, contain the container type holding this resource control as an rctl_entity_t, as defined below.

typedef enum {         RCENTITY_PROCESS,         RCENTITY_TASK,         RCENTITY_PROJECT,         RCENTITY_ZONE } rctl_entity_t;

With the firing time identifying the most recent tripped value, the received signal, and the entity type, resolution of which exact control and control value were triggered is straightforward.

7.5.8. Generalizing Hard and Soft Limits

The hard/soft rlimit interface implies a simple capability model on rlimits: a non-privileged user can modify the soft limit up to the value defined by the hard limit and can lower the hard limit. Only a root-privileged user can raise the hard limit.

The capability matrix defined by rlimits is not complete: not all hard limit settings are actually sustainable by the operating system. (For instance, the current file descriptor limit cannot be set to INT64_MAX.) Thus, one further extension, to allow the description of an operating system limit, is needed. This framework includes three classes of limit:

Basic. Writable by an unprivileged application
Privileged. Writable only by a privileged application
System. Established by the operating system instance and is not writable

The quantity for a given system value can be a calculated value for the given system size, but is more likely to be a theoretical maximum from the underlying implementation or the defining type. System limits cannot be changed. Because multiple basic limits can be placed on rctls associated with process collectives (by monitoring processes), the only ordering statement that can be made is that all limits on an rctl must be less than the system limit on that rctl.

7.5.9. Resource Controls and the Task

Resource controls provide enforceable limits on each task. These task limits are defined in the project database, although they can also be set explicitly. In general, task limits are available for fewer attributes than process limits, since not all process attributes are sensible over the entire task. (Limiting aggregate stack size, for instance, does not seem particularly useful.) The enforced limit for each resource is the smallest of the set of rctl values not yet encountered on that resource, although signal delivery is only to the observing process in the case of controls on collectives.

The current behavior of resource control values being exceeded on the task or on any process collective is that the resource control values on the task of type RCPRIV_PRIVILEGED or RCPRIV_SYSTEM deliver their local action to the violating process, whereas RCPRIV_BASIC type values deliver their local action to the process placing the resource control value on the task.

7.5.10. Visibility through `/proc`; Privileges and Ownership

The principle behind the representation of the process model via /proc is that the entirety of the process state is retrievable by a debugger. This principle can be satisfied for the process's local state, such as heap data or library static data, from accessing the appropriate parts of the process's address space in the as file. For kernel state, you can obtain the state visible to the process by forcing the victim process to execute the appropriate system call or calls through the agent LWP. With these two mechanisms, /proc and libproc present the entire environment of the process to a debugger.

The decision to provide a file entry in /proc is generally made for performance reasonsthe series of calls into libproc to make the victim yield the desired information is measured to be too expensive for a specific, frequent operation.

Thus, and analogous to rlimits, the resource controls do not explicitly appear in a separate file in /proc. Furthermore, since we are trying to keep the properties of the resource control abstract, a /proc realization would present a conflicting technical requirement (in that the file would contain structural representations).

Three privilege levels are available for resource controls: RCPRIV_BASIC, RCPRIV_PRIVILEGED, and RCPRIV_SYSTEM. RCPRIV_BASIC, the properties of which are modifiable by any calling process, is analogous to the soft limit of resource limits. The hard limit is encapsulated by the RCPRIV_PRIVILEGED, which requires root privilege to insert, delete, or modify. The final privilege level for a resource control is RCPRIV_SYSTEM, which is not changeable by any process, privileged or otherwise.

7.5. Resource Controls

7.5.1. Introduction to Resource Controls

7.5.2. What Is an rctl?

Figure 7.2. Process Collectives and Resource Control Sets

7.5.3. Numeric Values of Resource Controls

7.5.4. Resource Control Definitions

Table 7.3. Privilege Levels of Resource Control Values

Table 7.4. Operational Flags for setrctl(2) and geTRctl(2)

Table 7.5. Global Resource Control Properties and Actions

Table 7.6. Local (Value-Specific) Resource Control Properties and Actions

7.5.5. Policy

7.5.6. Consequences of Exceeding an rctl

7.5.7. Signal and siginfo Semantics for Exceeded Controls