Section 6.10. Resource Management and Observability

6.10. Resource Management and Observability

Most of the discussion in this chapter has described the ways in which zones can be used to isolate applications in terms of configuration, namespace, security, and administration. Another important aspect of isolation is ensuring that each application receives an appropriate proportion of the system resources: CPU, memory, and swap space. Without such a capability, one application can either intentionally or unintentionally starve other applications of resources.

In addition, there may be reasons to prioritize some applications over others or to adjust resources depending on dynamic conditions. For example, a financial company might want to give a stock trading application high priority while the trading floor is open, even if it means taking resources away from an application analyzing overall market trends.

The zones facility is tightly integrated with existing resource management controls available in Solaris. These controls come in three flavors: entitlements, which ensure a minimum level of service; limits, which bound resource consumption; and partitions, which allow physical resources to be exclusively dedicated to specific consumers. Each of these types of controls can be applied to zones. For example, a fair-share CPU scheduler can be configured to guarantee a certain share of CPU capacity for a zone. In addition, an administrator within a zone can configure CPU shares for individual applications running within that zone; these shares determine how to carve up the portion of CPU allocated to the zone. Likewise, resource limits can be established on either a per-zone basis (limiting the consumption of the entire zone) or a more granular basis (individual applications or users within the zone). In each case, the global zone administrator is responsible for configuring per-zone resource controls and limits, while the administrator of a particular non-global zone can configure resource controls within that zone.

Figure 6.3 shows how the fair-share CPU scheduler can divide CPU resources among zones. In the figure, the system is divided into four zones, each of which is assigned a certain number of CPU shares. If all four zones contain processes that are actively using the CPU, then the CPU is divided according to the shares; that is, the red zone receives 1/7 of the CPU (since a total of 7 shares is outstanding), the neutral zone receives 2/7, etc. In addition, the lisa zone has been further subdivided into five projects, each of which represents a workload running within that zone. The 2/7 of the CPU assigned to the lisa zone (based on the per-zone shares) is further subdivided among the projects within that zone according to the specified shares.

Figure 6.3. Zones and the Fair-Share Scheduler

Resource partitioning is supported through a mechanism called resource pools, which allows an administrator to specify a collection of resources that will be exclusively used by some set of processes. Although the only resources initially supported are CPUs, this is planned to later encompass other system resources such as physical memory and swap space. A zone can be "bound" to a resource pool, which means that the zone runs only on the resources associated with the pool. Unlike the resource entitlements and limits described above, this approach allows applications in different zones to be completely isolated in terms of resource usage; the activity within one zone has no effect on other zones. This isolation is furthered by restricting the resource visibility. Applications or users running within a zone bound to a pool see only resources associated with that pool. For example, a command that lists the processors on the system lists only the ones belonging to the pool to which the zone is bound. Note that the mapping of zones to pools can be one-to-one, or many-to-one; in the latter case, multiple zones share the resources of the pool, and features like the fair-share scheduler can control the manner in which they are shared.

Figure 6.4 shows the use of the resource pool facility to partition CPUs among zones. Note that processes in the global zone can actually be bound to more than one pool; a special case allows the use of resource pools to partition workloads even without zones. Non-global zones, however, can be bound to only one pool (that is, all processes within a non-global zone must be bound to the same pool).

Figure 6.4. Zones and Resource Pools

6.10.1. Performance

One of the advantages of technologies like zones that virtualize the operating system environment over a traditional virtual machine implementation is the minimal performance overhead. To substantiate this, we measured the performance of a variety of workloads when running in a non-global zone compared to the same workloads running without zones (or in the global zone). This data is shown in Table 6.3 (in each case, higher numbers represent a faster run).

Table 6.3. Performance Impact of Running in a Zone
Workload	Base	Zone	Diff (%)
Java	38.45	38.29	99.6
Time-sharing	23332.58	22406.51	96.0
Networking	283.30	284.24	100.3
Database	38767.62	37928.70	97.8

The final column shows the percentage degradation (or improvement) of the zone run versus the run in the global zone. As can be seen, the impact of running an application in a zone is minimal. The 4% degradation in the time-sharing workload is primarily due to the overhead associated with accessing commands and libraries through lofs.

We also measured the performance of running multiple applications on the system at the same time in different zones, partitioning CPUs either with resource pools or the fair-share scheduler. In each case, the performance when zones were used was equivalent to, and in some cases better than, the performance when each application was run on separate systems.

Since all zones on a system are part of the same operating system instance, processes in different zones can actually share virtual memory pages. This is particularly true for text pages, which are rarely modified. For example, although each zone has its own init process, each of those processes can share a single copy of the text for the executable, libraries, etc. This can result in substantial memory savings for commonly used executables and libraries such as libc. Similarly, other parts of the operating system infrastructure, such as the directory name lookup cache (or DNLC), can be shared among zones to minimize overheads.

6.10.2. Solaris Resource Management Interactions

The resource management features in Solaris and their interactions with zones are summarized below.

6.10.2.1. Accounting

The traditional accounting system uses a fixed record size and cannot be extended to differentiate between a process running in the global zone and a non-global zone. We have modified the system such that accounting records generated in any (including the global) zone contain only records pertinent to the zone in which the process executed.

The extended accounting subsystem is virtualized to permit different accounting settings and files on a per-zone basis for process- and task-based accounting. Since exacct records are extensible, they can be tagged with a zone name (EXD_PROC_ZONENAME and EXD_TASK_ZONENAME, for processes and tasks, respectively), allowing the global administrator to determine resource consumption per-zone. Accounting records are written to the global zone's accounting files as well as to their per-zone counterparts. The EXD_TASK_HOSTNAME, EXD_PROC_HOSTNAME, and EXD_HOSTNAME records contain the uname -n value for the zone in which the process/task executed,^[7] rather than the global zone's node name.

^[7] Tasks may not span zones.

6.10.2.2. Projects, Resource Controls, and the Fair-Share Scheduler

Projects are abstracted such that different zones may use separate project(4) databases, each with its own set of defined projects and resource controls. Projects running in different zones with the same project ID are considered distinct by the kernel, eliminating the possibility of cross-zone interference and allowing projects running in different zones (with the same project ID) to have different resource controls.

To prevent processes in a zone from monopolizing the system, zonewide limits for applicable resources limit the total resource usage of all process entities within a zone (regardless of project). The global administrator can specify these limits in the zonecfg configuration file.^[8] Privileged zone-wide rctls can be set only by superuser in the global zone.

^[8] The exact interface for specifying the limits has not yet been determined.

Some of the zonewide rctls include zone.cpu-shares, which is the top-level number of FSS shares allocated to the zone. CPU shares are thus first allocated to zones, and then further subdivided among projects within the zone (based on the zone's project(4) database). In other words, project.cpu-shares is now relative to the zone's zone.cpu-shares. Projects in a zone multiplex the shares allocated to the zone; in this way, FSS share allocation in a zone can be thought of as a two-level hierarchy.

6.10.2.3. Resource Pools

Resource pools are controlled by an attribute in the zone configuration, zone.pool, similar to project.pool. The zone as a whole is bound to this pool upon creation, and it can enforce this binding in the kernel, as well as limit the visibility of resources not in the resource pool.

Non-global zones must be bound to resource pools in their entirety; that is, attempting to bind individual processes, tasks, or projects in a non-global zone to a resource pool will fail. This allows the virtualization of pools such that the pool device only reports information about the particular pool that the zone is bound to.

poolstat(1M) will thus reveal information only about the resources with which the zone's pool is associated.

6.10.3. Kstats

The kstat framework is used extensively for applications monitoring the system's performance including mpstat(1M), vmstat(1M), iostat(1M), and kstat(1M). Unlike traditional Solaris systems, the values reported in a particular kstat may not be relevant (or accurate) in non-global zones.

When executing in a zone and if the pools facility is active, mpstat(1M) will only provide information for those processors which are a member of the processor set of the pool to which the zone is bound.

The vmstat, mpstat style commands show you activity for the whole system, unless you are using pools and processor sets for the zones, in which case these kstat commands will provide some zone specific information. This is because kstat isn't zone aware yet, but it is CPU aware.

Statistics fall into one of the following categories as far as zones are concerned:

Those that report information on the system as a whole and should be exported to all zones. Most kstats currently fall under this category. Examples include kmem_cache statistics.
Those that should be virtualized. These kstats have the same module: instance:name:class but export different values depending on which zone is viewing them. Examples include unix:0:rpc_client and a number of other kstats consumed by nfsstat(1M) that should report virtualized statistics since the subsystems they represent have been virtualized.
Those that "belong" to a particular zone and should only be exported to the zone to which they belong (and in some cases, the global zone as well). nfs:*:mntinfo is a good example of this category since zone A should not be exported information about the NFS mounts in zone B.

The kstat framework has been extended with new interfaces to specify (at creation time) which of the above classes a particular statistic belongs to.

6.10.3.1. Zone Observability via `prstat -Z`

The -Z option of prstat provides a summary per Zone. If run from the global zone, a summary of all zones is visible.

$ prstat -Z    PID USERNAME  SIZE   RSS STATE  PRI NICE       TIME  CPU PROCESS/NLWP  21132 root     2952K 2692K cpu0    49    0    0:00:00 0.1% prstat/1  21109 root     7856K 2052K sleep   59    0    0:00:00 0.0% sshd/1   2179 root     4952K 2480K sleep   59    0    0:00:21 0.0% automountd/3  21111 root     1200K  952K sleep   49    0    0:00:00 0.0% ksh/1   2236 root     4852K 2368K sleep   59    0    0:00:06 0.0% automountd/3   2028 root     4912K 2428K sleep   59    0    0:00:10 0.0% automountd/3    118 root     3372K 2372K sleep   59    0    0:00:06 0.0% nscd/24 ZONEID    NPROC  SIZE   RSS MEMORY      TIME  CPU ZONE      0       47  177M  104M    11%   0:00:31 0.1% global      5       33  302M  244M    25%   0:01:12 0.0% gallery      3       40  161M   91M   9.2%   0:00:40 0.0% nakos      4       43  171M   94M   9.5%   0:00:44 0.0% mcdougallfamily      2       30   96M   56M   5.6%   0:00:23 0.0% shared      1       32  113M   60M   6.0%   0:00:45 0.0% packer      7       43  203M   87M   8.7%   0:00:55 0.0% si Total: 336 processes, 1202 lwps, load averages: 0.02, 0.01, 0.01

6.10.3.2. DTrace

DTrace has special variables that can provide zone context. In the following example, we can easily discover which zone is causing the most page faults to occur.

# dtrace -n 'vminfo:::as_fault {@[zonename]=count()}' dtrace: description 'vminfo:::as_fault' matched 1 probe ^C global 4303 lisa 29867 global# dtrace -qn 'proc:::exec-success { printf("%-16s %s\n", zonename, curpsinfo->pr_ psargs); }' global           zlogin myzone init 0 myzone         /usr/bin/su root -c init 0 myzone         sh -c init 0 myzone         init 0 myzone         sh -c /usr/sbin/audit -t myzone         /usr/sbin/audit -t myzone         /sbin/sh /sbin/rc0 stop myzone         /usr/bin/who -r myzone         /usr/bin/uname -a myzone         /lib/svc/bin/lsvcrun /etc/rc0.d/K03samba stop myzone         /bin/sh /etc/rc0.d/K03samba stop myzone         pkill smbd myzone         pkill nmbd myzone         /lib/svc/bin/lsvcrun /etc/rc0.d/K05appserv stop myzone         /bin/sh /etc/rc0.d/K05appserv stop myzone         /lib/svc/bin/lsvcrun /etc/rc0.d/K05volmgt stop myzone         /bin/sh /etc/rc0.d/K05volmgt stop myzone         /sbin/zonename ...

For an example of how the zonename can be used, see zvmstat from the DTraceToolkit (see Section 6.3.3 in Solaris™ Performance and Tools).