Virtualized Subsystems | User Mode Linux

I plan to take advantage of UML's virtualization in one more way: to use it to provide customizable levels of confinement for processes. For example, you may wish to control the CPU consumption of a set of processes without affecting their access to other machine resources. Or you may wish to make a container for some processes that restricts their memory usage and filesystem access but lets them consume as much CPU time as they like.

I'm going to use UML to implement such a flexible container system by breaking it into separate subsystems (e.g., the scheduler and the virtual memory system). When you break UML, which is a virtual kernel, into separate pieces, those pieces are virtual subsystems. They can run on or within their nonvirtual counterparts, and they require a nonvirtual counterpart to host them in order to function at all.

For example, my virtualized scheduler runs a guest scheduler inside a host process. The host process provides the context needed for the guest scheduler to function. The guest scheduler requires some CPU cycles to allocate among the processes in its care. These cycles come from its host process, which is competing with other processes for CPU cycles from the host scheduler.

Similarly, a guest virtual memory system would allocate memory to its processes from a pool of memory provided by the host virtual memory system.

You would construct a container by building a set of virtualized subsystems, such as the scheduler and virtual memory system, loading it into the host kernel, and then loading processes into it. Those processes, in the scheduler and virtual memory system case, would get their CPU cycles from the guest scheduler and their memory from the guest virtual memory system. In turn, these would have some allocation of cycles and memory from the host.

Let's take the guest scheduler as an example since it has been implemented. A new guest scheduler is created by a process opening a special file in /proc:

host% cat /proc/schedulers/guest_o1_scheduler & Created sched_group 290 ('guest_o1_scheduler')

The contents of /proc/schedulers are the schedulers available for use. In this case, there is only one, guest_o1_scheduler. This creates a sched_group, which is the set of processes controlled by this scheduler. When the system boots, all processes are in sched_group 0, which is the host scheduler. sched_group 290 is the group controlled by the cat process we ran, which had process ID 290.

Once we have a guest scheduler, the next step is to give it some processes to control. This is done by literally moving processes from sched_group 0 to sched_group 290. Let's create three infinite shell loops and move two of them into the new scheduler:

host% bash -c 'while true; do true; done' & [2] 292 host% bash -c 'while true; do true; done' & [3] 293 host% bash -c 'while true; do true; done' & [4] 294 host% mv /proc/sched-groups/0/293 /proc/sched-groups/290/ host% mv /proc/sched-groups/0/294 /proc/sched-groups/290/

Now process 290, which is the host representative of the guest scheduler, is competing with the other host processes, including the busy loop with process ID 292, for CPU time. Since those are the only two active processes on the host scheduler, they will each get half of the CPU. The guest scheduler, inside process 290, is going to take its half of the CPU and split it between the two processes under its control. Thus, processes 293 and 294 will each get half of that, or a quarter of the CPU each:

host% ps uax ... root       292 49.1 0.7 2324 996 tty0 R 21:51 \     14:40 bash -c root       293 24.7 0.7 2324 996 tty0 R 21:51 \      7:23 bash -c root       294 24.7 0.7 2324 996 tty0 R 21:51 \      7:23 bash -c ...

The guest scheduler forms a CPU compartmentit gets a fixed amount of CPU time from the host and divides it among its processes. If it has many processes, it gets no more CPU time than if it had only a few. This is useful for enforcing equal access to the CPU for different users or workloads, regardless of how many processes they have running.

By loading each group of processes, whether a user, an application, a workload, or an arbitrary set of processes, into one of these compartments, you guarantee that the groups as a whole get treated equally by the scheduler.

I've described the guest schedulers as being loaded into the host kernel, but I can also see a role for userspace guest schedulers. Most obviously, by making the scheduler of a UML instance visible to the host as a guest scheduler, its processes become visible to the host in the same way as the host's own processes. Their names and resource usage become visible on the host. They also become controllable in the same waya signal can be sent to one from the host and the corresponding UML process will receive it.

Making a UML scheduler visible as a host guest scheduler requires an interface for a process to register itself as a guest scheduler. This interface would be the mechanism for telling the host about the guest's processes and their data. Once we have an interface like this, there's no reason that UML has to be the only user of it.

A number of other applications have internal process-like components and could use this interface. Anything with internal threads could make them visible on the host in the same way that UML processes would be. They would be controllable in the same way, and the attributes the host sees would be provided by the application.

A Web server could make requests or sessions visible as host processes. Similarly, a mail server could make incoming or outgoing e-mail messages look like processes. The ability to monitor and control these servers with this level of granularity would make them much more manageable.

The same ideas would apply to any other sort of compartment. A memory compartment would be assigned some amount of memory when created, just as a CPU compartment has a call on a certain amount of CPU time. Processes would be loaded into it and would then have their memory allocations satisfied from within that pool of memory.

If a compartment runs out of memory, it has to start swapping. It is required to operate on the fixed amount of memory it was provided and can't allocate more from the host when it runs short. It has to swap even if there is plenty of memory free on the rest of the system. In this way, the memory demands of the different groups of processes on the host are isolated from each other. One group can't adversely affect the performance of another by allocating all of the memory on the system.

Compartmentalization is an old subject, and there are many ways to do it, including some that are currently being implemented on Linux, principally, CKRM, or Class-based linux Kernel Resource Management. These projects add resource allocation and control infrastructures to the kernel and add interfaces that allow users to control the resulting compartments.

These approaches necessarily involve modifying the basic algorithms in the kernel, such as the scheduler and the virtual memory system. This adds some overhead to these algorithms even when compartments aren't being used, which is likely to be the common case. There is a bias in the Linux kernel development community against making common things more expensive in order to make uncommon things cheap. Compartmentalization performed in these ways conflicts with that ethos.

More importantly, these algorithms have been carefully tuned under a wide range of workloads. Any perturbations to them could throw off this tuning and require repeating all this work.

In contrast, the approach of gaining compartmentalization through virtualization requires no changes to these core algorithms. Instead of modifying an algorithm to accommodate compartments, a new copy of the same algorithm is used to implement them. Thus, there is no performance impact and no behavior perturbation when compartments are not being used.

The price of having two complete implementations of an algorithm instead of a single modified one is that compartments will tend to be more expensive. This is the trade-off against not affecting the performance and behavior when compartments are not being used.