Future Enhancements | User Mode Linux

A number of host kernel enhancements for improving UML performance host resource consumption are in the works. Some are working and ready to be merged into the mainline kernel, and some are experimental and won't be in the mainline kernel for a while.

`sysemu`

Starting with the mature ones, the sysemu patch adds a ptrace option that allows host system calls to be intercepted and nullified with one call to ptrace, rather than two. Without this patch, in order to intercept system calls, a process must intercept them both at system call entry and exit. A tool like strace needs to make the two calls to ptrace on each system call because the tool needs to print the system call when it starts, and it needs to print the return value when it exits. For something like UML, which nullifies the system call and doesn't need to see both the system call entry and exit, this is one ptrace call and two context switches too many.

This patch noticeably speeds up UML system calls, as well as workloads that aren't really system call intensive. A getpid() loop is faster by about 40%. A kernel build, which is a somewhat more representative workload than a getpid() loop, is faster by around 3%.

This improvement is not related to the skas3 patch at all. It is purely to speed up system call nullification, which UML has to do no matter what mode it is running in.

The sysemu patch went into the mainline kernel in 2.6.14, so if your host is running 2.6.14 or later, you already have this enhancement.

`PTRACE_FAULTINFO`

PTRACE_FAULTINFO is another patch that has been around for a long time. It is part of the skas3 patch but will likely be split out since it's less objectionable than other parts of skas3, such as /proc/mm. PTRACE_FAULTINFO is used by UML in either skas mode in order to extract page fault information from a UML process. skas0 mode has a less efficient way to do this but will detect the presence of PTRACE_FAULTINFO and use it if present on the host.

`MADV_TRUNCATE`

This is a relatively new patch from Badari Pulavarty of IBM. It allows a process to throw out modified data from a tmpfs file it has mapped. Rather than being a performance improvement like the previous patches, MADV_TRUNCATE reduces the consumption of host memory by its UML instances.

The problem this solves is that memory-mapped files, such as those used by UML for its physical memory, preserve their contents. This is normally a good thing. If you put some data in a file and it later just disappeared, you would be rather upset. However, UML sometimes doesn't care if its data disappears. When a page of memory is freed within the UML kernel, the contents of that page doesn't matter anymore. So, it would be perfectly alright if the host were to free that page and use it for something else. When that page of UML physical memory was later allocated and reused, the host would have to provide a page of its own memory, but it would have an extra page of free memory in the meantime.

I made an earlier attempt at creating a solution, which involved a device driver, /dev/anon, rather than an madvise extension. The driver allowed a process to map memory from it. This memory had the property that, when it was unmapped, it would be freed. /dev/anon mostly worked, but it was never entirely debugged.

Both /dev/anon and MADV_TRUNCATE are trying to do the same thingpoke a hole in a file. A third proposed interface, a system call for doing this, may still come into existence at some point.

The main benefit of these solutions is that it provides a mechanism for implementing hot-plug memory. The basic idea of hot-plug memory on virtual machines is that the guest contains a driver that communicates with the host. When the host is short of memory and wants to take some away from a guest, it tells the driver to remove some of the guest's memory. The guest does this simply by allocating memory and freeing it on the host. If the guest doesn't have enough free memory, it will start swapping out data until it does.

When the host wants to give memory back to a guest, it simply tells the driver to free some of its allocated memory back to the UML kernel.

This gives us what we need to avoid the pathological interaction between the host and guest virtual memory systems I described in Chapter 2. To recap, suppose that both the host and the guest are short of memory and are about to start swapping memory. They will both look for pages of memory that haven't been recently used to swap out. They will both likely find some of the same pages. If the host manages to write one of these out before the guest does, it will be on disk, and its page of memory will be freed. When the guest decides to write it out to its swap, the host will have to read it back in from swap, and the guest will immediately write it out to its own swap device.

So, that page of memory has made three trips between memory and disk when only one was necessary. This increased the I/O load on the host when it was likely already under I/O pressure. Reading the page back in for the benefit of the guest caused the host to allocate memory to hold it, again when it was already under memory pressure.

To make matters even worse, to the host, that page of memory is now recently accessed. It won't be a candidate for swapping from the host, even though the guest has no need for the data.

Hot-pluggable memory allows us to avoid this by ensuring that either the host or the UML instances swap, but not both. If the UML instances are capable of swappingthat is, the host administrator gave them swap deviceswe should manage the host's memory to minimize its swapping. This can be done by using a daemon on the host that monitors the memory pressure in the UML instances and the host. When the host is under memory pressure and on the verge of swapping, the daemon can unplug some memory from an idle UML instance and release it to the host.

Hot-plug memory also allows the UML instances to make better use of the host's memory. By unplugging some memory from an idle UML instance and plugging the same amount into a busy one, it will effectively transfer the memory from one to the other. When some UML instances will typically be idle at any given time, this allows more of them to run on the host without consuming more host memory. When an idle UML instance wakes up and becomes busy again, it will receive some memory from an instance that is now idle.

Since the MADV_TRUNCATE patch is new, it is uncertain when it will be merged into the mainline kernel and what the interface to it will be when it is. Whatever the interface ends up being, UML will use it in its hot-plug memory code. If MADV_TRUNCATE is not available in a mainline kernel, it will be available as a separate patch.

The interface to plug and unplug UML physical memory likely will remain as it is, regardless of the host interface. This uses the MConsole protocol to treat physical memory as a device that can be reconfigured dynamically. Removing some memory is done like this:

host% uml_mconsole debian config mem=-64M

This removes 64MB of memory from the specified UML instance.

The relevant memory statistics inside the UML (freshly booted, with 192MB of memory) before the removal look like this:

UML# grep Mem /proc/meminfo MemTotal:       191024 kB MemFree:        117892 kB

Afterward, they look like this:

UML# grep Mem /proc/meminfo MemTotal:       191024 kB MemFree:         52172 kB

Just about what we would expect. The memory can be plugged back in the same way with:

host% uml_mconsole debian config mem=+64M

That brings us basically back to where we started:

UML# grep Mem /proc/meminfo MemTotal:       191024 kB MemFree:        117396 kB

The main limitation to this currently is that you can't plug arbitrary amounts of memory into a UML instance. It can't end up with more than it had when it was booted because a kernel data structure that is sized according to the physical memory size at boot can't be changed later. It is possible to work around this by assigning UML instances a very large physical memory at boot and immediately unplugging a lot of it.

This limitation may not exist for long. People who want Linux to run on very large systems are doing work that would make this data structure much more flexible, with the effect for UML that it could add memory beyond what it had been booted with.

Since this capability is brand new, the UML management implications of it aren't clear at this point. It is apparent that there will be a daemon on the host monitoring the memory usage of the host and the UML instances and shuffling memory around in order to optimize its use. What isn't clear is exactly what this daemon will measure and exactly how it will implement its decisions. It may occasionally plug and unplug large amounts of memory, or it may constantly make small adjustments.

Memory hot-plugging can also be used to implement policy. One UML instance may be considered more important than another (possibly because its owner paid the hosting company some extra money) and will have preferential access to the host's memory as a result. The daemon will be slower to pull memory from this instance and quicker to give it back.

All of this is in the future since this capability is so new. It will be used to implement both functionality and policy. I can't give recommendations as to how to use this capability because no one has any experience with it yet.

`remap_file_pages`

Ingo Molnar spent some time looking at UML performance and at ways to increase it. One of his observations was that the large number of virtual memory areas in the host kernel hurt UML performance. If you look in /proc/<pid>/maps for the host process corresponding to a UML process, you will see that it contains a very large number of entries. Each of these entries is a virtual memory area, and each is typically a page long. If you look at the corresponding maps for the same process inside the UML instance, you will see basically the same areas of virtual memory, except that they will be much fewer and much larger.

This page-by-page mapping of host memory creates data structures in the host kernel and slows down the process of searching, adding, and deleting these mappings. This, in turn, hurts UML performance.

Ingo's solution to this was to create a new system call, remap_file_pages, that allows pages within one of these virtual memory areas to be rearranged. Thus, whenever a page is mapped into a UML process address space, it is moved around beneath the virtual memory area rather than creating a new one. So, there will be only one such area on the host for a UML process rather than hundreds and sometimes thousands.

This patch has a noticeable impact on UML performance. It has been around for a while, and Paolo Giarrusso has recently resurrected it, making it work and splitting it into pieces for easier review by the kernel development team. It is a candidate for inclusion into Andrew Morton's kernel tree. It was sent once but dropped because of clashes with another patch. However, Andrew did encourage Paolo to keep it maintained and resubmit it again.

`VCPU`

VCPU is another of Ingo's patches. This deals with the inefficiency of the ptrace interface for intercepting system calls. The idea, which had come up several times before, is to have a single process with a "privileged" context and an "unprivileged" context. The process starts in the privileged context and eventually makes a system call that puts it in the unprivileged context. When it receives a signal or makes a system call, it returns through the original system call back to the privileged context. Then it decides what to do with the signal or system call.

In this case, the UML kernel would be the privileged context and its processes would be unprivileged contexts. The behavior of regaining control when another process makes a system call or receives a signal is exactly what ptrace is used for. In this case, the change of control would be a system call return rather than a context switch, reducing the overhead of doing system call and signal interception.