Section 20.6. Task Queue Implementation Notes | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

20.6. Task Queue Implementation Notes

20.6.1. Use of Kmem Caches

The kmem subsystem (see Section 11.2) affords a convenient building block for task queues. A special kmem cache manages threads instead of memory. Threads are created in the cache constructor and destroyed in the cache destructor. The cache entries themselves hold the thread pointer and some information needed for internal housekeeping (flags, locks, and condition variables). Threads execute scheduled tasks and then call kmem_cache_free for their own entries. The kmem subsystem uses distributed locks internally, so the whole thing scales well over many CPUs.

This model is elegant, efficient, and simple (the whole design fit on a standard restaurant paper napkin, and the implementation took one evening), but it had a serious problem: There is no way to control the total number of allocated entries. Cache size is controlled by memory pressure and may grow quite a lot before memory pressure starts destroying free entries. This may sometimes create huge number of idle threads that waste system resources. And thus a new model appeared.

20.6.2. Use of Vmem Arenas

To address the problem of limiting the total number of threads created, we use a close sibling of kmem cachesvmem arenaswhich allocates just integer numbers. A task pool with maximum of N threads then becomes an array of N entries, and each entry is allocated by index by the integer covering interval (0, N] of the vmem arena. The vmem subsystem provides a needed scalability if the arena is backed by a kmem cache; in that case, the vmem subsystem uses this kmem cache internally for all allocations.

When a task is scheduled, we use vmem_alloc() to allocate an index in the array. Use of vmem_alloc() is close to the use of kmem caches in the previous modelit is scalable and guarantees that nothing can access the allocated array entry until we free the array. vmem_alloc() also tends to allocate entries that were freed recently.

The allocated control structure may have a running thread assigned to it. In this case, we simply wake up the thread, which executes the task and then goes back to sleep. If there is no thread, we create one. When the thread executes a job, it sleeps for some time, waiting for a new task to arrive. If it does not get a new job for a while, it wakes up and destroys itself. Such a strategy allows us to keep just enough working threads to handle the current workload.

Unfortunately, it turned out that vmem arenas backed by kmem caches have interesting implementation properties that make them difficult to use:

Arena size should be large enough to accommodate the kmem subsystem entries, which are allocated in full per-CPU slabs.
The recently freed entry is not allocated by the next vmem_alloc(), and the total number of used entries tends to be high, increasing number of idle threads.
The allocator behavior again depends on the memory pressure.

To address this problem we need to drop the use of backing kmem caches and use vmem arenas directly. Then we need to address the scalability issue somehow. And so another new model appeared.

20.6.3. Hashed Vmem Arenas

Usual vmem arenas use a single per-arena lock to protect internal state. To avoid lock and resource contention on multiple-CPU systems, we introduced an array of vmem arenas, each controlling its own array of task entries. If we have M arenas each of size N, then the maximum number of allocated entries is M x N. A good choice for M is the number of CPUs in the system, but we could use any number because it doesn't matter which arena we actually use for each allocation. When we can't allocate an entry from a specific arena, we try others until some arena has free entries or until no arenas left.

This model has all the nice properties we need, but it has some drawbacks also:

The whole set of M x N entries is allocated permanently and independently of actual system load, and much of that space may be unused. On modern hardware, the potential number of CPUs can be quite high (it can reach 512 on some platforms), and we usually need several dozen entries in each bucket. With a typical entry size around 64 bytes, the whole table may consume several megabytes of precious kernel memory.
Use of vmem arenas looks like overkill when all we need is a simple FIFO-type allocator that returns recently used entries first. A simple list of free entries seems sufficient for that purpose.

So, in the next cycle we get rid of vmem arenas and static per-bucket arrays and introduce a list of free entries. And that is how task pools are currently implemented.

20.6.4. Cached List of Entries

To keep the implementation scalable, we continue to use a per-CPU hash of entry buckets. Each bucket has a list of free entries that were used recently. Each entry has a thread servicing it. When a task is scheduled, we first try to find an idle entry in the free list. If the free list is empty, we go to the next bucket and so on. If a free entry is found, we just put task information there and wake up a sleeping thread. If there are no free entries, we create a new one in the original bucket and populate it with a thread. When a thread sleeps long enough waiting for a new job, it wakes up, removes itself from the free list, and destroys itself. All thread synchronization uses per-bucket locks and condition variables, and no global locks are used.

There is one problem, though. The thread_create() call may sleep waiting for memory so we cannot use it in NOSLEEP dispatches. So in this case, instead of allocating an entry with a thread, we schedule a background job by using a backing task queue, which allocates an entry and creates a new servicing thread. The original dispatch operation fails, but the next one is likely to use the newly created entry. This means that even on a system with plenty of memory, the taskq_dispatch() call may fail during warmup period.

When the system is low on memory, all SLEEP allocations begin blocking waiting for memory, so we do not create new entries and new threads when we detect that memory is lowmemory shortage may actually limit the number of created threads. Under rare circumstances it is possible that all existing threads in the pool will die from inactivity and no new threads will be created because of the memory shortage. In this case, all dispatch operations will fail and all dispatches will revert to their backup versions (which will probably be some variation of single-threaded task queues).

Such behavior has an interesting property: It slows down task execution and increases latencies when the system is low on memory. It may well be a good property under such severe circumstances.

Currently, there are no provisions for changing the priority of the servicing thread needed to support priority dispatching correctly. In the future, however, adding them should be easy since we know the exact state of the servicing thread at the dispatch time and we can properly update its priority. At present, there is no kernel API to change priority of the running thread, but we can easily implement priority scheduling once such an API comes into existence.

20.6.5. Problems with Task Pool Implementation

Various problems still attend the current implementation and could impact its performance under heavy load.

Although initial bucket hashing is done with the CPU number, actual threads are not bound to any CPU and may migrate from one CPU to another, so it is quite possible that the task will be scheduled on one CPU and executed on another one. This behavior increases system cache pollution and may be especially harmful on nonsymmetric architectures like ccNUMA. Binding threads to CPUs might solve this problem but could have other bad impacts on the system:

Each dispatch-execution cycle has a context switch from a dispatching thread to an executing thread. Regular task queues may have fewer context switches since the servicing thread may execute several task before it goes to sleep.
We may create more threads than there are spare CPUs on the system, so these extra threads will have no CPUs to execute them. Ideally, we would like to know the state of each CPU and create new threads only when some CPUs are idle.

Overall, our experiments with this implementation show that we do achieve significant benefits on multiple-CPU systems without hurting performance on the smaller systems.

20.6.6. Use of Dynamic Task Pools in STREAMS

The original motivation for designing and implementing the new dynamic task queue extensions came from our attempts to develop a good replacement for the current background-job scheduling in the Solaris STREAMS subsystem. You can get a good idea of the STREAMS scheduler from, but here's a quick summary.

Solaris STREAMS has five different queues for background job processing.

Background queue scheduling (STREAMS scheduler), which calls module and driver service procedures; the entries are placed in the queue by the qenable() function, usually called from putq().
Background syncq scheduling through sqenable() function.
Background scheduling of qwriter(9F) callbacks, entered through queue_writer() function.
Asynchronous freeing of memory obtained through esballoc().
Handling of bufcalls that are used when the system is low on memory.

The queue scheduling turns out to be the hottest task queue. It is heavily used throughout networking code and its implementation is extremely inefficient, so we focused on fixing just queue scheduling. The prototype results showed that we could achieve significant performance gains by distributing the locks and optimizing the scheduler. So the next logical step was to design a generic task scheduler that can meet high-stress networking demands and that can replace most of the STREAMS ad hoc scheduling implementations.

As a result, we reimplemented all the above schedulers, except bufcalls, to take advantage of the task queues. It turned out that the code became cleaner, simpler, and more efficient.