Section 17.7. Turnstiles and Priority Inheritance | Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)

17.7. Turnstiles and Priority Inheritance

A turnstile is a data abstraction that encapsulates sleep queues and priority inheritance information associated with mutex locks and reader/writer locks. The mutex and RW lock code use a turnstile when a kernel thread needs to block on a requested lock. The sleep queues implemented for other resource waits do not provide an elegant method of dealing with the priority inversion problem through priority inheritance. Turnstiles were created to address that problem.

Priority inversion describes a scenario in which a higher-priority thread is unable to run because a lower-priority thread is holding a resource it needs, such as a lock. The Solaris kernel addresses the priority inversion problem in its turn-stile implementation, providing a priority inheritance mechanism, where the higher-priority thread can will its priority to the lower-priority thread holding the resource it requires. The beneficiary of the inheritance, the thread holding the resource, will now have a higher scheduling priority and thus get scheduled to run sooner so it can finish its work and release the resource, at which point the original priority is returned to the thread.

In this section, we assume you have some level of knowledge of kernel thread priorities, which are covered in Section 3.7. Because turnstiles and priority inheritance are an integral part of the implementation of mutex and RW locks, we thought it best to discuss them here rather than later. For this discussion, it is important to be aware of these points:

The Solaris kernel assigns a global priority to kernel threads, based on the scheduling class they belong to.
Kernel threads in the timeshare and interactive scheduling classes will have their priorities adjusted over time, based on three things: the amount of time the threads spend running on a processor, sleep time (blocking), and the case when they are preempted. Threads in the real-time class are fixed priority; the priorities are never changed regardless of runtime or sleep time unless explicitly changed through programming interfaces or commands.

The Solaris kernel implements sleep queues for the placement of kernel threads blocking on (waiting for) a resource or event. For most resource waits, such as those for a disk or network I/O, sleep queues, in conjunction with condition variables, manage the systemwide queue of sleeping threads. These sleep queues are covered in Section 3.10. This set of sleep queues is separate and distinct from turn-stile sleep queues.

17.7.1. Turnstiles Implementation

Figure 17.7 illustrates the Solaris 10 turnstiles. Turnstiles are maintained in a systemwide hash table, turnstile_table[], which is an array of turnstile_chain structures; each entry in the array (each turnstile_chain structure) is the beginning of a linked list of turnstiles. The array is indexed via a hash function on the address of the synchronization object (the mutex or reader/writer lock), so locks that hash to the same array location will have a turnstile on the same linked list. The turnstile_table[] array is statically initialized at boot time.

Figure 17.7. Turnstiles

typedef struct turnstile_chain {         turnstile_t     *tc_first;      /* first turnstile on hash chain */         disp_lock_t     tc_lock;        /* lock for this hash chain */ } turnstile_chain_t; turnstile_chain_t       turnstile_table[2 * TURNSTILE_HASH_SIZE];                                                               See common/os/turnstile.c

Each entry in the chain has its own lock, tc_lock, so chains can be traversed concurrently. The turnstile itself has a different lock; each chain has an active list (ts_next) and a free list (ts_free). There are also a count of threads waiting on the sync object (waiters), a pointer to the synchronization object (ts_sobj), a thread pointer linking to a kernel thread that had a priority boost through priority inheritance, and the sleep queues. Each turnstile has two sleep queues, one for readers and one for writers (threads blocking on a read/write lock are maintained on separate sleep queues). The priority inheritance data is integrated into the turnstile.

#define TS_WRITER_Q     0       /* writer sleepq (exclusive access to sobj) */ #define TS_READER_Q     1       /* reader sleepq (shared access to sobj) */ #define TS_NUM_Q        2       /* number of sleep queues per turnstile */ typedef struct turnstile turnstile_t; struct _sobj_ops; struct turnstile {         turnstile_t     *ts_next;        /* next on hash chain */         turnstile_t     *ts_free;        /* next on freelist */         void            *ts_sobj;        /* s-object threads are blocking on */         int             ts_waiters;      /* number of blocked threads */         pri_t           ts_epri;         /* max priority of blocked threads */         struct _kthread *ts_inheritor;   /* thread inheriting priority */         turnstile_t     *ts_prioinv;     /* next in inheritor's t_prioinv list */         sleepq_t        ts_sleepq[TS_NUM_Q]; /* read/write sleep queues */ };                                                                     See sys/turnstile.h

Every kernel thread is born with an attached turnstile. That is, when a kernel thread is created (by the kernel thread_create() routine), a turnstile is allocated for the kthread and linked to kthread's t_ts pointer. A kthread can block on only one lock at a time, so one turnstile is sufficient.

We know from the previous sections on mutex and RW locks that a turnstile is required if a thread needs to block on a synchronization object. It calls turnstile_lookup() to look up the turnstile for the synchronization object in the turnstile_table[]. Since we index the array by hashing on the address of the lock, if a turnstile already exists (there are already waiters), then we get the correct turnstile. If no kthreads are currently waiting for the lock, turnstile_lookup() simply returns a null value. If the blocking code must be called (recall from the previous sections that subsequent tests are made on lock availability before it is determined that the kthread must block), then turnstile_block() is entered to place the kernel thread on a sleep queue associated with the turnstile for the lock.

Kernel threads lend their attached turnstile to the lock when a kthread becomes the first to block (the lock acquisition attempt fails, and there are no waiters). The thread's turnstile is added to the appropriate turnstile chain, based on the result of a hashing function on the address of the lock. The lock now has a turnstile, so subsequent threads that block on the same lock will donate their turnstiles to the free list on the chain (the ts_free link off the active turnstile).

In turnstile_block(), the pointers are set up as determined by the return from turnstile_lookup(). If the turnstile pointer is null, we link up to the turnstile pointed to by the kernel thread's t_ts pointer. If the pointer returned from the lookup is not null, there's already at least one kthread waiting on the lock, so the code sets up the pointer links appropriately and places the kthread's turnstile on the free list.

The thread is then put into a sleep state through the scheduling-class-specific sleep routine (for example, ts_sleep()). The ts_waiters field in the turnstile is incremented, the threads t_wchan is set to the address of the lock, and t_sobj_ops in the thread is set to the address of the lock's operations vectors: the owner, unsleep, and change_priority functions. The kernel sleepq_insert() function actually places the thread on the sleep queue associated with the turnstile.

The code does the priority inversion check (now called out of the turnstile_block() code), builds the priority inversion links and applies the necessary priority changes. The priority inheritance rules apply; that is, if the priority of the lock holder is less (worse) than the priority of the requesting thread, the requesting thread's priority is "willed" to the holder. The holder's t_epri field is set to the new priority, and the inheritor pointer in the turnstile is linked to the kernel thread. All the threads on the blocking chain are potential inheritors, based on their priority relative to the calling thread.

At this point, the dispatcher is entered through a call to swtch(), and another kernel thread is removed from a dispatch queue and context-switched onto a processor.

The wakeup mechanics are initiated as previously described, where a call to the lock exit routine results in a turnstile_wakeup() call if threads are blocking on the lock. turnstile_wakeup() does essentially the reverse of turnstile_block(); threads that inherited a better priority have that priority waived, and the thread is removed from the sleep queue and given a turnstile from the chain's free list. Recall that a thread donated its turnstile to the free list if it was not the first thread placed on the blocking chain for the lock; coming off the turnstile, threads get a turnstile back. Once the thread is unlinked from the sleep queue, the scheduling class wakeup code is entered, and the thread is put back on a processor's dispatch queue.