Other Kernel-Mode Synchronization Primitives
The Windows XP kernel offers some additional methods for synchronizing execution between threads or for guarding access to shared objects. In this section, I ll discuss the fast mutex, which is a mutual exclusion object that offers faster performance than a kernel mutex because it s optimized for the case in which no contention is actually occurring. I ll also describe the category of support functions that include the word Interlocked somewhere in their name. These functions carry out certain common operations such as incrementing or decrementing an integer or inserting or removing an entry from a linked list in an atomic way that prevents multitasking or multiprocessing interference.
Fast Mutex Objects
An executive fast mutex provides an alternative to a kernel mutex for protecting a critical section of code. Table 4-6 summarizes the service functions you use to work with this kind of object.
Service Function | Description |
ExAcquireFastMutex | Acquires ownership of mutex, waiting if necessary |
ExAcquireFastMutexUnsafe | Acquires ownership of mutex, waiting if necessary, in circumstance in which caller has already disabled receipt of APCs |
ExInitializeFastMutex | Initializes mutex object |
ExReleaseFastMutex | Releases mutex |
ExReleaseFastMutexUnsafe | Releases mutex without reenabling APC delivery |
ExTryToAcquireFastMutex | Acquires mutex if possible to do so without waiting |
Compared with kernel mutexes, fast mutexes have the strengths and weaknesses summarized in Table 4-7. On the plus side, a fast mutex is much faster to acquire and release if there s no actual contention for it. On the minus side, a thread that acquires a fast mutex will not be able to receive certain types of asynchronous procedure call, depending on exactly which functions you call, and this constrains how you send IRPs to other drivers.
Kernel Mutex | Fast Mutex |
Can be acquired recursively by a single thread (system maintains a claim counter) | Cannot be acquired recursively |
Relatively slower | Relatively faster |
Owner will receive only special kernel APCs | Owner won t receive any APCs unless you use the XxxUnsafe functions |
Can be part of a multiple-object wait | Cannot be used as an argument to KeWaitForMultipleObjects |
Incidentally, the DDK documentation about kernel mutex objects has long said that the kernel gives a priority boost to a thread that claims a mutex. I m reliably informed that this hasn t actually been true since 1992 (the year, that is, not the Windows build number). The documentation has also long said that a thread holding a mutex can t be removed from the balance set (that is, subjected to having all of its pages moved out of physical memory). This was true when Windows NT was young but hasn t been true for a long time.
To create a fast mutex, you must first allocate a FAST_MUTEX data structure in nonpaged memory. Then you initialize the object by calling ExInitializeFastMutex, which is really a macro in WDM.H:
ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); ExInitializeFastMutex(FastMutex);
where FastMutex is the address of your FAST_MUTEX object. The mutex begins life in the unowned state. To acquire ownership later on, call one of these functions:
ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL); ExAcquireFastMutex(FastMutex);
or
ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL); ExAcquireFastMutexUnsafe(FastMutex);
The first of these functions waits for the mutex to become available, assigns ownership to the calling thread, and then raises the current processor IRQL to APC_LEVEL. Raising the IRQL has the effect of blocking delivery of all APCs. The second of these functions doesn t change the IRQL.
You need to think about potential deadlocks if you use the unsafe function to acquire a fast mutex. A situation to avoid is allowing user-mode code to suspend a thread in which you hold a mutex. That would deadlock other threads that need the mutex. For this reason, the DDK recommends (and the Driver Verifier requires) that you forestall the delivery of user-mode and normal kernel-mode APCs either by raising the IRQL to APC_LEVEL or by calling KeEnterCriticalRegion before ExAcquireFastMutexUnsafe. (Thread suspension involves an APC, so user-mode code can t suspend your thread if you disallow user-mode APCs. Yes, I know the reasoning here is a bit of a stretch!)
Another possible deadlock can arise with a driver in the paging path in other words, a driver that gets called to help the memory manager process a page fault. Suppose you simply call KeEnterCriticalRegion and then ExAcquireFastMutexUnsafe. Now suppose the system tries to execute a special kernel-mode APC in the same thread, which is possible because KeEnterCriticalRegion doesn t forestall special kernel APCs. The APC routine might page fault, which might then lead to you being reentered and deadlocking on a second attempt to claim the same mutex. You avoid this situation by raising IRQL to APC_LEVEL before acquiring the mutex in the first place or, more simply, by using Ke AcquireFastMutex instead of KeAcquireFastMutexUnsafe. The same problem can if you use a regular KMUTEX or a synchronization event, of course.
IMPORTANT
If you use KeAcquireFastMutex, you will be at APC_LEVEL. This means you can t create any synchronous IRPs. (The routines that do this must be called at PASSIVE_LEVEL.) Furthermore, you ll deadlock if you try to wait for a synchronous IRP to complete (because completion requires executing an APC, which can t happen because of the IRQL). In Chapter 5, I ll discuss how to use asynchronous IRPs to work around this problem.
If you don t want to wait if the mutex isn t immediately available, use the try to acquire function:
ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL); BOOLEAN acquired = ExTryToAcquireFastMutex(FastMutex);
If the return value is TRUE, you now own the mutex. If it s FALSE, someone else owns the mutex and has prevented you from acquiring it.
To release control of a fast mutex and allow some other thread to claim it, call the release function corresponding to the way you acquired the fast mutex:
ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL); ExReleaseFastMutex(FastMutex);
or
ASSERT(KeGetCurrentIrql() < DISPATCH_LEVEL); ExReleaseFastMutexUnsafe(FastMutex);
A fast mutex is fast because the acquisition and release steps are optimized for the usual case in which there s no contention for the mutex. The critical step in acquiring the mutex is to atomically decrement and test an integer counter that indicates how many threads either own or are waiting for the mutex. If the test indicates that no other thread owns the mutex, no additional work is required. If the test indicates that another thread does own the mutex, the current thread blocks on a synchronization event that s part of the FAST_MUTEX object. Releasing the mutex entails atomically incrementing and testing the counter. If the test indicates that no thread is currently waiting, no additional work is required. If another thread is waiting, however, the owner calls KeSetEvent to release one of the waiters.
Note on Deadlock Prevention
Whenever you use synchronization objects such as spin locks, fast mutexes, and so on, in a driver, you should be on the lookout for potential deadlocks. We ve already talked about two deadlock issues: trying to acquire a spin lock that you already hold and trying to claim a fast mutex or synchronization event with APCs enabled in the thread. This sidebar concerns a more insidious potential deadlock that can arise when your driver uses more than one synchronization object.Suppose there are two synchronization objects, A and B. It doesn t matter what types of objects these are, and they needn t even be the same type. Now suppose we have two subroutines I ll call them Fred and Barney just so I have names to work with. Subroutine Fred claims object A followed by object B. Subroutine Barney claims B followed by A. This sets up a potential deadlock if Fred and Barney can be simultaneously active or if a thread running one of those routines can be preempted by a thread running the other routine.
The deadlock arises, as you probably remember from studying this sort of thing in school, when two threads manage to execute Fred and Barney at about the same time. The Fred thread gets object A, while the Barney thread gets object B. Fred now tries to get object B, but can t have it (Barney has it). Barney, on the other hand, now tries to get object A, but can t have it (Fred has it). Both threads are now deadlocked, waiting for the other one to release the object each needs.
The easiest way to prevent this kind of deadlock is to always acquire objects such as A and B in the same order, everywhere. The order in which you decide to acquire a set of resources is called the locking hierarchy. There are other schemes, which involve conditional attempts to acquire resources combined with back-out loops, but these are much harder to implement.
If you engage the Deadlock Detection option, the Driver Verifier will look for potential deadlocks resulting from locking hierarchy violations involving spin locks, kernel mutexes, and executive fast mutexes.
The DDK documents another synchronization primitive that I didn t discuss in this chapter: an ERESOURCE. File system drivers use ERESOURCE objects extensively because they allow for shared and exclusive ownership. Because file system drivers often have to use complex locking logic, the Driver Verifier doesn t check the locking hierarchy for an ERESOURCE.
Interlocked Arithmetic
You can call several service functions in a WDM driver to perform arithmetic in a way that s thread-safe and multiprocessor-safe. (See Table 4-8.) These routines come in two flavors. The first type of routine has a name beginning with Interlocked and performs an atomic operation in such a way that no other thread or CPU can interfere. The other flavor has a name beginning with ExInterlocked and uses a spin lock.
Service Function | Description |
InterlockedCompareExchange | Compares and conditionally exchanges |
InterlockedDecrement | Subtracts 1 from an integer |
InterlockedExchange | Exchanges two values |
InterlockedExchangeAdd | Adds two values and returns sum |
InterlockedIncrement | Adds 1 to an integer |
InterlockedOr | ORs bits into an integer |
InterlockedAnd | ANDs bits into an integer |
InterlockedXor | Exclusive-ORs bits into an integer |
ExInterlockedAddLargeInteger | Adds value to 64-bit integer |
ExInterlockedAddLargeStatistic | Adds value to ULONG |
ExInterlockedAddUlong | Adds value to ULONG and returns initial value |
ExInterlockedCompareExchange64 | Exchanges two 64-bit values |
The InterlockedXxx functions can be called at any IRQL; they can also handle pageable data at PASSIVE_LEVEL because they don t require a spin lock. Although the ExInterlockedXxx routines can be called at any IRQL, they operate on the target data at or above DISPATCH_LEVEL and therefore require a nonpaged argument. The only reason to use an ExInterlockedXxx function is if you have a data variable that you sometimes need to increment or decrement and sometimes need to access throughout some series of instructions. You would explicitly claim the spin lock around the multi-instruction accesses and use the ExInterlockedXxx function to perform the simple increments or decrements.
InterlockedXxx Functions
InterlockedIncrement adds 1 to a long integer in memory and returns the postincrement value to you:
LONG result = InterlockedIncrement(pLong);
where pLong is the address of a variable typed as a LONG (that is, a long integer). Conceptually, the operation of the function is equivalent to the statement return ++*pLong in C, but the implementation differs from that simple statement in order to provide thread safety and multiprocessor safety. InterlockedIncrement guarantees that the integer is successfully incremented even if code on other CPUs or in other eligible threads on the same CPU is simultaneously trying to alter the same variable. In the nature of the operation, InterlockedIncrement cannot guarantee that the value it returns is still the value of the variable even one machine cycle later because other threads or CPUs will be able to modify the variable as soon as the atomic increment operation completes.
InterlockedDecrement is similar to InterlockedIncrement, but it subtracts 1 from the target variable and returns the postdecrement value, just like the C statement return --*pLong but with thread safety and multiprocessor safety.
LONG result = InterlockedDecrement(pLong);
You call InterlockedCompareExchange like this:
LONG target; LONG result = InterlockedCompareExchange(&target, newval, oldval);
Here target is a long integer used both as input and output to the function, oldval is your guess about the current contents of the target, and newval is the new value that you want installed in the target if your guess is correct. The function performs an operation similar to that indicated in the following C code but does so via an atomic operation that s both thread-safe and multiprocessor-safe:
LONG CompareExchange(PLONG ptarget, LONG newval, LONG oldval) { LONG value = *ptarget; if (value == oldval) *ptarget = newval; return value; }
In other words, the function always returns the previous value of the target variable to you. In addition, if that previous value equals oldval, it sets the target equal to the newval you specify. The function uses an atomic operation to do the compare and exchange so that the replacement happens only if you re correct in your guess about the previous contents.
You can also call the InterlockedCompareExchangePointer function to perform a similar sort of compare-and-exchange operation with a pointer. This function is defined either as a compiler-intrinsic (that is, a function for which the compiler supplies an inline implementation) or a real function call, depending on how wide pointers are on the platform for which you re compiling and on the ability of the compiler to generate inline code.
The last function in this class is InterlockedExchange, which simply uses an atomic operation to replace the value of an integer variable and to return the previous value:
LONG value; LONG oldval = InterlockedExchange(&value, newval);
As you might have guessed, there s also an InterlockedExchangePointer that exchanges a pointer value (64-bit or 32-bit, depending on the platform). Be sure to cast the target of the exchange operation to avoid a compiler error when building 64-bit drivers:
PIRP Irp = (PIRP) InterlockedExchangePointer( (PVOID*) &foo, NULL);
InterlockedOr, InterlockedAnd and InterlockedXor are new with the XP DDK. You can use them in drivers that will run on earlier Windows versions because they re actually implemented as compiler-intrinsic functions.
Interlocked Fetches and Stores?
A frequently asked question is how to do simple fetch-and-store operations on data that s otherwise being accessed by InterlockedXxx functions. You don t have to do anything special to fetch a self-consistent value from a variable that other people are modifying with interlocked operations so long as the data in question is aligned on a natural address boundary. Data so aligned cannot cross a memory cache boundary, and the memory controller will always update a cache-sized memory block atomically. Thus, if someone is updating a variable at about the same time you re trying to read it, you ll get either the preupdate or the postupdate value but never anything in between.
For store operations, however, the answer is more complex. Suppose you write the following sort of code to guard access to some shared data:
if (InterlockedExchange(&lock, 42) == 0) { sharedthing++; lock = 0; // == don't do this }
This code will work fine on an Intel x86 computer, where every CPU sees memory writes in the same order. On another type of CPU, though, there could be a problem. One CPU might actually change the memory variable lock to 0 before updating memory for the increment statement. That behavior could allow two CPUs to simultaneously access sharedthing. This problem could happen because of the way the CPU performs operations in parallel or because of quirks in the memory controller. Consequently, you should rework the code to use an interlocked operation for both changes to lock:
if (InterlockedExchange(&lock, 42) == 0) { sharedthing++; InterlockedExchange(&lock, 0); }
ExInterlockedXxx Functions
Each of the ExInterlockedXxx functions requires that you create and initialize a spin lock before you call it. Note that the operands of these functions must all be in nonpaged memory because the functions operate on the data at elevated IRQL.
ExInterlockedAddLargeInteger adds two 64-bit integers and returns the previous value of the target:
LARGE_INTEGER value, increment; KSPIN_LOCK spinlock; LARGE_INTEGER prev = ExInterlockedAddLargeInteger(&value, increment, &spinlock);
Value is the target of the addition and one of the operands. Increment is an integer operand that s added to the target. Spinlock is a spin lock that you previously initialized. The return value is the target s value before the addition. In other words, the operation of this function is similar to the following function except that it occurs under protection of the spin lock:
__int64 AddLargeInteger(__int64* pvalue, __int64 increment) { __int64 prev = *pvalue; *pvalue += increment; return prev; }
Note that the return value is the preaddition value, which contrasts with the postincrement return from InterlockedExchange and similar functions. (Also, not all compilers support the __int64 integer data type, and not all computers can perform a 64-bit addition operation using atomic instructions.)
ExInterlockedAddUlong is analogous to ExInterlockedAddLargeInteger except that it works with 32-bit unsigned integers:
ULONG value, increment; KSPIN_LOCK spinlock; ULONG prev = ExInterlockedAddUlong(&value, increment, &spinlock);
This function likewise returns the preaddition value of the target of the operation.
ExInterlockedAddLargeStatistic is similar to ExInterlockedAddUlong in that it adds a 32-bit value to a 64-bit value:
VOID ExInterlockedAddLargeStatistic(PLARGE_INTEGER Addend, ULONG Increment);
This new function is faster than ExInterlockedAddUlong because it doesn t need to return the preincrement value of the Addend variable. It therefore doesn t need to employ a spin lock for synchronization. The atomicity provided by this function is, however, only with respect to other callers of the same function. In other words, if you had code on one CPU calling ExInterlockedAddLargeStatistic at the same time as code on another CPU was accessing the Addend variable for either reading or writing, you could get inconsistent results. I can explain why this is so by showing you this paraphrase of the Intel x86 implementation of the function (not the actual source code):
mov eax, Addend mov ecx, Increment lock add [eax], ecx lock adc [eax+4], 0
This code works correctly for purposes of incrementing the Addend because the lock prefixes guarantee atomicity of each addition operation and because no carries from the low-order 32 bits can ever get lost. The instantaneous value of the 64-bit Addend isn t always consistent, however, because an incrementer might be poised between the ADD and the ADC just at the instant someone makes a copy of the complete 64-bit value. Therefore, even a caller of ExInterlocked CompareExchange64 on another CPU could obtain an inconsistent value.
Interlocked List Access
The Windows NT executive offers three sets of support functions for dealing with linked lists in a thread-safe and multiprocessor-safe way. These functions support doubly-linked lists, singly-linked lists, and a special kind of singly-linked list called an S-List. I discussed noninterlocked doubly-linked and singly-linked lists in the preceding chapter. To close this chapter on synchronization within WDM drivers, I ll explain how to use these interlocked accessing primitives.
If you need the functionality of a FIFO queue, you should use a doubly-linked list. If you need the functionality of a thread-safe and multiprocessor-safe pushdown stack, you should use an S-List. In both cases, to achieve thread safety and multiprocessor safety, you will allocate and initialize a spin lock. The S-List might not actually use the spin lock, however, because the presence of a sequence number might allow the kernel to implement it using just atomic compare-exchange sorts of operations.
The support functions for performing interlocked access to list objects are similar, so I ve organized this section along functional lines. I ll explain how to initialize all three kinds of lists. Then I ll explain how to insert an item into all three kinds. After that, I ll explain how to remove items.
Initialization
You can initialize these lists as shown here:
LIST_ENTRY DoubleHead; SINGLE_LIST_ENTRY SingleHead; SLIST_HEADER SListHead; InitializeListHead(&DoubleHead); SingleHead.Next = NULL; ExInitializeSListHead(&SListHead);
Don t forget that you must also allocate and initialize a spin lock for each list. Furthermore, the storage for the list heads and all the items you put into the lists must come from nonpaged memory because the support routines perform their accesses at elevated IRQL. Note that the spin lock isn t used during initiali zation of the list head because it doesn t make any sense to allow contention for list access before the list has been initialized.
Inserting Items
You can insert items at the head and tail of a doubly-linked list and at the head (only) of a singly-linked list or an S-List:
PLIST_ENTRY pdElement, pdPrevHead, pdPrevTail; PSINGLE_LIST_ENTRY psElement, psPrevHead; PKSPIN_LOCK spinlock; pdPrevHead = ExInterlockedInsertHeadList(&DoubleHead, pdElement, spinlock); pdPrevTail = ExInterlockedInsertTailList(&DoubleHead, pdElement, spinlock); psPrevHead = ExInterlockedPushEntryList(&SingleHead, psElement, spinlock); psPrevHead = ExInterlockedPushEntrySList(&SListHead, psElement, spinlock);
The return values are the addresses of the elements previously at the head (or tail) of the list in question. Note that the element addresses you use with these functions are the addresses of list entry structures that are usually embedded in larger structures of some kind, and you ll need to use the CONTAINING_RECORD macro to recover the address of the surrounding structure.
Removing Items
You can remove items from the head of any of these lists:
pdElement = ExInterlockedRemoveHeadList(&DoubleHead, spinlock); psElement = ExInterlockedPopEntryList(&SingleHead, spinlock); psElement = ExInterlockedPopEntrySList(&SListHead, spinlock);
The return values are NULL if the respective lists are empty. Be sure to test the return value for NULLbefore applying the CONTAINING_RECORD macro to recover a containing structure pointer.
IRQL Restrictions
You can call the S-List functions only while running at or below DISPATCH_LEVEL. The ExInterlockedXxx functions for accessing doubly-linked or singly-linked lists can be called at any IRQL so long as all references to the list use an ExInterlockedXxx call. The reason for no IRQL restrictions is that the implementations of these functions disable interrupts, which is tantamount to raising IRQL to the highest possible level. Once interrupts are disabled, these functions then acquire the spin lock you ve specified. Since no other code can gain control on the same CPU, and since no code on another CPU can acquire the spin lock, your lists are protected.
NOTE
The DDK documentation states this rule in an overly restrictive way for at least some of the ExInterlockedXxx functions. It says that all callers must be running at some single IRQL less than or equal to the DIRQL of your interrupt object. There is, in fact, no requirement that all callers be at the same IRQL because you can call the functions at any IRQL. Likewise, no <= DIRQL restriction exists either, but there s also no reason for the code you and I write to raise IRQL higher than that.
It s perfectly OK for you to use ExInterlockedXxx calls to access a singly-linked or doubly-linked list (but not an S-List) in some parts of your code and to use the noninterlocked functions (InsertHeadList and so on) in other parts of your code if you follow a simple rule. Before using a noninterlocked primitive, acquire the same spin lock that your interlocked calls use. Furthermore, restrict list access to code running at or below DISPATCH_LEVEL. For example:
// Access list using noninterlocked calls: VOID Function1() { ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); KIRQL oldirql; KeAcquireSpinLock(spinlock, &oldirql); InsertHeadList(...); RemoveTailList(...); KeReleaseSpinLock(spinlock, oldirql); } // Access list using interlocked calls: VOID Function2() { ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); ExInterlockedInsertTailList(..., spinlock); }
The first function must be running at or below DISPATCH_LEVEL because that s a requirement of calling KeAcquireSpinLock. The reason for the IRQL restriction on the interlocked calls in the second function is as follows: Suppose Function1 acquires the spin lock in preparation for performing some list accesses. Acquiring the spin lock raises IRQL to DISPATCH_LEVEL. Now suppose an interrupt occurs on the same CPU at a higher IRQL and Function2 gains control to use one of the ExInterlockedXxx routines. The kernel will now attempt to acquire the same spin lock, and the CPU will deadlock. This problem arises from allowing code running at two different IRQLs to use the same spin lock: Function1 is at DISPATCH_LEVEL, and Function2 is practically speaking, anyway at HIGH_LEVEL when it tries to recursively acquire the lock.