Kernel Dispatcher Objects | Programming the Microsoft Windows Driver Model

Kernel Dispatcher Objects

The kernel provides five types of synchronization objects that you can use to control the flow of nonarbitrary threads. See Table 4-1 for a summary of these kernel dispatcher object types and their uses. At any moment, one of these objects is in one of two states: signaled or not-signaled. At times when it s permissible for you to block a thread in whose context you re running, you can wait for one or more objects to reach the signaled state by calling KeWaitForSingleObject or KeWaitForMultipleObjects. The kernel also provides routines for initializing and controlling the state of each of these objects.

Table 4-1. Kernel Dispatcher Objects
Object	Data Type	Description
Event	KEVENT	Blocks a thread until some other thread detects that an event has occurred
Semaphore	KSEMAPHORE	Used instead of an event when an arbitrary number of wait calls can be satisfied
Mutex	KMUTEX	Excludes other threads from executing a particular section of code
Timer	KTIMER	Delays execution of a thread for a given period of time
Thread	KTHREAD	Blocks one thread until another thread terminates

In the next few sections, I ll describe how to use the kernel dispatcher objects. I ll start by explaining when you can block a thread by calling one of the wait primitives, and then I ll discuss the support routines that you use with each of the object types. I ll finish this section by discussing the related concepts of thread alerts and asynchronous procedure call delivery.

How and When You Can Block

To understand when and how it s permissible for a WDM driver to block a thread on a kernel dispatcher object, you have to recall some of the basic facts about threads from Chapter 2. In general, whatever thread was executing at the time of a software or hardware interrupt continues to be the current thread while the kernel processes the interrupt. We speak of executing kernel-mode code in the context of this current thread. In response to interrupts of various kinds, the scheduler might decide to switch threads, of course, in which case a new thread becomes current.

We use the terms arbitrary thread context and nonarbitrary thread context to describe the precision with which we can know the thread in whose context we re currently operating in a driver subroutine. If we know that we re in the context of the thread that initiated an I/O request, the context is not arbitrary. Much of the time, however, a WDM driver can t know this fact because chance usually controls which thread is active when the interrupt occurs that results in the driver being called. When applications issue I/O requests, they cause a transition from user mode to kernel mode. The I/O Manager routines that create an IRP and send it to a driver dispatch routine continue to operate in this nonarbitrary thread context, as does the first dispatch routine to see the IRP. We use the term highest-level driver to describe the driver whose dispatch routine first receives the IRP.

As a general rule, only a highest-level driver can know for sure that it s operating in a nonarbitrary thread context. Let s suppose you are a dispatch routine in a lower-level driver, and you re wondering whether you re getting called in an arbitrary thread. If the highest-level driver just sent you an IRP directly from its dispatch routine, you d be in the original, nonarbitrary, thread. But suppose that driver had put an IRP on a queue and then returned to the application. That driver would have removed the IRP from the queue in an arbitrary thread and then sent it or another IRP to you. Unless you know that didn t happen, you should assume you re in an arbitrary thread if you re not the highest-level driver.

Notwithstanding what I just said, in many situations you can be sure of the thread context. Your DriverEntry and AddDevice routines are called in a system thread that you can block if you need to. You won t often need to explicitly block inside these routines, but you could if you wanted to. You receive IRP_MJ_PNP requests in a system thread too. In many cases, you must block that thread to correctly process the request. Finally, you ll sometimes receive I/O requests directly from an application, in which case you ll know you re in a thread belonging to the application.

NOTE
Microsoft uses the term highest-level driver primarily to distinguish between file system drivers and the storage device drivers they call to do actual I/O. The file system driver is highest level, while the storage driver is not. It would be easy to confuse this concept with the layering of WDM drivers, but it s not the same. The way I think of things is that all the WDM drivers for a given piece of hardware, including all the filter drivers, the function driver, and the bus driver, are collectively either highest level or not. A filter driver has no business queuing an IRP that, but for the intervention of the filter, would have flowed down the stack in the original thread context. So if the thread context was nonarbitrary when the IRP got to the topmost filter dispatch object (FiDO), it should still be nonarbitrary in every lower dispatch routine.

Also recall from the discussion earlier in this chapter that you must not block a thread if you re executing at or above DISPATCH_LEVEL.

Having recalled these facts about thread context and IRQL, we can state a simple rule about when it s OK to block a thread:

Block only the thread that originated the request you re working on, and only when executing at IRQL strictly less than DISPATCH_LEVEL.

Several of the dispatcher objects, and the so-called Executive Fast Mutex I ll discuss later in this chapter, offer mutual exclusion functionality. That is, they permit one thread to access a given shared resource without interference from other threads. This is pretty much what a spin lock does, so you might wonder how to choose between synchronization methods. In general, I think you should prefer to synchronize below DISPATCH_LEVEL if you can because that strategy allows a thread that owns a mutual exclusion lock to cause page faults and to be preempted by other threads if the thread continues to hold the lock for a long time. In addition, this strategy allows other CPUs to continue doing useful work, even though threads have blocked on those CPUs to acquire the same lock. If any of the code that accesses a shared resource can run at DISPATCH_LEVEL, though, you must use a spin lock because the DISPATCH_LEVEL code might interrupt code running at lower IRQL.

Waiting on a Single Dispatcher Object

You call KeWaitForSingleObject as illustrated in the following example:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER timeout; NTSTATUS status = KeWaitForSingleObject(object, WaitReason, WaitMode, Alertable, &timeout);

As suggested by the ASSERT, you must be executing at or below DISPATCH_LEVEL to even call this service routine.

In this call, object points to the object you want to wait on. Although this argument is typed as a PVOID, it should be a pointer to one of the dispatcher objects listed in Table 4-1. The object must be in nonpaged memory for example, in a device extension structure or other data area allocated from the nonpaged pool. For most purposes, the execution stack can be considered nonpaged.

WaitReason is a purely advisory value chosen from the KWAIT_REASON enumeration. No code in the kernel actually cares what value you supply here, so long as you don t specify WrQueue. (Internally, scheduler code bases some decisions on whether a thread is currently blocked for this reason. ) The reason a thread is blocked is saved in an opaque data structure, though. If you knew more about that data structure and were trying to debug a deadlock of some kind, you could perhaps gain clues from the reason code. The bottom line: always specify Executive for this parameter; there s no reason to say anything else.

WaitMode is one of the two values of the MODE enumeration: KernelMode or UserMode. Alertable is a simple Boolean value. Unlike WaitReason, these parameters do make a difference in the way the system behaves by controlling whether the wait can be terminated early in order to deliver asynchronous procedure calls of various kinds. I ll explain these interactions in more detail in Thread Alerts and APCs later in this chapter. Waiting in user mode also authorizes the Memory Manager to swap your thread s kernel-mode stack out. You ll see examples in this book and elsewhere where drivers create event objects, for instance, as automatic variables. A bug check would result if some other thread were to call KeSetEvent at elevated IRQL at a time when the event object was absent from memory. The bottom line: you should probably always wait in KernelMode and specify FALSE for the Alertable parameter.

The last parameter to KeWaitForSingleObject is the address of a 64-bit timeout value, expressed in 100-nanosecond units. A positive number for the timeout is an absolute timestamp relative to the January 1, 1601, epoch of the system clock. You can determine the current time by calling KeQuerySystemTime, and you can add a constant to that value. A negative number is an interval relative to the current time. If you specify an absolute time, a subsequent change to the system clock alters the duration of the timeout you might experience. That is, the timeout doesn t expire until the system clock equals or exceeds whatever absolute value you specify. In contrast, if you specify a relative timeout, the duration of the timeout you experience is unaffected by changes in the system clock.

Why January 1, 1601?

Years ago, when I was first learning the Win32 API, I was bemused by the choice of January 1, 1601, as the origin for the timestamps in Windows NT. I understood the reason for this choice when I had occasion to write a set of conversion routines. Everyone knows that years divisible by four are leap years. Many people know that century years (such as 1900) are exceptions they re not leap years even though they re divisible by 4. A few people know that every fourth century year (such as 1600 and 2000) is an exception to the exception they are leap years. January 1, 1601, was the start of a 400-year cycle that ends in a leap year. If you base timestamps on this origin, it s possible to write programs that convert a Windows NT timestamp to a conventional representation of the date (and vice versa) without doing any jumps.

Specifying a zero timeout causes KeWaitForSingleObject to return immediately with a status code indicating whether the object is in the signaled state. If you re executing at DISPATCH_LEVEL, you must specify a zero timeout because blocking is not allowed. Each kernel dispatcher object offers a KeReadStateXxx service function that allows you to determine the state of the object. Reading the state isn t completely equivalent to waiting for zero time, however: when KeWaitForSingleObject discovers that the wait is satisfied, it performs the side effects that the particular object requires. In contrast, reading the state of the object doesn t perform the operations, even if the object is already signaled and a wait would be satisfied if it were requested right now.

Specifying a NULL pointer for the timeout parameter is OK and indicates an infinite wait.

The return value indicates one of several possible results. STATUS_SUCCESS is the result you expect and indicates that the wait was satisfied. That is, either the object was in the signaled state when you made the call to KeWaitForSingleObject or else the object was in the not-signaled state and later became signaled. When the wait is satisfied in this way, operations might need to be performed on the object. The nature of these operations depends on the type of the object, and I ll explain them later in this chapter in connection with discussing each type of object. (For example, a synchronization type of event will be reset after your wait is satisfied.)

The return value STATUS_TIMEOUT indicates that the specified timeout occurred without the object reaching the signaled state. If you specify a zero timeout, KeWaitForSingleObject returns immediately with either this code (indicating that the object is not-signaled) or STATUS_SUCCESS (indicating that the object is signaled). This return value isn t possible if you specify a NULL timeout parameter pointer because you thereby request an infinite wait.

Two other return values are possible. STATUS_ALERTED and STATUS_USER _APC mean that the wait has terminated without the object having been signaled because the thread has received an alert or a user-mode APC, respectively. I ll discuss these concepts a bit further on in Thread Alerts and APCs.

Note that STATUS_TIMEOUT, STATUS_ALERTED, and STATUS_USER_APC all pass the NT_SUCCESS test. Therefore, don t simply use NT_SUCCESS on the return code from KeWaitForSingleObject in the expectation that it will distinguish between cases in which the object was signaled and cases in which the object was not signaled.

Windows 98/Me Compatibility Note

KeWaitForSingleObject and KeWaitForMultipleObjects have a horrible bug in Windows 98 and Millennium in that they can return the undocumented and nonsensical value 0xFFFFFFFF in two situations. One situation occurs when a thread terminates while blocked on a WDM object. The wait returns early with this bogus code. The return code should never happen (because it s undocumented), and the wait shouldn t terminate early unless you specify TRUE for the Alertable parameter. You can work around this problem by just reissuing the wait.

The other circumstance in which you can get the bogus return occurs if the thread you re trying to block is already blocked. How, you might well ask, could you be executing in the context of a thread that s really blocked? This situation happens in Windows 98/Me when someone blocks on a VxD-level object with the BLOCK_SVC_INTS flag and the system later calls a function in your driver at what s called event time. You can nominally be in the context of the blocked thread, and you simply cannot block a second time on a WDM object. In fact, I ve even seen KeWaitForSingleObject return with the IRQL raised to DISPATCH_LEVEL in this circumstance. As far as I know, there s no workaround for the problem. Thankfully, it seems to occur only with drivers for serial devices, in which there s a crossover between VxD and WDM code.

Waiting on Multiple Dispatcher Objects

KeWaitForMultipleObjects is a companion function to KeWaitForSingleObject that you use when you want to wait for one or all of several dispatcher objects simultaneously. Call this function as in this example:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER timeout; NTSTATUS status = KeWaitForMultipleObjects(count, objects, WaitType, WaitReason, WaitMode, Alertable, &timeout, waitblocks);

Here objects is the address of an array of pointers to dispatcher objects, and count is the number of pointers in the array. The count must be less than or equal to the value MAXIMUM_WAIT_OBJECTS, which currently equals 64. The array, as well as each of the objects to which the elements of the array point, must be in nonpaged memory. WaitType is one of the enumeration values WaitAll or WaitAny and specifies whether you want to wait until all of the objects are simultaneously in the signaled state or whether, instead, you want to wait until any one of the objects is signaled.

The waitblocks argument points to an array of KWAIT_BLOCK structures that the kernel will use to administer the wait operation. You don t need to initialize these structures in any way the kernel just needs to know where the storage is for the group of wait blocks that it will use to record the status of each of the objects during the wait. If you re waiting for a small number of objects (specifically, a number no bigger than THREAD_WAIT_OBJECTS, which currently equals 3), you can supply NULL for this parameter. If you supply NULL, KeWaitForMultipleObjects uses a preallocated array of wait blocks that lives in the thread object. If you re waiting for more objects than this, you must provide nonpaged memory that s at least count * sizeof(KWAIT_BLOCK) bytes in length.

The remaining arguments to KeWaitForMultipleObjects are the same as the corresponding arguments to KeWaitForSingleObject, and most return codes have the same meaning.

If you specify WaitAll, the return value STATUS_SUCCESS indicates that all the objects managed to reach the signaled state simultaneously. If you specify WaitAny, the return value is numerically equal to the objects array index of the single object that satisfied the wait. If more than one of the objects happens to be signaled, you ll be told about one of them maybe the lowest-numbered of all the ones that are signaled at that moment, but maybe some other one. You can think of this value being STATUS_WAIT_0 plus the array index. You can t simply perform the usual NT_SUCCESS test of the returned status before extracting the array index from the status code, though, because other possible return codes (including STATUS_TIMEOUT, STATUS_ALERTED, and STATUS_USER_APC) would also pass the test. Use code like this:

NTSTATUS status = KeWaitForMultipleObjects(...); if ((ULONG) status < count) { ULONG iSignaled = (ULONG) status - (ULONG) STATUS_WAIT_0;  }

When KeWaitForMultipleObjects returns a status code equal to an object s array index in a WaitAny case, it also performs the operations required by that object. If more than one object is signaled and you specified WaitAny, the operations are performed only for the one that s deemed to satisfy the wait and whose index is returned. That object isn t necessarily the first one in your array that happens to be signaled.

Kernel Events

You use the service functions listed in Table 4-2 to work with kernel event objects. To initialize an event object, first reserve nonpaged storage for an object of type KEVENT and then call KeInitializeEvent:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL); KeInitializeEvent(event, EventType, initialstate);

Event is the address of the event object. EventType is one of the enumeration values NotificationEvent and SynchronizationEvent. A notification event has the characteristic that, when it is set to the signaled state, it stays signaled until it s explicitly reset to the not-signaled state. Furthermore, all threads that wait on a notification event are released when the event is signaled. This is like a manual-reset event in user mode. A synchronization event, on the other hand, gets reset to the not-signaled state as soon as a single thread gets released. This is what happens in user mode when someone calls SetEvent on an auto-reset event object. The only operation performed on an event object by KeWaitXxx is to reset a synchronization event to not-signaled. Finally, initialstate is TRUE to specify that the initial state of the event is to be signaled and FALSE to specify that the initial state is to be not-signaled.

Table 4-2. Service Functions for Use with Kernel Event Objects
Service Function	Description
KeClearEvent	Sets event to not-signaled; doesn t report previous state
KeInitializeEvent	Initializes event object
KeReadStateEvent	Determines current state of event (Windows XP and Windows 2000 only)
KeResetEvent	Sets event to not-signaled; returns previous state
KeSetEvent	Sets event to signaled; returns previous state

NOTE
In this series of sections on synchronization primitives, I m repeating the IRQL restrictions that the DDK documentation describes. In the current release of Microsoft Windows XP, the DDK is sometimes more restrictive than the operating system actually is. For example, KeClearEvent can be called at any IRQL, not just at or below DISPATCH_LEVEL. KeInitializeEvent can be called at any IRQL, not just at PASSIVE_LEVEL. However, you should regard the statements in the DDK as being tantamount to saying that Microsoft might someday impose the documented restriction, which is why I haven t tried to report the true state of affairs.

You can call KeSetEvent to place an event in the signaled state:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG wassignaled = KeSetEvent(event, boost, wait);

As implied by the ASSERT, you must be running at or below DISPATCH_LEVEL to call this function. The event argument is a pointer to the event object in question, and boost is a value to be added to a waiting thread s priority if setting the event results in satisfying someone s wait. See the sidebar ( That Pesky Third Argument to KeSetEvent ) for an explanation of the Boolean wait argument, which a WDM driver would almost never want to specify as TRUE. The return value is nonzero if the event was already in the signaled state before the call and 0 if the event was in the not-signaled state.

A multitasking scheduler needs to artificially boost the priority of a thread that waits for I/O operations or synchronization objects in order to avoid starving threads that spend lots of time waiting. This is because a thread that blocks for some reason generally relinquishes its time slice and won t regain the CPU until either it has a relatively higher priority than other eligible threads or other threads that have the same priority finish their time slices. A thread that never blocks, however, gets to complete its time slices. Unless a boost is applied to the thread that repeatedly blocks, therefore, it will spend a lot of time waiting for CPU-bound threads to finish their time slices.

You and I won t always have a good idea of what value to use for a priority boost. A good rule of thumb to follow is to specify IO_NO_INCREMENT unless you have a good reason not to. If setting the event is going to wake up a thread that s dealing with a time-sensitive data flow (such as a sound driver), supply the boost that s appropriate to that kind of device (such as IO_SOUND_INCREMENT). The important thing is not to boost the waiter for a silly reason. For example, if you re trying to handle an IRP_MJ_PNP request synchronously see Chapter 6 you ll be waiting for lower-level drivers to handle the IRP before you proceed, and your completion routine will be calling KeSetEvent. Since Plug and Play requests have no special claim on the processor and occur only infrequently, specify IO_NO_INCREMENT, even for a sound card.

That Pesky Third Argument to KeSetEvent

The purpose of the wait argument to KeSetEvent is to allow internal code to hand off control from one thread to another very quickly. System components other than device drivers can, for example, create paired event objects that are used by client and server threads to gate their communication. When the server wants to wake up its paired client, it will call KeSetEvent with the wait argument set to TRUE and then immediately call KeWaitXxx to put itself to sleep. The use of wait allows these two operations to be done atomically so that no other thread can be awakened in between and possibly wrest control from the client and the server.

The DDK has always sort of described what happens internally, but I ve found the explanation confusing. I ll try to explain it in a different way so that you can see why you should always say FALSE for this parameter. Internally, the kernel uses a dispatcher database lock to guard operations related to thread blocking, waking, and scheduling. KeSetEvent needs to acquire this lock, and so do the KeWaitXxx routines. If you say TRUE for the wait argument, KeSetEvent sets a flag so that KeWaitXxx will know you did so, and it returns to you without releasing this lock. When you turn around and (immediately, please you re running at a higher IRQL than every hardware device, and you own a spin lock that s very frequently in contention) call KeWaitXxx, it needn t acquire the lock all over again. The net effect is that you ll wake up the waiting thread and put yourself to sleep without giving any other thread a chance to start running.

You can see, first of all, that a function that calls KeSetEvent with wait set to TRUE has to be in nonpaged memory because it will execute briefly above DISPATCH_LEVEL. But it s hard to imagine why an ordinary device driver would even need to use this mechanism because it would almost never know better than the kernel which thread ought to be scheduled next. The bottom line: always say FALSE for this parameter. In fact, it s not clear why the parameter has even been exposed to tempt us.

You can determine the current state of an event (at any IRQL) by calling KeReadStateEvent:

LONG signaled = KeReadStateEvent(event);

The return value is nonzero if the event is signaled, 0 if it s not-signaled.

NOTE
KeReadStateEvent isn t supported in Microsoft Windows 98/Me, even though the other KeReadStateXxx functions described here are. The absence of support has to do with how events and other synchronization primitives are implemented in Windows 98/Me.

You can determine the current state of an event and, immediately thereafter, place it in the not-signaled state by calling the KeResetEvent function (at or below DISPATCH_LEVEL):

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG signaled = KeResetEvent(event);

If you re not interested in the previous state of the event, you can save a little time by calling KeClearEvent instead:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); KeClearEvent(event);

KeClearEvent is faster because it doesn t need to capture the current state of the event before setting it to not-signaled. But beware of calling KeClearEvent when another thread might be using the same event since there s no good way to control the races between you clearing the event and some other thread setting it or waiting on it.

Using a Synchronization Event for Mutual Exclusion

I ll tell you later in this chapter about two types of mutual exclusion objects a kernel mutex and an executive fast mutex that you can use to limit access to shared data in situations in which a spin lock is inappropriate for some reason. Sometimes you can simply use a synchronization event for this purpose. First define the event in nonpaged memory, as follows:

typedef struct _DEVICE_EXTENSION {  KEVENT lock; } DEVICE_EXTENSION, *PDEVICE_EXTENSION;

Initialize it as a synchronization event in the signaled state:

KeInitializeEvent(&pdx->lock, SynchronizationEvent, TRUE);

Enter your lightweight critical section by waiting on the event. Leave by setting the event.

KeWaitForSingleObject(&pdx->lock, Executive, KernelMode, FALSE, NULL);  KeSetEvent(&pdx->lock, EVENT_INCREMENT, FALSE);

Use this trick only in a system thread, though, to prevent a user-mode call to NtSuspendThread from creating a deadlock. (This deadlock can easily happen if a user-mode debugger is running on the same process.) If you re running in a user thread, you should prefer to use an executive fast mutex. Don t use this trick at all for code that executes in the paging path, as explained later in connection with the unsafe way of acquiring an executive fast mutex.

Kernel Semaphores

A kernel semaphore is an integer counter with associated synchronization semantics. The semaphore is considered signaled when the counter is positive and not-signaled when the counter is 0. The counter cannot take on a negative value. Releasing a semaphore increases the counter, whereas successfully waiting on a semaphore decrements the counter. If the decrement makes the count 0, the semaphore is then considered not-signaled, with the consequence that other KeWaitXxx callers who insist on finding it signaled will block. Note that if more threads are waiting for a semaphore than the value of the counter, not all of the waiting threads will be unblocked.

The kernel provides three service functions to control the state of a semaphore object. (See Table 4-3.) You initialize a semaphore by making the following function call at PASSIVE_LEVEL:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL); KeInitializeSemaphore(semaphore, count, limit);

In this call, semaphore points to a KSEMAPHORE object in nonpaged memory. The count variable is the initial value of the counter, and limit is the maximum value that the counter will be allowed to take on, which must be as large as the initial count.

Table 4-3. Service Functions for Use with Kernel Semaphore Objects
Service Function	Description
KeInitializeSemaphore	Initializes semaphore object
KeReadStateSemaphore	Determines current state of semaphore
KeReleaseSemaphore	Sets semaphore object to the signaled state

If you create a semaphore with a limit of 1, the object is somewhat similar to a mutex in that only one thread at a time will be able to claim it. A kernel mutex has some features that a semaphore lacks, however, to help prevent deadlocks. Accordingly, there s almost no point in creating a semaphore with a limit of 1.

If you create a semaphore with a limit bigger than 1, you have an object that allows multiple threads to access a given resource. A familiar theorem in queuing theory dictates that providing a single queue for multiple servers is more fair (that is, results in less variation in waiting times) than providing a separate queue for each of several servers. The average waiting time is the same in both cases, but the variation in waiting times is smaller with the single queue. (This is why queues in stores are increasingly organized so that customers wait in a single line for the next available clerk.) This kind of semaphore allows you to organize a set of software or hardware servers to take advantage of that theorem.

The owner (or one of the owners) of a semaphore releases its claim to the semaphore by calling KeReleaseSemaphore:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG wassignaled = KeReleaseSemaphore(semaphore, boost, delta, wait);

This operation adds delta, which must be positive, to the counter associated with semaphore, thereby putting the semaphore in the signaled state and allowing other threads to be released. In most cases, you ll specify 1 for this parameter to indicate that one claimant of the semaphore is releasing its claim. The boost and wait parameters have the same import as the corresponding parameters to KeSetEvent, discussed earlier. The return value is 0 if the previous state of the semaphore was not-signaled and nonzero if the previous state was signaled.

KeReleaseSemaphore doesn t allow you to increase the counter beyond the limit specified when you initialized the semaphore. If you try, it doesn t adjust the counter at all, and it raises an exception with the code STATUS_SEMAPHORE_LIMIT_EXCEEDED. Unless someone has a structured exception handler to trap the exception, a bug check will eventuate.

You can also interrogate the current state of a semaphore with this call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG signaled = KeReadStateSemaphore(semaphore);

The return value is nonzero if the semaphore is signaled and 0 if the semaphore is not-signaled. You shouldn t assume that the return value is the current value of the counter it could be any nonzero value if the counter is positive.

Having told you all this about how to use kernel semaphores, I feel I ought to tell you that I ve never seen a driver that uses one of them.

Kernel Mutexes

The word mutex is a contraction of mutual exclusion. A kernel mutex object provides one method (and not necessarily the best one) to serialize access by competing threads to a given shared resource. The mutex is considered signaled if no thread owns it and not-signaled if a thread currently does own it. When a thread gains control of a mutex after calling one of the KeWaitXxx routines, the kernel also prevents delivery of any but special kernel APCs to help avoid possible deadlocks. This is the operation referred to in the earlier discussion of KeWaitForSingleObject (in the section Waiting on a Single Dispatcher Object ).

It s generally better to use an executive fast mutex rather than a kernel mutex, as I ll explain in more detail later in Fast Mutex Objects. The main difference between the two is that acquiring a fast mutex raises the IRQL to APC_LEVEL, whereas acquiring a kernel mutex doesn t change the IRQL. Among the reasons you care about this fact is that completion of so-called synchronous IRPs requires delivery of a special kernel-mode APC, which cannot occur if the IRQL is higher than PASSIVE_LEVEL. Thus, you can create and use synchronous IRPs while owning a kernel mutex but not while owning an executive fast mutex. Another reason for caring arises for drivers that execute in the paging path, as elaborated later on in connection with the unsafe way of acquiring an executive fast mutex.

Another, less important, difference between the two kinds of mutex object is that a kernel mutex can be acquired recursively, whereas an executive fast mutex cannot. That is, the owner of a kernel mutex can make a subsequent call to KeWaitXxx specifying the same mutex and have the wait immediately satisfied. A thread that does this must release the mutex an equal number of times before the mutex will be considered free.

Table 4-4 lists the service functions you use with mutex objects.

Table 4-4. Service Functions for Use with Kernel Mutex Objects
Service Function	Description
KeInitializeMutex	Initializes mutex object
KeReadStateMutex	Determines current state of mutex
KeReleaseMutex	Sets mutex object to the signaled state

To create a mutex, you reserve nonpaged memory for a KMUTEX object and make the following initialization call:

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL); KeInitializeMutex(mutex, level);

where mutex is the address of the KMUTEX object, and level is a parameter originally intended to help avoid deadlocks when your own code uses more than one mutex. Since the kernel currently ignores the level parameter, I m not going to attempt to describe what it used to mean.

The mutex begins life in the signaled that is, unowned state. An immediate call to KeWaitXxx would take control of the mutex and put it in the not-signaled state.

You can interrogate the current state of a mutex with this function call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG signaled = KeReadStateMutex(mutex);

The return value is 0 if the mutex is currently owned, nonzero if it s currently unowned.

The thread that owns a mutex can release ownership and return the mutex to the signaled state with this function call:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LONG wassignaled = KeReleaseMutex(mutex, wait);

The wait parameter means the same thing as the corresponding argument to KeSetEvent. The return value is always 0 to indicate that the mutex was previously owned because, if this were not the case, KeReleaseMutex would have bugchecked (it being an error for anyone but the owner to release a mutex).

Just for the sake of completeness, I want to mention a macro in the DDK named KeWaitForMutexObject. (See WDM.H.) It s defined simply as follows:

#define KeWaitForMutexObject KeWaitForSingleObject

Using this special name offers no benefit at all. You don t even get the benefit of having the compiler insist that the first argument be a pointer to a KMUTEX instead of any random pointer type.

Kernel Timers

The kernel provides a timer object that functions something like an event that automatically signals itself at a specified absolute time or after a specified interval. It s also possible to create a timer that signals itself repeatedly and to arrange for a DPC callback following the expiration of the timer. Table 4-5 lists the service functions you use with timer objects.

Table 4-5. Service Functions for Use with Kernel Timer Objects
Service Function	Description
KeCancelTimer	Cancels an active timer
KeInitializeTimer	Initializes a one-time notification timer
KeInitializeTimerEx	Initializes a one-time or repetitive notification or synchroni zation timer
KeReadStateTimer	Determines current state of a timer
KeSetTimer	(Re)specifies expiration time for a notification timer
KeSetTimerEx	(Re)specifies expiration time and other properties of a timer

There are several usage scenarios for timers, which I ll describe in the next few sections:

Timer used like a self-signaling event
Timer with a DPC routine to be called when a timer expires
Periodic timer used to call a DPC routine over and over again

Notification Timers Used like Events

In this scenario, we ll create a notification timer object and wait until it expires. First allocate a KTIMER object in nonpaged memory. Then, running at or below DISPATCH_LEVEL, initialize the timer object, as shown here:

PKTIMER timer; // <== someone gives you this ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); KeInitializeTimer(timer);

At this point, the timer is in the not-signaled state and isn t counting down a wait on the timer would never be satisfied. To start the timer counting, call KeSetTimer as follows:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER duetime; BOOLEAN wascounting = KeSetTimer(timer, duetime, NULL);

The duetime value is a 64-bit time value expressed in 100-nanosecond units. If the value is positive, it s an absolute time relative to the same January 1, 1601, epoch used for the system timer. If the value is negative, it s an interval relative to the current time. If you specify an absolute time, a subsequent change to the system clock alters the duration of the timeout you experience. That is, the timer doesn t expire until the system clock equals or exceeds whatever absolute value you specify. In contrast, if you specify a relative timeout, the duration of the timeout you experience is unaffected by changes in the system clock. These are the same rules that apply to the timeout parameter to KeWaitXxx.

The return value from KeSetTimer, if TRUE, indicates that the timer was already counting down (in which case, our call to KeSetTimer would have canceled it and started the count all over again).

At any time, you can determine the current state of a timer:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); BOOLEAN counting = KeReadStateTimer(timer);

KeInitializeTimer and KeSetTimer are actually older service functions that have been superseded by newer functions. We could have initialized the timer with this call:

ASSERT(KeGetCurrentIqrl() <= DISPATCH_LEVEL); KeInitializeTimerEx(timer, NotificationTimer);

We could also have used the extended version of the set timer function, KeSetTimerEx:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER duetime; BOOLEAN wascounting = KeSetTimerEx(timer, duetime, 0, NULL);

I ll explain a bit further on in this chapter the purpose of extra parameters in these extended versions of the service functions.

Once the timer is counting down, it s still considered to be not-signaled until the specified due time arrives. At that point, the object becomes signaled, and all waiting threads are released. The system guarantees only that the expiration of the timer will be noticed no sooner than the due time you specify. If you specify a due time with a precision finer than the granularity of the system timer (which you can t control), the timeout will be noticed later than the exact instant you specify. You can call KeQueryTimeIncrement to determine the granularity of the system clock.

Notification Timers Used with a DPC

In this scenario, we want expiration of the timer to trigger a DPC. You would choose this method of operation if you wanted to be sure that you could service the timeout no matter what priority level your thread had. (Since you can wait only below DISPATCH_LEVEL, regaining control of the CPU after the timer expires is subject to the normal vagaries of thread scheduling. The DPC, however, executes at elevated IRQL and thereby effectively preempts all threads.)

We initialize the timer object in the same way. We also have to initialize a KDPC object for which we allocate nonpaged memory. For example:

PKDPC dpc; // <== points to KDPC you've allocated ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL); KeInitializeTimer(timer); KeInitializeDpc(dpc, DpcRoutine, context);

You can initialize the timer object by using either KeInitializeTimer or KeInitializeTimerEx, as you please. DpcRoutine is the address of a deferred procedure call routine, which must be in nonpaged memory. The context parameter is an arbitrary 32-bit value (typed as a PVOID) that will be passed as an argument to the DPC routine. The dpc argument is a pointer to a KDPC object for which you provide nonpaged storage. (It might be in your device extension, for example.)

When we want to start the timer counting down, we specify the DPC object as one of the arguments to KeSetTimer or KeSetTimerEx:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER duetime; BOOLEAN wascounting = KeSetTimer(timer, duetime, dpc);

You could also use the extended form KeSetTimerEx if you wanted to. The only difference between this call and the one we examined in the preceding section is that we ve specified the DPC object address as an argument. When the timer expires, the system will queue the DPC for execution as soon as conditions permit. This would be at least as soon as you d be able to wake up from a wait. Your DPC routine would have the following skeletal appearance:

VOID DpcRoutine(PKDPC dpc, PVOID context, PVOID junk1, PVOID junk2) {  }

For what it s worth, even when you supply a DPC argument to KeSetTimer or KeSetTimerEx, you can still call KeWaitXxx to wait at PASSIVE_LEVEL or APC_LEVEL if you want. On a single-CPU system, the DPC would occur before the wait could finish because it executes at a higher IRQL.

Synchronization Timers

Like event objects, timer objects come in both notification and synchronization flavors. A notification timer allows any number of waiting threads to proceed once it expires. A synchronization timer, by contrast, allows only a single thread to proceed. Once a thread s wait is satisfied, the timer switches to the not-signaled state. To create a synchronization timer, you must use the extended form of the initialization service function:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); KeInitializeTimerEx(timer, SynchronizationTimer);

SynchronizationTimer is one of the values of the TIMER_TYPE enumeration. The other value is NotificationTimer.

If you use a DPC with a synchronization timer, think of queuing the DPC as being an extra thing that happens when the timer expires. That is, expiration puts the timer in the signaled state and queues a DPC. One thread can be released as a result of the timer being signaled.

The only use I ve ever found for a synchronization timer is when you want a periodic timer (see the next section).

Periodic Timers

So far, I ve discussed only timers that expire exactly once. By using the extended set timer function, you can also request a periodic timeout:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); LARGE_INTEGER duetime; BOOLEAN wascounting = KeSetTimerEx(timer, duetime, period, dpc);

Here period is a periodic timeout, expressed in milliseconds (ms), and dpc is an optional pointer to a KDPC object. A timer of this kind expires once at the due time and periodically thereafter. To achieve exact periodic expiration, specify the same relative due time as the interval. Specifying a zero due time causes the timer to immediately expire, whereupon the periodic behavior takes over. It often makes sense to start a periodic timer in conjunction with a DPC object, by the way, because doing so allows you to be notified without having to repeatedly wait for the timeout.

Be sure to call KeCancelTimer to cancel a periodic timer before the KTIMER object or the DPC routine disappears from memory. It s quite embarrassing to let the system unload your driver and, 10 nanoseconds later, call your nonexistent DPC routine. Not only that, but it causes a bug check. These problems are so hard to debug that the Driver Verifier makes a special check for releasing memory that contains an active KTIMER.

An Example

One use for kernel timers is to conduct a polling loop in a system thread dedicated to the task of repeatedly checking a device for activity. Not many devices nowadays need to be served by a polling loop, but yours may be one of the few exceptions. I ll discuss this subject in Chapter 14, and the companion content includes a sample driver (POLLING) that illustrates all of the concepts involved. Part of that sample is the following loop that polls the device at fixed intervals. The logic of the driver is such that the loop can be broken by setting a kill event. Consequently, the driver uses KeWaitForMultipleObjects. The code is actually a bit more complicated than the following fragment, which I ve edited to concentrate on the part related to the timer:

VOID PollingThreadRoutine(PDEVICE_EXTENSION pdx) { NTSTATUS status; KTIMER timer;  KeInitializeTimerEx(&timer, SynchronizationTimer);  PVOID pollevents[] = { (PVOID) &pdx->evKill, (PVOID) &timer, }; C_ASSERT(arraysize(pollevents) <= THREAD_WAIT_OBJECTS); LARGE_INTEGER duetime = {0}; #define POLLING_INTERVAL 500  KeSetTimerEx(&timer, duetime, POLLING_INTERVAL, NULL); while (TRUE) {  status = KeWaitForMultipleObjects(arraysize(pollevents), pollevents, WaitAny, Executive, KernelMode, FALSE, NULL, NULL); if (status == STATUS_WAIT_0) break;  if (<device needs attention>) <do something>; } KeCancelTimer(&timer); PsTerminateSystemThread(STATUS_SUCCESS); }

Here we initialize a kernel timer. You must specify a SynchronizationTimer here, because a NotificationTimer stays in the signaled state after the first expiration.
We ll need to supply an array of dispatcher object pointers as one of the arguments to KeWaitForMultipleObjects, and this is where we set that up. The first element of the array is the kill event that some other part of the driver might set when it s time for this system thread to exit. The second element is the timer object. The C_ASSERT statement that follows this array verifies that we have few enough objects in our array that we can implicitly use the default array of wait blocks in our thread object.
The KeSetTimerEx statement starts a periodic timer running. The duetime is 0, so the timer goes immediately into the signaled state. It will expire every 500 ms thereafter.
Within our polling loop, we wait for the timer to expire or for the kill event to be set. If the wait terminates because of the kill event, we leave the loop, clean up, and exit this system thread. If the wait terminates because the timer has expired, we go on to the next step.
This is where our device driver would do something related to our hardware.

Alternatives to Kernel Timers

Rather than use a kernel timer object, you can use two other timing functions that might be more appropriate. First of all, you can call KeDelayExecutionThread to wait at PASSIVE_LEVEL for a given interval. This function is obviously less cumbersome than creating, initializing, setting, and awaiting a timer by using separate function calls.

ASSERT(KeGetCurrentIrql() == PASSIVE_LEVEL); LARGE_INTEGER duetime; NSTATUS status = KeDelayExecutionThread(WaitMode, Alertable, &duetime);

Here WaitMode, Alertable, and the returned status code have the same meaning as the corresponding parameters to KeWaitXxx, and duetime is the same kind of timestamp that I discussed previously in connection with kernel timers. Note that this function requires a pointer to a large integer for the timeout para meter, whereas other functions related to timers require the large integer itself.

If your requirement is to delay for a very brief period of time (less than 50 microseconds), you can call KeStallExecutionProcessor at any IRQL:

KeStallExecutionProcessor(nMicroSeconds);

The purpose of this delay is to allow your hardware time to prepare for its next operation before your program continues executing. The delay might end up being significantly longer than you request because KeStallExecutionProcessor can be preempted by activities that occur at a higher IRQL than that which the caller is using.

Using Threads for Synchronization

The Process Structure component of the operating system provides a few routines that WDM drivers can use for creating and controlling system threads. I ll be discussing these routines later on in Chapter 14 from the perspective of how you can use these functions to help you manage a device that requires periodic polling. For the sake of thoroughness, I want to mention here that you can use a pointer to a kernel thread object in a call to KeWaitXxx to wait for the thread to complete. The thread terminates itself by calling PsTerminateSystemThread.

Before you can wait for a thread to terminate, you need to first obtain a pointer to the opaque KTHREAD object that internally represents that thread, which poses a bit of a problem. While running in the context of a thread, you can determine your own KTHREAD easily:

ASSERT(KeGetCurrentIrql() <= DISPATCH_LEVEL); PKTHREAD thread = KeGetCurrentThread();

Unfortunately, when you call PsCreateSystemThread to create a new thread, you get back only an opaque HANDLE for the thread. To get the KTHREAD pointer, you use an Object Manager service function:

HANDLE hthread; PKTHREAD thread; PsCreateSystemThread(&hthread, ...); ObReferenceObjectByHandle(hthread, THREAD_ALL_ACCESS, NULL, KernelMode, (PVOID*) &thread, NULL); ZwClose(hthread);

ObReferenceObjectByHandle converts your handle to a pointer to the underlying kernel object. Once you have the pointer, you can discard the handle by calling ZwClose. At some point, you need to release your reference to the thread object by making a call to ObDereferenceObject:

ObDereferenceObject(thread);

Thread Alerts and APCs

Internally, the Windows NT kernel uses thread alerts as a way of waking threads. It uses an asynchronous procedure call as a way of waking a thread to execute some particular subroutine in that thread s context. The support routines that generate alerts or APCs aren t exposed for use by WDM driver writers. But since the DDK documentation and header files contain a great many references to these concepts, I want to finish this discussion of kernel dispatcher objects by explaining them.

I ll start by describing the plumbing how these two mechanisms work. When someone blocks a thread by calling one of the KeWaitXxx routines, they specify by means of a Boolean argument whether the wait is to be alertable. An alertable wait might finish early that is, without any of the wait conditions or the timeout being satisfied because of a thread alert. Thread alerts originate in user mode when someone calls the native API function NtAlertThread. The kernel returns the special status value STATUS_ALERTED when a wait terminates early because of an alert.

An APC is a mechanism whereby the operating system can execute a function in the context of a particular thread. The asynchronous part of an APC stems from the fact that the system effectively interrupts the target thread to execute an out-of-line subroutine.

APCs come in three flavors: user mode, kernel mode, and special kernel mode. User-mode code requests a user-mode APC by calling the Win32 API QueueUserAPC. Kernel-mode code requests an APC by calling an undocumented function for which the DDK headers have no prototype. Diligent reverse engineers probably already know the name of this routine and something about how to call it, but it s really just for internal use and I m not going to say any more about it. The system queues APCs to a specific thread until appropriate execution conditions exist. Appropriate execution conditions depend on the type of APC, as follows:

Special kernel APCs execute as soon as possible that is, as soon as an activity at APC_LEVEL can be scheduled in the thread. A special kernel APC can even temporarily awaken a blocked thread in many circumstances.
Normal kernel APCs execute after all special APCs have been executed but only when the target thread is running and no other kernel-mode APC is executing in this thread. Delivery of normal kernel and user-mode APCs can be blocked by calling KeEnterCriticalRegion.
User-mode APCs execute after both flavors of kernel-mode APC for the target thread have been executed but only if the thread has previously been in an alertable wait in user mode. Execution actually occurs the next time the thread is dispatched for execution in user mode.

If the system awakens a thread to deliver a user-mode APC, the wait primitive on which the thread was previously blocked returns with one of the special status values STATUS_KERNEL_APC and STATUS_USER_APC.

The Strange Role of APC_LEVEL

The IRQ level named APC_LEVEL works in a way that I found to be unexpected. You re allowed to block a thread running at APC_LEVEL (or at PASSIVE_LEVEL, but we re concerned only with APC_LEVEL right now). An APC_LEVEL thread can also be interrupted by any hardware device, following which a higher-priority thread might become eligible to run. In either situation, the thread scheduler can then give control of the CPU to another thread, which might be running at PASSIVE_LEVEL or APC_LEVEL. In effect, the IRQL levels PASSIVE_LEVEL and APC_LEVEL pertain to a thread, whereas the higher IRQLs pertain to a CPU.

How APCs Work with I/O Requests

The kernel uses the APC concept for several purposes. We re concerned in this book just with writing device drivers, though, so I m only going to explain how APCs relate to the process of performing an I/O operation. In one of many possible scenarios, when a user-mode program performs a synchronous ReadFile operation on a handle, the Win32 subsystem calls a kernel-mode routine named NtReadFile. NtReadFile creates and submits an IRP to the appropriate device driver, which often returns STATUS_PENDING to indicate that it hasn t finished the operation. NtReadFile returns this status code to ReadFile, which thereupon calls NtWaitForSingleObject to wait on the file object to which the user-mode handle points. NtWaitForSingleObject, in turn, calls KeWaitForSingleObject to perform a nonalertable user-mode wait on an event object within the file object.

When the device driver eventually finishes the read operation, it calls IoCompleteRequest, which queues a special kernel-mode APC. The APC routine calls KeSetEvent to signal the file object, thereby releasing the application to continue execution. Some sort of APC is required because some of the tasks that need to be performed when an I/O request is completed (such as buffer copying) must occur in the address context of the requesting thread. A kernel-mode APC is required because the thread in question is not in an alertable wait state. A special APC is required because the thread is actually ineligible to run at the time we need to deliver the APC. In fact, the APC routine is the mechanism for awakening the thread.

Kernel-mode routines can call ZwReadFile, which turns into a call to NtReadFile. If you obey the injunctions in the DDK documentation when you call ZwReadFile, your call to NtReadFile will look almost like a user-mode call and will be processed in almost the same way, with just two differences. The first, which is quite minor, is that any waiting will be done in kernel mode. The other difference is that if you specified in your call to ZwCreateFile that you wanted to do synchronous operations, the I/O Manager will automatically wait for your read to finish. The wait will be alertable or not, depending on the exact option you specify to ZwCreateFile.

How to Specify Alertable and WaitMode Parameters

Now you have enough background to understand the ramifications of the Alertable and WaitMode parameters in the calls to the various wait primitives. As a general rule, you ll never be writing code that responds synchronously to requests from user mode. You could do so for, say, certain I/O control requests. Generally speaking, however, it s better to pend any operations that take a long time to finish (by returning STATUS_PENDING from your dispatch routine) and to finish them asynchronously. So, to continue speaking generally, you don t often call a wait primitive in the first place. Thread blocking is appropriate in a device driver in only a few scenarios, which I ll describe in the following sections.

Kernel Threads

Sometimes you ll create your own kernel-mode thread when your device needs to be polled periodically, for example. In this scenario, any waits performed will be in kernel mode because the thread runs exclusively in kernel mode.

Handling Plug and Play Requests

I ll show you in Chapter 6 how to handle the I/O requests that the PnP Manager sends your way. Several such requests require synchronous handling on your part. In other words, you pass them down the driver stack to lower levels and wait for them to complete. You ll be calling KeWaitForSingleObject to wait in kernel mode because the PnP Manager calls you within the context of a kernel-mode thread. In addition, if you need to perform subsidiary requests as part of handling a PnP request for example, to talk to a universal serial bus (USB) device you ll be waiting in kernel mode.

Handling Other I/O Requests

When you re handling other sorts of I/O requests and you know that you re running in the context of a nonarbitrary thread that must get the results of your deliberations before proceeding, it might conceivably be appropriate to block that thread by calling a wait primitive. In such a case, you want to wait in the same processor mode as the entity that called you. Most of the time, you can simply rely on the RequestorMode in the IRP you re currently processing. If you gained control by means other than an IRP, you could call ExGetPreviousMode to determine the previous processor mode. If you re going to wait for a long time, it would be well to use the result of these tests as the WaitMode argument in your KeWaitXxx call, and it would also be well to specify TRUE for the Alertable argument.

NOTE
The bottom line: perform nonalertable waits unless you know you shouldn t.