Error Handling | Programming the Microsoft Windows Driver Model

Error Handling

To err is human; to recover is part of software engineering. Exceptional conditions are always arising in programs. Some of them start with program bugs, either in our own code or in the user-mode applications that invoke our code. Some of them relate to system load or the instantaneous state of hardware. Whatever the cause, unusual circumstances demand a flexible response from our code. In this section, I ll describe three aspects of error handling: status codes, structured exception handling, and bug checks. In general, kernel-mode support routines report unexpected errors by returning a status code, whereas they report expected variations in normal flow by returning a Boolean or numeric value other than a formal status code. Structured exception handling offers a standardized way to clean up after really unexpected events, such as dereferencing an invalid user-mode pointer, or to avoid the system crash that normally ensues after such events. A bug check is the internal name for a catastrophic failure for which a system shutdown is the only cure.

Status Codes

Kernel-mode support routines (and your code too, for that matter) indicate success or failure by returning a status code to their caller. An NTSTATUS value is a 32-bit integer composed of several subfields, as illustrated in Figure 3-2. The high-order 2 bits denote the severity of the condition being reported success, information, warning, or error. I ll explain the impact of the customer flag shortly. The facility code indicates which system component originated the message and basically serves to decouple development groups from each other when it comes to assigning numbers to codes. The remainder of the status code 16 bits worth indicates the exact condition being reported.

figure 3-2 format of an ntstatus code.

Figure 3-2. Format of an NTSTATUS code.

You should always check the status returns from routines that provide them. I m going to break this rule frequently in some of the code fragments I show you because including all the necessary error handling code often obscures the expository purpose of the fragment. But don t you emulate this sloppy practice!

If the high-order bit of a status code is 0, any number of the remaining bits could be set and the code would still indicate success. Consequently, never just compare status codes with 0 to see whether you re dealing with success instead, use the NT_SUCCESS macro:

NTSTATUS status = SomeFunction(...); if (!NT_SUCCESS(status)) { //handle error  }

Not only do you want to test the status codes you receive from routines you call, but you also want to return status codes to the routines that call you. In the preceding chapter, I dealt with two driver subroutines DriverEntry and AddDevice that are both defined as returning NTSTATUS codes. As I discussed, you want to return STATUS_SUCCESS as the success indicator from these routines. If something goes wrong, you often want to return an appropriate status code, which is sometimes the same value that a routine returned to you.

As an example, here are some initial steps in the AddDevice function, with all the error checking left in:

NTSTATUS AddDevice(PDRIVER_OBJECT DriverObject, PDEVICE_OBJECT pdo) { NTSTATUS status; PDEVICE_OBJECT fdo; status = IoCreateDevice(DriverObject, sizeof(DEVICE_EXTENSION), NULL, FILE_DEVICE_UNKNOWN, FILE_DEVICE_SECURE_OPEN, FALSE, &fdo);  if (!NT_SUCCESS(status)) {  KdPrint(("IoCreateDevice failed - %X\n", status)); return status; } PDEVICE_EXTENSION pdx = (PDEVICE_EXTENSION) fdo->DeviceExtension; pdx->DeviceObject = fdo; pdx->Pdo = pdo; pdx->state = STOPPED;  IoInitializeRemoveLock(&pdx->RemoveLock, 0, 0, 0);  status = IoRegisterDeviceInterface(pdo, &GUID_SIMPLE, NULL, &pdx->ifname); if (!NT_SUCCESS(status)) { KdPrint(("IoRegisterDeviceInterface failed - %X\n", status)); IoDeleteDevice(fdo); return status; }  }

If IoCreateDevice fails, we ll simply return the same status code it gave us. Note the use of the NT_SUCCESS macro as described in the text.
It s sometimes a good idea, especially while debugging a driver, to print any error status you discover. I ll discuss the exact usage of KdPrint later in this chapter (in the Making Debugging Easier section).
IoInitializeRemoveLock, discussed in Chapter 6, cannot fail. Consequently, there s no need to check a status code. Generally speaking, most functions declared with type VOID are in the same cannot fail category. A few VOID functions can fail by raising an exception, but the DDK documents that behavior very clearly.
Should IoRegisterDeviceInterface fail, we have some cleanup to do before we return to our caller; namely, we must call IoDeleteDevice to destroy the device object we just created.

You don t always have to fail calls that lead to errors in the routines you call, of course. Sometimes you can ignore an error. For example, in Chapter 8, I ll tell you about a power management I/O request with the subtype IRP_MN_POWER_SEQUENCE that you can use as an optimization to avoid unnecessary state restoration during a power-up operation. Not only is it optional whether you use this request, but it s also optional for the bus driver to implement it. Therefore, if that request should fail, you should just go about your business. Similarly, you can ignore an error from IoAllocateErrorLogEntry because the inability to add an entry to the error log isn t at all critical.

Completing an IRP with an error status driver programmers call this failing the IRP usually leads to a failure indication in the return from a Win32 API function in an application. The application can call GetLastError to determine the cause of the failure. If you fail the IRP with a status code containing the customer flag, GetLastError will return exactly that status code. If you fail the IRP with a status code in which the customer flag is 0 (which is the case for every standard status code defined by Microsoft), GetLastError returns a value drawn from WINERROR.H in the Platform SDK. Knowledge Base article Q113996, Mapping NT Status Error Codes to Win32 Error Codes, documents the correspondence between GetLastError return values and kernel status codes. Table 3-1 shows the correspondence for the most important status codes.

Table 3-1. Correspondence Between Common Kernel-Mode and User-Mode Status Codes
Kernel-Mode Status Code	User-Mode Error Code
STATUS_SUCCESS	NO_ERROR (0)
STATUS_INVALID_PARAMETER	ERROR_INVALID_PARAMETER
STATUS_NO_SUCH_FILE	ERROR_FILE_NOT_FOUND
STATUS_ACCESS_DENIED	ERROR_ACCESS_DENIED
STATUS_INVALID_DEVICE_REQUEST	ERROR_INVALID_FUNCTION
ERROR_BUFFER_TOO_SMALL	ERROR_INSUFFICIENT_BUFFER
STATUS_DATA_ERROR	ERROR_CRC

The difference between an error and a warning can be significant. For example, failing a METHOD_BUFFERED control operation (see Chapter 9) with STATUS_BUFFER_OVERFLOW a warning causes the I/O Manager to copy data to the user-mode buffer. Failing the same operation with STATUS_BUF FER_TOO_SMALL an error causes the I/O Manager to not copy any data.

Structured Exception Handling

The Windows family of operating systems provides a method of handling exceptional conditions that helps you avoid potential system crashes. Closely integrated with the compiler s code generator, structured exception handling lets you easily place a guard on sections of your code and invoke exception handlers when something goes wrong in the guarded section. Structured exception handling also lets you easily provide cleanup statements that you can be sure will always execute no matter how control leaves a guarded section of code.

Very few of my seminar students have been familiar with structured exceptions, so I m going to explain some of the basics here. You can write better, more bulletproof code if you use these facilities. In many situations, the parameters that you receive in a WDM driver have been thoroughly vetted by other code and won t cause you to generate inadvertent exceptions. Good taste may, therefore, be the only impetus for you to use the stuff I m describing in this section. As a general rule, though, you always want to protect direct references to user-mode virtual memory with a structured exception frame. Such references occur when you directly reference memory and when you call MmProbeAndLockPages, ProbeForRead, and ProbeForWrite, and perhaps at other times.

Sample Code
The SEHTEST sample driver illustrates the mechanics of structured exceptions in a WDM driver.

Which Exceptions Can Be Trapped

Gary Nebbett researched the question of which exceptions can be trapped with the structured exception mechanism and reported his results in a newsgroup post several years ago. The SEHTEST sample incorporates what he learned. In summary, the following exceptions will be caught when they occur at IRQL less than or equal to DISPATCH_LEVEL (note that some of these are specific to the Intel x86 processor):

Anything signaled by ExRaiseStatus and related functions
Attempt to dereference invalid pointer to user-mode memory
Debug or breakpoint exception
Integer overflow (INTO instruction)
Invalid opcode

Note that a reference to an invalid kernel-mode pointer leads directly to a bug check and can t be trapped. Likewise, a divide-by-zero exception or a BOUND instruction exception leads to a bug check.

Kernel-mode programs use structured exceptions by establishing exception frames on the same stack that s used for argument passing, subroutine calling, and automatic variables. A dedicated processor register points to the current exception frame. Each frame points to the preceding frame. Whenever an exception occurs, the kernel searches the list of exception frames for an exception handler. It will always find one because there is an exception frame at the very top of the stack that will handle any otherwise unhandled exception. Once the kernel locates an exception handler, it unwinds the execution and exception frame stacks in parallel, calling cleanup handlers along the way. Then it gives control to the exception handler.

When you use the Microsoft compiler, you can use Microsoft extensions to the C/C++ language that hide some of the complexities of working with the raw operating system primitives. You use the __try statement to designate a compound statement as the guarded body for an exception frame, and you use either the __finally statement to establish a termination handler or the __except statement to establish an exception handler.

NOTE
It s better to always spell the words __try, __finally, and __except with leading underscores. In C compilation units, the DDK header file WARNING.H defines macros spelled try, finally, and except to be the words with underscores. DDK sample programs use those macro names rather than the underscored names. The problem this can create for you is that in a C++ compilation unit, try is a statement verb that pairs with catch to invoke a completely different exception mechanism that s part of the C++ language. C++ exceptions don t work in a driver unless you manage to duplicate some infrastructure from the run-time library. Microsoft would prefer you not do that because of the increased size of your driver and the memory pool overhead associated with handling the throw verb.

Try-Finally Blocks

It s easiest to begin explaining structured exception handling by describing the try-finally block, which you can use to provide cleanup code:

__try { <guarded body> } __finally { <termination handler> }

In this fragment of pseudocode, the guarded body is a series of statements and subroutine calls that expresses some main idea in your program. In general, these statements have side effects. If there are no side effects, there s no particular point to using a try-finally block because there s nothing to clean up. The termination handler contains statements that undo some or all of the side effects that the guarded body might leave behind.

Semantically, the try-finally block works as follows: First the computer executes the guarded body. When control leaves the guarded body for any reason, the computer executes the termination handler. See Figure 3-3.

figure 3-3 flow of control in a try-finally block.

Figure 3-3. Flow of control in a try-finally block.

Here s one simple illustration:

LONG counter = 0; __try { ++counter; } __finally { --counter; } KdPrint(("%d\n", counter));

First the guarded body executes and increments the counter variable from 0 to 1. When control drops through the right brace at the end of the guarded body, the termination handler executes and decrements counter back to 0. The value printed will therefore be 0.

Here s a slightly more complicated variation:

VOID RandomFunction(PLONG pcounter) { __try { ++*pcounter; return; } __finally { --*pcounter; } }

The net result of this function is no change to the integer at the end of the pcounter pointer: whenever control leaves the guarded body for any reason, including a return statement or a goto, the termination handler executes. Here the guarded body increments the counter and performs a return. Next the cleanup code executes and decrements the counter. Then the subroutine actually returns.

One final example should cement the idea of a try-finally block:

static LONG counter = 0; __try { ++counter; BadActor(); } __finally { --counter; }

Here I m supposing that we call a function, BadActor, that will raise some sort of exception that triggers a stack unwind. As part of the process of unwinding the execution and exception stacks, the operating system will invoke our cleanup code to restore the counter to its previous value. The system then continues unwinding the stack, so whatever code we have after the __finally block won t get executed.

Try-Except Blocks

The other way to use structured exception handling involves a try-except block:

__try { <guarded body> } __except(<filter expression>) { <exception handler> }

The guarded body in a try-except block is code that might fail by generating an exception. Perhaps you re going to call a kernel-mode service function such as MmProbeAndLockPages that uses pointers derived from user mode without explicit validity checking. Perhaps you have other reasons. In any case, if you manage to get all the way through the guarded body without an error, control continues after the exception handler code. You ll think of this case as being the normal one. If an exception arises in your code or in any of the subroutines you call, however, the operating system will unwind the execution stack, evaluating the filter expressions in __except statements. These expressions yield one of the following values:

EXCEPTION_EXECUTE_HANDLER is numerically equal to 1 and tells the operating system to transfer control to your exception handler. If your handler falls through the ending right brace, control continues within your program at the statement immediately following that right brace. (I ve seen Platform SDK documentation to the effect that control returns to the point of the exception, but that s not correct.)
EXCEPTION_CONTINUE_SEARCH is numerically equal to 0 and tells the operating system that you can t handle the exception. The system keeps scanning up the stack looking for another handler. If no one has provided a handler for the exception, a system crash will occur.
EXCEPTION_CONTINUE_EXECUTION is numerically equal to -1 and tells the operating system to return to the point where the exception was raised. I ll have a bit more to say about this expression value a little further on.

Take a look at Figure 3-4 for the possible control paths within and around a try-except block.

figure 3-4 flow of control in a try-except block.

Figure 3-4. Flow of control in a try-except block.

For example, you can protect yourself from receiving an invalid pointer by using code like the following. (See the SEHTEST sample in the companion content.)

PVOID p = (PVOID) 1; __try { KdPrint(("About to generate exception\n")); ProbeForWrite(p, 4, 4); KdPrint(("You shouldn't see this message\n")); } __except(EXCEPTION_EXECUTE_HANDLER) { KdPrint(("Exception was caught\n")); } KdPrint(("Program kept control after exception\n"));

ProbeForWrite tests a data area for validity. In this example, it will raise an exception because the pointer argument we supply isn t aligned to a 4-byte boundary. The exception handler gains control. Control then flows to the next statement after the exception handler and continues within your program.

In the preceding example, had you returned the value EXCEPTION_CONTINUE_SEARCH, the operating system would have continued unwinding the stack looking for an exception handler. Neither your exception handler code nor the code following it would have been executed: either the system would have crashed or some higher-level handler would have taken over.

You should not return EXCEPTION_CONTINUE_EXECUTION in kernel mode because you have no way to alter the conditions that caused the exception in order to allow a retry to occur.

Note that you cannot trap arithmetic exceptions, or page faults due to referencing an invalid kernel-mode pointer, by using structured exceptions. You just have to write your code so as not to generate such exceptions. It s pretty obvious how to avoid dividing by 0 just check, as in this example:

ULONG numerator, denominator; // <== numbers someone gives you ULONG quotient; if (!denominator) <handle error>else quotient = numerator / denominator;

But what about a pointer that comes to you from some other part of the kernel? There is no function that you can use to check the validity of a kernel-mode pointer. You just need to follow this rule:

Usually, trust values that a kernel-mode component gives you.

I don t mean by this that you shouldn t liberally sprinkle your code with ASSERT statements you should because you may not initially understand all the ins and outs of how other kernel components work. I just mean that you don t need to burden your own driver with excessive defenses against mistakes in other, well-tested, parts of the system unless you need to work around a bug.

More About NULL Pointers

While we re on the subject of invalid pointers, note that a NULL pointer is (a) an invalid user-mode pointer in Windows XP and (b) a perfectly valid pointer in Windows 98/Me. If you use a NULL pointer directly, as in *p, or indirectly, as in p->StructureMember, you ll be trying to reference something in the first few bytes of virtual memory. Doing so in Windows XP will cause a trappable access violation.

Dereferencing a NULL pointer in Windows 98/Me will not, of itself, cause any immediately observable problem. I once spent several days tracking down a bug that resulted from overstoring location 0x0000000C in a Windows 95 system. That location is the real-mode vector for the breakpoint (INT 3) interrupt. The wild store didn t show up until some infrequently used application did an INT 3 that wasn t caught by a debugger. The system reflected the interrupt to real mode. The invalid interrupt vector pointed to memory containing a bunch of technically valid but nonsensical instructions followed by an invalid one. The system halted with an invalid operation exception. As you can see, the eventual symptom was very far removed in space and time from the wild store.

To debug a different problem in Windows 98, I once installed a debugging driver to catch alterations to the first 16 bytes of virtual memory. I had to remove it because so many VxD drivers (including some belonging to Microsoft) were getting caught.

The moral of these anecdotes is that you should always test pointers for NULL before using them if there is any possibility that the pointer could be NULL. To learn whether the possibility exists, read documentation and specifications very carefully.

Exception Filter Expressions

You might be wondering how to perform any sort of involved error detection or correction when all you re allowed to do is evaluate an expression that yields one of three integer values. You could use the C/C++ comma operator to string expressions together:

__except(expr-1, ... EXCEPTION_CONTINUE_SEARCH){}

The comma operator basically discards whatever value is on its left side and evaluates its right side. The value that s left over after this computational game of musical chairs (with just one chair!) is the value of the expression.

You could use the C/C++ conditional operator to perform a more involved calculation:

__except(<some-expr> ? EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)

If the some_expr expression is TRUE, you execute your own handler. Otherwise, you tell the operating system to keep looking for another handler above you in the stack.

Finally, it should be obvious that you could just write a subroutine whose return value is one of the EXCEPTION_Xxx values:

LONG EvaluateException() { if (<some-expr>) return EXCEPTION_EXECUTE_HANDLER; else return EXCEPTION_CONTINUE_SEARCH; }  __except(EvaluateException())

For any of these expression formats to do you any good, you need access to more information about the exception. You can call two functions when evaluating an __except expression that will supply the information you need. Both functions actually have intrinsic implementations in the Microsoft compiler and can be used only at the specific times indicated:

GetExceptionCode() returns the numeric code for the current exception. This value is an NTSTATUS value that you can compare with manifest constants in ntstatus.h if you want to. This function is available in an __except expression and within the exception handler code that follows the __except clause.
GetExceptionInformation() returns the address of an EXCEP TION_POINTERS structure that, in turn, allows you to learn all the details about the exception, such as where it occurred, what the machine registers contained at the time, and so on. This function is available only within an __except expression.

NOTE
The scope rules for names that appear in try-except and try-finally blocks are the same as elsewhere in the C/C++ language. In particular, if you declare variables within the scope of the compound statement that follows __try, those names aren t visible in a filter expression, an exception handler, or a termination handler. Documentation to the contrary that you might have seen in the Platform SDK or on MSDN is incorrect. For what it s worth, the stack frame containing any local variables declared within the scope of the guarded body still exists at the time the filter expression is evaluated. So if you had a pointer (presumably declared at some outer scope) to a variable declared within the guarded body, you could safely dereference it in a filter expression.

Because of the restrictions on how you can use these two expressions in your program, you ll probably want to use them in a function call to some filter function, like this:

LONG EvaluateException(NTSTATUS status, PEXCEPTION_POINTERS xp) {  }  __except(EvaluateException(GetExceptionCode(), GetExceptionInformation()))

Raising Exceptions

Program bugs are one way you can (inadvertently) raise exceptions that invoke the structured exception handling mechanism. Application programmers are familiar with the Win32 API function RaiseException, which allows you to generate an arbitrary exception on your own. In WDM drivers, you can call the routines listed in Table 3-2. I m not going to give you a specific example of calling these functions because of the following rule:

Raise an exception only in a nonarbitrary thread context, when you know there s an exception handler above you, and when you really know what you re doing.

In particular, raising exceptions is not a good way to tell your callers information that you discover in the ordinary course of executing. It s far better to return a status code, even though that leads to apparently more unreadable code. You should avoid exceptions because the stack-unwinding mechanism is very expensive. Even the cost of establishing exception frames is significant and something to avoid when you can.

Table 3-2. Service Functions for Raising Exceptions
Service Function	Description
ExRaiseStatus	Raise exception with specified status code
ExRaiseAccessViolation	Raise STATUS_ACCESS_VIOLATION
ExRaiseDatatypeMisalignment	Raise STATUS_DATATYPE_MISALIGNMENT

Real-World Examples

Notwithstanding the expense of setting up and tearing down exception frames, you have to use structured exception syntax in an ordinary driver in particular situations.

One of the times you must set up an exception handler is when you call MmProbeAndLockPages to lock the pages for a memory descriptor list (MDL) you ve created:

PMDL mdl = MmCreateMdl(...); __try { MmProbeAndLockPages(mdl, ...); } __except(EXCEPTION_EXECUTE_HANDLER) { NTSTATUS status = GetExceptionCode(); IoFreeMdl(mdl); return CompleteRequest(Irp, status, 0); }

(CompleteRequest is a helper function I use to handle the mechanics of completing I/O requests. Chapter 5 explains all about I/O requests and what it means to complete one.)

Another time to use an exception handler is when you want to access user-mode memory using a pointer from an untrusted source. In the following example, suppose you obtained the pointer p from a user-mode program and believe it points to an integer:

PLONG p; // from user-mode __try { ProbeForRead(p, 4, 4); LONG x = *p;  } __except(EXCEPTION_EXECUTE_HANDLER) { NTSTATUS status = GetExceptionCode();  }

Bug Checks

Unrecoverable errors in kernel mode can manifest themselves in the so-called blue screen of death (BSOD) that s all too familiar to driver programmers. Figure 3-5 is an example (hand-painted because no screen capture software is running when one of these occurs!). Internally, these errors are called bug checks, after the service function you use to diagnose their occurrence: KeBugCheckEx. The main feature of a bug check is that the system shuts itself down in as orderly a way as possible and presents the BSOD. Once the BSOD appears, the system is dead and must be rebooted.

figure 3-5 the blue screen of death.

Figure 3-5. The blue screen of death.

You call KeBugCheckEx like this:

KeBugCheckEx(bugcode, info1, info2, info3, info4);

where bugcode is a numeric value identifying the cause of the error and info1, info2, and so on are integer parameters that will appear in the BSOD display to help a programmer understand the details of the error. This function does not return (!).

As a developer, you don t get much information from the Blue Screen. If you re lucky, the information will include the offset of an instruction within your driver. Later on, you can examine this location in a kernel debugger and, perhaps, deduce a possible cause for the bug check. Microsoft s own bug-check codes appear in bugcodes.h (one of the DDK headers); a fuller explanation of the codes and their various parameters can be found in Knowledge Base article Q103059, Descriptions of Bug Codes for Windows NT, which is available on MSDN, among other places.

Sample Code
The BUGCHECK sample driver illustrates how to call KeBugCheckEx. I used it to generate the screen shot for Figure 3-5.

You can certainly create your own bug-check codes if you want. The Microsoft values are simple integers beginning with 1 (APC_INDEX_MISMATCH) and (currently) extending through 0xF6 (PCI_VERIFIER_DETECTED_VIOLATION), along with a few others. To create your own bug-check code, define an integer constant as if it were STATUS_SEVERITY_SUCCESS status code, but supply either the customer flag or a nonzero facility code. For example:

#define MY_BUGCHECK_CODE 0x002A0001  KeBugCheckEx(MY_BUGCHECK_CODE, 0, 0, 0, 0);

You use a nonzero facility code (42 in this example) or the customer flag (which I left 0 in this example) so that you can tell your own codes from the ones Microsoft uses.

Now that I ve told you how to generate your own BSOD, let me tell you when to do it: never. Or at most, in the checked build of your driver for use during your own internal debugging. You and I are unlikely to write a driver that will discover an error so serious that taking down the system is the only solution. It would be far better to log the error (using the error-logging facilities I ll describe in Chapter 14) and return a status code.

Note that the end user can configure the behavior of KeBugCheckEx in the advanced settings for My Computer. The user can choose to automatically restart the machine or to generate the BSOD. The end user can likewise choose several levels of detail (including none) for a dump file and whether to log an event in the system event log.