Why Drivers Fail | The Windows 2000 Device Driver Book: A Guide for Programmers (2nd Edition)

< BACK NEXT >

[oR]

While testing uncovers the presence of bugs, the more serious challenge is to analyze, isolate, and correct the source of the bug. The goal of this section is to provide the basis for the analysis of driver bugs. Drivers fail for specific reasons and some up-front thought about the ways in which they fail can start the elusive search for a bug in an orderly fashion.

Categories of Driver Errors

Drivers can fail in any number of interesting ways. Although it is not possible to give a complete list, the following sections describe some of the more common types of driver pathology.

HARDWARE PROBLEMS

It goes without saying (to a software developer, anyway) that there is always an even chance that the hardware itself is the source of a problem. In fact,when developing drivers for new, undeployed, untested hardware, the chances of a hardware problem rise significantly. Symptoms of hardware problems include

Errors occur during data transfer.
Device status codes indicate an error (when a device reports an internal error, it is the equivalent of a confession).
Interrupts do not arrive or they arrive spuriously.
The device does not respond properly to commands.

The cause might be as simple as undocumented behavior in the device, and hardware designers have been known to alter documentation after witnessing the results of their work. There might be a restriction on command timing or sequencing. The firmware on the device might be faulty. There could be a bus protocol problem resulting in sporadic failures as other devices on the bus engage. Then again, the device might just be broken.

Because attempting to debug a problem whose ultimate source is hardware is so frustrating, it is best to eliminate (within reason) this category of fault before proceeding.

The best approach to validate a hardware problem is to employ a logic analyzer or hardware emulator. Hardware and software designers should work closely on unstable platforms until the first level of fault isolation can be determined.

SYSTEM CRASHES

Because driver code operates in kernel mode, it is straightforward for such code to kill the entire system. While many driver logic errors produce a crash, the most common stem from access violations (e.g., referencing a logical address that has no physical memory behind it) through use of a bad C pointer. Among the more difficult-to-trace scenarios within this category is setting DMA addresses incorrectly into the mapping registers. The device scribbles into random memory with a resulting crash seemingly caused by another entire subsystem.

A later section in this chapter deals with analyzing system crashes to determine the ultimate source.

RESOURCE LEAKS

Because kernel-mode driver code is trusted code, the system does not perform any tracking or recovery of system resources on behalf of a driver. When a driver unloads, it is responsible for releasing whatever it may have allocated. This includes both memory from the pool areas and any hardware the driver manages.

Even while a driver is running, it can leak memory if it regularly allocates pool space for temporary use and then fails to release it. High-layered drivers can leak IRPs by failing to free those fabricated and passed to lower levels. Resource leaks force sluggish system performance and, ultimately, a system crash.

Windows 2000 allows memory allocated by kernel-mode code to be tagged with an ID. When analyzing system memory after a crash, the tags help determine where the blocks are located, their size, and most importantly, which subsystem allocated them. Tools such as GFLAGS, supplied with the Platform SDK, enable the pool tagging feature globally for the system.

Resource leaks in progress can sometimes be determined by careful monitoring of system objects. Tools such as WINOBJ (supplied with the Platform SDK or by www.sysinternals.com) assist in this monitoring.

Tracking resource leakage can be an arduous process. Considerable patience must be exercised when analyzing and isolating such problems.

THREAD HANGS

Another failure mode is caused by synchronous I/O requests that never return. The user-mode thread issuing the request is blocked forever and remains forever in its wait state. This type of behavior can result from several causes.

First, an explicit bug might be failing to ever call IoCompleteRequest, thus never sending the IRP back to the I/O Manager. Not so obvious is the need to call IoStartNextPacket. Even if there are no pending requests to be processed, a driver must call this function because it marks the Device object as idle. Without this call, all new IRPs are placed in the pending queue, never arriving at the Start I/O routine.

Second, a logic error can hang a thread in a Dispatch routine. Perhaps the driver is attempting recursively to acquire a Fast Mutex or an Executive resource. Perhaps another code path has acquired a mutex but failed to release it. Subsequent requests for the mutex hang indefinitely.

Similarly, DMA drivers can hang while awaiting ownership of the Adapter object or its mapping registers. The IRP request is therefore never processed, which in turn queues all further IRPs. For slave DMA devices, the offending driver might cause other drivers using the same DMA channel to freeze.

Drivers that manage multiunit controllers can effect similar problems by not releasing the Controller object. New IRPs sent to any Device object using the locked Controller object queue indefinitely.

Unfortunately, there is no convenient way to see who currently owns Adapter or Controller objects, Mutexes, or Executive resources. It is sometimes helpful to maintain a resource management structure for tracking purposes. Each owner of a synchronization object should register its use within the structure, clearing it when the object is released. Of course, this technique requires a manual coding effort; the act of adding the code often reveals the source of the problem.

Another hit or miss attempt to isolate thread hang problems is the use of the checked build of the Windows 2000 kernel. The checked build reports the use of system synchronization objects through DbgPrint statements that appear on an attached debugger.

SYSTEM HANGS

Occasionally, a driver error causes the entire system to lock up. For example, a deadly embrace involving multiple spin locks (or attempts to acquire the same spin lock multiple times on a single CPU) can freeze system operation. Endless loops in a driver's Interrupt Service Routine or DPC routine cause a similar failure.

Once this kind of system collapse occurs, it is difficult, if not impossible, to regain control of the system. The best approach is usually to debug the driver interactively, using WinDbg, and attempt to duplicate the failure.

Reproducing Driver Errors

One key to isolating a driver bug is the ability to reproduce the problem. Intermittent errors are the bane of a driver author's existence. By meticulously recording the exact sequence of events leading up to the failure, the possibility of reproduction increases. The causes of intermittent failures are numerous.

TIME DEPENDENCIES

Some problems occur only when a driver is running at full speed (or worse, at some exact slower speed). This could produce an unusually high I/O request rate or data transfer rate. Stress testing is usually a good way to attempt reproduction of this type of failure.

MULTIPROCESSOR DEPENDENCIES

If the driver is certified for multiprocessor operation, it must be tested on a multiprocessor platform. Numerous timing conditions present themselves only within the MP environment. For example, ISR, DPC, and I/O Timer routines can run simultaneously on an SMP machine. One warning: SMP debugging is very painful, so it is best to start with a single processor environment.

MULTITHREADING DEPENDENCIES

If a driver manages sharable Device objects, the test strategy must access a single Device object from multiple threads. IRPs that flow from multiple threads often provoke unintended results.

OTHER CAUSES

A computer system involves many components. Sometimes behavior appears non-deterministic due to system load conditions, combinations of installed hardware and drivers, or other configuration differences. A detailed log is perhaps the best tool to assist in identifying this category of problem.

Defensive Coding Strategies

Any good software design anticipates problems. To facilitate the detection and isolation of failures, several coding techniques should be employed.

Maximize the generation of intermediate output within the driver code. Intermediate output, also known as trace output, should be sprinkled liberally within the driver code. Using the function DbgPrint (described later in this chapter), intermediate output is directed at a connected interactive debugger such as WinDbg. Sysinternals also produces a utility, DebugView, that captures this trace output on a single system.
Use assertions (described later in this chapter) liberally to validate internal consistency within driver code.
Debug code can remain with the driver source, properly bracketed by #ifdef and #endif statements. When needed, a debug version of the driver can be used to track particularly elusive bugs.
Faithfully maintain version information for each driver shipped. Often, bugs follow a particular version of a driver, yielding major clues to its cause.
Use version control software throughout the development process. This allows code changes to be easily backed out to test for failure modes on older code bases.

Keeping Track of Driver Bugs

Research has shown that bugs are not evenly distributed throughout code. Rather, they tend to cluster in a few specific routines, proportional to the routine's complexity. A carefully maintained bug log identifies the routines that deserve special attention.

A good bug log allows patterns that highlight configuration-related failures to be spotted. It can also highlight holes within the testing design and strategy itself.

Good failure logs should contain at least the following:

An exact description of the failure.
As much detail as possible about the prevailing conditions at the time of the failure. For example, the OS version and service pack, the drivers installed and their versions, and so on.
The exact configuration of the system at the time of failure.
Bug severity (from showstopper to cosmetic).
Current status of the bug.

< BACK NEXT >