While testing uncovers the presence of bugs, the more serious challenge is to analyze, isolate, and correct the source of the bug. The goal of this section is to provide the basis for the analysis of driver bugs. Drivers fail for specific reasons and some up-front thought about the ways in which they fail can start the elusive search for a bug in an orderly fashion. Categories of Driver ErrorsDrivers can fail in any number of interesting ways. Although it is not possible to give a complete list, the following sections describe some of the more common types of driver pathology. HARDWARE PROBLEMSIt goes without saying (to a software developer, anyway) that there is always an even chance that the hardware itself is the source of a problem. In fact,when developing drivers for new, undeployed, untested hardware, the chances of a hardware problem rise significantly. Symptoms of hardware problems include
The cause might be as simple as undocumented behavior in the device, and hardware designers have been known to alter documentation after witnessing the results of their work. There might be a restriction on command timing or sequencing. The firmware on the device might be faulty. There could be a bus protocol problem resulting in sporadic failures as other devices on the bus engage. Then again, the device might just be broken. Because attempting to debug a problem whose ultimate source is hardware is so frustrating, it is best to eliminate (within reason) this category of fault before proceeding. The best approach to validate a hardware problem is to employ a logic analyzer or hardware emulator. Hardware and software designers should work closely on unstable platforms until the first level of fault isolation can be determined. SYSTEM CRASHESBecause driver code operates in kernel mode, it is straightforward for such code to kill the entire system. While many driver logic errors produce a crash, the most common stem from access violations (e.g., referencing a logical address that has no physical memory behind it) through use of a bad C pointer. Among the more difficult-to-trace scenarios within this category is setting DMA addresses incorrectly into the mapping registers. The device scribbles into random memory with a resulting crash seemingly caused by another entire subsystem. A later section in this chapter deals with analyzing system crashes to determine the ultimate source. RESOURCE LEAKSBecause kernel-mode driver code is trusted code, the system does not perform any tracking or recovery of system resources on behalf of a driver. When a driver unloads, it is responsible for releasing whatever it may have allocated. This includes both memory from the pool areas and any hardware the driver manages. Even while a driver is running, it can leak memory if it regularly allocates pool space for temporary use and then fails to release it. High-layered drivers can leak IRPs by failing to free those fabricated and passed to lower levels. Resource leaks force sluggish system performance and, ultimately, a system crash. Windows 2000 allows memory allocated by kernel-mode code to be tagged with an ID. When analyzing system memory after a crash, the tags help determine where the blocks are located, their size, and most importantly, which subsystem allocated them. Tools such as GFLAGS, supplied with the Platform SDK, enable the pool tagging feature globally for the system. Resource leaks in progress can sometimes be determined by careful monitoring of system objects. Tools such as WINOBJ (supplied with the Platform SDK or by www.sysinternals.com) assist in this monitoring. Tracking resource leakage can be an arduous process. Considerable patience must be exercised when analyzing and isolating such problems. THREAD HANGSAnother failure mode is caused by synchronous I/O requests that never return. The user-mode thread issuing the request is blocked forever and remains forever in its wait state. This type of behavior can result from several causes. First, an explicit bug might be failing to ever call IoCompleteRequest, thus never sending the IRP back to the I/O Manager. Not so obvious is the need to call IoStartNextPacket. Even if there are no pending requests to be processed, a driver must call this function because it marks the Device object as idle. Without this call, all new IRPs are placed in the pending queue, never arriving at the Start I/O routine. Second, a logic error can hang a thread in a Dispatch routine. Perhaps the driver is attempting recursively to acquire a Fast Mutex or an Executive resource. Perhaps another code path has acquired a mutex but failed to release it. Subsequent requests for the mutex hang indefinitely. Similarly, DMA drivers can hang while awaiting ownership of the Adapter object or its mapping registers. The IRP request is therefore never processed, which in turn queues all further IRPs. For slave DMA devices, the offending driver might cause other drivers using the same DMA channel to freeze. Drivers that manage multiunit controllers can effect similar problems by not releasing the Controller object. New IRPs sent to any Device object using the locked Controller object queue indefinitely. Unfortunately, there is no convenient way to see who currently owns Adapter or Controller objects, Mutexes, or Executive resources. It is sometimes helpful to maintain a resource management structure for tracking purposes. Each owner of a synchronization object should register its use within the structure, clearing it when the object is released. Of course, this technique requires a manual coding effort; the act of adding the code often reveals the source of the problem. Another hit or miss attempt to isolate thread hang problems is the use of the checked build of the Windows 2000 kernel. The checked build reports the use of system synchronization objects through DbgPrint statements that appear on an attached debugger. SYSTEM HANGSOccasionally, a driver error causes the entire system to lock up. For example, a deadly embrace involving multiple spin locks (or attempts to acquire the same spin lock multiple times on a single CPU) can freeze system operation. Endless loops in a driver's Interrupt Service Routine or DPC routine cause a similar failure. Once this kind of system collapse occurs, it is difficult, if not impossible, to regain control of the system. The best approach is usually to debug the driver interactively, using WinDbg, and attempt to duplicate the failure. Reproducing Driver ErrorsOne key to isolating a driver bug is the ability to reproduce the problem. Intermittent errors are the bane of a driver author's existence. By meticulously recording the exact sequence of events leading up to the failure, the possibility of reproduction increases. The causes of intermittent failures are numerous. TIME DEPENDENCIESSome problems occur only when a driver is running at full speed (or worse, at some exact slower speed). This could produce an unusually high I/O request rate or data transfer rate. Stress testing is usually a good way to attempt reproduction of this type of failure. MULTIPROCESSOR DEPENDENCIESIf the driver is certified for multiprocessor operation, it must be tested on a multiprocessor platform. Numerous timing conditions present themselves only within the MP environment. For example, ISR, DPC, and I/O Timer routines can run simultaneously on an SMP machine. One warning: SMP debugging is very painful, so it is best to start with a single processor environment. MULTITHREADING DEPENDENCIESIf a driver manages sharable Device objects, the test strategy must access a single Device object from multiple threads. IRPs that flow from multiple threads often provoke unintended results. OTHER CAUSESA computer system involves many components. Sometimes behavior appears non-deterministic due to system load conditions, combinations of installed hardware and drivers, or other configuration differences. A detailed log is perhaps the best tool to assist in identifying this category of problem. Defensive Coding StrategiesAny good software design anticipates problems. To facilitate the detection and isolation of failures, several coding techniques should be employed.
Keeping Track of Driver BugsResearch has shown that bugs are not evenly distributed throughout code. Rather, they tend to cluster in a few specific routines, proportional to the routine's complexity. A carefully maintained bug log identifies the routines that deserve special attention. A good bug log allows patterns that highlight configuration-related failures to be spotted. It can also highlight holes within the testing design and strategy itself. Good failure logs should contain at least the following:
|