Multithreading Tips and Tricks | Debugging Applications for MicrosoftВ® .NET and Microsoft WindowsВ® (Pro-Developer)

As I've been emphasizing throughout this book, one of the keys to debugging is up-front planning. With multithreaded programming, up-front planning is the only way you can avoid the dreaded deadlocks. I break down the necessary planning for multithreaded applications into the following categories:

Don't do it.
Don't overdo it.
Multithread only small, discrete pieces.
Synchronize at the lowest level.
Spin your critical sections.
Don't use CreateThread.
The default memory manager might kill you.
Get the dump in the field.
Review the code—and review the code again.
Test on multiprocessor machines.

Don't Do It

This first tip might seem a little facetious, but I'm absolutely serious. Make sure there's no other way you can structure your program before you decide to incorporate multithreading into your application. When you include multithreading in your application, you're easily adding a minimum of an extra month of development and testing to your schedule.

If you're coding thick client applications and you need your program to do some lightweight background processing, check to see whether the work can be handled either through the Microsoft Foundation Class (MFC) library OnIdle processing or through a background periodic timer event. With a little creative thinking, you can probably find a way to avoid multithreading and the headaches that go with it.

Don't Overdo It

When it comes to server-based applications, you also have to be extremely careful not to create too many threads. One common mistake we've all seen is that some developers end up with a server application in which each connection runs on its own thread. The average development team is doing well to get 10 concurrent connections during their heaviest testing, and it looks like their code works fine. The code might work fine when first deployed, but as soon as business picks up, the server starts bogging down because it's not scalable.

When working on server applications, you'll definitely want to take advantage of the excellent support Microsoft Windows 2000, Windows XP, and Windows Server 2003 has for thread pooling with the QueueUserWorkItem family of functions. That way you can fine-tune the tradeoff between the number of threads and the amount of work you want to get done. Developers are used to things like Microsoft Internet Information Services (IIS) and COM+ handling thread pooling, but developing your own thread pooling system is not something most developers have much experience at, so make sure you spend extra time prototyping your particular situation. For instance, it's much easier to deadlock with misused thread pools than you can ever imagine.

Multithread Only Small, Discrete Pieces

If you must multithread, try to keep it to small, discrete pieces. With thick client applications, you should stick to small pieces of work that are generally devoid of any user interface (UI) elements. For example, printing in the background is a smart use of multithreading because your application's UI will be able to accept input while data is printing.

In server applications, it's slightly different in that you need to judge whether the overhead of thread creation and work will actually speed up your application. Although threads are much more lightweight than processes, they still take quite a bit of overhead. Consequently, you'll want to make sure that the benefit of cranking up that thread will be worth the effort. For example, many server applications have to transfer data back and forth between some type of database. The cost of waiting for the write to that database can potentially be high. If you have a situation in which you don't need to do transactional recording, you can plop parts of the database write into a thread pool object and let it complete on its own time, and thus continue your processing. That way you'll be more responsive to the calling process and get more work done.

Synchronize at the Lowest Level

Since writing the first edition of this book, I have seen that this particular multithread rule is broken more than any other. You have to keep your synchronization methods at the lowest level possible in your code. This might sound like common sense, but the mistake I see made over and over is that developers are using fancy C++ wrapper classes that grab the synchronization object in the constructor and release it in the destructor. The following code shows an example class you'll find in CRITICALSECTION.H in this book's source code:

class CUseCriticalSection;     class CCriticalSection { public      :         CCriticalSection ( DWORD dwSpinCount = 4000 )     {         InitializeCriticalSectionAndSpinCount ( &m_CritSec  ,                                                 dwSpinCount  ) ;     }     ~CCriticalSection ( )     {         DeleteCriticalSection ( &m_CritSec ) ;     }         friend CUseCriticalSection ; public     :     CRITICAL_SECTION m_CritSec ; } ;     class CUseCriticalSection { public      :     CUseCriticalSection ( const CCriticalSection & cs )     {         m_cs = &cs ;         EnterCriticalSection ( ( LPCRITICAL_SECTION)&(m_cs->m_CritSec));     }         ~CUseCriticalSection ( )     {         LeaveCriticalSection ( (LPCRITICAL_SECTION)&(m_cs->m_CritSec) );         m_cs = NULL ;     }     private     :     CUseCriticalSection ( void )     {         m_cs = NULL ;     }     const CCriticalSection * m_cs ; } ;

These classes look great from an object-oriented standpoint, but the implementation issues absolutely kill your performance. The constructor for a wrapper class such as CUseCriticalSection is called at the top of the scope, where it's declared and destroyed when that scope ends. Nearly everyone uses the synchronization class as it is shown in the following code:

void DoSomethingMultithreaded ( ) {     CUseCriticalSection ( g_lpCS ) ;         for ( . . . )     {         CallSomeOtherFunction ( . . . ) ;     }         // Here's the only piece of data really needing protection.     m_xFoo = z ;         YetAnotherCallHere ( . . . ) ; }

The constructor grabs the critical section at the top curly brace, that is, right after the prolog, yet the destructor is not called until the bottom curly brace, right before the epilog. That means you hold onto the critical section for the life of the function, even though DoSomethingMultithreaded is probably calling functions that don't need to be holding onto the critical section. All you are succeeding in doing is killing performance.

As you look at DoSomethingMultithreaded, you're probably thinking, "How expensive can acquiring a synchronization object really be?" If there's no contention for the synchronization object, the cost is very small. However, with multiple threads, the instant a thread can't acquire a synchronization object, you begin a potentially astronomical cost!

Let's start by taking a look at what happens when you call WaitForSingleObject to acquire a synchronization object. Since you are an assembly language demigod from reading Chapter 7, you might want to follow along in the Disassembly window as it will show you exactly what I'm about to discuss. Note that I'm doing the work on Windows XP; the Windows 2000 version of WaitForSingleObject might be slightly different. WaitForSingleObject itself is simply a wrapper around WaitForSingleObjectEx, which does about 40 lines or so of assembly language and calls two functions to set up some data. Down toward the bottom of WaitForSingleObjectEx is a call to NtWaitForSingleObject from NTDLL.DLL. So the WaitForSingleObject function is a call to a wrapper of a wrapper. If you disassemble the address where NtWaitForSingleObject is in memory (use {,,ntdll}_NtWaitForSingleObject@12 in the Address field of the Disassembly window), you'll see that it's really a call to some weird function, ZwWaitForSingleObject, which is also out of NTDLL.DLL. (On Windows 2000, you'll stop at NtWaitForSingleObject.) As you look at the disassembly for ZwWaitForSingleObject, you'll see that it looks something like the following:

_ZwWaitForSingleObject@12: 77F7F4A3  mov         eax,10Fh  77F7F4A8  mov         edx,7FFE0300h  77F7F4AD  call        edx   77F7F4AF  ret         0Ch   77F7F4B2  nop

The real action is at that address, 0x7FFE0300. If you dump what's at that address, you'll see the following:

7FFE0300  mov         edx,esp  7FFE0302  sysenter          7FFE0304  ret

The middle line in the preceding code, showing the SYSENTER, is the magical assembly language instruction. It's one that you'll see only in this context and you won't ever see generated in your code, so I didn't cover it in Chapter 7. Just from the name you can probably guess what it does: it's the instruction that transitions you from user mode to kernel mode. On Windows 2000, the INT 2E call does the same thing as the SYSENTER instruction. Why did I go through this long drawn-out discussion to show you that WaitForSingleObject eventually calls SYSENTER and transitions to kernel mode? I did it simply because that call to SYSENTER sends your thread on a journey into kernel mode, and I wanted to show all the overhead associated with moving your thread out of the thread queue and figuring out what you're waiting on, as well as all the rest of the necessary work for thread coordination. Of course, when you get up to kernel mode, if you actually have to wait on that kernel object you passed to WaitForSingleObject, you'll have thousands of instructions doing the work to pull your thread out of the active thread queue and to place it in the waiting thread queue.

Some eagle-eyed readers are thinking that if you call WaitForSingleObject when waiting on a kernel handle, you're going to hit that cost no matter what. That's true because kernel handles used for cross-process synchronization give you no choice. However, for that reason, most people doing internal synchronization that don't require the cross-process synchronization use the trusty standby of a critical section, as I showed earlier in the CUseCriticalSection class. As most of us have read at one time or another, critical sections are great because you can acquire them without going to kernel mode. That's exactly correct, but most people forget one crucial detail. What happens if you can't acquire that critical section? There obviously has to be some sort of synchronization going on if you can't acquire that critical section. There is—and it's a Microsoft Win32 semaphore handle.

I went through this long discussion because I wanted to fully explain the problem of holding onto those synchronization objects too long. I've seen applications where we've been able to track down the critical contention issues and remove the wrapping classes and gain a significant performance boost. I've found it's much better to explicitly call the synchronization acquire and release functions only around the actual data accesses—even when you might be doing those calls two, three, or more times per function. With critical-section synchronization in particular, the increase in speed is considerable. The other benefit of keeping the synchronization around the actual data accesses is also one of your best defenses against inadvertent deadlocks.

I just want to reiterate that the wrapper classes like CUseCriticalSection are not evil in themselves, it's just the improper use that's the issue. What I've seen done that's perfectly acceptable is code like the following:

void DoSomeGoodMultithreaded ( ) {         for ( . . . )     {         CallSomeOtherFunction ( . . . ) ;     }         // Protect this data access but not hold onto the lock too long.     {         CUseCriticalSection ( g_lpCS ) ;         m_xFoo = z ;     }         YetAnotherCallHere ( . . . ) ; }

The CUseCriticalSection helper class is still present, but by introducing it inside the separate standalone curly braces, it's given a scope so that it's acquired and released just around the one necessary spot, and it isn't held too long.

Spin Your Critical Sections

As I mentioned in the previous section, critical sections are the preferred method of synchronization when you are only synchronizing inside a process. However, you can get a considerable performance boost using critical sections if you remember to spin!

Years ago, some folks at Microsoft were wondering about multithreaded application performance, so they came up with several testing scenarios to find out more. After lots of study, they found something quite counterintuitive, though not unheard of in computer science. They found that in certain cases it was much faster to poll than to actually perform an operation. We've all been told since we were wee programmers never to poll, but in the case of critical sections, that's exactly what you want to do.

The vast majority of critical-section protection was for small data protection cases. As I described in the last section, a critical section is protected by a semaphore, and making the call into kernel mode to acquire that critical section is extremely expensive. The original implementation of EnterCriticalSection simply looked to see whether the critical section could be acquired. If it couldn't, EnterCriticalSection went right into kernel mode. In most cases, by the time the thread got into kernel mode and back down, the other thread had released the critical section a million years ago in computer time. The counterintuitive idea the Microsoft researchers came up with on multiple CPU systems was to check whether the critical section was available and, if it wasn't, spin the CPU, then check again. On single CPU systems, the spin count, obviously, is ignored. If after the second check the critical section wasn't available, finally transition to kernel mode. The idea was that keeping the thread in user mode, even though it was spinning on nothing, was tremendously faster than transitioning to kernel mode.

Two functions allow you to set the critical-section spin count. The first is InitializeCriticalSectionAndSpinCount, which you should use in place of InitializeCriticalSection. For the second function, SetCriticalSectionSpinCount, you want to change the value you originally started with, or you need to change the value for library code that uses only InitializeCriticalSection. Of course, I am assuming that you can access the critical-section pointer in your derived code.

Determining your spin count can be problematic. If you work in an environment in which you have the two to three weeks to run through all the scenarios, grab all those interns sitting around and have fun. However, most of us aren't that lucky. I always use the value 4,000 for my spin count. That's what Microsoft uses for the operating system heaps, and I always figured that my code was probably less intensive than those. Using that number also would be big enough should I keep my code in user mode almost all the time.

Don't Use CreateThread/ExitThread

One of the more insidious mistakes that people make in multithreaded development is using CreateThread. Of course, that begs this question: if you can't use CreateThread to start a thread, how can you get any threads cranked up? Instead of CreateThread, you should always use _beginthreadex, the C run-time function to start your threads. As you'd expect, since ExitThread is paired with CreateThread to end a thread, _beginthreadex has its own matching exit function, _exitthreadex, that you'll need to use instead as well.

You might be using CreateThread in your application right now and not be experiencing any problems whatsoever. Unfortunately, some very subtle bugs can occur because the C run time is not initialized when you use CreateThread. The C run time relies on some per-thread data, so certain standard C run-time functions designed before high speed multithreaded applications were the norm. For example, the function strtok holds the string to parse in per-thread storage. Using _beginthreadex ensures that the per-thread data is there along with other things the C run time needs. To ensure proper thread cleanup, use _exitthreadex, which will ensure the C run time resources are cleaned up when you need to exit the thread prematurely.

The _beginthreadex function works the same way and takes the same type of parameters as CreateThread. To end your thread, simply return from the thread function or call _endthreadex. However, if you want to leave early, use the _endthreadex C run time function to end your threads. As with the CreateThread API function, _beginthreadex returns the thread handle, which you must pass to CloseHandle to avoid a handle leak.

If you look up _beginthreadex, you'll also see a C run time function named _beginthread. You'll want to avoid using that function like the plague because its default behavior is a bug, in my opinion. The handle returned by _beginthread is cached, so if the thread ends quickly, another spawned thread could overwrite that location. In fact, the documentation on _beginthread indicates that it's safer to use _beginthreadex. When reviewing your code, make sure to note calls to _beginthread and _endthread so that you can change them to _beginthreadex and _endthreadex, respectively.

The Default Memory Manager Might Kill You

A client of ours wanted to make the fastest possible server application. When they brought us in to tune the application, they found that adding threads to the application, which according to their design should have scaled the processing power, was having no affect. One of the first things I did when their server application was warmed up and cooking along was to stop it in the debugger and, from the Threads window, look at where each thread was sitting.

This application made use of quite a bit of the Standard Template Library (STL), which as I pointed when discussing WinDBG in the "Trace and Watch Data" section of Chapter 8, can be a performance killer all its own because it allocates tons of memory behind your back. By stopping the server application, I was looking to see which threads were in the C run time memory management system. We all have the memory management code (you install the C run time source code every time you install Microsoft Visual Studio, right?), and I'd seen that a single critical section protects the whole memory management system. That was always scary to me because I thought it could lead to performance issues. When I looked at our client's application, I was horrified to see that 38 out of 50 threads were blocked on the C run-time memory management critical section! More than half of their application was waiting and doing nothing! Needless to say, they were not very thrilled to find this out.

For most applications, the Microsoft supplied C run time is perfectly fine and will not cause you memory problems. However, in the larger, high-volume server applications, that single critical section can eat you up. My first recommendation is to always think long and hard about using STL, and if you insist on using STL, take a look at the STLPort version I discussed back in Chapter 2. I've already discussed numerous issues with STL earlier in this book. In the context of high-volume, multithreaded applications, the Microsoft-supplied STL can lead to bottlenecks.

The bigger problem is what to do about the C run-time single critical section problem. What we really need is each thread with its own heap instead of a single global heap for all threads. That way each thread would never make that kernel mode transition to allocate or deallocate memory. Of course, you can probably see that the solution is not as simple as making a single heap per thread, because you have the problem of handling the case in which memory is allocated in one thread and deallocated in another. Fortunately, there are three solutions to the conundrum.

The first solution is the commercial memory management systems on the market, which handle the per-thread heap code for you. Unfortunately, the pricing models for those products are borderline extortion and your manager will never buy anything that expensive. The second solution, one of the big performance boosts with Windows 2000, is the major improvements Microsoft made to how the operating system heaps work (those heaps created with HeapCreate and accessed with HeapAlloc and HeapFree). To take advantage of the operating system heap, you can easily replace your malloc/free allocations with the appropriate Heap* functions. For the C++ new and delete functions, you'll need to provide replacement global functions. The third and last solution, if you'll be running on multiprocessor machines, is to use Emery Berger's excellent Hoard multiprocessor memory management code (http://www.hoard.org). It's a drop-in replacement for the C and C++ memory routines and is very fast on multiprocessor machines. If you have trouble getting it to link in because of duplicate symbols, you'll have to use the /FORCE:MULTIPLE command line option to LINK.EXE. Keep in mind that Hoard is for multiprocessor machines and can actually run slower than the default allocators on single processor systems.

Get the Dump in the Field

One of the most frustrating experiences is when your program appears to be deadlocking in the field and no matter how hard you try, you can't duplicate the deadlock. However, with the latest improvements in DBGHELP.DLL, you should never want in that situation ever again. The new minidump functions will allow you to take a snapshot of the deadlock so that you can debug it at your leisure. In Chapter 13, I discussed the particulars of the minidump function and my improved wrapper around it, SnapCurrentProcessMiniDump, in BUGSLAYERUTIL.DLL.

To get the dump in the field, you'll want to create a background thread that simply creates and waits on an event. When that event is signaled, it will call SnapCurrentProcessMiniDump, and snap the dump to disk. The following pseudo code snippet shows the function. To tickle the event, have a separate executable that the user will execute to set the event.

DWORD WINAPI DumperThread ( LPVOID ) {     HANDLE hEvents[2] ;     hEvents[0] = CreateEvent ( NULL                  ,                                 TRUE                  ,                                 FALSE                 ,                                 _T ( "DumperThread" )  ) ;     hEvents[1] = CreateEvent ( NULL                     ,                                 TRUE                     ,                                 FALSE                    ,                                 _T ( "KillDumperThread" ) ) ;         int iRet = WaitForMultipleObjects ( 2 , hEvents , FALSE , INFINITE);     while ( iRet != 1 )     {         // You might want to create a unique filename each time.         SnapCurrentProcessMiniDump ( MiniDumpWithFullMemory ,                                        _T ( "Program.DMP" )  ) ;         iRet = WaitForMultipleObjects ( 2 , hEvents , FALSE , INFINITE);     }     VERIFY ( CloseHandle ( hEvents[ 0 ] ) ) ;     VERIFY ( CloseHandle ( hEvents[ 1 ] ) ) ;     return ( TRUE ) ; }

Review the Code—And Review the Code Again

If you really do need to multithread your application, you must allow plenty of time to walk through your multithreaded code in full code reviews. The trick is to assign one person to each thread in your code and one person to each synchronization object. In many ways, the code review in multithreaded programming is really a "multithreaded" review.

When you review the code, pretend that each thread is running at real-time priority on its own dedicated CPU and that the thread is never interrupted. Each "thread person" walks through the code, paying attention only to the particular code that his thread is supposed to be executing. When the "thread person" is ready to acquire a synchronization object, the "object person" literally moves behind the "thread person." When the "thread person" releases a synchronization object, the "object person" goes to a neutral corner of the room. In addition to the thread and object representatives, you should have some developers who are monitoring the overall thread activity so that they can assess the program's flow and help determine the points at which different threads deadlock.

As you're working through the code review, keep in mind that the operating system has its own synchronization objects that it applies to your process and that those objects can cause deadlocks as well. The process critical section, explained in the next section's Debugging War Story "The Deadlock Makes No Sense," and the infamous Microsoft Windows 9x/Me Win16 mutex are both synchronization objects that the operating system uses in your process. Be sure to pay attention to anything that could possibly cause any sort of contention in your application.

Test on Multiprocessor Machines

As I mentioned, a multithreaded application requires a much higher level of testing than a single-threaded one. The most important tip I have for testing your multithreaded application is to test it thoroughly on multiprocessor machines. And I don't mean simply running your application through a few paces; I mean continually testing your program in all possible scenarios. Even if your application runs perfectly on single-processor machines, a multiprocessor machine will turn up deadlocks you never thought possible.

The best approach to this kind of testing is to have the team's developers running the application on multiprocessor machines every day. If you're a manager and you don't have any multiprocessor machines in your shop, stop reading right now and immediately equip half your developers and QA testers with multiprocessor machines! If you're a developer without a multiprocessor machine, show this chapter to your manager and demand the proper equipment to do your job! Several people have written me and mentioned that showing this chapter really did help them get a multiprocessor machine, so don't hesitate to tell your manager that John Robbins said the company owed you one.

Debugging War Story: Saving Some Jobs

The Battle

When a vice president of development called and said he wanted me to work on his team's deadlock now, I knew this job was going to be tough. The guy was abrupt and not too happy that he had to bring a consultant in to help his team. He called his two key developers together and the four of us got into a conference call. The VP was ranting and raving that the developers had slipped too long because of this deadlock bug, and he was not happy. I could just imagine the two engineers cringing as their dirty laundry got aired to some guy on the phone they didn't know. They were porting the application from, in the VP's words, "a real operating system" (UNIX) to "this (censored) toy operating system called Windows," and that just "ruined his year." Of course, when I asked him why, he did have to admit that "it was to stay in business." I had to smile on my end of the phone quite a bit about that!

The engineers mailed me the code and we started going through it with the VP stalking around the conference room. As I got my bearings in the code and got toward the area where the deadlock was, I immediately broke out in a huge flop sweat and felt my heart pounding. I knew that if I'd said that all they had to do was backspace over a D, an N, an E, and an S, and type a P, an O, an S, and a T, the VP was going to blow a fuse and probably fire those two engineers.

The Outcome

I didn't say anything for quite a bit until I collected my thoughts. After a long exhale and saying "Whew, this looks really, really tough," I told the development manager that it was going to take a few hours to sort this out. It was best if the engineers and I worked on it alone because I'm sure he had much more important things to do than to listen to three engineers read hexadecimal numbers to each other over the phone. Fortunately, he bought it and I told the engineers I would call them back in their computer lab.

When I called the engineers I told them they had made a very common mistake that a lot of UNIX developers make: calling a function to send a message from one thread to another in some versions of UNIX immediately returns. However, in Windows, SendMessage doesn't return until the message is processed. I saw in the code the place where the thread they were sending to was already blocked on the synchronization object, and that SendMessage was the reason for the deadlock. They felt a little bad when I told them that to fix their problem they just needed to change SendMessage to PostMessage. I told them it was perfectly understandable they misunderstood what was going on. We spent the rest of the day going over other things they were running into such as DLL relocations and building their applications with full debug symbols. When we got back on the phone with the VP, I just told him it was one of the toughest bugs I'd worked on, but his engineers really went the extra mile to help make it right. In the end, everyone was happy. The VP got his bug fixed, the engineers learned a bunch of hints to help them develop better, and I didn't get anyone fired!

The Lesson

If you've got multiple threads and you want to use message communications between threads, think long and hard about how those synchronization objects and messages will interact. If you're in that situation, try always to use PostMessage. Of course, if you're using the messages to pass more than 32-bit data values across, PostMessage calls won't work because the parameters you pass can be corrupted by the time the other thread processes the message. In that case, use SendMessageTimeOut so that you'll at least return at some point and then can look to see whether the other thread is deadlocked or could not process the message.

Debugging War Story: The Deadlock Makes No Sense

The Battle

A team was developing an application and ran into a nasty deadlock that made no sense. After struggling with the deadlock for a couple of days—an ordeal that brought development to a standstill—the team asked me to come help them figure out the bug.

The product they were working on had an interesting architecture and was heavily multithreaded. The deadlock they were running into occurred only at a certain time, and it always happened in the middle of a series of DLL loads. The program deadlocked when WaitForSingleObject was called to check whether a thread was able to create some shared objects.

The team was good and had already double-checked and triple-checked their code for potential deadlocks—but they remained completely stumped. I asked if they had walked through the code to check for deadlocks, and they assured me that they had.

The Outcome

I remember this situation fondly because it was one of the few times I got to look like a hero within 5 minutes of starting the debugger. Once the team duplicated the deadlock, I took a quick look at the Call Stack window and noticed that the program was waiting on a thread handle inside DllMain. As part of their architecture, when a certain DLL loads, that DLL's DllMain starts another thread. It then immediately calls WaitForSingleObject on an acknowledge event object to ensure that the spawned thread was able to properly initialize some important shared objects before continuing with the rest of the DllMain processing.

What the team didn't know is that each process has something named a process critical section that the operating system uses to synchronize various actions happening behind the scenes in a process. One situation in which the process critical section is used is to serialize the execution of DllMain for the four cases in which DllMain is called: DLL_PROCESS_ATTACH, DLL_THREAD_ATTACH, DLL_THREAD_DETACH, and DLL_PROCESS_DETACH. The second parameter to DllMain indicates the reason the call to DllMain occurred.

In the team's application, the call to LoadLibrary caused the operating system to grab the process critical section so that the operating system could call the DLL's DllMain for the DLL_PROCESS_ATTACH case. The DLL's DllMain function then spawned a second thread. Whenever a process spawns a new thread, the operating system grabs the process critical section so that it can call the DllMain function of each loaded DLL for the DLL_THREAD_ATTACH case. In this particular program, the second thread blocked because the first thread was holding the process critical section. Unfortunately, the first thread then called WaitForSingleObject to ensure that the second thread was able to properly initialize some shared objects. Because the second thread was blocked on the process critical section, held by the first thread, and the first thread blocked while waiting on the second thread, the result was the usual deadlock.

The Lesson

The obvious lesson is to avoid doing any Wait* or EnterCriticalSection calls inside DllMain so that you avoid those kernel object blocks because the process critical section blocks any other thread. As you can see, even experienced developers can get bitten by multithreaded bugs—and as I mentioned earlier, this kind of bug is often in the place you least expect it.