Implementing SWS | Debugging Applications for MicrosoftВ® .NET and Microsoft WindowsВ® (Pro-Developer)

As happens with many projects, when I first considered implementing SWS, I thought it was going to be a pretty tall task, but as I got into the coding, implementing the tool turned out to be a somewhat simpler operation. My plan was to tackle the _penter function, the binary file and symbol enumeration, and the run time and the tuning algorithm, in that order.

The _penter Function

I have to admit that before I settled on using the _penter function through the /Gh compiler switch, I looked at attempting SWS so that it would run on an unmodified binary. While it's certainly doable, it's quite complicated—I'd have to write a debugger to ensure my hooking code was injected before anything else, and I'd have to insert 5-byte jumps over every function I found in the symbol table. Having implemented similar code for commercial products, I knew how much hard work that would be. In the end, I felt it was perfectly acceptable to require a special build, especially when we have tools like SettingsMaster, which make the SWS build trivial to add to your configuration.

With the function-hooking design mercifully out of the way, I started looking at the actions that had to take place each time the _penter function was called. Listing 19-1 shows my _penter function and its supporting code. As you can see, the code is naked, and I generate my own prolog and epilog to handle function setup and teardown. I go naked because the _penter documentation requires it and it makes getting the return address of the function easier. Fortunately, when the compiler says it will generate _penter before anything else, it means it! The following disassembly shows the effects of the /Gh switch. As you can see, the call to _penter comes even before the PUSH EBP standard function prolog.

Bar: 00401050: E8B7000000           call      _penter 00401055: 55                   push      ebp 00401056: 8BEC                 mov       ebp,esp 00401058: E8A8FFFFFF           call      ILT+0(?Foo 0040105D: 3BEC                 cmp       ebp,esp 0040105F: E8AE000000           call      _chkesp 00401064: 5D                   pop       ebp 00401065: C3                   ret

If you daydream a little bit, you can see that the /Gh switch might allow some other interesting utilities. The first one that pops into my mind is a performance tool. Since Microsoft now offers the /GH switch (Enable _pexit Hook Function), using such a tool should be much easier since you'll know when the function ends. I encourage you to do some thinking about the power of _penter and _pexit.

Common Debugging Question: Why do some function calls start with "ILT" in the Disassembly window?

In the discussion of the code generated by the /Gh switch, one of the function calls is CALL ILT+0(?Foo, which has caused confusion for many developers. What you're seeing in action is the magic of incremental linking. ILT stands for Incremental Link Table.

When creating a debug build, the linker wants to link the binary as fast as possible. Consequently, each function is given quite a bit of padding so that if the function changes, the linker only has to overwrite that function in place; it doesn't have to move things around in the binary. By the way, the padding around those functions comprises INT 3 instructions. However, a function can grow larger than the space remaining. In that case, the linker has to move the function in the binary.

If each function that called the now-moved function called the real function address, every time a function moved because of a new link, the linker would have to grind through all the CALL instructions and update each one. To save that processing time, the linker builds the special Incremental Link Table, which it then uses for each call. Instead of updating a bunch of locations when a function changes, the linker now has to update only a single spot in the binary. Listing 19-1 shows an ILT table.

@ILT+0(_wmain): 00401005  jmp         wmain (401070h)  @ILT+5(??_GCResString@@UAEPAXI@Z): 0040100A  jmp         CResString::`scalar deleting destructor' (401B40h)  @ILT+10(?ParseCommandLine@@YAHHQAPAGAAUtag_CMDOPTS@@@Z): 0040100F  jmp         ParseCommandLine (401439h)  @ILT+15(?ShowHelp@@YAXXZ): 00401014  jmp         ShowHelp (401644h)  @ILT+20(??1CResString@@UAE@XZ): 00401019  jmp         CResString::~CResString (401A00h)  @ILT+25(?LoadStringW@CResString@@QAEPBGI@Z): 0040101E  jmp         CResString::LoadStringW (401A30h)  @ILT+30(??2@YAPAXI@Z): 00401023  jmp         operator new (401B90h)  @ILT+35(??_GCResString@@UAEPAXI@Z): 00401028  jmp         CResString::`scalar deleting destructor' (401B40h)  @ILT+40(??0CResString@@QAE@PAUHINSTANCE__@@@Z): 0040102D  jmp         CResString::CResString (401990h)  @ILT+45(??BCResString@@QBEPBGXZ): 00401032  jmp         CResString::operator unsigned short const * (401B20h)

Therefore, when an instruction is CALL @ILT+15(?ShowHelp@@YAXXZ), the call is the @ILT+15(?ShowHelp@@YAXXZ) label, which is a jump to the real instruction.

Listing 19-1: PENTERHOOK.CPP

 /*---------------------------------------------------------------------- Debugging Applications for Microsoft .NET and Microsoft Windows Copyright   1997-2003 John Robbins -- All rights reserved. ----------------------------------------------------------------------*/ /*//////////////////////////////////////////////////////////////////////                               The Includes //////////////////////////////////////////////////////////////////////*/ #include "stdafx.h" #include "SWSDLL.h" #include "SymbolEngine.h" #include "VerboseMacros.h" #include "ModuleItemArray.h"     /*//////////////////////////////////////////////////////////////////////                      Defines, Constants, & Typedefs //////////////////////////////////////////////////////////////////////*/ // The event handle the _penter function checks to see if the processing  // is disabled. static HANDLE g_hStartStopEvent = NULL ; // The simple automatic class I used to ensure various things are // initialized. class CAutoMatic { public  :     CAutoMatic ( void )     {         g_hStartStopEvent = ::CreateEvent ( NULL                ,                                             TRUE                ,                                             FALSE               ,                                             SWS_STOPSTART_EVENT  ) ;         ASSERT ( NULL != g_hStartStopEvent ) ;     }     ~CAutoMatic ( void )     {         VERIFY ( ::CloseHandle ( g_hStartStopEvent ) ) ;     } } ;     /*//////////////////////////////////////////////////////////////////////                            File Scope Globals //////////////////////////////////////////////////////////////////////*/ // The automatic class. static CAutoMatic g_cAuto ;     // The module array. static CModuleItemArray g_cModArray ;     /*//////////////////////////////////////////////////////////////////////                           Function Prototypes //////////////////////////////////////////////////////////////////////*/     /*////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////*/     extern "C" void SWSDLL_DLLINTERFACE __declspec(naked) _penter ( void ) {     DWORD_PTR dwCallerFunc ;         // The function prolog.     __asm     {         PUSH EBP                    // Set up the standard stack frame.         MOV  EBP , ESP             PUSH EAX                    // Save off EAX as I need to use it                                     // before saving all registers.         MOV  EAX , ESP              // Get the current stack value into                                     //  EAX.             SUB  ESP , __LOCAL_SIZE     // Save off the space needed by the                                     // local variables.             PUSHAD                      // Save off all general register                                     // values.             // Now I can calculate the return address.         ADD  EAX , 04h + 04h        // Account for the PUSH EBP and the                                     // PUSH EAX.         MOV  EAX , [EAX]            // Grab the return address.         SUB  EAX , 5                // Take off the 5 byte jump used to                                     // call _penter to get to the start                                     // of the function.         MOV  [dwCallerFunc] , EAX   // Save the caller function.         }         // If the start/stop event is signaled, don't do any processing.     if ( WAIT_TIMEOUT == WaitForSingleObject ( g_hStartStopEvent , 0 ))     {         // Do the work.         g_cModArray.IncrementFunctionEntry ( dwCallerFunc ) ;     }     // The function epilog.     __asm     {         POPAD                       // Restore all general purpose                                     // values.             ADD ESP , __LOCAL_SIZE      // Remove space needed for locals.             POP EAX                     // Restore EAX             MOV ESP , EBP               // Restore the standard stack frame.         POP EBP         RET                         // Return to caller.     } }

The .SWS File Format and Symbol Enumeration

As you can see from Listing 19-1, the _penter function isn't very exciting. Things get more interesting when you start to think about how to organize the address data. Since I need to associate an address with a function name, I use my good friend, the DBGHELP.DLL symbol engine, somewhere in the program. However, looking up symbols in the symbol engine is not the fastest of operations, so I had to find a way to keep the data access small and fast because it will be touched by each function call.

When I thought about it, what I wanted for the data arrangement was just a sorted array of all the function addresses with their associated execution counts. That way, when I had the return address in the _penter function, I could just do a quick binary search for the address. This solution seemed relatively straightforward because all I had to do was enumerate the modules' symbols and sort the array of functions. I had all the necessary data to do this.

I thought that SWS, like the WST, should keep each module's execution counts for a run in a separate data file. I favor that approach because if you don't want a particular application run, you can delete it from the overall merged data set. The WST uses the <module name>.<run number> naming format, but I wanted SWS to use a <module name>.<run number>.SWS scheme so that I could eventually write a graphical program to make combining all the runs easier.

After I decided how I wanted to approach the run-time data processing, I considered how to generate the actual order file. As I already mentioned, I needed to concatenate the individual runs. However, when I thought about producing the actual order file, I realized that I hadn't brought forward all the necessary data. The order file needs the names of the functions as well as their sizes, but all I had planned to store were the function addresses. Although I could have used the symbol engine again when generating the actual order file, the only way to get a symbol's size is to enumerate all the module's symbols. Since I was already handling full symbol enumeration in the initial data generation phases, I thought I should just go ahead and add the function sizes to the file. I didn't need to store the function names because I could always look them up by loading the PDB file for the binary.

If you did find Russ Blake's Optimizing Windows NT and you read the chapter "Tuning the Working Set of Your Application," you might wonder why I didn't mention anything about bit sets and time intervals. The Windows NT performance team implemented WST using a scheme in which a bit in a bit set represents each function. Every so many milliseconds, WST writes out the bit set to record which functions executed in that time interval. I've often wondered why they implemented WST that way. At first glance, it seems like you do save some memory with the bit set, but you have to remember that there must be some way of mapping a function address to a bit. I don't think the bit set scheme saves much more memory than my scheme. I believe the Windows NT performance team used the bit set scheme because they were tuning the operating system with WST. In contrast, I'm looking at individual binaries, so it's an issue of scale.

When designing my data structures, I was concerned about one point: I wanted to simply increment a value whenever a function executed. In multithreaded programs, I need to protect that value so that only one thread at a time is manipulating it. My goal was to make the SWS run time as fast as possible, so the best choice was to increment the function count with the InterlockedIncrement API function. Since that function uses the hardware locking mechanism (the LOCK prefix on an instruction), I can guarantee data coherency in multithreaded applications. However, in Microsoft Win32, the largest data size that you can pass to InterlockedIncrement is a 32-bit value. Therefore, it's possible to wrap after 4,294,967,295 function calls. Four billion is a large number, but some message loops could execute that much if you run your application long enough.

The WST program avoided this problem at run time by recording only when the function executed and then calculating the total executions after the fact, when it's easier to deal with the possible wrapping. When tuning the operating system, the odds are quite good that you could execute some functions more than four billion times. However, with SWS, I'm willing to take the chance during my initial design and perform some testing to determine whether function count wrap is a problem. Function count wrap might be less likely with my design because I left the hooks in so that you can start and stop the data collection at will. Thus, each time you run the application, you can generate data only in the cases in which you need it.

In the extremely rare case in which you do execute a function more than 4 billion times, you have two options. The first is to implement the data counting variable as a 64-bit integer and have a per-module synchronization object to protect all the bits. The other option is drastic in that I will need to implement the SWS run time to work like WST. Of course, the wrapping problem happens only on Win32, so another option would be to implement SWS for Microsoft Win64 only, because there's no way you'll ever hit 18,446,744,073,709,551,615 function executions in your lifetime.

You can see how simple the file format actually is by looking at Listing 19-2. I quickly realized that I needed a common base class to handle all the file manipulation, which is CSWSFile defined in SWSFILE.H and SWSFILE.CPP. The class is really nothing more than a wrapper over your normal Win32 file manipulation.

Listing 19-2: FILEFORMAT.H

/*---------------------------------------------------------------------- Debugging Applications for Microsoft .NET and Microsoft Windows Copyright   1997-2003 John Robbins -- All rights reserved. ----------------------------------------------------------------------*/ #ifndef _FILEFORMAT_H #define _FILEFORMAT_H     /*//////////////////////////////////////////////////////////////////////                          Defines and Structures //////////////////////////////////////////////////////////////////////*/ // The SWS file signature (SWS2). #define SIG_SWSFILE '2SWS' #define EXT_SWSFILE _T ( ".SWS" )     /*////////////////////////////////////////////////////////////////////// The SWS file header. //////////////////////////////////////////////////////////////////////*/ typedef struct tag_SWSFILEHEADER {     // The file signature. See the SIG_* defines above.      DWORD   dwSignature ;     // The link time for the binary associated with this file.     DWORD   dwLinkTime ;     // The load address for the binary.     DWORD64 dwLoadAddr ;     // The size of the image.     DWORD   dwImageSize ;     // The number of entries in this file.     DWORD   dwEntryCount ;     // The flags field.     DWORD   dwFlags ;     // The modulename for this file.     TCHAR   szModuleName[ MAX_PATH ] ;     DWORD   dwPadding ; } SWSFILEHEADER , * LPSWSFILEHEADER ;     /*////////////////////////////////////////////////////////////////////// The entry type for SWS files. //////////////////////////////////////////////////////////////////////*/ typedef struct tag_SWSENTRY {     // The function address.     DWORD64 dwFnAddr ;     // The function size.     DWORD   dwSize ;     // The execution count.     DWORD   dwExecCount ; } SWSENTRY , * LPSWSENTRY ;     #endif  // _FILEFORMAT_H

One key issue I want to point out in the file format is that I store the binary files' load address in the SWS file. Originally I stored only the addresses for the functions, but I remembered the case in which the binary might be relocated in memory. In that situation, the SWSDLL.DLL run time would be called with one address, but I wouldn't have any record of that address in any of the module's loaded SWS files. Even though we should always rebase our DLLs, sometimes we forget, and I wanted to make sure SWS didn't crater when that happened.

One area that did give me a little trouble was generating the symbols for the initial SWS module. Because of the way programs are linked and symbols are generated, many of the symbols reported in a module are not those that have _penter calls inserted in them. For example, if you link against the static C run time, your module will have all sorts of C run-time functions added. Since the address lookup would be faster in the SWS run time if there were fewer symbols, I looked at a few ways to minimize the numbers.

Listing 19-3 shows the symbol enumeration callback and how I started limiting the number of symbols. The first step I took was to check whether the symbol had corresponding line information with it. Because I assumed that functions that have _penter calls were properly compiled using the steps I specified earlier, I safely got rid of many extraneous symbols. My next step in eliminating symbols was to check whether specific strings are part of the symbols. For example, any symbols that start with _imp__ are imported functions from other DLLs. There are two other checks that I didn't implement but instead left as exercises for you dear readers. The first is that you should be able to flag symbols from specific files, which SWS should ignore. The main reason for implementing this feature is so that you can add all the C run-time source files to that list. The last symbol elimination trick ensures that the address in question comes only from a code section in the module. You might not need this last check, but it would ensure that only true code symbols are used.

Listing 19-3: SWS symbol enumeration

 /*---------------------------------------------------------------------- FUNCTION        :   SymEnumSyms DISCUSSION      :     The symbol enumeration callback. The only thing done here is that the data is added to the SWS files. PARAMETERS      :     szSymbolName    - The symbol name.     ulSymbolAddress - The symbol address.     ulSymbolSize    - The size, in bytes, of the symbol.     pUserContext    - The SWS file.      RETURNS         :     TRUE  - Everything is cool.     FALSE - There was a problem adding data to the file. ----------------------------------------------------------------------*/ BOOL CALLBACK SymEnumSyms ( PSTR    szSymbolName    ,                             DWORD64 ulSymbolAddress ,                             ULONG   ulSymbolSize    ,                             PVOID   pUserContext     ) {     LPENUMSYMCTX pCTX = (LPENUMSYMCTX)pUserContext ;         CImageHlp_Line cLine ;     DWORD dwDisp ;     if ( FALSE == g_cSym.SymGetLineFromAddr ( ulSymbolAddress ,                                               &dwDisp         ,                                               &cLine            ) )     {         // If no source and line was found for this symbol, ignore         // it.         return ( TRUE ) ;     }         // Future enhancements to ignoring particular symbols:     // 1. Add an ignore file list to search here.     // 2. Check if this address is in the modules code section. This     // will avoid putting IAT symbols into the output files.          // Is this symbol in the ignore containing list?     for ( int i = 0 ; i < IGNORE_CONTAINING_COUNT ; i++ )     {         if ( NULL != strstr ( szSymbolName              ,                               g_szIgnoreContaining[ i ]  ) )         {             // Drop out now.             return ( TRUE ) ;         }     }          if ( NULL != pCTX->pfnVerboseOutput )     { #ifdef _WIN64         pCTX->pfnVerboseOutput(_T("  Adding Symbol : 0x%016I64X %S\n"), #else         pCTX->pfnVerboseOutput(_T("  Adding Symbol : 0x%08X %S\n" ) , #endif                                 (DWORD_PTR)ulSymbolAddress          ,                                 szSymbolName                          );     }     if ( FALSE == pCTX->pSWSFile->AddData ( ulSymbolAddress ,                                             ulSymbolSize    ,                                             0                ) )     {         ASSERT ( !"Adding to SWS file failed!" ) ;         return ( FALSE ) ;     }     pCTX->iAddedCount++ ;     return ( TRUE ) ; }

The Run Time and Tuning

One symbol problem I had at run time happened because the symbol engine doesn't return static functions. Being Mr. Contentious, when I didn't find an address that came from a module, I popped up my usual six or seven assertion message boxes. At first I was a little confused that I was seeing the assertions, because one of my test programs didn't have anything declared as static. When I looked at the stack in the debugger, I found I was looking at a symbol named something like $E127. There was a call to _penter in the function and everything looked good. It finally dawned on me that I was looking at a compiler-generated function, such as a copy constructor. Although I really would have liked to keep the error checking in the code, I noticed that there were quite a few of those compiler-generated functions in some programs, so all I could do was report the problem with a TRACE statement in debug builds.

The last interesting aspect of SWS is the tuning of a module. The code for the TuneModule function is large, so Listing 19-4 shows only the algorithm. As you can see, I work to ensure that I pack each code page with as many functions as possible to eliminate padding. The interesting part is where I hunt down the best-fitting function. I decided to try to pack as many functions with execution counts as possible into the pages. If I couldn't find a function with an execution count that fit, I used a function that had no execution counts. My initial algorithm for fitting everything together worked great. However, it started crashing when tuning certain modules.

A little exploration revealed that I was getting into a situation in which I had a page almost filled but also had a function whose size was bigger than the page. That's right—a function size reported by the symbol engine was bigger than a memory page. When I looked more closely, I noticed that those huge functions appeared only when they were the last symbols in the code section. Evidently, the symbol engine treats everything after certain symbols as part of the symbol, so the size is wrong. In the tuning algorithm, you can see that when I get a symbol larger than the page size, the only thing I can do is punt and drop the symbol into the order file. That might not be the best solution, but it's a boundary condition that you shouldn't run into too often.

Listing 19-4: The SWS tuning algorithm

 // The TuneModule algorithm. BOOL TuneModule ( LPCTSTR szModule ) {     Build the output .SWS filename.         Copy the base .SWS file to a temporary file       Open the temporary .SWS file         for each szModule.#.SWS in the directory     {         Verify this #.SWS link time matches the            temporary .SWS file             for each address in this #.SWS         {             Add this address's execution count to                the same address in the temporary file         }     }         Get the page size for this computer          while not done     {         Look for the first entry in the temporary            .SWS file that has an address             If I looped through each address         {             done = TRUE              break          }                  If this entry has an execution count of zero         {             done = TRUE              break          }                  if this entry's size is less than the remaining           page size         {             Output the entry's function name to the PRF             Zero out the address so I don't use this one               again             Subtract this entry size from the remaining              page remaining size         }         else if this entry's size is greater than a            page size         {             Punt and just write out the address to the PRF                as there's nothing I can do             Reset the page amount remaining to the page size         }         else         {             // This entry is too big to fill into the page             // so find the best fitting function to pack             // the page.             for each item in the temporary .SWS file             {                 if the address is not zero'd out                 {                     if a best fit has not been found                     {                         Set this entry as the best fit overall                         Set this entry as the best exec count fit                     }                     else                     {                         if this entry size is > the best fit size                         {                             Set this entry as the best fit overall                         }                         if this item has an execution count                         {                             if this entry size is > the best exec count                                size                             {                                 Set this entry as the best exec count fit                             }                         }                     }                                      }             }             if no best exec count fit was found             {                 Set the best exec count fit to the best overall fit             }             Output the best exec count function name to the PRF             Reset the page amount remaining to the page size         }              }     Close any temporary files }