|
As happens with many projects, when I first considered implementing SWS, I thought it was going to be a pretty tall task, but as I got into the coding, implementing the tool turned out to be a somewhat simpler operation. My plan was to tackle the _penter function, the binary file and symbol enumeration, and the run time and the tuning algorithm, in that order.
I have to admit that before I settled on using the _penter function through the /Gh compiler switch, I looked at attempting SWS so that it would run on an unmodified binary. While it's certainly doable, it's quite complicated—I'd have to write a debugger to ensure my hooking code was injected before anything else, and I'd have to insert 5-byte jumps over every function I found in the symbol table. Having implemented similar code for commercial products, I knew how much hard work that would be. In the end, I felt it was perfectly acceptable to require a special build, especially when we have tools like SettingsMaster, which make the SWS build trivial to add to your configuration.
With the function-hooking design mercifully out of the way, I started looking at the actions that had to take place each time the _penter function was called. Listing 19-1 shows my _penter function and its supporting code. As you can see, the code is naked, and I generate my own prolog and epilog to handle function setup and teardown. I go naked because the _penter documentation requires it and it makes getting the return address of the function easier. Fortunately, when the compiler says it will generate _penter before anything else, it means it! The following disassembly shows the effects of the /Gh switch. As you can see, the call to _penter comes even before the PUSH EBP standard function prolog.
Bar: 00401050: E8B7000000 call _penter 00401055: 55 push ebp 00401056: 8BEC mov ebp,esp 00401058: E8A8FFFFFF call ILT+0(?Foo 0040105D: 3BEC cmp ebp,esp 0040105F: E8AE000000 call _chkesp 00401064: 5D pop ebp 00401065: C3 ret
If you daydream a little bit, you can see that the /Gh switch might allow some other interesting utilities. The first one that pops into my mind is a performance tool. Since Microsoft now offers the /GH switch (Enable _pexit Hook Function), using such a tool should be much easier since you'll know when the function ends. I encourage you to do some thinking about the power of _penter and _pexit.
In the discussion of the code generated by the /Gh switch, one of the function calls is CALL ILT+0(?Foo, which has caused confusion for many developers. What you're seeing in action is the magic of incremental linking. ILT stands for Incremental Link Table.
When creating a debug build, the linker wants to link the binary as fast as possible. Consequently, each function is given quite a bit of padding so that if the function changes, the linker only has to overwrite that function in place; it doesn't have to move things around in the binary. By the way, the padding around those functions comprises INT 3 instructions. However, a function can grow larger than the space remaining. In that case, the linker has to move the function in the binary.
If each function that called the now-moved function called the real function address, every time a function moved because of a new link, the linker would have to grind through all the CALL instructions and update each one. To save that processing time, the linker builds the special Incremental Link Table, which it then uses for each call. Instead of updating a bunch of locations when a function changes, the linker now has to update only a single spot in the binary. Listing 19-1 shows an ILT table.
@ILT+0(_wmain): 00401005 jmp wmain (401070h) @ILT+5(??_GCResString@@UAEPAXI@Z): 0040100A jmp CResString::`scalar deleting destructor' (401B40h) @ILT+10(?ParseCommandLine@@YAHHQAPAGAAUtag_CMDOPTS@@@Z): 0040100F jmp ParseCommandLine (401439h) @ILT+15(?ShowHelp@@YAXXZ): 00401014 jmp ShowHelp (401644h) @ILT+20(??1CResString@@UAE@XZ): 00401019 jmp CResString::~CResString (401A00h) @ILT+25(?LoadStringW@CResString@@QAEPBGI@Z): 0040101E jmp CResString::LoadStringW (401A30h) @ILT+30(??2@YAPAXI@Z): 00401023 jmp operator new (401B90h) @ILT+35(??_GCResString@@UAEPAXI@Z): 00401028 jmp CResString::`scalar deleting destructor' (401B40h) @ILT+40(??0CResString@@QAE@PAUHINSTANCE__@@@Z): 0040102D jmp CResString::CResString (401990h) @ILT+45(??BCResString@@QBEPBGXZ): 00401032 jmp CResString::operator unsigned short const * (401B20h)
Therefore, when an instruction is CALL @ILT+15(?ShowHelp@@YAXXZ), the call is the @ILT+15(?ShowHelp@@YAXXZ) label, which is a jump to the real instruction.
Listing 19-1: PENTERHOOK.CPP
/*---------------------------------------------------------------------- Debugging Applications for Microsoft .NET and Microsoft Windows Copyright 1997-2003 John Robbins -- All rights reserved. ----------------------------------------------------------------------*/ /*////////////////////////////////////////////////////////////////////// The Includes //////////////////////////////////////////////////////////////////////*/ #include "stdafx.h" #include "SWSDLL.h" #include "SymbolEngine.h" #include "VerboseMacros.h" #include "ModuleItemArray.h" /*////////////////////////////////////////////////////////////////////// Defines, Constants, & Typedefs //////////////////////////////////////////////////////////////////////*/ // The event handle the _penter function checks to see if the processing // is disabled. static HANDLE g_hStartStopEvent = NULL ; // The simple automatic class I used to ensure various things are // initialized. class CAutoMatic { public : CAutoMatic ( void ) { g_hStartStopEvent = ::CreateEvent ( NULL , TRUE , FALSE , SWS_STOPSTART_EVENT ) ; ASSERT ( NULL != g_hStartStopEvent ) ; } ~CAutoMatic ( void ) { VERIFY ( ::CloseHandle ( g_hStartStopEvent ) ) ; } } ; /*////////////////////////////////////////////////////////////////////// File Scope Globals //////////////////////////////////////////////////////////////////////*/ // The automatic class. static CAutoMatic g_cAuto ; // The module array. static CModuleItemArray g_cModArray ; /*////////////////////////////////////////////////////////////////////// Function Prototypes //////////////////////////////////////////////////////////////////////*/ /*////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////*/ extern "C" void SWSDLL_DLLINTERFACE __declspec(naked) _penter ( void ) { DWORD_PTR dwCallerFunc ; // The function prolog. __asm { PUSH EBP // Set up the standard stack frame. MOV EBP , ESP PUSH EAX // Save off EAX as I need to use it // before saving all registers. MOV EAX , ESP // Get the current stack value into // EAX. SUB ESP , __LOCAL_SIZE // Save off the space needed by the // local variables. PUSHAD // Save off all general register // values. // Now I can calculate the return address. ADD EAX , 04h + 04h // Account for the PUSH EBP and the // PUSH EAX. MOV EAX , [EAX] // Grab the return address. SUB EAX , 5 // Take off the 5 byte jump used to // call _penter to get to the start // of the function. MOV [dwCallerFunc] , EAX // Save the caller function. } // If the start/stop event is signaled, don't do any processing. if ( WAIT_TIMEOUT == WaitForSingleObject ( g_hStartStopEvent , 0 )) { // Do the work. g_cModArray.IncrementFunctionEntry ( dwCallerFunc ) ; } // The function epilog. __asm { POPAD // Restore all general purpose // values. ADD ESP , __LOCAL_SIZE // Remove space needed for locals. POP EAX // Restore EAX MOV ESP , EBP // Restore the standard stack frame. POP EBP RET // Return to caller. } }
As you can see from Listing 19-1, the _penter function isn't very exciting. Things get more interesting when you start to think about how to organize the address data. Since I need to associate an address with a function name, I use my good friend, the DBGHELP.DLL symbol engine, somewhere in the program. However, looking up symbols in the symbol engine is not the fastest of operations, so I had to find a way to keep the data access small and fast because it will be touched by each function call.
When I thought about it, what I wanted for the data arrangement was just a sorted array of all the function addresses with their associated execution counts. That way, when I had the return address in the _penter function, I could just do a quick binary search for the address. This solution seemed relatively straightforward because all I had to do was enumerate the modules' symbols and sort the array of functions. I had all the necessary data to do this.
I thought that SWS, like the WST, should keep each module's execution counts for a run in a separate data file. I favor that approach because if you don't want a particular application run, you can delete it from the overall merged data set. The WST uses the <module name>.<run number> naming format, but I wanted SWS to use a <module name>.<run number>.SWS scheme so that I could eventually write a graphical program to make combining all the runs easier.
After I decided how I wanted to approach the run-time data processing, I considered how to generate the actual order file. As I already mentioned, I needed to concatenate the individual runs. However, when I thought about producing the actual order file, I realized that I hadn't brought forward all the necessary data. The order file needs the names of the functions as well as their sizes, but all I had planned to store were the function addresses. Although I could have used the symbol engine again when generating the actual order file, the only way to get a symbol's size is to enumerate all the module's symbols. Since I was already handling full symbol enumeration in the initial data generation phases, I thought I should just go ahead and add the function sizes to the file. I didn't need to store the function names because I could always look them up by loading the PDB file for the binary.
If you did find Russ Blake's Optimizing Windows NT and you read the chapter "Tuning the Working Set of Your Application," you might wonder why I didn't mention anything about bit sets and time intervals. The Windows NT performance team implemented WST using a scheme in which a bit in a bit set represents each function. Every so many milliseconds, WST writes out the bit set to record which functions executed in that time interval. I've often wondered why they implemented WST that way. At first glance, it seems like you do save some memory with the bit set, but you have to remember that there must be some way of mapping a function address to a bit. I don't think the bit set scheme saves much more memory than my scheme. I believe the Windows NT performance team used the bit set scheme because they were tuning the operating system with WST. In contrast, I'm looking at individual binaries, so it's an issue of scale.
When designing my data structures, I was concerned about one point: I wanted to simply increment a value whenever a function executed. In multithreaded programs, I need to protect that value so that only one thread at a time is manipulating it. My goal was to make the SWS run time as fast as possible, so the best choice was to increment the function count with the InterlockedIncrement API function. Since that function uses the hardware locking mechanism (the LOCK prefix on an instruction), I can guarantee data coherency in multithreaded applications. However, in Microsoft Win32, the largest data size that you can pass to InterlockedIncrement is a 32-bit value. Therefore, it's possible to wrap after 4,294,967,295 function calls. Four billion is a large number, but some message loops could execute that much if you run your application long enough.
The WST program avoided this problem at run time by recording only when the function executed and then calculating the total executions after the fact, when it's easier to deal with the possible wrapping. When tuning the operating system, the odds are quite good that you could execute some functions more than four billion times. However, with SWS, I'm willing to take the chance during my initial design and perform some testing to determine whether function count wrap is a problem. Function count wrap might be less likely with my design because I left the hooks in so that you can start and stop the data collection at will. Thus, each time you run the application, you can generate data only in the cases in which you need it.
In the extremely rare case in which you do execute a function more than 4 billion times, you have two options. The first is to implement the data counting variable as a 64-bit integer and have a per-module synchronization object to protect all the bits. The other option is drastic in that I will need to implement the SWS run time to work like WST. Of course, the wrapping problem happens only on Win32, so another option would be to implement SWS for Microsoft Win64 only, because there's no way you'll ever hit 18,446,744,073,709,551,615 function executions in your lifetime.
You can see how simple the file format actually is by looking at Listing 19-2. I quickly realized that I needed a common base class to handle all the file manipulation, which is CSWSFile defined in SWSFILE.H and SWSFILE.CPP. The class is really nothing more than a wrapper over your normal Win32 file manipulation.
Listing 19-2: FILEFORMAT.H
/*---------------------------------------------------------------------- Debugging Applications for Microsoft .NET and Microsoft Windows Copyright 1997-2003 John Robbins -- All rights reserved. ----------------------------------------------------------------------*/ #ifndef _FILEFORMAT_H #define _FILEFORMAT_H /*////////////////////////////////////////////////////////////////////// Defines and Structures //////////////////////////////////////////////////////////////////////*/ // The SWS file signature (SWS2). #define SIG_SWSFILE '2SWS' #define EXT_SWSFILE _T ( ".SWS" ) /*////////////////////////////////////////////////////////////////////// The SWS file header. //////////////////////////////////////////////////////////////////////*/ typedef struct tag_SWSFILEHEADER { // The file signature. See the SIG_* defines above. DWORD dwSignature ; // The link time for the binary associated with this file. DWORD dwLinkTime ; // The load address for the binary. DWORD64 dwLoadAddr ; // The size of the image. DWORD dwImageSize ; // The number of entries in this file. DWORD dwEntryCount ; // The flags field. DWORD dwFlags ; // The modulename for this file. TCHAR szModuleName[ MAX_PATH ] ; DWORD dwPadding ; } SWSFILEHEADER , * LPSWSFILEHEADER ; /*////////////////////////////////////////////////////////////////////// The entry type for SWS files. //////////////////////////////////////////////////////////////////////*/ typedef struct tag_SWSENTRY { // The function address. DWORD64 dwFnAddr ; // The function size. DWORD dwSize ; // The execution count. DWORD dwExecCount ; } SWSENTRY , * LPSWSENTRY ; #endif // _FILEFORMAT_H
One key issue I want to point out in the file format is that I store the binary files' load address in the SWS file. Originally I stored only the addresses for the functions, but I remembered the case in which the binary might be relocated in memory. In that situation, the SWSDLL.DLL run time would be called with one address, but I wouldn't have any record of that address in any of the module's loaded SWS files. Even though we should always rebase our DLLs, sometimes we forget, and I wanted to make sure SWS didn't crater when that happened.
One area that did give me a little trouble was generating the symbols for the initial SWS module. Because of the way programs are linked and symbols are generated, many of the symbols reported in a module are not those that have _penter calls inserted in them. For example, if you link against the static C run time, your module will have all sorts of C run-time functions added. Since the address lookup would be faster in the SWS run time if there were fewer symbols, I looked at a few ways to minimize the numbers.
Listing 19-3 shows the symbol enumeration callback and how I started limiting the number of symbols. The first step I took was to check whether the symbol had corresponding line information with it. Because I assumed that functions that have _penter calls were properly compiled using the steps I specified earlier, I safely got rid of many extraneous symbols. My next step in eliminating symbols was to check whether specific strings are part of the symbols. For example, any symbols that start with _imp__ are imported functions from other DLLs. There are two other checks that I didn't implement but instead left as exercises for you dear readers. The first is that you should be able to flag symbols from specific files, which SWS should ignore. The main reason for implementing this feature is so that you can add all the C run-time source files to that list. The last symbol elimination trick ensures that the address in question comes only from a code section in the module. You might not need this last check, but it would ensure that only true code symbols are used.
Listing 19-3: SWS symbol enumeration
/*---------------------------------------------------------------------- FUNCTION : SymEnumSyms DISCUSSION : The symbol enumeration callback. The only thing done here is that the data is added to the SWS files. PARAMETERS : szSymbolName - The symbol name. ulSymbolAddress - The symbol address. ulSymbolSize - The size, in bytes, of the symbol. pUserContext - The SWS file. RETURNS : TRUE - Everything is cool. FALSE - There was a problem adding data to the file. ----------------------------------------------------------------------*/ BOOL CALLBACK SymEnumSyms ( PSTR szSymbolName , DWORD64 ulSymbolAddress , ULONG ulSymbolSize , PVOID pUserContext ) { LPENUMSYMCTX pCTX = (LPENUMSYMCTX)pUserContext ; CImageHlp_Line cLine ; DWORD dwDisp ; if ( FALSE == g_cSym.SymGetLineFromAddr ( ulSymbolAddress , &dwDisp , &cLine ) ) { // If no source and line was found for this symbol, ignore // it. return ( TRUE ) ; } // Future enhancements to ignoring particular symbols: // 1. Add an ignore file list to search here. // 2. Check if this address is in the modules code section. This // will avoid putting IAT symbols into the output files. // Is this symbol in the ignore containing list? for ( int i = 0 ; i < IGNORE_CONTAINING_COUNT ; i++ ) { if ( NULL != strstr ( szSymbolName , g_szIgnoreContaining[ i ] ) ) { // Drop out now. return ( TRUE ) ; } } if ( NULL != pCTX->pfnVerboseOutput ) { #ifdef _WIN64 pCTX->pfnVerboseOutput(_T(" Adding Symbol : 0x%016I64X %S\n"), #else pCTX->pfnVerboseOutput(_T(" Adding Symbol : 0x%08X %S\n" ) , #endif (DWORD_PTR)ulSymbolAddress , szSymbolName ); } if ( FALSE == pCTX->pSWSFile->AddData ( ulSymbolAddress , ulSymbolSize , 0 ) ) { ASSERT ( !"Adding to SWS file failed!" ) ; return ( FALSE ) ; } pCTX->iAddedCount++ ; return ( TRUE ) ; }
One symbol problem I had at run time happened because the symbol engine doesn't return static functions. Being Mr. Contentious, when I didn't find an address that came from a module, I popped up my usual six or seven assertion message boxes. At first I was a little confused that I was seeing the assertions, because one of my test programs didn't have anything declared as static. When I looked at the stack in the debugger, I found I was looking at a symbol named something like $E127. There was a call to _penter in the function and everything looked good. It finally dawned on me that I was looking at a compiler-generated function, such as a copy constructor. Although I really would have liked to keep the error checking in the code, I noticed that there were quite a few of those compiler-generated functions in some programs, so all I could do was report the problem with a TRACE statement in debug builds.
The last interesting aspect of SWS is the tuning of a module. The code for the TuneModule function is large, so Listing 19-4 shows only the algorithm. As you can see, I work to ensure that I pack each code page with as many functions as possible to eliminate padding. The interesting part is where I hunt down the best-fitting function. I decided to try to pack as many functions with execution counts as possible into the pages. If I couldn't find a function with an execution count that fit, I used a function that had no execution counts. My initial algorithm for fitting everything together worked great. However, it started crashing when tuning certain modules.
A little exploration revealed that I was getting into a situation in which I had a page almost filled but also had a function whose size was bigger than the page. That's right—a function size reported by the symbol engine was bigger than a memory page. When I looked more closely, I noticed that those huge functions appeared only when they were the last symbols in the code section. Evidently, the symbol engine treats everything after certain symbols as part of the symbol, so the size is wrong. In the tuning algorithm, you can see that when I get a symbol larger than the page size, the only thing I can do is punt and drop the symbol into the order file. That might not be the best solution, but it's a boundary condition that you shouldn't run into too often.
Listing 19-4: The SWS tuning algorithm
// The TuneModule algorithm. BOOL TuneModule ( LPCTSTR szModule ) { Build the output .SWS filename. Copy the base .SWS file to a temporary file Open the temporary .SWS file for each szModule.#.SWS in the directory { Verify this #.SWS link time matches the temporary .SWS file for each address in this #.SWS { Add this address's execution count to the same address in the temporary file } } Get the page size for this computer while not done { Look for the first entry in the temporary .SWS file that has an address If I looped through each address { done = TRUE break } If this entry has an execution count of zero { done = TRUE break } if this entry's size is less than the remaining page size { Output the entry's function name to the PRF Zero out the address so I don't use this one again Subtract this entry size from the remaining page remaining size } else if this entry's size is greater than a page size { Punt and just write out the address to the PRF as there's nothing I can do Reset the page amount remaining to the page size } else { // This entry is too big to fill into the page // so find the best fitting function to pack // the page. for each item in the temporary .SWS file { if the address is not zero'd out { if a best fit has not been found { Set this entry as the best fit overall Set this entry as the best exec count fit } else { if this entry size is > the best fit size { Set this entry as the best fit overall } if this item has an execution count { if this entry size is > the best exec count size { Set this entry as the best exec count fit } } } } } if no best exec count fit was found { Set the best exec count fit to the best overall fit } Output the best exec count function name to the PRF Reset the page amount remaining to the page size } } Close any temporary files }
|