Profiling | Advanced .NET Programming

Although using a profiler can be just as important as using performance counters in order to identify ways that you can improve performance of an application, I'm not going to spend nearly as much time discussing profiling. This is partly because of the lack of Microsoft-supplied profilers - at the time of writing there are none for managed code, other than a couple of samples in the Framework SDK - and partly because there are relatively few new underlying principles to understand concerning the use of profilers. However, I will use one third-party profiler, the Compuware DevPartner profiler, to demonstrate the general principles of profiling, and will briefly discuss the profiling API. I'll also examine the allocation profiler, a profiler supplied by Microsoft that focuses particularly on memory usage.

Choosing a Profiler

Since this is a technology book rather than a sales book, it's not really the purpose of the book to start recommending particular non-Microsoft products, nor indeed any particular products! In general, if you believe that your software will benefit from a detailed analysis of the time spent in various methods, you will be better off searching the Web for suitable profiling products. Companies that I am aware of which do currently offer profilers for managed code include Compuware, AutomatedQA, and Rational. However, there will certainly be others - you will need to evaluate which is the most suitable product based on your needs and budget. In some cases, companies sell the full profiler, but offer a more basic version free of charge.

It is worth mentioning that if you don't wish to purchase a commercial profiler, there are a couple of alternative techniques:

Use the MS sample profilers. The Framework SDK contains a couple of sample profilers, one designed for general profiling, the other specifically designed to look for methods that consume too much time. The source code for these profilers is available in the Tool Developers Guide/Samples subfolder in the .NET Framework SDK. These profilers are both quite basic ones, and you will need to compile them, but they do have the advantage that you have the source code available for them.
Add your own profiling code. Due to the complexity of the profiling API, using the profiling API is not normally a realistic solution. However, there is an alternative technique that will allow you to obtain quite good quality information about the execution times of sections of your code: just take the difference between the times before and after the relevant code is completed. With a little bit of work, you can achieve some quite sophisticated and reasonably automated generation of profiling information based on this technique. If you do use this technique, however, you should be aware of inaccuracies due to any time during which threads are sleeping. You will be measuring wall-clock time, not CPU time. We'll examine how to implement your own code to time other code next.

Writing Your Own Profiling Timer Code

The basic principle of writing a profiling timer is relatively simple: you check what the time is before some operation is carried out, perform the operation, then check the time again and take the difference. That tells you how long the operation took. In this section I'll show you how to write such an application. The disadvantage of this approach compared to using a commercial profiler is that you have to explicitly add calls to your timer code at the points in your profiled code where you want to measure times, whereas a commercial profiler will normally retrieve this information for all methods automatically. On the other hand, you get complete freedom to choose exactly which code you wish to profile. For example, it doesn't have to be a complete method. We'll use the sample we develop here to measure how long it takes to execute a particular for loop.

In order to use this technique, we need a high-resolution timer which can accurately measure very small intervals of time. Unfortunately, the .NET Framework class library doesn't help us here, so we are going to have to turn to unmanaged code, and a native API method, QueryPerformanceCounter().

In this section I use the term timer to indicate some code that measures an interval of time for profiling purposes. Don't confuse that with the more usual meaning of timer: a component which raises events at regular intervals of time.

There are two properties of managed classes that look at first sight like possible candidates for a timer, but I wouldn't recommend you use either of them to measure small times (though they will both be fine if you are measuring larger times, say, greater than about a tenth of a second):

System.Environment.TickCount measures the time in milliseconds since the computer was started. Unfortunately, one millisecond isn't really short enough for many operations. For example, you might want to profile a method that takes a tenth of a millisecond to execute, but which is invoked so many times that it might be significantly affecting performance.
System.DateTime.Now.Ticks is documented as returning the number of 100-nanosecond intervals that have elapsed since 12:00AM, January 1st, year 1 AD. This sounds more hopeful, since 100 nanoseconds is just one tenth of a millionth of a second. Unfortunately the accuracy is still not good, since this property appears to increment in large blocks. System.DateTime is just not intended to measure such small intervals, and my own tests suggest that System.DateTime is unable in practice to distinguish time intervals of less than about 1/100 of a second.

Details of the implementations of these properties is not documented - they both defer internally to IL internalcall methods, indicating that their implementations is supplied internally by the CLR.

So having rejected DateTime.Ticks and Environment.TickCount, let's check out our preferred solution. QueryPerformanceCounter() is an API function which is documented as measuring the tick count of a certain performance counter known as the high-resolution performance counter. This counter simply increments a tick count at a given frequency. The frequency itself is undocumented since it is dependent on your hardware, so you need to call another method, QueryPerformanceFrequency() to obtain that value. To give you some idea of the resolution you can expect, on my 1GHz machine, calling QueryPerformanceFrequency() indicates a resolution of one three-millionth of a second - that should be enough for any realistic profiling scenario.

I'm going to illustrate the use of these methods by developing a small example, called IntervalTimer. The example contains a class, also called IntervalTimer, which acts like a stopwatch. You call its Start() method to start measuring an interval of time, then you call its Stop() method to stop the measurement, after which you can examine how long the interval takes. For this, I've overridden ToString() to display the timing information. Because IntervalTimer uses the high-resolution performance counter, it is extremely accurate even for very small intervals.

I'll start off with the test harness that illustrates how to use the IntervalTimer class. The following code measures how long it takes to execute a simple for loop.

 static void Main() {    int quantity = 0;    IntervalTimer timer = new IntervalTimer();    for (int numTests = 0 ; numTests < 3 ; numTests++)    {       timer.Start();       timer.Stop();       Console.WriteLine("Just starting and stopping timer: ");       Console.WriteLine(" " + timer.ToString());       timer.Start();       for (int i=0 ; i<1000 ; i++)          quantity += i;       timer.Stop();       Console.WriteLine("counting to 1000: ");       Console.WriteLine("     " + timer.ToString());    }    Console.WriteLine("\nquantity is " + quantity); }

The outer for loop in this sample means that we will run the test three times. Within each test, we first measure how much CPU time is used just to start and stop the timer and display the results. Then we perform the real test - we use the IntervalTimer to measure how long it takes to execute a for loop that performs 1000 additions. At each iteration of the loop, we add the loop index to a variable called quantity. We also display the value of quantity at the end of all the tests (just to make sure that the variable is actually used and therefore can't be optimized away by an intelligent JIT compiler).

Running a release build of this code gives these results on my machine:

 Just starting and stopping timer:    Interval: 0.000006 seconds (23 ticks) counting to 1000:    Interval: 0.000009 seconds (34 ticks) Just starting and stopping timer:    Interval: 0.000006 seconds (20 ticks) counting to 1000:    Interval: 0.000009 seconds (32 ticks) Just starting and stopping timer:    Interval: 0.000006 seconds (21 ticks) counting to 1000:    Interval: 0.000009 seconds (32 ticks) quantity is 1498500

I've implemented IntervalTimer.ToString() so that it displays the total time in seconds as well as the number of tick counts this represents. For small times, the number of tick counts may be important because if - say - a time was measured to be 20 tick counts then we know that it is only accurate to within 5% (because you can't measure times of less than one tick). The results above show that the numbers of tick counts is low so our times won't be especially precise. However, just look at how tiny the times we've managed to measure are: starting and stopping the timer takes about 6 millionths of a second (6 microseconds): starting and stopping the timer and running the loop takes 9 microseconds. From this we can deduce that it took approximately 3 microseconds to execute the loop. A more sophisticated timer would probably do this calculation for us and just show us the 3 microseconds, but I don't want the example to get too complicated. You might want to compare this sample with the similar PerfTest sample I presented in Chapter 6. In that chapter, I used System.DateTime to measure an interval - but that was possible only because the interval was very large - about 1 second.

Now let's look at the code for the IntervalTimer class. First we need to define the P/Invoke wrappers for the unmanaged API functions we will be using:

 public class IntervalTimer {    [DllImport("kernel32.dll")]       static extern private int QueryPerformanceCounter(out long count);     [DllImport("kerne132.dll")]       static extern private int QueryPerformanceFrequency(out long count);

Next, an enum that indicates whether the timer is started or stopped, and some member fields.

 public enum TimerState {NotStarted, Stopped, Started} private TimerState state; private long ticksAtStart;    // tick count when Start() called private long intervalTicks;   // no. of ticks elapsed until Stop() called private static long frequency; private static int decimalPlaces; private static string formatString; private static bool initialized = false;

The meanings of the instance fields should be obvious. The static fields are to do with storing information about the timer frequency - recall that the frequency has to be determined by calling QueryPerformanceFrequency(). frequency is the frequency (number of ticks per second). decimalPlaces is used when displaying timing information - clearly, the frequency will determine how many decimal places we can display for the time. Suppose, for example, the timer was a really slow one that only fired every tenth of a second. Then it would be silly to claim a time of 5.6343442 seconds - the best we could say is that the time is about 5.6 seconds - or use 1 decimal place. That's what decimalPlaces stores. formatString contains the formatting string used in IntervalTimer.ToString(). Since the format string will depend on how many decimal places can be shown, we can't hardcode it into the program.

All this information is gathered together the first time an IntervalTimer object is instantiated:

 public IntervalTimer() {    if (!initialized)    {       QueryPerformanceFrequency(out frequency);       decimalPlaces = (int)Math.Log10(frequency);       formatString = String.Format("Interval: {{0:F{0}}} seconds ({{1}} ticks)",                                                                  decimalPlaces);       initialized = true;    }    state = TimerState.NotStarted; }

You might wonder why I haven't used a static constructor to perform this work. The reason is that using a static constructor appears to distort the timing information the first time that IntervalTimer is used. The reason for this is not clear. It is, however, difficult to predict the precise timing and order of execution of statements when a static constructor is involved - there may be some multithreading at work with static constructors, and this is likely to have something to do with the problem. To be safe, I've performed the static initialization in an instance constructor.

The implementations of the Start() and Stop() methods are relatively simple:

 public void Start() {    state = TimerState.Started;    ticksAtStart = CurrentTicks; } public void Stop() {    intervalTicks = CurrentTicks - ticksAtStart;    state = TimerState.Stopped; }

The following is the method which works out how many seconds a given tick count between calling Start() and calling Stop() corresponds to.

 public float GetSeconds() {    if (state != TimerState.Stopped)       throw new TimerNotStoppedException();    return (float)intervalTicks/(float)frequency; }

Note that calling GetSeconds() only makes sense if the timer has been started and stopped - this is checked and if there's a problem the method throws an exception, TimerNotStoppedException, which we'll define soon.

This is how the timing information is displayed:

 public override string ToString() {    if (state != TimerState.Stopped)       return "Interval timer, state: " + state.ToString();    return String.Format(formatString, GetSeconds(), intervalTicks); }

And finally, here's the custom exception:

 public class TimerNotStoppedException : ApplicationException {    public TimerNotStoppedException()       : base("Timer is either still running or has not been started")    {    } }

Now we've seen how to implement our own profiling timer, we'll have a look at one commercial profiler, to get a feel for what a full profiler application can achieve.

Demonstration: The Compuware Profiler

In order to demonstrate profiling in action, I'm going to use the Compuware DevPartner profiler, which is typical of the features that you can expect in a typical profiler. The profiler is available from http://www.compuware.com, and once installed exists as an add-in to VS.NET. In order to profile a program, you click on the Tools menu, and select the DevPartner Profiler option to enable the profiler, before running the program normally in VS.NET. After running the program, a new window will open in VS.NET containing profiling information. You can also stop the application at any time using menu options, in order to review the profiling information collected so far.

For this demonstration we will run the UseResources sample in VS.NET (as a release build, of course). Running on my computer, and clicking on the 50MB button, then freeing the memory and closing the form produced these results:

click to expand

As you can see, the profiling information is supplied via an extra window in VS.NET. The window is divided into two panes, with a treeview to select what data you want to display in the listview. The listview is a property sheet with two tab pages.

The Method List tab details the percentage of execution time spent in each method, while the Session Summary tab gives overall statistics as well as indicating the time spent executing code in any unmanaged libraries that have been invoked.

The results here for the Method List are not surprising - the time spent allocating memory (in particular the MegaByteClass constructor) was a significant drain on computer time. Besides the % Time in Method, there is also a % Time with Children column. The former column indicates only the time spent executing code in that method, while the latter indicates the time spent while that method was in the stack frame. For example, if the computer is executing method A and method A calls method B, which in turn calls method C, then the time spent in B and C will show up in A's % with Children figure, but not in A's % in Method figure. This means that the percentages in the first column will add up to 100% (minus any time spent executing unmanaged code), but the percentages in the second column won't. When you are using a profiler to look at performance, you'll need to take care to distinguish between the figures. The third column is also important - this gives the number of times that the method was called. Pay attention to this column: if you see that a particular method is taking a large proportion of CPU time, this may be because that method takes a long time to execute, or it may be because it is being invoked a large number of times (which may imply that it is one of the other methods further up the call stack that needs optimizing). The fourth column measures the average amount of time spent executing the code in a method (excluding time spent in child methods).

In the case of the UseResources application, there's not much we can do to speed it up, other than not allocating so much memory, which would defeat the purpose of this particular example.

The profiling information for a program reveals some quite interesting details about the internal workings of .NET programs, besides telling us which methods need improvement if the application is to be speeded up. Firstly, notice just how many methods for Windows Forms applications are actually are buried inside .NET (look at the size of that scroll bar in the screenshot). In fact, the Session Summary tab indicates that 2125 distinct methods were called when running this sample. If we click on the Called header to sort the items in the listview by the number of times each method was called, the most called method is revealed to be PeekMessageW, a method concerned with the internal operation of the Windows message loop, but despite being invoked over 44,000 times this method consumed very little CPU time. In Chapter 11 we'll see how the underlying message loop in a Windows Forms application works:

click to expand

The information presented here is clouded by the amount of information relating to method calls internal to the CLR. That's very nice for investigating the CLR under the hood, but not so good for trying to find out which of our own methods is taking the CPU time. With the Compuware profiler, this is easy to solve by expanding the treeview. This will reveal a series of nodes that show information based on individual assemblies:

click to expand

This view makes it very clear which methods in our own code are using the bulk of the CPU time.

Although these screenshots are for the Compuware profiler, you will find that most commercial profilers give you very similar features, albeit with slightly different user interfaces.

The Allocation Profiler

The allocation profiler has been supplied by Microsoft. It works on similar principles to other commercial profilers to the extent that it uses the profiling API to monitor what a program is doing. However, unlike most profilers, the allocation profiler is intended not to profile performance but to profile the allocation of variables in memory and the operation of the garbage collector. It can display full information about what objects have been allocated or freed at what times, and even what addresses in memory have been used. The cost of this is that running a program from the allocation profiler will slow it down considerably, so there's no point trying to obtain timing information from it. At the present time the allocation profiler is unsupported and is available in both compiled and source code versions on the www.gotdotnet.com website (go to the user samples area and search for Allocation Profiler). There are, however, some hints that Microsoft is intending to incorporate it in the Framework SDK at some point in the future.

When you run the profiler, you will be presented with a dialog box indicating the profiling options:

If you click on the Start Application button, you'll be presented with a standard open file dialog allowing you to browse and select an assembly to run. At any time while the application is running, you can click on the Show Heap now button to get a view of the managed heap. The following screenshot shows the situation after I've allocated 11MB with the UseResources example:

click to expand

The screenshot shows that the profiler is able to break up the classes into component fields, and has identified, for example, that the memory occupied at the most nested level consists of the int[] arrays contained in the MegaByteClass class. The allocation profiler offers various other views that show, for example, the addresses and ages of objects on the managed heap.

The Profiling API

For performance counters, I went into some detail concerning how to access the counters programmatically from your code. In the case of performance counters, that's worth doing because programmatic access is so easy using the performance counter classes. Getting a program to use these classes to monitor its own resource usage is a relatively simple operation. For the profiling API that's not the case. The profiling API is entirely an unmanaged, COM-based API. It is highly complex, and it would certainly not normally be worthwhile using it in order to get code to monitor itself in the way you would with performance counters. Not only that, but because of the way the profiling API hooks into the CLR, it's quite easy to crash the entire CLR if there are bugs in your profiling code. This is because for performance reasons the CLR's profiling support does the absolute minimum of error checking. For example, if the CLR is notifying another application whenever the JIT-compiled code invokes a new method, then it is imperative that the notification process uses only a few machine cycles, otherwise the profiling process will destroy the timings that it is supposedly measuring. In general, the only reason for using the profiling API is if you are writing a fully-fledged profiler application.

If for any reason you do wish to use the profiling API, however, you can find the documentation and samples for it in the Tool Developers Guide in the .NET Framework SDK. You'll need to have a good understanding of COM, and you'll need to do the coding in unmanaged C++. Essentially the procedure involves implementing a COM component that implements the interface ICorProfilerCallback (this interface is defined in the header files corprof.h and corprof.idl, also supplied with the .NET Framework SDK). You then need to temporarily set up a couple of environment variables that indicate that your component is a profiler. As long as these environment variables are set, any managed application that is launched will automatically be hooked up to the indicated profiler.