The Life and Times of a .NET Web Application

Now that we ve introduced JIT, we will explore some of the other ways CLR influences the performance of an application over the course of its execution. Bear in mind that as far as CLR is concerned, it does not matter which high-level programming languages the application components were written in. By the time CLR encounters them, they are either managed assemblies written in MSIL, or they are unmanaged code to be run outside of CLR.

Load Time AppDomains

When CLR loads a new application, applications are placed in special memory areas set aside for them called AppDomains. Because CLR provides memory type safety, it is possible for multiple applications to safely cohabit within the same AppDomain. Applications in the same AppDomain function as a group in the sense that they can share data quickly and efficiently, and if the AppDomain is unloaded, all applications and assemblies loaded into that domain are unloaded together.

Run Time Interoperability

As a .NET application runs, it may make calls into unmanaged code, such as COM components or standard Windows DLLs. Whenever execution of a thread passes between managed code and unmanaged code, a transition is said to occur. These transitions carry certain costs.

One cost of making a transition is that the arguments and return values being passed between the caller and callee must be marshaled. Marshaling is the process of arranging the objects in memory according to the expectations of the code that will process them. Naturally, data types such as strings and complex structures are more expensive to marshal than simple types like integers.

NOTE
In the case of strings, it is often necessary to convert them to different formats such as ANSI and Unicode. This is an example of an expensive marshalling operation.

Another cost of transitioning concerns CLR s memory manager, known as the garbage collector. (The garbage collector will be discussed in more detail later in the chapter.) Whenever a transition into unmanaged code occurs, CLR must identify all the objects referenced by the call to unmanaged code, to ensure the garbage collector does not move them and thereby disrupt the unmanaged thread. Objects that have been identified as possibly in use by unmanaged code are said to be pinned.

NOTE
Obviously, the most desirable behavior for an application is to minimize the number of transitions needed to do a given amount of work. When testing, use the # of marshalling counter in the .NET CLR Interop performance object to locate areas where application threads are repeatedly transitioning between modes and doing only a small amount of work before transitioning back.

Run Time Garbage Collection

One of CLR s most prominent features is automatic memory management, better known as garbage collection. Rather than requiring developers to implement their own memory management, CLR automatically allocates memory for objects when they are created, and periodically checks to see which objects the application is done using. Those objects that are no longer in use are marked as garbage and collected, meaning that the memory they occupy is made available for use by new objects.

Generations and Promotion

Naturally, garbage collection needs to be fast, since time spent managing memory comes at the expense of time spent letting the application do its job.

One assumption about memory management that has withstood considerable scrutiny can be summarized by simply saying that the vast majority of objects are usually needed for only a short amount of time. Microsoft s garbage collector (GC) makes the most of this by sorting objects into three categories, or generations, numbered 0, 1, and 2. Each generation has a heap size, which refers to the total number of bytes that can be occupied by all objects in that generation. These heap sizes change over the course of an application s execution, but their initial sizes are usually around 256 KB for generation 0, 2 MB for generation 1, and 10 MB for generation 2.

Objects in generation 0 are youngest. Any time an application creates a new object, the object is placed in generation 0. If there is not enough room on the generation 0 heap to accomodate the new object, then a generation 0 garbage collection occurs. During a collection, every object in the generation is examined to see if it is still in use. Those still in use are said to survive the collection, and are promoted to generation 1. Those no longer in use are de-allocated. You will notice that the generation 0 heap is always empty immediately after it is collected, so there is always room to allocate a new object that is, unless the system is out of memory, as we discuss below.

NOTE
You may wonder what happens if a new object is so large that its size exceeds the space available on the generation 0 heap all by itself. Objects larger than 20 KB are allocated on a special heap all their own, known as the large object heap. You ll find performance counters to track the large object heap size in the .NET CLR Memory performance object.

In the course of promoting objects from generation 0 to generation 1, the GC must check to see if there is room to store the promoted objects in generation 1. If there is enough room on the generation 1 heap to accomodate objects promoted from generaton 0 the true GC terminates, having only collected generation 0. If, on the other hand, the capacity of the generation 1 heap will be exceeded by promoting objects into it from generation 0, then generation 1 is collected as well. Just as before, objects that are no longer in use are de-allocated, while all surviving objects are promoted, this time to generation 2. You ll notice that after generation 1 is collected, its heap is occupied only by those objects newly promoted from generation 0.

Just as generation 1 must sometimes be collected to make room for new objects, so must generation 2. Just as before, unused objects in generation 2 are de-allocated, but the survivors remain in generation 2. Immediately after a collection of generation 2, its heap is occupied by surviving as well as newly promoted objects.

Immediately following a collection, a heap s contents are re-arranged so as to be adjacent to each other in memory, and the heap is said to be compacted.

Notice that any time generation 1 is collected, so is generation 0, and whenever generation 2 is collected, the GC is said to be making a full pass because all three generations are collected.

As long as only a few objects need to be promoted during a collection, then the garbage collector is operating efficiently, making the most memory available with the least amount of work. To optimize the likelihood that the garbage collector will operate efficiently, it is also self-tuning, adjusting its heap sizes over time according to the rate at which objects are promoted. If too many objects are being promoted from one heap to another, the GC increases the size of the younger heap to reduce the frequency at which it will need to collect that heap. If, on the other hand, objects are almost never promoted out of a heap, this is a sign that the GC can reduce the size of the heap and improve performance by reducing the application s working set. The exception here is generation 2: since objects are never promoted out of generation 2, the GC s only choice is to increase the size of the generation 2 heap when it starts getting full. If your application s generation 2 heap grows too steadily for too long, this is probably a sign that the application should be reviewed for opportunities to reduce the lifetime of objects. When generation 2 can no longer accommodate promoted objects, this means the garbage collector cannot allocate space for new objects, and attempts to create new objects will cause a System.OutOfMemoryException.

The GC also attempts to keep the size of the generation 0 heap within the size of the system s L2 cache. This keeps memory I/O costs to a minimum during the most frequent collections. When monitoring your application, it may be helpful to see if it allows the GC to take advantage of this optimization.

Pinned Objects

As mentioned earlier, pinned objects are those that have been marked as possibly in use by threads executing unmanaged code. When the GC runs, it must ignore pinned objects. This is because changing an object s address in memory (when compacting or promoting it) would cause severe problems for the unmanaged thread. Objects therefore survive any collection that occurs while they are pinned.

When monitoring application performance, pinned objects indicate memory that cannot be managed or reclaimed by the garbage collector. Pinned objects are usually found in places where the application is using significant amounts of unmanaged code.

Finalization

Some objects might store references to unmanaged resources such as network sockets or mutexes. Since de-allocating such an object would result in loss of the reference to the unmanaged resource, developers might specify that the GC must cause the object to clean up after itself before it can be de-allocated, in a process called finalization.

Finalization carries several performance costs. For example, objects awaiting finalization cannot be de-allocated by the garbage collector until they are finalized. Moreover, if an object pending finalization references other objects, then those objects are considered to be in use, even if they are otherwise unused. In contrast to the garbage collector, the programmer has no way to directly control the finalization process. Since there are no guarantees as to when finalization will occur, it is possible for large amounts of memory to become tied up at the mercy of the finalization queue.

When a garbage collection occurs, objects pending finalization are promoted instead of collected, and tracked by the Finalization Survivors counter in the .NET CLR Memory performance object. Objects referenced by finalization survivors are also promoted, and tracked by the Promoted Finalization counters in the .NET CLR Memory performance object.

When monitoring an application that uses objects that require finalization, it is important to watch out for excessive use of memory by objects that are pending finalization directly or otherwise.

Differences Between Workstation and Server GC

Whenever a collection occurs, the GC must suspend execution of those threads that access objects whose locations in memory will change as they are promoted or compacted. Choosing the best behavior for the GC depends on the type of application.

Desktop applications that interact directly with individual users tend to allocate fewer memory objects than Web-based applications that serve hundreds or even thousands of users, and so minimizing the latency involved in a garbage collection is a higher priority than optimizing the rate at which memory is reclaimed.

Therefore, Microsoft implements the GC in two different modes. Note that the best GC is not chosen automatically - CLR will use the Workstation GC (mscorwks.dll) unless the developer specifies that the application requires the Server GC (mscorsvr.dll) instead.

NOTE
In our experience, with most Web application scenarios, we have found that the Server GC out performs the Workstation GC.

Run Time Exceptions

Whenever a method encounters a situation it can t deal with in the normal course of execution, it creates an exception object that describes the unexpected condition (such as out of memory or access denied). The exception is then thrown, meaning the thread signals CLR that it is in a state of distress, and cannot continue executing until the exception has been handled.

When an exception is thrown, the manner of its disposal will depend on whether or not the application has code to handle the exception. Either CLR will halt the application because it cannot handle the exception gracefully, or CLR will execute the appropriate exception handler within the application, after which the application may continue execution. (An application could be designed to terminate gracefully after handling certain exceptions; in that case we would say that the application continues, if only to terminate as intended.)

Suppose method main() calls method foo(), which in turn calls method bar(), and bar() throws a System.FileNotFoundException. The CLR suspends execution of the thread while it looks for an exception filter that matches the thrown exception. Method bar() might have an exception handler whose filter specifies System.DivideByZeroException. The FileNotFoundException would not match this filter, and so CLR would continue in search of a matching exception filter. If none of the exception filters specified by function bar() matched the exception, the system would recurse up the call stack from bar() to the function that called it, in this case, foo(). Now, suppose foo() has an exception handler that specifies System.FileNotFoundException. The exception handler in foo() will execute, thereby catching the exception.

When we speak of throw-to-catch depth, we refer to the number of layers up the call stack CLR had to traverse to find an appropriate exception handler. As it was in our hypothetical example, the throw-to-catch depth was 1. If bar() had caught its own exception, the depth would have been 0. And if CLR had needed to recurse all the way up to main(), the depth would have been 2.

Once an exception has been caught, execution of the application resumes inside a block of code called a finally block. The purpose of a finally block is to clean up after whatever operations might have been interrupted by the exception. Finally blocks are optional, but every finally block that exists between the method that threw the exception and the method that caught it will be executed before the application resumes regular execution.

Therefore, in our example above, if functions foo() and bar() each implement a finally block, both will execute before program flow returns to normal. If the developer chose not to write a finally block for bar(), but did write one for foo(), the finally block in foo() would still execute.

Exceptions in Unmanaged Code

When managed code calls unmanaged code, and that unmanaged code throws an exception which it does not catch, the exception is converted into a .NET exception, and the CLR becomes involved in attempting to handling it. As with any other. NET exception, CLR will halt the application if it is not handled.

Unmanaged exceptions, which do not concern CLR, won t be tabulated by any of the .NET CLR performance counters. On the other hand, .NET exceptions which originated in unmanaged will be tabulated by the # of Exceps Thrown counters once they are converted. When tabulating .NET exceptions converted from unmanaged code, the Throw to Catch Depth performance counter will only count stack frames within the .NET environment, causing the throw-to-catch depth to appear shorter than it actually is.

Exceptions and Performance

Exception handling is expensive. Execution of the involved thread is suspended while CLR recurses through the call stack in search of the right exception handler, and when it is found, the exception handler and some number of finally blocks must all have their chance to execute before regular processing can resume.

Exceptions are intended to be rare events, and it is assumed that the cost of handling them gracefully is worth the performance hit. When monitoring application performance, some people are tempted to hunt for the most expensive exceptions. But why tune an application for the case that isn t supposed to happen? An application that disposes of exceptions quickly is still just blazing through exceptions instead of doing real work. Therefore, we recommend that you work to identify the areas where exceptions most often occur, and let them take the time they need so that your application can continue running gracefully.

.NET Performance Counters

Now that you have been introduced to those aspects of the .NET Framework that have a direct impact on the performance of your Web application, we will discuss some of the new .NET performance counters that allow you to measure the performance of the .NET Framework and your managed code. This section is not intended to discuss all of the counters; doing so would require far more than a chapter of material. Instead, we set out to present those counters that would give you the most bang for your buck. The counters presented below, in our opinion, are the ones that can tell the most about your application in the shortest amount of time. Note that this subset of counters does not represent all of the requirements for monitoring the performance of your .NET Web application. Depending on your system architecture, you may find it necessary to monitor other .NET related counters along with counters not specific to .NET.

TIP
If you are interested in capturing performance counter data as part of an application that you are developing, you can reference under managed languages in the System.Diagnostics.PerformanceCounter namespace.

.NET CLR Memory Object

All of the counters found under this object relate memory usage by the .NET framework. No matter whether you are running a .NET Web application or .NET desktop application, these counters will help you understand how the framework is using the system s memory resources. It is important to note that if your application consists of both managed and unmanaged code, these counters will only draw a partial picture of memory usage, since they do not track memory use by unmanaged code even though it may be running as part of the same application.

# GC Handles Performance Counter

The # GC Handles performance counter displays the current number of garbage collection handles in use. Garbage collection handles are handles to resources outside of CLR and the managed environment. A single handle may only occupy a tiny amount of memory in the managed heap; however, the unmanaged resource it represents could actually be very expensive. You may encounter a large amount of activity with GC handles if multiple objects were created through the use of your Web application. For instance, if a particular user scenario required the allocation of an unmanaged resource such as a network socket each time a user executed that scenario an object consisting of this array would be created along with a corresponding GC handle. When under heavy load specifically when this scenario is called your Web site would create a large number of GC handles, possibly causing your application to become unstable.

# Gen 0 Collections

This and the following two counters are important for understanding how efficiently memory is being cleaned up. The # Gen 0 Collections counter displays the number of times generation 0 objects have been garbage collected since the start of your application. Each time an object that is still in use is garbage collected at generation 0, it is promoted from generation 0 to generation 1. As we described earlier, one scenario in which generation 0 promotions occur is if your Web application needs to create a new object whose required memory resources exceed the resources available at generation 0. In that case an object remaining in use at the generation 0 level would be promoted, freeing the resources needed for the newest object. The rate of Gen 0 collections will usually correspond with rate at which the application allocates memory.

# Gen 1 Collections

This counter displays the number of times the Gen 1 heap has been collected since the start of the application. You should monitor this counter in the same fashion as the # Gen 0 Collections counter. If you see numerous collections at generation 1, it is an indication that there are not sufficient resources to allocate for objects being promoted from generation 0 to generation 1. Thus, objects will be promoted from generation 1 to generation 2, leading to high resource utilization at the generation 2 level.

# Gen 2 Collections

This counter displays the number of times generation 2 objects have been garbage collected since the start of the application. Of the three counters discussing generation-level collection information (# Gen 0 Collections, # Gen 1 Collections and # Gen 2 Collections) the # Gen 2 Collections is the most important to monitor. With Web applications if you are seeing a high activity for this counter, the aspnet_wp process could be forced to restart. The restart will occur if the amount of global memory has been fully allocated to resources at the generation 2 level. The restart of the aspnet_wp process forces additional memory to be allocated to the global memory.

# Total Committed Bytes

This counter displays the amount of virtual memory committed by your application. It is obviously ideal for an application to require as little memory as possible, thereby reducing the amount of work required for the garbage collector to manage it.

% Time in GC

This counter indicates the amount of time spent by the garbage collector on behalf of an application to collect and compact memory. If your application is not optimized, you will see the garbage collector working constantly, promoting and deleting objects. This time spent by the garbage collector reflects its use of critical processor and memory resources.

Gen 0 heap size

The Gen 0 heap size counter displays the maximum bytes that can be allocated in generation 0. The generation 0 size is dynamically tuned by the garbage collector; therefore, the size will change during the execution of an application. A reduced heap size reflects that the application is economizing on memory resources, thereby allowing the GC to reduce the size of the application's working set.

Gen 0 Promoted Bytes/sec

This counter displays the amount of bytes promoted per second from generation 0 to generation 1. Even though your application may exhibit a high number of promotions, you may not see a high number of promoted bytes per second if the objects being promoted may be extremely small in size. You should monitor the # Gen 1 heap size counter along with this counter in order to verify whether promotions are resulting in poor resource allocation at the generation 1 level.

Gen 1 heap size

This counter displays the current number of bytes in generation 1. Unlike its Gen 0 heap size counterpart, the Gen 1 heap size counter does not display the maximum size of generation 1. Instead, it displays the current amount of memory allocated to objects at the generation 1 level. When monitoring this counter, you will want to monitor the # Gen 0 Collections counter simultaneously. If you find a high number of generation 0 collections occurring, you will find the generation 1 heap size increasing along with them. Eventually, objects will need to be promoted to generation 2, leading to inefficient memory utilization.

Gen 1 Promoted Bytes/sec

Gen 1 Promoted Bytes/sec displays the number of bytes promoted per second from generation 1 to generation 2. Similar to the approach for the Gen 0 Promoted Bytes/sec counter, you should monitor the Gen 2 heap size counter when monitoring the Gen 1 Promoted Bytes/sec counter. The two counters will provide you with a good indication of how much memory is being allocated for objects being promoted from generation 1 to generation 2.

Gen 2 heap size

This counter displays the current number of bytes in generation 2. When monitoring an application that is experiencing a high number of promotions from generation 1 to generation 2, the generation 2 heap size will increase since objects cannot be further promoted.

.NET CLR Loading

The following counters found under the .NET CLR Loading performance object, when used alongside other counters such as % Processor Time, allow you to gain a more detailed understanding of the effects on system resources through the loading of .NET applications, AppDomains, classes and assemblies.

Total AppDomains

This counter displays the peak number of AppDomains (application domains) loaded since the start of the application. As mentioned earlier, AppDomains are a secure and versatile unit of processing that CLR can use to provide isolation between applications running in the same process. AppDomains are particularly useful when you need to run multiple applications within the same process. In the case of a Web application, you may find yourself having to run multiple applications within the aspnet_wp process. From a performance standpoint, understanding the number of AppDomains currently running on the server is critical because each time you create or destroy an AppDomain system resources are taxed. Just as important is the need to understand the type of activity occurring between AppDomains. For example, if your applications must cross AppDomain boundaries during execution, this will result in context switches. Context switches (as discussed in Chapter 4) are expensive, particularly when a server is experiencing 15,000 context switches per second or more.

Total Assemblies

This counter displays the total number of assemblies loaded since the start of the application. Assemblies can be loaded as domain-neutral when their code can be shared by all AppDomains, or they can be loaded as domain-specific when their code is private to the AppDomain. If the assembly is loaded as domain-neutral from multiple AppDomains, then this counter is incremented once only. You should be aware of the total number of assemblies loaded on the server because of the resources needed to create and destroy them. Sometimes developers will load assemblies that aren t really required by the application. Alternatively, developers may not be aware of how many assemblies they are truly loading because they are making an indirect reference.

Total Classes Loaded

This counter displays the total number of classes loaded in all of the assemblies since the start of the application. Each class loaded is not a static class, so it has a constructor. When calling the class the developer will have to instantiate the class, which is more resource intensive than creating the object once and calling the object s method.

.NET CLR LocksAndThreads

When tracking down a bottleneck that could be related to thread or process contention, the .NET CLR LocksAndThreads performance object is the best place to start. Here, we describe those counters under the .NET CLR Locks AndThreads performance object that can help rule out possible contention issues quickly and efficiently.

Contention Rate/sec

This counter displays the number of times per second that threads in the run time attempt to acquire a managed lock unsuccessfully. It should be noted that under conditions of heavy contention, threads are not guaranteed to obtain locks in the order they ve requested them.

Total # of Contentions

This counter displays the total number of times threads in CLR have attempted to acquire a managed lock unsuccessfully.

Current Queue Length

This counter displays the total number of threads currently waiting to acquire some managed lock. If you see that the queue length continues to grow under constant application load, you may be dealing with an irresolvable lock rather than a resolvable lock. The difference between irresolvable and resolvable locks is that irresolvable locks are caused when an error within the application code s logic makes it impossible for the application to release a lock on an object.

.NET CLR Exceptions

Applications that throw excessive amounts of exceptions can be extremely resource intensive. Ideally, an application should not throw any exceptions. However, many times developers will intentionally throw exceptions as part of the error checking process. This exception generating code should be cleaned up before taking an application into production. Here we have listed two counters found under the .NET CLR Exceptions object. If you choose to monitor only one of these, you should pay most attention to the # of Exceps Thrown/sec counter. If you see this counter exceed 100 exceptions per second, your application code warrants further investigation.

# of Exceps Thrown

This counter displays the total number of exceptions thrown since the start of the application. These include both .NET exceptions and unmanaged exceptions that are converted into .NET exceptions (for example, a null pointer reference exception in unmanaged code would get rethrown in managed code as a .NET System.NullReferenceException), but excludes exceptions which were thrown and caught entirely within unmanaged code. This counter includes both handled and unhandled exceptions. Exceptions that are rethrown will be counted again. This counter is an excellent resource when you are attempting to determine what portion of the code may be generating a high number of exceptions. You could do this by walking through the application while simultaneously monitoring this counter. When you find a sudden jump in the exception count, you can go back and review the code that was executed during that portion of the walkthrough in order to pin down where an excessive number of exceptions are thrown.

# of Exceps Thrown /sec

This counter displays the number of exceptions thrown per second. These include both .NET exceptions and unmanaged exceptions that get converted into .NET exceptions but excludes exceptions that were thrown and caught entirely within unmanaged code. This counter includes both handled and unhandled exceptions. As mentioned earlier, if you monitor a consistently high number of exceptions per second thrown (100 or more), you will need to review the source code in order to determine why and where these exceptions are being thrown.

.NET CLR Security

Depending on how much emphasis you place on the security of your Web application, you will find the following set of counters to be either extremely active or hardly used. These counters should be kept active when truly necessary. Conducting security checks of your application is critical even if there is an effect upon application performance. However, using the security features of the .NET Framework unwisely will not only create security holes in your application, but performance issues will emerge due to poor application design.

# Link Time Checks

Many times you will monitor a counter and see excessive activity for that counter. This activity can be deceiving unless you truly understand what is going on with the counter. The # Link Time Checks counter is just one example. The count displayed is not indicative of serious performance issues, but it is indicative of the security system activity. This counter displays the total number of linktime Code Access Security (CAS) checks since the start of the application. An example of when a linktime CAS check would occur is when a caller makes a call to a callee demanding execution of an operation. The linktime check is performed once per caller and at only one level, thus making it less resource expensive than a stack walk.

% Time in RT checks

This counter displays the percentage of elapsed time spent in performing run time Code Access Security (CAS) checks since the last such check. CAS allows code to be trusted to varying degrees and enforces these varying levels of trust depending on code identity. This counter is updated at the end of a runtime security check; it represents the last observed value and is not an average. If this counter contains in a high percentage, you will want to revisit what is being checked and how often. Your application may be executing unnecessary stack walk depths (the Stack Walk Depth counter is discussed next). Another cause for a high percentage of time spent in runtime checks could be numerous linktime checks.

Stack Walk Depth

This counter displays the depth of the stack during that last runtime CAS check. Runtime CAS check is performed by walking the stack. An example of when the stack is walked would be when your application calls an object that has four methods (method A D). If your code calls method A, a stack walk depth of 1 would occur. However, if you were to call method D, which in turn calls methods C, B and A, a stalk walk of depth of 4 would occur.

Total Runtime Checks

This counter displays the total number of runtime CAS checks performed since the start of the application. Runtime CAS checks are performed when a caller makes a call to a callee demanding a particular permission. The runtime check is made on every call by the caller, and the check is done by examining the current thread stack of the caller. Utilizing information from this counter and that of the Stack Walk Depth counter, you can gain a good idea of the performance penalty you are paying for executing security checks. A high number for the total runtime checks along with a high stack walk depth indicates performance overhead.