Before you look at how to use the tracing tools provided by .NET, you need to understand who will be using the information provided by the tracing code and how they will use that information. The four most common stakeholders are the application's end users, the application's support team, the application's testing team, and the application's development team.
An application's end users have to know information that could adversely affect their use of the application. These people need status information from an application that helps them to
Identify application status and failures, especially during lengthy operations
Identify application mistakes that might affect business processes
Work around certain application problems
Some applications have the luxury of a dedicated support team, whereas others have to make do with the support available from the general help desk. However support is handled, a first-level support team has to respond directly to end users when they're having trouble using an application. This team needs status and debugging information from an application that helps them to
Identify the application's status and behaviors
Diagnose application errors and identify the causes
Fix or work around the straightforward problems
Pass diagnostic information to developers for problems that can't be solved
Assist end users with performing certain application tasks
Improve an application's availability to its end users
An application development team has to identify the causes of application errors, fix an application when it goes wrong, and generally maintain and enhance the application's code. This team needs debugging, status, and metrics information from an application that helps them to
Diagnose application errors and identify the causes
Identify and understand end user /tester actions causing application problems
Identify application performance bottlenecks and reliability "hot spots"
Improve an application's reliability and availability
Provide further tracing code to isolate and identify tenacious problems
As you can see from the preceding sections requirements lists, the four groups of people need similar information. The main difference is often the level of detail each group needs. If the required information is categorized, there are four main types of diagnostic data:
Failure data : Exceptions, errors, warnings, crashes, and assertion failures
Timing data : Performance, communications, queries, and lengthy operations
Statistical data : Traffic volumes and usage patterns
Debugging data : User actions, program state, and program data
It is the job of a developer to include tracing code to emit all of this data, and it is quite an art to identify exactly which information will be useful for solving a wide variety of potential problems. At the very least, I suggest that your tracing code be capable of recording each of the following situations:
Every program crash and the call stack that led to the crash
Every exception and the call stack that led to the exception
Every assertion failure (I discuss assertions in the next section)
Every "Page not found" (404) error
Every component or page time-out
Timing data for every lengthy software operation
Timing data for every business-critical software operation
Timing data for every database query
Timing data for any "real-time" communication with another application
Status data recording the results of application health checks
Status data reporting any communication problems between components
Records of all messages involving communication between components
Traffic volumes, especially any large increases in traffic volume
Usage patterns, especially any significant changes in usage
Note that having tracing code that records this information doesn't mean that all of the tracing code needs to be activated all of the time. As you'll see later in the section titled "Step 5: Trace Control at Runtime," it's possible to use an application's configuration file at runtime to control which information is actually emitted at a fine level.
Of course, you also need to record information that pinpoints the exact location and type of problem. This might include some sort of standard header incorporating the following data:
Component name and path
Machine name, IP, and maybe its geographic location if applicable
Type of environment (e.g., dev, integration test, user test, live, or contingency)
Whether support is required and the support group to be contacted
The bottom line is that all of this diagnostic information helps you to improve your application's availability through better and faster understanding of problems. In the case of a large distributed application where problems can be especially hard to locate and diagnose, this information can make the difference between your application being viable and not viable .
No matter what tracing tools you use, there are some design decisions that you need to make. Here are a few rules of thumb that have proven to be useful for many applications, even those built before .NET arrived:
Centralize tracing code : Create reusable tracing classes or components that standardize logging information and provide tracing facilities without bothering developers with unnecessary details.
Centralize diagnostic information : Place all diagnostic information in one central location (such as a database) to allow easier monitoring and statistical analysis.
Minimize performance degradation : Use trace control mechanisms to ensure that you don't emit diagnostic information unless it's needed and to postpone the gathering of expensive diagnostic information until it's really necessary.
Analyze failures and keep metrics : To improve the service offered to an application's end users, you should study its failures and then provide the support team with the root cause of each problem and a list of possible solutions. Providing metrics in the form of charts can also help the support people to identify reliability and availability trends.