Application Reliability | Comprehensive VB .NET Debugging

I was called in to deal with a sophisticated and highly functional trading system that was plagued with some nightmarish reliability problems. The application consisted of many megabytes of source code spread over dozens of components running on multiple machines. Its error handling and recovery was either nonexistent or deliberately suppressed errors, and its real-time interaction with external systems was frequently changing. In short, it was an application from one of the lower levels of Dante's Inferno.

This experience taught me more than I ever wanted to know about the reliability of complex systems and supplied some significant insights into how to approach reliability in the distributed world that VB .NET inhabits.

Understanding Reliability

Taking data from studies published by the Gartner Group, Cahners Instat, and others in the late 1990s, the typical causes of application failure can be mapped. Figure 1-1 shows this data in a chart format. Although it is unclear whether the percentages shown in the chart will change as .NET allows you to build more complex and distributed applications, it is a reasonable starting point for discussion.

Figure 1-1: Causes of application failure

The various categories shown in Figure 1-1 deserve some explanation. Human error covers backup and restore problems ”errors caused by a lack of rigorous operational procedures and failures associated with configuration problems. The hardware factor includes components such as disks, memory, and fans. The network factor covers failures in routers, switches, cabling, and network servers. System software consists of the operating system, device drivers, fire-walls, Web servers, database servers, load balancers, and other sundry applications such as antivirus programs. The application software is, of course, the application that is being profiled for failure. Finally, the environment category covers power failures, cooling failures, storms, flood, and fire.

At least two interesting observations can be made about this information. The first is that the .NET Framework, here coming under the category of system software, could by itself make application reliability worse . The replacement of the relatively simple VB.Classic execution engine with a more complex class library and common language runtime (CLR) engine is unlikely to improve matters. The hope is that this extra layer is reliable and will help you to make your applications less defective than when using VB.Classic. In reality, the significant productivity boost provided by the .NET Framework may come at a high price in application reliability.

The second observation is that you need to look outside of your application for the majority of application failures. While beyond the scope of this book, it is worthwhile to understand that uninterruptible power supplies , reliable hardware, and a trustworthy network are critical in ensuring that your applications are reliable and consistently available. You also should not neglect the reliability of your Web and database servers, two very critical elements. This may involve building diagnostic utilities into your system aimed at monitoring factors that are not necessarily under your control.

Measuring Reliability

This is all very interesting, but how do you measure the reliability of your own .NET application? The first step is to establish exactly what your application's reliability requirements are and then to record those requirements for later benchmarking. If you ignore availability for the moment, as I discuss this in the next section, you need to consider at least two issues when constructing a reliability metric:

How well an application provides the required services
How well an application provides correct results

The first element, how well an application provides the required services, corresponds to a requirement something like "The system will prevent trade entry for no more than 1 hour per year." The second element, how well an application provides correct results, corresponds to a requirement something like "The system will correctly report 999 out of every 1,000 trades."

A reliability metric is often measured using something called mean time between failures (MTBF), which is made up of the following simple formula:

MTBF = Hours / Failures

This formula measures the average time that an application will run until a failure occurs. So if your application has six failures a year, its MTBF works out as 1,461 hours, or about 2 months. Notice that this is not necessarily the same as mean time between bugs (MTBB). If your application recovers properly from a bug, that bug might not even be considered as an application failure.

Unfortunately, reliability is not always as simple as this formula implies. One important element that it fails to consider is the type of error. To continue with the example application, if a trading system allows an option trade exposing the trader to literally millions of dollars of risk, it doesn't really matter how reliable the application is in all other respects. So now you need a more complex formula in order to take into account the severity of the bugs allowed:

MTBFW = Hours / (Failures Severity Weighting)

The type of application failure also needs to be considered, because certain bugs and errors are much more damaging than others. Ask yourself the following questions:

Is the failure permanent or transient?
Does the failure corrupt data or not?

The preferable type of failure is usually transient and doesn't damage data, whereas at the other end of the scale is a permanent failure that corrupts data. The transient failure that corrupts data can also be nasty because its occasional nature means that a problem may not be noticed for a while. This can mean corrupted data backups and other rather scary situations.

Another element that needs to be considered is the units of time used in the MTBF formula. Each type of application has its own domain time requirements:

Calendar time: Suitable for systems with regular usage patterns
Clock time: Suitable for systems with peak/trough usage patterns
Processor time: Suitable for nonstop systems

It can be difficult to specify a level of reliability that meets the business needs while also meeting budget and schedule requirements. This is especially true when the usage patterns and the number of users change over the lifetime of the application. There are always tradeoffs to be made in this area.

Designing for Software Reliability

Here is a list of measures to improve the reliability of your VB .NET application during the design and construction processes. They are discussed in more detail in the text that follows .

Emphasize reliability as an explicit design goal
Recruit designers, developers, and testers who value reliability
Define specific reliability targets and add them to your requirements
Test that the reliability requirements have been met
Design monitoring and diagnostic facilities into your application
Use assertions to document and enforce assumptions and conventions
Design and build health checks directly into the application
Design redundancy into your application at critical failure points
Have a consistent error-handling and recovery scheme
Trap and record all application bugs and failures
Use the fail-fast principle in your designs
Use the excellent diagnostic tools within .NET

Emphasizing reliability as a design goal and making sure that the people working on your project have reliability as one of their primary goals are essential. Without people buying into the reliability goals, all of the other measures outlined in the preceding list will not work very effectively. This means ensuring that the business sponsors and the IT managers are not allowed to be vague about reliability and quality targets. It may suit many project sponsors not to be explicit because they are worried about the schedule and resource costs of clear targets. They can always fall back to the position that good developers would understand and provide reliability without having it stated explicitly. The reality is that providing good application reliability is difficult, and providing it without a clear target is almost impossible .

Defining specific reliability targets ensures that you know how reliable your application needs to be. Performing tests against these reliability targets means that you can offer hard figures to your end users and to the application support staff. If you're happy with the reliability testing, you might even be able to offer a service-level agreement (SLA). Creating an SLA means that there's no choice except to be explicit about the SLA requirements.

Design automated monitoring and diagnostic components into your application in order to perform ongoing application analysis and to identify application faults and failures early. As your application grows, adds more users, and supports increasingly complex links to other systems, this monitoring will allow you to identify trends and to understand or even predict new problems. Another benefit this gives you is the ability to identify invisible failures. These are errors that don't stop your application from running, but may cause problems whose adverse effects wouldn't otherwise be identified until later.

In this respect, it can be useful to treat the application maintenance team as users. If this team is able to request diagnostic and other facilities to be built into the application, you are likely to find that your end users will experience a more reliable application.

Use the Trace and Debug classes, which I discuss in Chapter 5, to enforce and document all of the assumptions and conventions that every programmer makes. For example, if two methods must only be called in a specific order, an assertion can enforce this convention. Or if a variable is only supposed to have one of three values, an assertion can check this assumption. These assertions can catch many bugs automatically during development and also serve as source code documentation of the thinking of the original developer.

Automated health checks can verify that your application is working properly. For instance, a script could ping each of the components in a distributed system and report any components that failed to respond or generated an incorrect response. Another script could perform a dummy customer interaction with a Web page and e-mail or page a support technician if no response was received within a certain time. This type of checking allows you to spot problems and failures within seconds or minutes rather than hours or even days.

You can make the mission-critical parts of your application more reliable by adding redundant software. For example, you could calculate a critically important value in two or three different ways, using common validation checks to ensure success. Alternatively, you can have two or three copies of the same component, so that if one fails, its identical companion takes over. Redundant hardware components and databases are, of course, very common.

Proper error handling and recovery in distributed systems is very difficult. In some systems, as much as 80% of the code is devoted to error handling as opposed to 20% for functionality. What makes this even more difficult is that much of this code is untested during normal testing and in production, because it's only ever run when a bug or failure actually appears. A consistent pattern for building error handling and recovery is therefore invaluable. Once you've shown that the pattern works, it's then relatively simple to ensure that the same pattern is implemented properly throughout your application. You can even design schemes that will retry after failures, and in the worst case you'll at least have a reasonable postmortem trail to follow. Chapter 13 discusses ways of implementing this type of functionality.

Linked to the error handling discussed previously is the need to trap and record the full details of all bugs and failures. An unhandled error will crash your component and possibly your application. Chapter 13 contains a comprehensive discussion about implementing error handling and recovery properly in VB .NET.

You should always use the fail-fast principle when designing your error-recovery routines. This consists of three ideas that have consistently been found to improve reliability after a bug or failure has occurred:

No answer is better than the wrong answer (to the end user ).
Fault containment is better if the task stops (but not the application).
Recover back to a known safe state (using Try Catch Finally ).

Finally, there are some excellent diagnostic tools available with the .NET Framework SDK and the .NET Framework. For an in-depth discussion of these tools, please see Chapter 5.

Improving Software Reliability

Not all failures and bugs are created equal. Therefore, removing the bugs with the worst consequences is the most important goal. In one particular study, removing 60% of the software faults within an application only led to a 3% reliability improvement. So focus your effort where it does the most good ” mainly on the failures that have the worst effects on your end users.

Defining the defects with the worst consequences can sometimes be difficult. We all know of end users who insist on marking every bug as critical. I suggest you use the following criteria for assessing the relative importance of each problem:

Does the bug result in lost or corrupted data?
How many people does the bug affect?
Is there a reasonable workaround for the bug?

A second effective method of improving reliability is to fix the bugs that occur in the most frequently used parts of your software system. This will tend to give you a better result for your efforts, because fixing a single bug that occurs ten times in a day will usually be more beneficial than fixing multiple bugs that only rarely occur.

Ultimately, to build a reliable system requires that your application's requirements, its developers, and its testers all place a strong emphasis on reliability. The processes and measurements discussed in this section can help, but even with the new .NET technologies, the most important path to reliability is the people involved in the project.