Application Availability | Comprehensive VB .NET Debugging

Application availability depends to a large degree on the reliability issues discussed in the previous section, but the extra factor of failure recovery enters into the equation. I am defining application availability here as the system uptime required by end users in order to meet business needs.

Understanding Availability

Taking data from a study done by Dataquest in November 1999, the typical causes of unplanned application downtime can be mapped. Figure 1-2 shows this data in a chart format.

Figure 1-2: Causes of unplanned downtime

Once again, the categories shown in the chart in Figure 1-2 deserve some further explanation. Human error is a category that covers operational errors, backup problems, configuration issues, and so on. The hardware factor includes components such as disks, memory, and fans. The network factor covers failures in routers, switches, cabling, and network servers. System software consists of the operating system, device drivers, firewalls, Web servers, database servers, load balancers, and other sundry applications such as antivirus programs. Finally, the environment category covers factors such as power failures, cooling failures, flood, and fire. The application software is, of course, the application that is being profiled for failure.

Notice that the downtime profile shown in Figure 1-2 looks rather different from the failure profile shown in Figure 1-1, though it contains the same types of failure. This is because certain types of errors (for example, system software failures) may occur more often than others (for example, hardware failures). However, the former type of error is often either transient or much quicker to fix. Because the time to recovery is an important part of application availability, it skews the profile to reflect the fact that certain types of error have more side effects and are harder (and therefore slower) to fix than others.

As with the reliability profile, you can see that you need to look outside of your immediate application for the majority of the application's downtime. While beyond the scope of this book, it is worthwhile to understand that rigorous operational procedures are critical in ensuring that your applications are reliable and consistently available. The application designer should work with the likely operational procedures in mind, attempting to minimize the complexity and manual aspects of every procedure.

Measuring Availability

The most popular method of measuring availability is with a percentage figure. For example, your end users or business analysts might claim that your application must be 99.9% available, or perhaps they opt for the magical "five nines" (in other words, 99.999% reliability). In real terms, 99.9% availability implies a maximum downtime of 8.7 hours in a single year, 99.99% availability represents about 52 minutes of outage , and 99.999% means a downtime of not more than about 5 minutes in a year.

The extra factor when assessing availability as opposed to reliability is called mean time to recovery (MTTR). This is made up of the following simple formula:

MTTR = Hours of Downtime / Failures

This formula measures the average time that an application takes between going down and coming back up again. So if your application has six failures a year and a total of 24 hours of downtime in the year, its MTTR works out as 4 hours.

The formula to measure the percentage of application availability is therefore slightly more complex than the reliability formula discussed earlier. Mean time between failures is shown as MTBF, and mean time to recovery is shown as MTTR. Both figures are represented in hours:

Availability = (MTBF / (MTBF + MTTR)) — 100

As an example, if an application fails six times a year (MTBF = 1,461 hours) and the average recovery time is 1 hour (MTTR = 1), feeding these figures into the formula gives an availability percentage of 99.93. This in turn translates to an average downtime of about 6.1 hours per year.

It appears that most business organizations can live comfortably with 99.9% availability, which equates to about 8.5 hours service outage per year. This figure is very achievable with motivated people, good software development processes, and rigorous operational procedures.

The VS .NET documentation provides an interesting table as a guideline for the availability requirements of different business categories. This table is reproduced in Table 1-1 for your convenience.

Table 1-1: Typical Business Availability Guidelines (VS .NET Documentation)
BUSINESS CATEGORY	FAILURES PER YEAR	AVERAGE TIME TO REPAIR	DOWNTIME PER YEAR	AVAILABILITY
Noncommercial	10	10 hours	88 hours	99.00%
Commercial	5	8.8 hours	44 hours	99.50%
Business-critical	4	2.25 hours	8.5 hours	99.90%
Mission-critical	4	0.25 hour	1 hour	99.99%

Designing for Software Availability

The list of design recommendations for software availability has some similarity with the previous software reliability list, but with some added concepts:

Emphasize availability as an explicit design goal
Recruit designers, developers, and testers who value availability
Define specific availability targets and add them to your requirements
Test that the availability requirements have been met
Design monitoring and diagnostic facilities into your application
Design redundancy into your application at critical failure points
Isolate critical applications
Use queuing for component communication
Have a consistent error-handling and recovery scheme

The best rewards for your effort are likely to come from designing redundant software components (the line in italic font). Doing important calculations in two or three different ways and then cross-validating the results is a very useful design concept. Having two or three copies of critical components so that one copy can take over from another in the event of failure is another important design tool.

You should also take care to isolate your business-critical and mission-critical VB .NET applications from other applications. On the server side, this means preventing other applications from competing with your critical application for resources such as CPU time, memory, network bandwidth, and database usage. On the client side, this might mean preventing the use of applications that interfere or compete with your application. As you try to attain ever-higher levels of availability, you should try to either reduce or eliminate any interference from other sources.

Designing your application to use queuing for intra-component and intra-application communication can help that application's availability. This involves using middleware such as Microsoft Message Queue (MSMQ) or TIBCO Rendezvous (TIBRV) to send and receive asynchronous messages. Queuing is useful for guaranteed message delivery, as the sender and recipient do not have to be connected together, and one or the other can even be offline without affecting the message delivery. This removes a potential failure point from your system. By increasing the number of routes for successful message delivery, your end users perceive that your application is available more consistently.

Improving Software Availability

Notice that it is possible for an application to have quite a low MTBF while still having a high availability, and vice versa. If failures are corrected quickly enough, thus reducing the MTTR, the resulting downtime is relatively low. This is important because once you have fixed the majority of the reliability problems and you're starting to work toward the higher levels of availability, it usually turns out to be cheaper to expend effort on faster failure recovery times than it is to grasp for those elusive final percentage points of reliability.

So the first step is to improve your application's reliability until the number of failures is down toward single figures per year. Then the majority of your effort should be directed at improving the recovery time from each defect. When you start to see diminishing returns in improvement on one factor, you need to concentrate on improving the other factor. So you need to balance these two factors together.

Relying on your technical support staff to keep your application available is probably a mistake. Because the support department does not usually have the skills or budget to analyze failures properly, it tends to concentrate on the quick fix, trying to get your application up and running as quickly as possible. While this is necessary, it is not sufficient. You also need to assign skilled people to analyze the root causes of availability problems. Producing a steady flow of architectural recommendations and procedural improvements is essential to the improvement of reliability and availability.