9.5 Availability Requirement Pattern


9.5 Availability Requirement Pattern

Basic Details

Related patterns:

None

Anticipated frequency:

Usually no more than one requirement-though from it might flow dozens of extra requirements

Pattern classifications:

None

Applicability

Use the availability requirement pattern to define when the system is available to users: the system's "normal opening times" (which could be "open all hours") plus how dependably the system (or a part of the system) is available when it should be. This requirement pattern is written to fit systems that appear to have a life of their own, such as server type systems that sit waiting with a range of services for users to call upon whenever they wish. It is not meaningful to specify availability in the same way for desktop-type applications (such as a diagram editor) that you start up when you want.

This requirement pattern has not been written to satisfy the demands of life-critical systems. It is for normal business systems, where the most disastrous outcome is commercial (financial).

Discussion

It's easy to say "the system shall be available 24x7," but even the most bullet-proof, fail-safe, over-engineered system won't roll on for ever. Anyone who's aware of that should have qualms about such a blanket requirement. In any case, "24x7" has become cliché: it's often not intended to be taken literally when used in speech, so you can't rely on it being taken seriously when stated as a requirement. Fortunately, there's an easy answer to that too: add a percentage to it. "The system shall be available to users for 24 hours a day, every day, 99 percent of the time." That's better. But where has this figure of 99 percent come from? It sounds suspiciously arbitrary. And if I'm a software developer, what am I to do when I encounter such a requirement? What should I do differently if it said 99.9 percent? If I'm the project manager, how much will it cost to achieve this 99 percent? If I'm a tester, how can I test that the system satisfies this requirement? If I run it for a week nonstop without incident, is that good enough? No, requirements like this are unhelpful to everyone. It's time to go back to the drawing board.

image from book
A Sense of Percents

Availability percentages are quoted with apparently careless abandon, so let's spell out what they actually imply. For a system that aims to be available 24 hours a day, the following table shows for how long the system must run before it can acceptably accumulate one hour of downtime (time when the system is unavailable to users when it should be available):

Open table as spreadsheet

Availability

equates to 1 hour's downtime every

and downtime in a year of

90%

10 hours

5.2 weeks

95%

20 hours

2.6 weeks

99%

4.2 days

3.7 days

99.9%

6 weeks

9 hours

99.99%

13.7 months

53 minutes

Bear in mind that these reflect unavailability from all causes, both planned and unplanned. These figures let us picture how much havoc would be wrought to our availability target by, say, a 12-hour shutdown to upgrade the software or by a six-hour shutdown after an attack by a hacker.

Let's look at availability rates in a different way and venture subjective judgments on how hard a few availability levels are to achieve (though they'll vary according to the nature of the system and the technology you use):

Open table as spreadsheet

Downtime of

equates to availability of

and is how hard to achieve?

1 hour per day

95.833%

No problem.

1 hour per week

99.405%

Beginning to get tight.

1 hour per month

99.863%

Much care needed!

1 hour per year

99.989%

Very tough indeed!

System failures are a matter of luck: they might never happen. But if you want high availability without taking the necessary trouble, the odds of achieving it are stacked heavily against you.

image from book

Let's start by recognizing that our revised easy requirement is conveying two things. First, what I'll call the availability window, which is the times during which we want the system to be available (for example, 24 hours every day, or stated business hours). And second, how dependably it should be available during those times (the hard part!). The availability window is easy to specify and says a lot about the nature of the system, so begin by writing a requirement for it. It can be 24x7 if necessary, but making it less reduces the pressure on developers. The converse unavailability window (scheduled downtime) gives time for the various housekeeping and other extracurricular activities every system must perform-for which 24x7 allows no dedicated time at all. Doing these things during the unavailability window makes it easier to provide high availability the rest of the time. Define the bounds of the availability window according to what you require, not what sounds attractive.

The remainder of this section discusses an overall approach to specifying availability. For clarity, it leaves out the details of how to carry out each step and how to specify the resulting requirements; they are covered in the "Extra Requirements" subsection.

Before going any further, we must recognize that our availability goals cover only components within the scope of our system. We cannot be held responsible for the availability of anything that's outside our control. It's essential to state this clearly and prominently in the requirements specification-or everyone will naturally attribute to the system any downtime due to external causes. If the discussion and setting of availability goals must include external factors, separate the goals for the system from external goals. For example, if the customers of a Web-based service are to perceive less than one hour's downtime per month, we could allocate ten minutes of that to Internet communication unavailability (outside our control, but for which figures can be obtained), five minutes to Web server unavailability (assuming it to be outside scope) and forty-five minutes to unavailability of our system. The latter could then be sub-allocated into five minutes for hardware and operating system and forty minutes for our own application software. These allocations can be adjusted. For example, by choosing high-quality, replicated hardware and high-quality third-party products, we can reduce their allocations, leaving as much as possible for our own software. But this takes us into the technical realm that the requirements stage should eschew as far as possible.

Separate downtime allocations can be given to different parts of the system. If so, statistics should be gathered once the system goes live to see how well each part is going against its target. To do this, the duration of each failure must be correctly assigned to the correct cause-system or external-which isn't always easy to do. It's also liable to be contentious if different managers are responsible for different parts of the system. It's questionable whether a system can dependably work out when it's been available, but if you want it to, define requirements for what you want it to record. Otherwise, gathering these statistics is a manual process.

Our availability conundrum can be summed up as

  1. We don't know what we'd get if we ignored availability.

  2. We can't work out how much we need, or how much we're prepared to pay for it.

  3. We don't know how much we could improve it by if we tried, nor how much it would cost.

To produce requirements for those features that our system needs in order to achieve the business' availability goals, we must satisfactorily unravel all three parts of the conundrum. That's a tall order; in fact, it is literally impossible, and the following paragraphs point out why. But that doesn't mean we should give up; we just have to set our sights a bit lower.

Taking (a) first, every system has what we can call a natural availability level, which is what you get if your developers build the system without paying any special attention to availability. (Notice that I say your developers, because if they're highly skilled your system will have a higher natural availability level than if they were mediocre.) The trouble with our system's natural availability level is that we can't possibly know what it is until well after it's been built. We might have a gut feel, but any attempt to quantify it would be just a wild guess. Nevertheless, it's a useful concept for discussion purposes: it helps us tell when we're on shaky ground.

For (b), the news is better: we can paint a reasonably clear picture of how important availability of this system is to the business-by quizzing key stakeholders about the damage the business would suffer in various situations, and asking them how much they'd be prepared to invest to reduce the chances of it happening. The results are still not strictly systematic-because of our inability to determine the chances of such failures happening, nor how much it would actually cost to do better-but they give the project team a sense of how far to go to improve availability. They also give stakeholders a good understanding of the issues.

Moving on to (c), achieving anything higher than the natural availability level is going to cost money. You'll also quickly reach a point of rapidly diminishing returns, where each incremental increase in availability costs noticeably more. The following graph demonstrates the cost of increasing availability-though I stress it is indicative only and not based on real figures. Cost 1 is that of the natural availability system, which-we discover eventually-has 95 percent availability. Building a system with 99.5 percent availability would cost roughly twice as much-double the whole system budget, that is. This demonstrates how vital it is to get availability requirements right: few other aspects of a system can have such a large impact on its cost.

Figure 9-2 shows the whole y-axis down to zero to point out that 95 percent availability is actually quite a high figure. Also, being prepared to accept reduced availability in an effort to reduce cost is usually a waste of time.

image from book
Figure 9-2: The relative cost of different availability levels

We run into further trouble when we try to specify ways to improve our system's availability: we can't know how much effect each possible precaution will have. If a system already exists, we can at least spot the most common failings and focus on them; but we can't do that for a system that's yet to be built. Hints can be found by looking at experiences with the organization's other systems, or systems previously built by the same development team, or similar systems.

The best way to achieve our availability goals is to specify requirements for a wide range of features that contribute. These requirements can be identified by investigating the three main causes of downtime (regular housekeeping, periodic upgrades and unexpected failure) and working out ways to reduce them. Each of these requirements can contain an estimate of its availability benefit. Give each one a low priority by default-though you can give a higher priority to any you feel deserves it (because many of these features will be worthwhile in their own right). Post-requirements planning can estimate the cost-in development effort and/or the financial cost of purchasing extra hardware or third-party products-of implementing each requirement. These benefit and cost estimates then let you make more informed choices of which of these requirements to implement: some requirements will emerge as more cost-effective than others.

The resulting requirements might not give stakeholders assurances in the terms they seek (or are used to seeing), but to do so would be misleading, because you couldn't guarantee the system will achieve them.

The steps to take are as follows (and the subsections referred to are within the "Extra Requirements" section later in this pattern):

  • Step 1: Write a requirement for the availability window, as per the template and example in this pattern. If different chunks of the system can have different availability windows, write a requirement for each one.

  • Step 2: Work out the seriousness of the impact on the business of downtime-as described in the section titled "The Business Impact of Downtime."

  • Step 3: Specify what is to happen when the system is unavailable or not working properly, as described in the "Partial Availability" section.

  • Step 4: Give a thought to surreptitious unavailability-which means bursts of poor response time when background work is being done by the system-if the unavailability window is small or nonexistent, as described in the "Surreptitious Unavailability" section.

  • Step 5: Specify requirements for features to improve availability-by investigating the causes of downtime and working out ways to reduce them, as described in the "Requirements for Reducing Downtime" section.

All the preceding steps are undertaken as part of the requirements specification process. Further steps, which follow, can be done later, after cost estimates have been made for implementing the system-including specific estimates for all the requirements that contribute to increasing availability:

  • Step 6: Estimate the cost of implementing each requirement for improving availability. This should be done as part of the project's main estimation process.

  • Step 7: Calculate the cost effectiveness of each requirement for improving availability, based on its estimated cost and its estimated effectiveness. A spreadsheet is perhaps the most convenient vehicle for doing this. Use these cost effectiveness values to decide whether to implement any of these requirements immediately; adjust the priority of each requirement accordingly.

Content

A requirement to specify the availability window needs to contain the following:

  1. Normal availability extent The times during which the system is planned to be available. This could be "always" (24x7), or a start and end time each day-and perhaps which days of the week, too.

  2. Meaning of available A definition of what is meant by available in the context of this requirement. This must not be stated in terms that depend on how the system is implemented (for example, the availability of an individual server machine). For a typical system, available means that users are able to log in and perform whatever functions they have access to. Assuming a system is either available or not is something of an over-simplification; see the "Partial Availability" subsection later in this pattern for a discussion of the possibilities in between.

  3. Tolerated downtime qualifier (optional) A caveat recognizing that perfect availability can't be guaranteed, and describing where more details are to be found about the amount of downtime that would be considered tolerable.

Template(s)

This template is for a requirement that defines the availability window of a system, with an optional clause for a tolerable level of unavailability (which needn't itself be in quantitative terms).

Open table as spreadsheet

Summary

Definition

«Extent» availability

The system shall normally be available to users «Availability extent description» [, except in exceptional circumstances of a frequency and duration not to exceed «Tolerated downtime qualifier»]. "Normally available" shall be taken to mean «Availability meaning».

Example(s)

As for the template, these examples define the availability window. All other requirements related to availability are covered in the "Extra Requirements" section.

Open table as spreadsheet

Summary

Definition

7 a.m. to 7 p.m. availability

The system shall be available to all users from 7 a.m. to 7 p.m. on business days (that is, weekdays that are not public holidays), except in exceptional circumstances of a frequency and duration not to exceed those defined in other requirements. "Available" shall be taken to mean that all user functions are operational.

Availability of dynamic Web functions

The dynamic functions of the company's Web site shall be available to visitors 24 hours per day, every day of the year, except for unscheduled downtime not to exceed 1 hour per week (averaged over each calendar quarter) plus scheduled downtime not to exceed one outage per calendar month of a maximum of 4 hours to be carried out at the time of a week's lowest Web site activity.

"Dynamic functions" are those that require the active involvement of the Web shop system (for example, to place or inquire on orders).

Web site availability

The company's Web site shall be available to visitors 24 hours per day, every day of the year. "Available" shall be taken to mean that all static Web pages shall be viewable. In addition, if any dynamic function (as defined in the previous requirement) is unavailable, then a static page of explanation shall be presented in its place.

It is recognized that constant availability with no interruption at all cannot be guaranteed, but only outages resulting from extraordinary causes that could not reasonably be prevented will be regarded as tolerable.

Extra Requirements

The proper specifying of availability can involve numerous extra requirements of diverse kinds; they might include many features that developers find desirable but which do not normally appear justified to the business. Typing them directly to the availability goals of the business provides that justification.

This section is divided into four in accordance with the approach described in the preceding "Discussion" section: the business impact of downtime, partial availability, surreptitious unavailability, and requirements for reducing downtime. The last of these is where the serious action is, and it is itself broken down into six separate areas, each covered in its own subsection.

The Business Impact of Downtime

The first questions to ask are: Just how vital is high availability? Why's it needed? What's it for? Does survival of the business depend on it-in which case you've got to go to enormous trouble and expense? Or is it just nice to have, like a company intranet outside office hours, where if it's down you'll try again later? Answers to these sorts of questions are your best guide to the most suitable way to frame availability goals.

You can work out the seriousness of the impact on the business of downtime (during the availability window) by presenting key stakeholders with a few scenarios. One might be: the system fails altogether at 9 a.m. on Monday morning. How much damage has the business suffered after half an hour of the system being down? After two hours? Six hours? Three days? Pick the point at which serious pain starts, and then ask: How much extra is the business prepared to invest to reduce the chances of suffering this much damage? (Recognize that there's a kind of backward connection here: longer failures do more damage but are easier to shorten, so it's necessary to find shortest downtime period that hurts.)

Write up the results of these exercises as informal narrative in the requirements specification. Don't simply record every remark that was made: distill the salient conclusions into a few punchy points. Where possible, identify the source of the statement (if it's a senior executive, say) to give it added weight. The aim is to guide anyone involved in planning or developing the system-to give them a feel for the lengths they should go to. Here are a couple of examples:

  • "An outage of more than twenty-four hours would lead to a permanent loss of 25 percent of customers. (Source: marketing manager)."

  • "We're prepared to pay an extra «Amount of money» if it means we can be up and running again two hours after a major incident. (Source: CEO)."

These are targets only: it's impossible to guarantee they'll be achieved, because nothing's going to force the gremlins in the machine to abide by them. So stating them as requirements-things the system is required to satisfy-is problematical and actually reduces their credibility. Statements like these carry more weight when they're not requirements.

If you still feel the urge to state an availability percentage (or, equivalently, a tolerable amount of downtime per given time period), go ahead. If so, it's preferable for this to be an informal statement too-because it can't be guaranteed either.

Partial Availability

What should happen when the system isn't fully or properly available to users but is still alive enough to do something? See if there's some fallback position that lets you deliver a reduced service to users or at least inform them that something's wrong. When one part of a system fails, most of the time the rest keeps on running. So treating all failures as all-or-nothing gives an exaggerated picture of their effect on availability. Still, because availability is already too complicated to calculate, you can, if you wish, ignore the subtleties of partial availability when confronting quantitative availability levels.

It can be worthwhile to divide a system into two or three chunks for the purpose of availability goals. These chunks could be according to their importance or the technology we know each uses. For example, if we're building a Web site and the system behind it, it would make sense to state higher availability goals for the static parts of the Web site than for the interactive parts (placing orders, say).

When the system is partially available, the working part might be able to adapt accordingly. For example, if the software behind our Web site fails (or is down for maintenance), we'd like to let our users know that certain functions are temporarily unavailable-perhaps by having fallback static Web pages to display in this situation. Here's an example requirement:

Open table as spreadsheet

Summary

Definition

System unavailable page

When the system is unavailable to users, any attempt by a user to access the system shall result in the display of a page informing them that it is unavailable.

This response is not expected if those parts of the system needed to provide such a display are themselves not running-though all practical steps shall be taken to make to the user that something is wrong.

The requirements can't state the reaction to every type of failure (and mustn't attempt to), but they may address a small number of salient ones. It's also possible to specify requirements for steps that are to be taken when appropriate to improve error handling; these can act as guidelines for developers.

Surreptitious Unavailability

If we don't give our system spare time to do its housekeeping (that is, no unavailability window), it is forced to do it while users are active. This can manifest itself as intermittent slow response time or possibly a disconcerting delay (say, 30 seconds or a minute) if it stops certain types of processing for users altogether. We can call this surreptitious unavailability, because the system is unavailable for this time but in such a way that the unavailability is difficult to notice.

If degraded performance is tolerable for a short while, adjust your performance requirements to allow it-and to say how much is acceptable-though at requirements time it's impossible to know how much time might be needed. You could also stipulate quiet times of day (or days of the week) to which such tasks are restricted. If degraded performance is unacceptable, make this clear, either in the relevant performance requirements themselves or in an additional requirement. Otherwise, developers are likely to argue that response time during housekeeping is a special case. Either way, bring this issue out into the open early.

Demanding both constant availability and consistently good response times is liable to create a squeeze that puts pressure on developers-and costs extra to deal with. (That this appears to happen rarely is perhaps due to surreptitious unavailability being ignored.) Relax this squeeze if you can: don't write onerous requirements unless there's a genuine need. If the goal is for the effect of background housekeeping not to be noticeable-which is usually perfectly acceptable-it's a good idea to permit a degrading of user response times small enough to fit the bill (say, by 10 percent).

Here are a few example requirements for alternative ways to prevent surreptitious unavailability getting out of hand (though they can be used in combination if need be):

Open table as spreadsheet

Summary

Definition

Housekeeping response time increase maximum 10%

The running of system housekeeping processes while users are active shall not cause a perceptible increase in response time for any function of more than 10% over that when no housekeeping process is running. (That is, an increase in response time of up to 10% during housekeeping is tolerable.)

No housekeeping between 5 a.m. and midnight

No system housekeeping processes shall be run between the hours of 5 a.m. and midnight.

No housekeeping to run for more than 2 minutes

No system housekeeping process may run for more than two minutes in any ten if it might cause a perceptible increase in response time for any user function of more than 10% over that when no housekeeping process is running.

Requirements for Reducing Downtime

We can increase a system's availability by examining the three main causes of downtime-maintenance, periodic upgrades, and unexpected failure-and working out ways to reduce them. Some of these ways will be more cost-effective than others. Decisions on which ones to implement (and when) can be taken later in the project: it usually makes business sense to defer some in order to deliver the system faster. The role of the requirements here is to provide the information on which these decisions can be based: to demonstrate what precautions can be taken, what effect each is likely to have, and some idea of their complexity.

Requirements specifying features introduced to help achieve availability targets can be diverse and numerous (so there is a lot to work through in this section!). They span the duplication of hardware and software components and features needed by the products you use (such as the database). They also cover functions for monitoring, startup and shutdown, error diagnosis, software installation, security, and potentially other things that don't have an obvious connection to availability. Lots of these functions the system needs anyway, but improving availability might demand that they be better: more powerful, faster, easier to use-in general, built with more care and attention.

In lots of systems, many of these behind-the-scenes functions are often cobbled together as an afterthought, with little time allocated to developing them. Part of the reason is that most requirements specifications omit them altogether-because their connection to the system's business goals appears tenuous and normal users don't use them. Connecting functions directly with availability goals that are in turn attached to business goals gives us solid justification for including those functions and for treating them with the same seriousness as the rest of the system. This is an important point: well worth highlighting.

To identify extra requirements for functions that help us achieve our availability goals, we need to look at the reasons systems are unavailable. Anything that reduces any of these causes will increase our expected availability. Let's introduce a few straightforward formulae. The first one breaks the problem down into three constituent factors:

(Formula 1) image from book

where

Total downtime is the amount of time during any given period for which the system is unavailable to users during the planned availability window.

 

Maintenance is regular housekeeping that needs to be performed to keep the system operating smoothly-such as database backups-or business processing that cannot be done while users are accessing the system and which must be performed during the availability window. The stopping-and-restarting of any component counts as maintenance if it must be done periodically.

 

Upgrades are the installing of new software or hardware, and all related tasks.

 

Failures are anything that goes wrong that renders the system unavailable.

Assign every cause of downtime to one of these three factors. The dividing lines between them aren't always clear-cut. For example, if new software has to be rushed into production to forestall a looming failure known in advance, should that count as a normal upgrade or a failure? You could introduce an extra category for "preemptive corrections" and perhaps more for other boundary regions. They could be useful for management purposes, but they would only cloud the following discussion so they're omitted here.

A second formula can be applied either to all outages collectively or to one of the three factors at a time:

(Formula 2) image from book

(We talk about average duration here because we're making estimates for the future, not dealing with actual outages in the past.)

Maximizing availability is equivalent to minimizing frequency and duration for each of the three factors. Frequency needs to be treated separately from duration because reducing one involves different steps to reducing the other. Let's consider each of the three factors in turn and suggest ways to minimize their frequency and duration-giving us six directions from which to attack the problem and which are discussed in the following six subsections in the order indicated in this table:

Open table as spreadsheet
 

Frequency

Duration

Maintenance

1

2

Upgrades

3

4

Failures

5

6

But first observe that downtime for maintenance and upgrades will be zero if they can be undertaken wholly outside the availability window. Every serious system has various sorts of housekeeping tasks it must undertake. Many of these tasks are usually easier to do if nothing else is happening at the same time-especially real work by pesky users.

All of these six attack directions must be considered for everything that is within the scope of the system. This includes hardware-if that's in scope. It also includes all third-party products you need-if they're in scope. The implication is that the project team is free to choose all the products the system needs-in particular, to choose products that enable us to satisfy the availability requirements. A complication arises if product choices are forced on the project. If possible, treat these products as outside the system's scope and, accordingly, separate their availability goals from the availability goals of the system in scope. This isn't always possible, however, and you might have to accept responsibility for the availability of a product that's somewhat outside your control. There can also be cloudy areas. For example, if the choice of database has already been made, you might still be able to reconfigure it or purchase add-ons so as to increase its availability.

For each of the six attack directions, it's necessary to work through each of the types of constituent technical pieces in turn. The main ones are

  1. Hardware. Consider the following:

    • The computer(s) on which the system will run.

    • Computers running other software: database, Web server, firewall, and so on.

    • Users' desktop machines (conceivably!).

    • Communications hardware, including internal networking devices, cabling and phone lines, and hardware.

    • Power supplies, both the normal and emergency uninterruptible supplies.

  2. Third-party software products, such as database, Web server, and middleware.

  3. Our own software.

Draw up a list of all those pieces relevant to your environment that have a bearing on the availability of your system. For each of the six attack directions, consider each item on your list and specify requirements for it as appropriate (as described in the following six attack direction subsections).

Again, worry about only those pieces that are within the scope of your system, as defined in the requirements specification. If the list you draw up includes things you believe shouldn't be the project's concern, you might have set the system's scope too broadly and need to reduce it. Nevertheless, it's useful to put in the requirements specification a list of all the pieces outside scope whose failure can affect the availability of your system-because your stakeholders might want to check their dependability and perhaps improve some of them.

Each requirement created to improve any of these attack directions should include a statement estimating the extent to which it contributes. Occasionally, it's possible to state the extent categorically (not as a mere estimate). This statement can be omitted if the requirement's definition already makes the effect obvious. A suggested template for such statements is

  • "It is estimated that this requirement reduces «Factor» by/to «Extent».

    where

    «Factor» is one of the six attack directions

    and

    «Extent» is the average amount by which it improves the factor."

An average is usually worked out by estimating in what percentage of failures this requirement will help, and by how much it helps on average then. For example, something that helps in five percent of failures but typically saves then 20 minutes will save an average of one minute per failure. Here are some examples:

  • "It is estimated that this requirement reduces average duration of daily housekeeping by five minutes."

  • "This requirement reduces frequency of upgrades by three per year."

  • "It is estimated that this requirement reduces duration of each application software failure by 15 minutes on average."

Always stress when anything is just an estimate. At requirements time, we don't know how much the system is going to cost or what its "natural" availability will be. So we have little idea how much extra it'll cost to achieve stated availability goals. Even metrics determined from previous projects (if you have any) wouldn't tell us much. Too much depends on technology choices that (usually) have yet to be made. It is therefore impossible for the requirements to make concrete judgments on what features our software needs in order to deliver acceptable availability. And this is even before we begin to think about how much it would cost.

There is a risk that the total of the downtime savings indicated in these requirements might add up to more downtime than we anticipate-by overselling the benefit of some of the preventive steps. To prevent this happening, you could extract all these figures and, say, put them in a spreadsheet.

If availability demands aren't onerous or if you have confidence that the system's natural availability level will be good enough, you can leave out of the initial implementation all the requirements for increasing availability. They can be introduced selectively later once we see how good the system's actual availability is and the causes of any failures that do occur. It is, however, a good idea for developers to bear in mind all the availability-related requirements-to make provision for them so that it's straightforward to add them later. Also. choose third-party products that satisfy these requirements as far as possible: it would be disappointing to have to replace a third-party product later just because it's not reliable enough.

Now for the six attack directions themselves. Note that they're all written to cover hardware and third-party products as well as our own software.

Attack Direction 1: Frequency of Maintenance

Commercial systems historically do their regular maintenance once a day: the old end-of-day run. This is a convenient and natural cycle. There might be sound business or technical reasons for doing system maintenance several times a day-or less than once a day. These reasons should take precedence over the desire to reduce frequency of maintenance as part of our efforts to improve availability. But this section would be failing in its duty if it didn't point out that doing maintenance less often improves availability (if user access must be curtailed to perform maintenance). On the other hand, there might be a trade-off between maintenance frequency and maintenance duration: when doing it less often means it takes longer each time.

A system can have more than one type of maintenance-for example, daily and monthly. Moving some processing from a frequently run type to a one less frequently run would then reduce the total maintenance time-and improve availability. But it's rare that the requirements can effect changes like this; they are more a design matter.

Here's an example requirement for the record-but it does look rather old-fashioned:

Open table as spreadsheet

Summary

Definition

Maintenance no more than daily

The system shall not be shut down for maintenance more than once per day.

Attack Direction 2: Duration of Maintenance

Duration of maintenance can be brought down to zero for most types of system through the use of suitable products (especially the database); it might take some extra development effort, too. Here are some sample requirements for a few things that contribute to reducing (or eliminating) maintenance time:

Open table as spreadsheet

Summary

Definition

Database backups while system active

A database product shall be selected that permits backing up of the database while other database activities are going on.

It is estimated that this requirement reduces duration for which the system would be unavailable to users for maintenance by fifteen minutes each day.

Product restarts unnecessary

Each product used (both software and hardware) shall be chosen on the basis that it can be depended upon to run for an extended duration without needing to be restarted.

It is estimated that this requirement reduces duration for which the system would be unavailable to users for maintenance by thirty minutes each week.

Housekeeping while system active

All system housekeeping tasks that can be performed while the system is available to users (such as purging old data) shall be.

It is estimated that this requirement reduces duration for which the system would be unavailable to users for maintenance by ten minutes each day.

Note that these requirements contribute to reducing both maintenance duration and frequency, so the estimates of their impact must reflect both.

Attack Direction 3: Frequency of Upgrades

The frequency with which the various components of the system are upgraded is determined primarily by the forces that motivate the upgrade: to introduce new features or other software improvements, fix defects, add faster hardware, and so on. Those forces will usually strike a sensible balance with the forces that don't want the system interrupted. Nevertheless, if it's important for availability reasons to limit the frequency of system shutdowns for upgrades, the requirements are the place to say so. Each type of component in the system (hardware, third-party products, our own software, and so on) has its own upgrade considerations, and therefore need to be treated separately in these requirements.

It's worth observing, though, that stable systems need upgrading less frequently-and because high-quality systems are stable, this implies that quality reduces frequency of upgrades. There's also a trend towards more iterative development methodologies (shorter development cycles with more frequent deliveries, that is), but it's not necessary to install every iteration live. Iterative approaches don't force us into more frequent upgrades.

Requirements that address upgrade frequency are more technical than requirements should be. But if you want to go further-to take steps to perform upgrades without interrupting user access-you'll have to get more technical still. It is possible to design systems in such a way that you can upgrade software components while the system is running, but it's very hard to do. Don't expect an average development team to be capable of tackling it. And expect it to be expensive, for both development and testing.

Here are a couple of example requirements:

Open table as spreadsheet

Summary

Definition

Three-monthly software upgrades

An upgraded version of the system's application software shall ordinarily be installed no more frequently than once every three months.

(This requirement is present solely to help facilitate calculating the estimated system downtime.)

Machine shutdown without interrupting system

It shall be possible to shut down a machine that runs application software without interrupting user access to the system as a whole.

Attack Direction 4: Duration of Upgrades

A typical system upgrade in many organizations is poorly planned and is of a duration that cannot be predicted in advance. It can drag on to be a thirty-six hour marathon of frequent coffee breaks and late night pizzas. The people involved might be lauded for their stamina and dedication, but their heroics shouldn't be necessary and involve risks.

The duration of an upgrade can be reduced by preparation, which costs time and money. The shorter the duration, the more preparation must be done (to cram all the work into the smallest possible window). As you make the window smaller, the preparation effort grows exponentially. Possible steps:

  1. Rehearse what needs to be done. This can include trying out the upgrade on a test system (or more than one).

  2. Automate as much as possible. Write scripts, or more substantial software. Often these aren't regarded as "real software," but they are, and they should be treated just as seriously as any other software. After all, if they do something wrong, they can do just as much damage.

  3. Prepare instructions for the work that needs to be done. Arrange for as many tasks to be performed in parallel as possible.

  4. Do as much as possible beforehand. Spend your precious downtime on only those tasks that must be done while the system is down.

  5. Bring in as many people as necessary. If minimizing the time it takes is the top priority, bring in as many people as it takes.

Some organizations omit some or even all of these steps-often out of sheer ignorance. While requirements cannot involve themselves in the conduct of a particular upgrade, they can indicate the lengths that should be taken to reduce the duration of each one. They're useful if they merely alert people responsible for upgrades to the fact that it's possible to make preparations.

If you have multiple instances of the system to upgrade, preparation becomes more cost-effective because it's shared across all those instances. Software for automating upgrades is particularly important if you're specifying a product. In this case, you definitely need to specify proper requirements for the upgrade software.

Avoid setting a time limit for upgrades unless there is a genuine business reason. Even for one system, upgrade durations will vary. We just want each one to take as little time as possible. If we set a limit of three hours, we'd still want a two-hours-by-hand upgrade to be done more quickly if possible.

Here are a couple of example requirements:

Open table as spreadsheet

Summary

Definition

Minimize software upgrade duration

All reasonable steps shall be taken to minimize the length of time for which the system must be shut down when upgrading its software. "Reasonable steps" shall be taken to mean up to two person days of effort for each hour of downtime saved.

It is estimated that this requirement reduces average software upgrade duration by two hours.

Upgrade instructions

Instructions shall be written for each system upgrade, to describe all the steps that must be taken to install it successfully.

Attack Direction 5: Frequency of Failures

What we're talking about here is reliability, in the sense of rarely going wrong. There are two types of failures: accidental (such as software defects, hardware breakages) and deliberate (primarily malicious attacks by someone either outside or inside the organization). We need to take steps to prevent both. Minimizing accidental failures is achieved by quality. For hardware and purchased software, this means buying reliable, high-quality products. For our own software, it means building with quality: primarily sound development to keep software defects to a minimum and good testing to find them. Hardware reliability can also be enhanced by replication: having more than one of everything (or some things).

Protecting against deliberate attempts to cause failures is a matter of security. It includes firewalls and antivirus software, as well as access control to prevent valid users doing things they shouldn't.

There's only one requirement here, because we don't have room to cover the other topics that contribute most to stopping failures: good development and testing practices, and security.

Open table as spreadsheet

Summary

Definition

Replicate hardware

All hardware components of the system shall be replicated, such that failure of any one hardware component shall not render the system unavailable to users.

It is acceptable for system performance to be poorer than normal after the failure of a piece of hardware.

It is estimated that this requirement reduces the frequency of failures by two per year.

Attack Direction 6: Duration of Failures

A system that fails only once a year would appear to be of high quality. But that counts for nothing if it takes three weeks to recover from that one failure. Stopping failures from happening in the first place is usually given much higher priority than keeping the duration of each shutdown to a minimum. But according to our Formula 2 for total downtime, reducing their duration is just as important.

When a failure occurs, its duration is determined by the following formula:

(Formula 3) image from book

where

Outage time is the length of time from the moment the system became unavailable to users until it becomes available again.

 

Time to detect is the length of time it takes to detect the failure and to raise the alarm. It includes the time it takes to notify people.

 

Time to react is the length of time between people being notified until the first person can begin to work on the problem.

 

Time to fix is the length of time it takes to investigate and rectify the problem and make the system available to users.

Imagine your average system crashing at 2 a.m. The time to detect is the half-hour it took a dozing operator to spot the usual messages are missing from the screen; the time to react is the hour and a half it took to phone, wake, and drag into the office the on-call programmer; the time to fix is the three hours the programmer looked for subtle clues among paltry evidence before finding the cause (the thirty seconds spent rectifying the silly fault hardly registers), plus the half hour it took to restart everything. Minimizing outage time involves minimizing all three factors: it's little use having lightning-fast system monitoring detect a problem in a millisecond if it still takes hours to fix. It also means paying for a taxi to get the programmer to the office ten minutes faster is just as valuable as fixing it ten minutes quicker, which is something for expenses-conscious managers to bear in mind.

It's analogous to a house fire. The duration of the fire is how long it burns before someone raises the alarm, the time it takes for the fire brigade to arrive and put out the fire-oh, plus the time before you can repair the damage and move in again. The last point is worth noting: what we're interested in is how long it takes before everything's back to normal.

The preceding formula assumes human intervention is necessary, but it is possible for a system to deal with some types of problems automatically. For example, if a system monitor detects an expected process is not running, it could start it up. In such cases outage time equals time to detect plus time for automated reaction. It must be stressed, however, that it's hard to develop automated responses that properly rectify an identified problem, and such responses are possible for only a few kinds of fault. The system monitor in this example doesn't do that: it doesn't prevent the problem recurring, which it would do repeatedly if an error in the relevant process's software caused it to crash each time it starts up.

The remainder of this section deals with each of the three factors in Formula 3 in turn. It addresses only features that can be built into a system, although operational factors have an equally large (or larger) part to play. Requirements are not the place to deal with the details of operational matters.

Time to Detect

Minimizing the time it takes to detect a failure involves spotting any problem as fast as possible and then notifying whoever should be notified, also as rapidly as can be.

In an office full of people, if one person suddenly collapses, others would notice and come to their assistance. In contrast, if one machine in a network (or one process in a machine) collapses, the natural reaction of its colleague machines is to do nothing or at most to complain that it's not doing its job. If we want machines to feign a little concern, we have to tell them how. For this, requirements should be specified, covering three aspects:

  1. Any piece of software that detects a serious error must raise an alarm.

  2. Special system monitoring facilities are needed to check that all machines and processes that should be running are running and to raise an alarm if they're not. They need to run on more than one machine, if they are to detect the failure of a machine on which they run.

  3. A notification mechanism is needed-something to raise an alarm on request-to tell nominated human beings there's a problem that someone needs to fix. This might provide some way for the people to acknowledge being notified. And if no one acknowledges the notification, the mechanism might notify more people.

The first two are types of problem detection, and the third covers what to do when a problem is detected. All involve investigating the specific needs of your system and its environment: one-size-fits-all requirements won't work here. Here are some questions to ask-and when answering them keep in mind that the primary concern is reducing response time:

  • What constitutes a serious error? Don't attempt to identify them individually, but define criteria by which any type of error can be judged whether it counts as serious.

  • Which people need to be notified when a serious error is detected? Does it depend on what kind of error? Does it change according to the time of day (especially outside normal office hours)?

  • By what means should we notify people? A message on a screen, email, pager, SMS, instant message, ring a loud bell, tell some other system? Do we need to notify one person by multiple means? Should we use different means for different people or at different times of the day? Should the means to use vary depending on how serious the error is?

  • Do we need acknowledgment that someone is taking responsibility for the problem? What if no one acknowledges doing so?

It's not necessary to ask what machines and processes need to be monitored, because that's too technical. It's possible to specify a requirement for this in general, technology-independent terms (as the second example requirement that follows does).

The detecting of an error or the raising of an alarm could take other actions too, if we want-say, for exceptionally serious errors. For example, if an attack by hackers was detected, we might want to shut down the system completely. This is within the scope of the subject of availability only insofar as it prevents further damage, but it demonstrates that the features being discussed here can be beneficial in ways beyond just reducing the duration of failures.

Software that checks a system's availability can also be a basis for statistics on its availability. Up to a point, that is, because the checker could itself fail and can tell us nothing when it's not running. It also needs to be built to cater for deliberate system downtime and, ideally, to have a way of distinguishing the three types of downtime, which means system shutdown lets the operator express why the system is being shut down.

Here are some requirements covering the three aspects listed:

Open table as spreadsheet

Summary

Definition

Serious software error raises alarm

Any software that detects a serious error shall raise an alarm, by invoking the notification mechanism specified in the next requirement.

A serious error for the purpose of this requirement is one that is deemed to require immediate human intervention.

It is estimated that this requirement reduces the average duration of a failure detectable within software by 30 minutes.

Notification mechanism

There shall be a mechanism to notify designated people using designated means when a message is passed to it. The following means shall be supported:

  • Email

  • SMS

  • Pager

Designated people means a list of users associated with the category to which the message belongs. Each user who wishes to be notified by pager must have a pager number set for them.

Designated means are all those means on a list of means associated with an individual user. There shall also be a default list of means to use if a user to be notified has no list of their own.

System monitor

There shall be a system monitor that is able to detect within 30 seconds the perceived failure of any of the machines and processes that are expected to be running at all times.

It is estimated that this requirement reduces the duration of each failure of a monitored machine or process by five minutes.

In practice, notification mechanisms deserve to be specified in more detail than in the second requirement here. You could even treat it as an infrastructure in its own right. Examples of various notification-related example requirements are given in the extendability requirement pattern in Chapter 10.

Time to React

Getting investigators working on a failure is largely an operational matter, which doesn't concern the requirements. Steps might include making provision for emergency access to the live system by people (mainly developers) who normally don't have it. It might be, though, that special features can be added to the system to enable quicker access by whoever is to investigate a failure. These could include

  1. Remote access facilities, if the system doesn't otherwise need them. The intent here is to allow someone to dial in from home, especially after hours, to work on the problem.

  2. Access control extensions, to allow an investigator to do more in an emergency situation than they would normally be allowed to.

Bear in mind, though, that some types of failures could also affect the working of these features too, if hardware on which they depend has failed. Insisting on replication of components used in combating a failure is worthwhile. Even if rarely needed, it is precisely at times of crisis that they will be called upon.

Observe, too, that these features give an investigator exceptional ability to do deliberate damage and they thereby constitute a risk (however small) of facilitating a worse incident. Lest you consider such a coincidence unlikely, a malicious developer could contrive a failure precisely to provide this opportunity.

Here are a couple of example requirements:

Open table as spreadsheet

Summary

Definition

Emergency remote access

The system shall provide the ability for a personal computer to dial in and access it remotely. This facility shall ordinarily be disabled and enabled only in the event of a system failure that warrants immediate investigation.

It is estimated that this requirement reduces the duration of each failure that occurs outside office hours by one hour.

Emergency extended access

It shall be possible to grant extended access to a nominated person, to bypass normal access control restrictions.

This feature is intended to be used only when the person in question is investigating a system failure; extended access is to be revoked immediately afterwards. (It is recommended that for the duration of the emergency any person granted such access be closely supervised.)

It is estimated that this requirement reduces the duration of each failure by 15 minutes.

Time to Fix

Minimizing the time it takes to fix a failure involves providing investigators with as much information about the problem as possible and giving them the best tools for probing the state of the system. This is a subject that's often completely ignored when specifying and developing systems (beyond chronicling errors), but at the very least you should reflect on whether it deserves serious consideration. Another important way to getting a system up and running quickly is making provision for disaster recovery; this topic is discussed at the end of this section.

The information to help diagnose a failure needs to be gathered as a matter of course while the system is running normally-like an aircraft's black box flight recorder. Steps that can be taken to gather this information include:

  1. Record everything that happens in the system that might be of interest, especially errors (even those not serious enough to constitute a system failure).

  2. Insist that all error messages be clear, correct, and detailed. Considerable time can be wasted if a problem produced an error message that was uninformative or, worse, misleading. An investigator could be sent off on a wild goose chase.

Here's an example requirement for each of these two steps:

Open table as spreadsheet

Summary

Definition

Record all errors

Every error detected by the system shall be recorded. At least the following shall be recorded:

  • Error ID

  • Message text

  • Date and time at which the error occurred

  • Machine name of the machine on which the error occurred

For the purpose of this requirement, a minor exception condition that the software is designed to handle completely itself (such as invalid data entered by a user) does not constitute an error.

It is estimated that this requirement reduces the average duration of each failure by two minutes.

Clear, detailed error message

Each error message shall be clear and self-explanatory and contain items of variable information as appropriate to isolate the cause. The variable information might be the name of a machine, the amount of free space on a disk, a customer ID.

It is estimated that this requirement reduces the average duration of each failure by two minutes.

As for diagnostic tools to investigate a problem, steps to consider include:

  1. Identify a range of tools likely to be useful for investigation, and install them on the system.

    If it is unacceptable to have the investigative tools permanently installed (for valid security reasons because they do constitute a security risk), have them readily at hand. There might be several ways to do this: having software ready to install, or having a separate machine containing the tools ready to connect to the live network.

  2. Document error messages-not necessarily all, but those about which something extra can usefully be said. These explanations can tell an investigator what an error really means, what causes it, and how to respond to it. The set of error message explanations must be made available to investigators. While this step might appear to relate to the gathering of information, the set of error messages actually constitutes a diagnostic tool.

  3. Develop special software to examine the integrity of the system, especially its data. Programmers sometimes create such software for their own use in testing, which then usually languishes unknown and unappreciated, which is a waste. Regarding such utilities as part of the mainstream system can make it available to help when a problem occurs.

Here's an example requirement for the second step:

Open table as spreadsheet

Summary

Definition

Error message explanations

Each error message for which explanatory information is available (over and above its message text) shall be documented. The following information shall be provided for each such message:

  • Error ID

  • Message text

  • Explanation of each item of variable information: its origin and meaning

  • Extended explanation of error's meaning

  • Description of likely cause(s)

  • Description of suggested response(s)

It is estimated that this requirement reduces the average duration of each failure by one minute.

Another way to get a system up and running again quickly is to invest in a disaster recovery system: a duplicate hardware and software environment, preferably in a different physical location. As its name suggests, such a set-up lets you get up and running again when anything up to and including a disaster befalls the main site. But it doesn't help if the reason for the failure was a major software fault that affects the second system, too. Disaster recovery involves a lot more than setting up a second environment, and it must be investigated-and later tested-thoroughly. Bear in mind that any upgrades performed on the production system must also be performed on the disaster recovery system. Here are some simple example requirements that suggest a few aspects to worry about but that in practice deserve to be specified in much more detail:

Open table as spreadsheet

Summary

Definition

Disaster recovery site

There shall be a disaster recovery site at a physically separate location from that of the main production system. It shall duplicate all the features of the main site.

It is acceptable for the disaster recovery system to have lower performance than the main system.

Disaster recovery data

There shall be a means of supplying the disaster recovery site with an up-to-date copy of all production data.

There shall be a similar means to supply data to the production system, to allow it to start running again when the fault has been fixed.

Disaster recovery communications

There shall be a means of directing all communications intended for the production site to the disaster recovery site instead, in the event of a disaster.

It shall also be possible to switch communications back to the production system when the fault has been fixed.

Disaster recovery procedures

There shall be written procedures to explain how to get the business operating from the disaster recovery site.

Considerations for Development

If you're faced with an unfriendly availability requirement, question it: "What am I supposed to do with that?" Should that get you nowhere, work your way through the suggestions in the "Extra Requirements" subsection in this pattern to come up with a swag of concrete steps to take. You can even formulate them as requirements if you like. Then implement them.

Too many kinds of extra requirements related to availability exist to discuss them here. But pay particular attention to the availability window specified for the system, because this affects whether housekeeping tasks must be performed while users are accessing the system.

Throughout the development process, document any problem you detect that could affect system availability and explain how to deal with it and recover when it occurs. Check with the testing team to see if they have discovered this problem already, as they might have valuable additional information and insights.

Considerations for Testing

A classic availability requirement (of the "24x7 availability 99.9 percent of the time" kind) is so hard to test that it's not even worth trying for a normal commercial system, which is the root of the argument against such requirements in the first place. The kind of starting-point availability requirement advocated by this pattern (that defines the availability window) is more practical to test: you need to simulate a small number of days' running (you might even consider that one day suffices) and check that there's nothing that prevents you running the system constantly during the specified hours. Whenever you encounter a primary availability requirement, first ask yourself how easy it is to test. Also be alert for the two nastiest kinds: those that are impossible to test, and those that are feasible but impractical to test.

No matter what form availability requirements take, testing should include running the system continuously for an extended period, which means for as long as you can but certainly for several days. Running for a month or more continuously is excellent. Keep a wary eye open for memory leaks: observe how much memory each process takes up, and check that it doesn't grow steadily the longer the process has been running. Any software that's expected to run for an extended period will surely lead to unhappiness if it has a memory leak. Pass it back to the development team smartly, but also let it keep on running to see what happens.

Requirements whose aim is to deliver availability (those covered in the "Extra Requirements" section earlier in this pattern) are too diverse to enumerate here. Treat each one individually on its own merits. If extra requirements of these kinds have not been formally specified, but developers have devised their own steps to achieve availability goals, you could find out what those steps are and test them as if they were requirements. It's impractical to prove that all the steps demanded to increase availability actually deliver a stated availability level. The best you can do is review any reasoning or calculations performed by the analyst or developers and ask yourself if their assumptions look reasonable.

Test for surreptitious unavailability. Find out if any housekeeping-type tasks are performed while users are active. If so, perform a range of user functions while this housekeeping is underway, and test its effect on response time.

Nearly all systems are installed containing known defects. Document well each known defect that has the potential to affect the availability of the system. Explain what causes it, how to diagnose whether it was the cause of a system failure, and how to respond when it happens. Make these explanations as easy to find as possible in the event of a failure. These steps can reduce the length of a system outage significantly.




Microsoft Press - Software Requirement Patterns
Software Requirement Patterns (Best Practices)
ISBN: 0735623988
EAN: 2147483647
Year: 2007
Pages: 110

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net