3.6 Developing RMA Requirements

Now we will expand on the performance requirements previously discussed, developing and quantifying requirements when possible. In this section, we will discuss two types of thresholds: general and environment specific. General thresholds are those that apply to most or all networks. They are rules of thumb that have been determined via experience to work in most environments. They are applied when there are no environment-specific thresholds to use. Environment-specific thresholds are determined for the environment of the current network project on which you are working. They are specific to that environment and typically are not applicable to other networks. These thresholds are useful in distinguishing between low and high performance for the network.

3.6.1 Reliability

Reliability is a statistical indicator of the frequency of failure of the network and its components and represents the unscheduled outages of service. A measure of reliability is in MTBCF, usually expressed in hours. A related measure is the MTBF, which considers all failures, regardless of their significance at the time of failure, and is a conservative approximation, useful in simple systems. MTBF can confuse the designer of a complex system. As systems become more complex and resource limitations restrict the degree of redundancy or the purchase of higher-reliability components, the use of MTBCF becomes more illuminating, although it does take more careful analysis, focusing on the system performance during specific mission scenarios. MTBF is computed as the inverse of the failure rate, which is estimated through testing or analysis in terms of failures per hours of operation. Criticality is factored in by considering the failure rates only of components that are critical to the mission. On the surface, reliability arithmetic is a little confusing; remember that calculations for reliability of systems are really performed by adding failure rates (failures per hour) of the components and then inverting the sum of the failure rates.

3.6.2 Maintainability

Maintainability is a statistical measure of the time to restore the system to fully operational status once it has experienced a fault. This is generally expressed as MTTR. Repairing a system failure consists of several stages: detection, isolation of the failure to a component that can be replaced, the time required to deliver the necessary parts to the location of the failed component (logistics time), and the time to replace the component, test it, and restore full service. MTTR usually assumes the logistics time is zero; this is an assumption, which is invalid if a component must be replaced to restore service but takes days to obtain.

3.6.3 Availability

Availability (also known as operational availability) is the relationship between the frequency of mission-critical failures and the time to restore service. This is defined as the MTBCF (or MTBF) divided by the sum of MTTR and MTBCF or MTBF. These relationships are shown in the following equation, where A is availability.

A = (MTBCF)/(MTBCF + MTTR) or A = (MTBF)/(MTBF + MTTR)

Some important considerations in evaluating availability are often overlooked. First, availability does not necessarily reflect the percentage of time that the system is operational: Scheduled maintenance is not taken into account in this calculation—only unscheduled maintenance is. This is significant because scheduled maintenance does not penalize the system since it can be performed when the components are not needed to perform the mission. The real concern is the surprise nature of system failures, which is reflected in availability.

One consequence is the ability to schedule preventive maintenance and replace frequently failing components at a higher interval than the failure rate, thereby increasing the reliability of the overall system. Another way to improve availability is to reduce the MTTR, either through redundancy or accelerated replacement capabilities. Several ways to achieve redundancy include installing automated switch-over capabilities for critical components so that they automatically switch to a redundant unit in milliseconds when a fault is detected; however, automatic switching is an expensive feature to add to these systems. If the pressure to restore service does not warrant this level of expense, rapid-replacement solutions can improve MTTR substantially over traditional troubleshooting and repair procedures: Hot spares that can be manually switched vastly accelerate the restoration of service, as does prepositioning spare components in proximity to the critical component's location. Other measures of availability include uptime, downtime, and error and loss rates.

Uptime and Downtime

A common measure of availability is in terms of percentage of uptime or downtime. For example, a request for proposal from a potential customer may state a required uptime of 99.999% (commonly known as "five nines"), but what does that really mean? What do the terms uptime and downtime really mean?

When availability is represented as the percentage of uptime or downtime, it is measured per week, month, or year, based on the total amount of time for that period. Uptime is when the system (applications, devices, networks) is available to the user (in this case, user may also be an application or device). Available, as we will see, can range in meaning from having basic connectivity to actually being able to run applications across the network. How this is described in the requirements analysis is important to how it will be measured and verified. Likewise, downtime can range from not being able to connect across the network to have connectivity, but with loss rates high enough (or capacity low enough) that applications will not function properly. Figure 3.10 shows some commonly used availability percentages (as percentage of uptime), ranging from 99% to 99.999%.

% Uptime	Amount of Allowed Downtime (Hours [h], Minutes [m], or Seconds [s] per Time Period)
	Yearly	Monthly	Weekly	Daily
99%	87.6 h	7.3 h	1.68 h	14.4 m
99.9%	8.76 h	44 m	10 m	1.4 m
99.99%	53 m	4.4 m	1 m	8.6 s
99.999%	5.3 m	26.3 s	6 s	0.86 s

Figure 3.10: Uptime measured over different time periods.

Another way to view availability is by how much downtime can be tolerated per time period. The range shown in Figure 3.10, 99% to 99.999%, covers the majority of requested uptime requirements. At the low end of this range, 99% allows the system to be down quite a bit of time (more than 87 hours per year). This may be acceptable for testbeds or system prototypes, but it is unacceptable for most operational systems. When commercial service providers offer uptime that is at the low end of this range, this will need to be factored into the overall availability for the network.

An uptime level of 99.99% is closer to where most systems actually operate. At this level, 1 minute of downtime is allowed per week, which equates to a few transients per week, or one minor interruption per month (where a transient is a short-lived [on the order of seconds] network event, such as traffic being rerouted, or a period of congestion). Many networks that have had early uptime requirements below 99.99% are now relying on the network much more and are revising uptime requirements to 99.99% or greater.

At 99.999%, most systems begin to push their operational limits. This level of performance, which indicates that the network is highly relied on (e.g., for missioncritical applications), will have an impact on the network architecture and design in several areas, as will be discussed in later chapters.

Finally, an uptime beyond 99.999% approaches the current fringe of performance, where the effort and costs to support such a high degree of uptime can sky-rocket. There are even some applications that cannot tolerate any downtime while in session. For these types of applications, uptime is 100% while in session. An example of such an application is the remote control of an event (e.g., piloting a vehicle, performing an operation), where downtime may result in loss of the vehicle or possible loss of life (one of the criteria for a mission-critical application). In cases like this, however, the times of high availability are often known well in advance (scheduled) and can be planned for.

I should note at this point that many system outages are very brief in time (transients) and may not affect users or their applications. In some cases the users may not even know that a problem existed, as the outage may manifest itself merely as a pause in an application. Yet such events are part of the reliability estimate, especially in networks that have strict limits on reliability. Therefore, although a 10-minute weekly downtime (or 99.9% uptime) would be noticeable (e.g., application sessions dropping) if it occurred all at once, it could actually be a distribution of several 15-second transients, each of which results in applications stalling for several seconds.

With this information and the previous table of uptime estimates, we can present a general threshold for uptime. The general threshold for uptime, based on practical experience, is 99.99%. When this general threshold is applied, uptime requirements of less than 99.99% are considered low performance and uptime requirements of 99.99% or greater are considered high performance. Remember that this general threshold is used in the absence of any environment-specific thresholds that may be developed for your network as part of the requirements analysis process.

Measuring Uptime

In the previous section, we asked, "What does that (99.999% uptime) really mean?" In Chapter 1, requirements for services include that they be configurable, measurable, and verifiable within the system. This is one of the biggest problems with a requirement for percentage of uptime or downtime, but how can it be configured, measured, or verified?

In terms of measuring uptime, this question may be asked in three parts: When should it be measured (frequency), where should it be measured, and how should it be measured (service metrics)? First, let's consider the frequency of measurement. A problem with stating uptime solely as a percentage is that it does not relate a time factor. Consider, for example, a requirement for 99.99% uptime. Without a time factor, that can mean a downtime of 53 minutes per year, 4.4 minutes per month, or 1 minute per week. Without stating a time factor, there can be a single outage of up to 53 minutes. As long as that is the only outage during the year, this availability requirement may be met. Some networks can tolerate a cumulative downtime of 53 minutes per year but may not be able to handle a single downtime of that magnitude. There is a big difference between one large outage and several smaller outages.

Stating a time factor (frequency) along with uptime makes that requirement more valid. For networks that cannot tolerate large outages but can tolerate several small outages, 99.99% uptime measured weekly may make more sense than just 99.99% uptime. By stating "measured weekly," you are forcing the requirement that outages can be no larger than 1 minute total per week.

Next let's consider where uptime should be measured. Stating where uptime is measured is as important as stating its frequency. If nothing is said about where it is measured, the assumption is that a downtime anywhere in the network counts against overall uptime. For some networks, this may be the case and as such should be explicitly stated as a requirement. For many networks, however, uptime is more effective when it is selectively applied to parts of the network. For example, uptime to servers or specialized devices may be more important than general uptime across the network. If this is the case, it should be included as part of the performance requirement.

Figure 3.11 shows an example where uptime would apply everywhere in the network.

click to expand
Figure 3.11: Uptime measured everywhere.

The loss of service to any device in this situation would count against overall uptime. In Figure 3.12, uptime has been refined to apply only between each user LAN and the server LAN. In this example, if service is lost between the user LANs, it would not count against an overall uptime requirement.

click to expand
Figure 3.12: Uptime measured selectively.

In the aforementioned examples, we show a trade-off of being able to apply uptime everywhere versus having a precise and more achievable application of uptime. It is possible, however, to have both apply to the same network. You can apply one standard everywhere in the network and have another standard that is applied selectively.

Often, uptime is measured end-to-end, either between user devices (generic computing devices) or between networks. For example, it may be measured end-to-end between user networks, at network monitoring stations, or at network interfaces. Measurement points include the LAN/WAN interface, router entry/exit points, and at monitoring stations distributed throughout the network. Determining where to measure uptime is particularly important when service providers are embedded in the system or in demarcated services between well-defined areas of the network.

Finally, how you measure uptime is also important. As we will see later in this section, it may be measured in terms of lack of connectivity or as a loss rate (BER, cell loss rate, frame loss rate, or packet loss rates). The method you use to measure uptime will have an impact on how it can be configured within the network devices, as well as how it will be verified.

When developing requirements for uptime, you should keep in mind that some downtime on the network needs to be scheduled to allow for changes to be made to the network, hardware and software to be upgraded, or tests to be run. Scheduled downtime for maintenance should not count against the overall uptime of the network, and the amount of planned downtime (frequency and duration) should be included in your performance requirement.

Indeed, given the ways that we specify where, when, and how uptime is measured, we can actually have multiple uptime requirements in the network. For example, there may be a general requirement across the entire network, measured everywhere, and another, higher performance requirement, measured only between vital devices in the network.

Thus, a requirement for uptime might look something like this:

Network Uptime (see Figures 3.11 and 3.12):

The 99.99% uptime requirement is measured weekly at every router interface and user device in the network.
The 99.999% uptime requirement is measured weekly for access to the server farm network, measured at the router interface at the server farm, at the server network interface cards (NICs). The application ping will also be used to test connectivity between each user LAN and the server LAN.
Note that these requirements do not apply to scheduled downtime periods for maintenance.

3.6.4 Thresholds and Limits

RMA requirements may include descriptions of thresholds and/or limits. RMA requirements are gathered and/or derived for each of the applications in the network from discussions with users of each application, as well as from documentation on the applications and testing of the applications on the existing network or on a testbed network. From the requirements, you can often determine environment-specific thresholds (or limits) on each application by plotting application performance levels. From this, you may determine what constitutes low-and high-performance RMA for your network. In the absence of environment-specific thresholds, you can use the general thresholds presented in this chapter.

In addition, check RMA guarantees (if any) for services and/or technologies that exist in the current network or that are likely to be in the planned network. For example, Pacific Bell's switched multimegabit data service (SMDS) has a stated availability, in terms of mean time between service outages or 3500 hours or more, with an MTTR of 3.5 hours or less. On the other hand, the Bellcore specification for SMDS calls for an availability of 99.9%. From our earlier discussion, we can see that both of these specifications describe similar availability characteristics, with the MTBCF/MTTR specification being more specific.

Some of these estimation techniques require knowledge of which technologies and/or services exist or are planned for the system. At this point in the process we really should not know which technologies and/or services we will be using for the network (since those are determined in the architecture and design processes), so these techniques may be used on the technologies and/or services that are in the current network or on a set of candidate technologies and/or services for the planned network. Later in the architecture and design processes, when we have chosen technologies and/or services for the network, we can apply the information gathered here for each of those technologies and/or services.

An example of a general thresholds for RMA is for uptime. Users normally expect the system to be operational as close to 100% of the time as possible. Uptime can get close to 100%, within a tenth, hundredth, or sometimes a thousandth of a percent, but with trade-offs of system complexity and cost. Earlier in this chapter, we described a general threshold for uptime, based on experience and observation, of 99.99%. In general, requirements for uptime that are 99.99% or greater are considered high performance, and those that are less than 99.99% are low performance. In addition, a general threshold of approximately 99.5% can be used to distinguish a low-performance requirement from that of prototypes and testbeds. These thresholds are shown in Figure 3.13.

click to expand
Figure 3.13: Thresholds between testbed and low-and high-performance uptime.

Note that any environment-specific thresholds that are developed for your network would supersede these general thresholds.

Error or loss rate is used as a measurement of uptime. For example, for an uptime requirement of 99.99%, packet or cell loss can be used to measure this performance. For many applications, experience has shown that a 2% packet loss is sufficient to lose application sessions. This can be considered downtime and in fact is a more practical measurement and verification of uptime than attempts at mathematical representations.

Using a 2% loss threshold and measuring packet loss in the network, times when the packet loss is greater than 2% is counted as downtime. For 99.99% measured weekly, this is 1 minute per week that the network can have 2% or greater packet loss (see Section 3.9.3).

In addition to the aforementioned thresholds, performance and service guarantees will also be listed as application requirements. Determining uptime requirements is an iterative process. As users, devices, applications, and networks evolve, their requirements will need to be adjusted.