Technical Quality Metrics for Transport Services | Practical Service Level Management: Delivering High-Quality Web-Based Services

There are common low-level technical quality metrics that can be applied to network infrastructures to paint a picture of overall service quality. They are as follows:

Workload and required bandwidth
Availability and packet loss
One-way latency
Round-trip latency
Jitter

These were first introduced in Chapter 2, "Service Level Management," and they are expanded upon here.

Workload and Bandwidth

Some services require a guaranteed, unchanging amount of bandwidth to function properly; for those situations, bandwidth guarantees are necessary. Others simply require a certain amount of bandwidth averaged over a long interval. In either case, the bandwidth required is a function of the workload being applied to the transport system, and it is usually measured in terms of bytes per second, typically over specified intervals with median and 95th percentile values. Even the simplest transport devices provide basic counts of the number of bytes into and out of each device interface. In most cases, packet or frame counts are also provided.

Data transport providers sell bandwidth in terms of similar measures. Bandwidth is often provided in terms of a guaranteed rate along with a maximum burst rate above that guaranteed rate. For example, a Frame Relay circuit has a Committed Information Rate (CIR); data in excess of that rate is tagged as being discard eligible and may be discarded without notice by the network. Asynchronous Transfer Mode (ATM) has a sustainable cell rate, which is the average bandwidth over a long period, and a peak cell rate, which is the maximum bandwidth allowed over the period defined by the maximum burst size. ATM can provide both steady bandwidth guarantees (constant bit rate) and average bandwidth guarantees (for example, variable bit rate); other technologies have similar services.

ISPs usually bill according to either the total number of bytes transmitted in a month or by a more complex formula that looks at peak usage.

When billing by the total number of bytes, the ISP uses the monthly byte count produced by the router connecting to the subscriber. If the count goes above the agreed-on number of gigabytes, there's an additional charge.

Billing by peak usage works as follows: At each five-minute interval in the entire month, the ISP measures both input and output bandwidth in bytes/second. The higher value is recorded as the peak usage for each five-minute interval. At the end of the month, the 95th percentile value of all of those measurements is used as the basis for billing. An effect of this is that the top five percent of the five-minute samples are ignored; you can therefore burst up to the maximum bandwidth of your access line for up to five percent of the month without any additional cost.

In both cases (total number of bytes transmitted or peak usage), the workload measured is not precisely the same as the workload or bandwidth as seen by the application. If errors interfere with data packets, and those packets are therefore retransmitted, the low-level workload metrics usually count those packets again. The paradoxical result is that as link quality deteriorates, the byte count carried by that link rises!

Availability and Packet Loss

Availability on a communications link is usually represented as the percentage of time that the link is electrically operating and can carry traffic at better than a specified error rate. For example, a digital link that is powered and that provides a clocking signal is available to carry traffic, and the error rate over that link is described in terms such as error-free seconds. Because errors on digital links normally occur in severe bursts, the use of error-free seconds as a measure of link availability is understandable. During the period when errors are occurring, the link is probably unusable; at other times, there are probably no errors at all. In some cases, a supplier will specify that any period of unavailability less than, for example, one hour does not count in the availability metrics.

Note that a link with a moderate, steady error rate may interfere with so many data packets that the link's effective throughput is quite small. (Effective throughput is the throughput of a link after retransmissions are taken into account. It varies according to the link's error rate and the particular error-recovery protocols in use on that link.) A simple measure of availability or packet loss on the link would show only the moderate error rate, while a more sophisticated measure of the impact of that error rate on a particular application might show that the link was, for all practical purposes, unavailable.

Multimedia streaming is designed to tolerate limited "noise" resulting from packet loss because an interruption (for rebuffering) as you experience multimedia is worse than a small anomaly introduced by a dropping a packet in the audio/video stream. Because interruptions of audio are more objectionable to end users than short freezes in video, most multimedia streaming servers attempt to give the audio portion of the signal preference over the video portion if effective throughput is restricted.

Packet loss for transactions does not cause transaction failure because transaction protocols automatically retransmit as necessary to ensure error-free completion. Any packet losses merely add delay to the total transaction completion time.

To avoid the necessity of specifying the precise error-recovery protocols used on links and their impact on perceived packet loss and effective throughput, thereby creating a very complex measure, organizations can specify error rates in terms of block error rate (or packet error rate, or ATM's cell error ratio, and so on). After all, it doesn't matter if there are one or more errors within a particular data block; any error at all will require retransmission of the entire block if perfect transmission is required. The usual exception is streaming media, which can accept low error rates without major impact on the application; in those cases, simple bit error rates may be sufficient.

Packet loss over an Internet connection may be defined very coarsely in terms of ping ratios. For that measure, short "ping" packets are transmitted in a burst and are immediately echoed back by the destination. The percentage of packets that return within a defined, short time window is taken as the success ratio. This is an extremely coarse approximation because Internet paths are variable, and error rates can fluctuate greatly because of intersecting traffic flows. The effect of errors on TCP communications is also quite complex, as TCP is very sensitive to the particular pattern of errors over time. Therefore, a brief burst of ping packets may not be representative of the effective performance of the connection as seen by TCP. Block-oriented error ratios for the underlying transport, such as ATM's cell error ratio or similar measures for Frame Relay, are therefore preferable. They're not perfect for TCP, but they're better than ping ratios.

One-Way Latency

Services in the interactive classes may require one-way delay measurements. Routing protocols often select different paths in each direction between a pair of nodes, so the latency in the two directions can be considerably different. (Each ISP typically tries to hand a packet to another ISP as soon as possible, thereby decreasing the distance it must carry the packet.) See Figure 10-1.

Figure 10-1. Internet Latencies and Asymmetric Routes

The challenge is getting the time measurement between two unsynchronized hosts. The Network Time Protocol (NTP) can be used; there are commercial variants that offer similar functionality. A more expensive, and more accurate, approach is using Global Positioning System (GPS) receivers to measure time and synchronize the clocks at each site.

Round-Trip Latency

Round-trip latency is the common metric for transactional and (some) interactive services. Synchronizing clocks is not an issue because the initiator gets a response and can easily determine the elapsed time, which is the metric of interest.

Round-trip latency on the Web is easily measured by using the time elapsed between the first two steps in the establishment of a TCP connection. (A "SYN" packet is sent out, and a "SYN ACK" packet is returned.) The turnaround time at the destination is minimal, as it does not involve the destination application. Most active measurement collectors provide that measurement, labeling it initial connection time or TCP connection time. It can also be obtained from "ping" packets.

Jitter

Jitter is the variance in packet arrival times. It can have a serious impact on service quality, although buffering in the receiving host helps considerably. Anyone with a CD in his or her car appreciates the value of buffering and knows its limitations when there are too many bumps in the road. The real problem is the additional delay added by the dejitter buffer; it's usually one or two times the expected jitter. For interactive use of the network, such as voice communications, ITU-T's standard G.114 suggests a maximum one-way latency of 150 milliseconds (ms). Therefore, a dejitter buffer of, say, 50 ms would form a large part of the latency budget.

Jitter is measured by tracking the arrival time of each successive packet and calculating the variance between them, assuming a transmitter is introducing traffic into the network at a constant rate. Some commercial tools enhance this measure with additional analytics; for example, the NetIQ product calculates the distribution of the jitter measures to provide more insight into the range of underlying performance.