8.1 Quality-of-service models

QoS is not a new concept; several models have been proposed and implemented over the years to offer quality of service, with varying degrees of success. Often these mechanisms have been tied to specific media or offered as proprietary solutions by vendors. What has been conspicuously lacking until recently is an overall architecture for handling QoS on an end-to-end basis for traffic moving from private to public networks over a variety of media types. One of the other key stumbling blocks in the past was how to provide a generic QoS when there are multiple protocols to consider. Now we are beginning to see that IP has become pervasive enough to simplify this task. We can broadly classify these alternative models into the following categories:

Service marking
Label switching
Integrated services/RSVP
Static per hop classification
Relative priority marking/differentiated services

These models are discussed briefly in this section. First we need to define how traffic is to be identified, since this is fundamental to the provision of any service level.

8.1.1 Traffic differentiation and flow handling

Over recent years there has been increasing emphasis on a more intelligent approach to dealing with IP datagram traffic. Datagrams are fine for delivery; however, to offer value-added services such as quality of service we need a higher-level view that can pick out individual packets and associate them with a particular session context. In reference [1] the concept of the packet train was introduced; a more widely used term for this association is flow.

Flows

So far we have mentioned the term "flow" without really defining it. A flow is an abstraction that is fundamental to several major technologies used in traffic engineering, switching, network security, and quality of service. It is useful to reflect on some of the definitions provided in the standards. Reference [2] defines a flow as follows: IP flow (or simply flow) is an IP packet stream from a source to a destination (unicast or multicast) with an associated Quality of Service (QoS) and higher-level demultiplexing information. The associated QoS could be best effort.

Reference [3] defines a microflow as a single instance of an application-to-application flow of packets that is identified by source address, source port, destination address, destination port, and protocol ID.

We define a flow as a distinguishable stream of related datagrams that results from a single user activity and requires the same quality of service. For example, a flow might comprise a transport connection or a video stream between a specific host pair (see Figure 8.1). We define a flow to be simplex. Thus, an N-way teleconference will generally require N flows, one originating at each site. In order to provide scalable traffic engineering solutions, flows with similar service requirements may be combined to form an aggregated flow.

click to expand
Figure 8.1: Flow identification. Here we see two flows highlighted amidst background traffic.

In practice there are many flows to consider, depending upon the level of granularity required. We could, for example, single out an individual FTP connection between two nodes by isolating flows at the TCP or UDP port level. For some applications it may be appropriate to use higher-level fields to identify flows. However, as granularity increases it becomes increasingly difficult to achieve flow classification efficiently in real time, especially if the fields of interest are not at fixed offsets (this is one area for vendor differentiation).

Once flows are identified the packets they contain can be handled in a manner consistent with policy rules. Currently the Internet handles IP packet forwarding in a fairly democratic manner; all packets effectively receive the same quality of service, since packets are typically forwarded using a strict FIFO queuing discipline. For new QoS models, such as integrated services, a router must be capable of identifying and implementing an appropriate service quality for each flow. This requires that routers support flow-specific states and represents a fundamental change to the Internet model (the Internet architecture was founded on the concept that all flow-related states should be a function of end systems [4]).

As indicated, recognizing flows in real time is not without its problems; it depends upon the protocols used; the presence of addressing, port, or Service Access Point (SAP) fields; and whether or not the system performing flow identification is smart enough to be stateful. For example, in Figure 8.1 there are UDP flows within the background traffic (specifically a TFTP session on port 69). TFTP illustrates a particular problem for flow identification, since it uses ports dynamically (the well-known port 69 is not used beyond the connection phase, and a new port is allocated from the pool). In order to identify all packets that form part of the same TFTP flow, stateful monitoring of session and recognition of events within the TFTP payload are required. When dynamic ports are used, it is necessary to track the session from the outset; otherwise, it may be impossible to determine the application type and the appropriate QoS. As we shall see later, flow states can be implemented as soft states in devices such as routers to provide transparency.

Flow handling

Within an intermediate system, such as a router, incoming traffic needs to be unambiguously sorted so that it can be handled appropriately. For example, if SAP traffic from a particular network is to be given dedicated bandwidth and assigned a certain priority, then this traffic needs to be separated from the mass of incoming packets and then queued and scheduled for release to meet the policy objectives. To achieve all this we need specialized system components, each with a clearly delineated responsibility, as follows:

Admission control
Packet classifier
Packet scheduler

Figure 8.2 illustrates how the various flow-handling components inter-work in a generic system.

Figure 8.2: Integrated services model for a host and a route.

Admission control

When a packet arrives at an interface (e.g., a router wide area port), it may need to convince admission control that it should be given special treatment. Admission control contains the decision algorithm that a device uses to determine if there are sufficient resources to accept the requested QoS for a new flow. If there are not enough free resources, accepting that flow could impact earlier guarantees, so the new flow must either be rejected (in the IS model) or handled on a best-effort basis (with some feedback mechanism to inform the sender). If the new flow is accepted, then the packet classifier and packet scheduler are configured to handle this new flow so that the requested QoS can be realized.

Packet classifier

A classifier is a component of a system that defines how the system should sort incoming traffic (packets, frames, or cells) into appropriate groups in order that they can be scheduled with the appropriate service level. For example, a classifier could be used to differentiate real-time traffic from non—real-time traffic, multicast traffic from unicast traffic, TCP for UDP traffic, or traffic based on specific ToS settings. Classifiers can be broadly divided into two types, as follows:

Flow classifiers—are applicable to IP unicast and multicast traffic only, typically using criteria such as the IP protocol number, IP source-destination address/mask, and source-destination port number (or range of ports). On a system where multiple classifiers are implemented, there needs to be a method of determining which classifier has preference if the packet classification data are sufficiently ambiguous to be handled by more than one controlled classifier. For example, classifiers can be numbered so that the lower-numbered classifier takes preference if there is a conflict.
Nonflow classifiers—are applicable to both switched and routed traffic, and operate using criteria such as the protocol type (e.g., IP = 0x0800), a cast type (unicast, multicast, or broadcast), or IEEE 802.1p priority tag values (as described later).

For the purpose of this chapter we are primarily interested in flow classifiers. Since much of the data-networking world is now exclusively IP-oriented, considerable research and development effort is focused on IP packet classification. Flow classifiers are divided into different types, depending upon the service model implemented (e.g., in the IS model, classifiers are divided into multifield and behavior aggregate classifiers; these are described later in this chapter). Nonflow classifiers are appropriate for switched VLANs, multiprotocol LANs, and local area networks where protocol stacks do not support Layer 3 information (e.g., DEC LAT traffic).

The flow classifier identifies packets of an IP flow in hosts and routers that will receive a certain level of service. To realize effective traffic control, each incoming packet is mapped by the classifier into a specific traffic class. All packets that are classified in the same class get the same treatment from the packet scheduler. The choice of a class is based upon the source and destination IP address and port number in the existing packet header or an additional classification number, which must be added to each packet. A class can correspond to a broad category of flows. For example, all video flows from a videoconference with several participants can belong to one service class. But it is also possible that only one flow belongs to a specific service class.

Packet scheduler

The packet scheduler manages the forwarding of different flows in hosts and routers, based on their service class, using queue management and various scheduling algorithms. The packet scheduler must ensure that the packet delivery corresponds to the QoS parameter for each flow. A scheduler can also police or shape the traffic to conform to a certain level of service. The packet scheduler must be implemented at the point where packets are queued. This is typically the output driver level of an operating system and corresponds to the link layer protocol.

An early application of flows—IP switching

Ipsilon Networks, Inc. (now integrated into Nokia Telecommunications) raised a few eyebrows in the mid-1990s with an innovative technique, whereby IP sessions were mapped dynamically onto connection-oriented ATM VCs. This feature, called IP switching, enabled IP for the first time to benefit directly from ATM prioritization and QoS mechanisms. The basic idea was to route datagram IP traffic as normal, while monitoring traffic for any identifiable flows. Note that Ipsilon's definition of a flow is "a sequence of IP packets sent from a particular source to a particular destination sharing the same protocol type, type of service, and other characteristics as determined by information in the packet header."

Flows are classified as long-lived and short-lived. Table 8.1 illustrates some typical examples. Once a flow is identified, an ATM VC can be established dynamically, and this identified flow is then taken out of the routing path and switched directly over the ATM VC.

Table 8.1: Typical Flow-Oriented Traffic From the Types of Data That Normally Do Not Qualify as a Flow
Long-Lived Flows	Short-Lived Flows
FTP Data	DNS Query
Telnet Data	SMTP Data
HTTP Data	NTP
Web Image Downloads	POP
Multimedia (audio/video)	SNMP

Ipsilon's implementation requires two hardware devices: an IP switch (incorporating an ATM fabric) and a switch controller (a fairly standard UNIX routing platform incorporating flow classification software). IP switching operates as follows:

At system startup, each IP node sets up a virtual channel on each of its ATM physical links to be used as the default-forwarding channel.
An ATM input port on the IP switch receives incoming traffic from the upstream device on the default channel and forwards it to the IP switch controller.
The IP switch controller forwards packets over the default-forwarding channel. It also performs flow classification to assess suitability for switching out of the slow routed path. Short-lived flows, or unidentified packet streams, are ignored and continue to be forwarded using hop-by-hop store-and-forward routing.
Once a long-lived flow is identified (e.g., an FTP session), the switch controller asks the upstream node to label that traffic using a new virtual channel.
If the upstream node concurs, the traffic starts to flow on the new virtual channel. Independently, the downstream node also can ask the IP switch controller to set up an outgoing virtual channel for the flow.
When the flow is isolated to a particular input and output channel, the IP switch controller instructs the switch to make the appropriate port mapping in hardware, bypassing the routing software and its associated processing overhead. Long duration flows are thus switched out of the slow routing path (using cut-through switching) and mapped directly onto ATM SVCs.

This design enables IP switches to forward packets at rates limited only by the aggregate throughput of the underlying switching engine. First-generation IP switches supported up to 5.3 million pages per second throughput, and since there is no need to reassemble ATM cells into IP packets at intermediate switches, throughput remains optimized throughout the IP network.

The technology introduced by Ipsilon is now obsolete (largely due to a lack of mindshare rather than respect for the technology and the fact that the technique was also ATM specific). Cisco recognized the potential threat this technology posed to its market dominance and countered strongly with its own technology, tag switching. Tag switching [5] classifies traffic by inserting special tag fields into packets as they enter the ingress to the backbone. Tags specify QoS requirements such as minimum bandwidth or maximum latency. Once a packet is tagged it can be switched through the network quickly without being reclassified. At the egress point the tag is removed by another tag-switching router. Cisco's implementation is currently available only in its high-end product lines (such as the 7000 series routers and BPX switches), and to some extent this has slowed deployment, since these are not edge devices. As discussed later in this chapter, the IETF is developing a standards-based tagging scheme called MultiProtocol Label Switching (MPLS), a superset of tag switching and related technologies such as the Ipsilon IP switching. MPLS incorporates scalability features such as flow aggregation onto large trunks. Both tag switching and MPLS require routers and switches to maintain flow state information, and this requires massive amounts of switch memory for backbone links. Nevertheless MPLS is a key technology for the future and is of particular interest for backbone scalability and virtual private networking.

Standard models for ensuring consistent behavior

The purpose of these components is to collectively enable an individual device (such as a router) to comply with policy based on the Service-Level Agreement (SLA) contracted between the user and the service provider. One piece that is missing here is how to coordinate multiple devices in a network so that they all behave consistently when dealing with incoming traffic. There are a number of fundamentally different approaches to this problem, including the following:

The original IP model uses simple service marking via the Type of Service (ToS) field to indicate how the flow should be handled. IPv6 implements relative priority marking.
The differentiated services model uses relative priority marking to enable recipients to clearly identify the class of traffic received.
The integrated service model uses explicit signaling protocol to reserve resources across a network of routers in advance of sending data.
Label switching, as implemented by ATM and MPLS, is used to set up virtual circuits, where the QoS state is maintained for the duration of the circuit.

We now briefly review these various models.

8.1.2 Service marking

Simple service marking model

An example of a service marking model is IPv4 Type of Service (ToS), as defined in [6], and illustrated in Figure 8.3. With IPv4 ToS, an application could mark each packet with a request for a particular type of service, which could include requests to minimize delay, maximize throughput, maximize reliability, or minimize cost. Intermediate network nodes may select routing paths or forwarding behaviors that are engineered to satisfy this service request.

click to expand
Figure 8.3: IPv4 and IPv6 QoS-related fields.

The precedence field in IPv4 is defined in [6] as follows:

111—Network Control	11—Flash
110—Internetwork Control	010—Immediate
101—CRITIC/ECP	001—Priority
100—Flash Override	000—Routine

The ToS bits in IPv4 are defined in [6] as follows:

Bit 3:	0 = Normal Delay	1 = Low Delay
Bit 4:	0 = Normal Throughput	1 = High Throughput
Bit 5:	0 = Normal Reliability	1 = High Reliability
Bits 6–7:	Reserved for Future Use

Reference [7] redefines the use of the ToS bits to single enumerated values rather than as a set of bits each with specific meaning. It also assigns bit 6 for ToS applications, as follows:

xxx1000:	minimize delay
xxx0100:	maximize throughput
xxx0010:	maximize reliability
xxx0001:	minimize monetary cost
xxx0000:	normal service

This model is subtly different from the relative priority marking model offered by IPv6. The ToS markings defined in [8] are very generic and do not span the range of possible service semantics. Furthermore, the service request is associated with each individual packet, whereas some service semantics may depend on the aggregate forwarding behavior of a sequence of packets (a flow). The service-marking model does not easily accommodate growth in the number and range of future services (since the codepoint space is small) and involves configuration of the ToS forwarding behavior association in each core network node. Standardizing service markings implies standardizing service offerings, which is outside the scope of the IETE Note that provisions are made in the allocation of the Differentiated Services (DS) codepoint to allow for locally significant codepoints, which may be used by a provider to support service-marking semantics.

IPv6 class of service model

IPv6 incorporates two major enhancements to support Class of Service (CoS): traffic differentiation field and the flow label, as follows:

Traffic differentiation: IPv6 is able to differentiate two traffic types: congestion controlled and noncongestion controlled. Congestion-controlled traffic tolerates delays in the face of network congestion (e.g., File Transfer Protocol—FTP). Non-congestion-controlled traffic requires a smooth delivery under any load (e.g., isochronous content such as streaming audio and video where there would be tight constraints on jitter).

8.1.3 Label switching

Examples of the label-switching (or virtual circuit) model include Frame Relay [9], ATM [10], and MPLS. In this model path forwarding state and traffic management or QoS state is established for traffic streams on each hop along a network path. Traffic aggregates of varying granularity are associated with a label-switched path at an ingress node, and packets or cells within each label-switched path are marked with a forwarding label that is used to look up the next-hop node, the per hop forwarding behavior, and the replacement label at each hop. This model permits finer granularity resource allocation to traffic streams, since label values are not globally significant but are only significant on a single link. Resources can, therefore, be reserved for the aggregate of packets or cells received on a link with a particular label, and the label-switching semantics govern the next-hop selection, allowing a traffic stream to follow a specially engineered path through the network. This improved granularity comes at the cost of additional management and configuration requirements to establish and maintain the label-switched paths. In addition, the amount of forwarding state maintained at each node scales in proportion to the number of edge nodes of the network in the best case (assuming multipoint-to-point label-switched paths), and it scales in proportion with the square of the number of edge nodes in the worst case, when edge-edge label-switched paths with provisioned resources are employed.

8.1.4 Integrated services and RSVP

Integrated Services (IS) builds upon the Internet's best-effort model with support for real-time transmission and guaranteed bandwidth for specific traffic flows. A flow is an identifiable stream of datagrams from a unique sender to a unique receiver (e.g., a Telnet session over TCP or a TFTP session over UDP). In most applications two flows are necessary (one in each direction). Applications that initiate flows can specify the required QoS (bandwidth, maximum packet delay, etc.).

The integrated services model relies upon traditional datagram forwarding in the default case but allows senders and receivers to exchange signaling messages, which establish additional packet classification and forwarding state on each node along the path between them [11]. In the absence of state aggregation, the amount of state on each node scales in proportion to the number of concurrent reservations, which can be potentially large on high-speed links. This model also requires application support for the Resource Reservation Protocol (RSVP) signaling protocol. Differentiated services mechanisms can be utilized to aggregate integrated services state in the core of the network [12].

A variation of the integrated services model eliminates the requirement for hop-by-hop signaling by utilizing only static classification and forwarding policies that are implemented in each node along a network path. These policies are updated on administrative time scales and not in response to the instantaneous mix of microflows active in the network. The state requirements for this variation are potentially worse than those encountered when RSVP is used, especially in backbone nodes, since the number of static policies that might be applicable at a node over time may be larger than the number of active sender-receiver sessions that might have installed reservation state on a node. Although the support of large numbers of classifier rules and forwarding policies may be computationally feasible, the management burden associated with installing and maintaining these rules on each node within a backbone network that might be traversed by a traffic stream is substantial.

8.1.5 Relative priority marking and differentiated services

Examples of the relative priority-marking model include IPv4 precedence marking as defined in [6], IEEE802.5 Token Ring priority [13], and the default interpretation of 802.1p traffic classes [14]. In this model the application, host, or proxy node selects a relative priority or precedence for a packet (e.g., delay or discard priority), and the network nodes along the transit path apply the appropriate priority forwarding behavior corresponding to the priority value within the packet header. The Differentiated Services (DiffServ, or DS) architecture can be thought of as a refinement to this model, since it more clearly specifies the role and importance of boundary nodes and traffic conditioners and since its per hop behavior model permits more general forwarding behaviors than relative delay or discard priority.

DiffServ is a much simpler and more scalable solution than the integrated services model, since it does require additional signaling protocol. With DiffServ, traffic entering a network is classified and possibly conditioned at the boundary, and then assigned to different behavior aggregates. DiffServ uses the existing Type of Service (ToS) field in the IPv4 header [8]. Bits in this field have been reallocated by DiffServ to characterize the service characteristics required (in terms of delay, throughput, and packet loss). DiffServ specifies how the network should treat each packet on a hop-by-hop basis, using the ToS field. The ToS field is marked at the edge (access point) of the network by Customer Premises Equipment (CPE), such as routers. Since DiffServ operates at Layer 3, it can work transparently over any Layer 2 transport (ATM, Frame Relay, ISDN, etc.), and since it relies on embedded ToS marking, it places no additional burden on the network. Equipment that supports DiffServ needs to be able to interpret and act upon ToS settings in real time, and this does require additional CPU power. DiffServ is currently viewed as the most flexible method of supporting Virtual Private Networks (VPNs), when used in combination with traffic engineering mechanisms such as MultiProtocol Label Switching (MPLS).

8.1.6 Vendor approaches to QoS

Equipment manufacturers have taken a number of approaches to enable bandwidth policy management and service specifications at both the access point and the core. The QoS features, supported in hardware and/or software, can be broadly categorized as follows:

Guaranteed bandwidth: A feature that enables a high probability that bandwidth will be available at all times. Excess traffic is discarded, delayed, or counted and billed.
Committed information rate: A feature that allows a minimum amount of bandwidth to be guaranteed. Transmitters can send at a higher rate up to a specified maximum. Excess traffic is discarded, delayed, or counted and billed.
Rate shaping: A feature that regulates bandwidth for a flow so that it does not exceed a specified maximum. Excess traffic is delayed.
Expedited service: A feature that defines service classes. A customer's traffic is compared against the Service-Level Agreement (SLA). Excess traffic is discarded, delayed, or counted and billed.
Congestion management: A feature that allows traffic policies to be specified, based on maximizing throughput during congestion conditions. Typically there is a precedence upon which packets are dropped.
Resource reservation: A feature that enables end-to-end reservation of resources to guarantee service quality.

Table 8.2 illustrates the broad approaches supported by leading networking vendors in the area of QoS. Note that devices such as routers implement a range of queuing disciplines to support QoS requests, ranging from simple FIFO strategies through to more sophisticated techniques such as Weighted Fair Queuing (WFQ) and Class-Based Queuing (CBQ).

Table 8.2: QoS Features Supported by Leading Networking Vendors
Vendor	Guaranteed Bandwidth	Committed Info Rate	Rate Shaping	Expedited Service	Congestion Management	Resource Reservation
Cisco	yes	yes	yes	yes	yes	yes
Ascend	yes	yes	yes	yes	yes	yes
Lucent	yes	yes	yes	yes	yes	yes
Bay Networks	yes	yes	yes	yes	no	no
3Com	yes	yes	no	yes	yes	yes
Nokia	yes	yes	yes	no	no	no
Fore Systems	no	no	no	no	yes	yes
Packeteer	yes	yes	yes	no	no	no
Yago	yes	yes	no	yes	no	no
Xedia	yes	yes	yes	yes	no	no
Structured Internetworks	yes	yes	no	no	no	yes
Resonate	no	no	no	no	yes	no
Foundry Networks	yes	yes	no	yes	yes	no
BlazeNet	yes	yes	no	no	no	no
ArrowPoint	yes	yes	yes	yes	no	no
LanOptics	yes	yes	no	no	no	no
InfoHighway	no	no	yes	yes	yes	yes
Internet Devices	yes	no	no	yes	no	no

8.1.7 Service-Level Agreements (SLAs)

In order for an organization to expect better than best-effort service from its service provider, it must have a Service-Level Agreement (SLA) with that provider. The SLA basically specifies the service classes supported and the amount of traffic available to each class. It is the provider's responsibility to provision the network to cope with the service commitments made. At the ingress of the service provider's network (typically an access router), packets are classified, policed, and possibly shaped. The rules used by these processes and the amount of bandwidth and buffering space required are derived from the SLA. SLAs may be static or dynamic. Static SLAs are negotiated on a regular (e.g., monthly and yearly) basis. Dynamic SLAs are negotiated on demand, through the use of a signaling protocol such as RSVP

The Internet Service Provider (ISP) market is extremely competitive, and there is increasing focus on both the specification and monitoring of SLAs (we will review some monitoring tools shortly). The most common SLA guarantees are round-trip latency, packet loss, and availability (uptime per month), as follows:

Round-trip latency—Internet Round-Trip Time (RTT) is fairly predictable (currently between 200 and 400 ms on average), and SLAs offered by service providers generally specify 80–150 ms depending upon the region (e.g., UUNET offers 85 ms U.S. RTT and 120 ms transatlantic RTT at the time of writing). One point to be aware of is the difference between average and peak latency. Peak latency is the delay imposed during a worst-case congestion period (i.e., this is the worst latency value we would ever expect to see). If an SLA specifies average latency, then in reality users could expect to see much worse delays at periods of high congestion.
Packet loss is also an important factor in maintaining good performance. As packet loss approaches 30 percent over the Internet, there are likely to be so many retransmissions taking place that the service is effectively unusable. A good rule of thumb is that 10 percent packet loss would severely impair performance, and greater than 15 percent makes the service effectively unavailable. Most service providers do not currently offer this feature.

The SLA comprises a series of traffic rules or mandates, expressed in sufficient detail to be measurable, as follows:

Rule 1: 95 percent of in-profile traffic delivered at service level B will experience no more than 50 ms latency.
Rule 2: 99.9 percent of in-profile traffic delivered at service level C will be delivered.
Rule 3: 99 percent of in-profile traffic delivered at service level A will experience no more than 100 ms latency.

The SLA should also specify the level of technical support required. For example, you may require 24-hour, 7-day support, with a maximum 2-hour response time for a particularly mission-critical network. A more general specification would be 12-hour, 5-day support, with a 4- or 8-hour response time. The SLA should also include details of any compensation due should the provider not meet the agreement.

Service providers have to date concentrated on providing SLAs on backbone access. With the emergence of Virtual Private Networks (VPNs), it is likely that SLAs will need to be extended right into the customer premises; however, this raises the possibility that two or more service providers may be required to match guarantees, with the local area parts of the VPN network managed by private companies or outsourced. The technologies involved differ markedly between the local and wide area networks, making service prediction difficult (e.g., several T1 links fed into Gigabit Ethernet and then distributed as switched and shared Ethernet or Token Ring LANs). In the local area there are currently many different techniques for offering service guarantees; some LAN technologies offer no guarantees. Efforts are currently focusing on using a common tagging format to ensure that frames can be classified and handled uniformly regardless of the technology used, and the IETF is focusing on the DiffServ model as an overall architecture to provide end-to-end SLAs. We will discuss these technologies and issues in detail shortly. For the foreseeable future SLA homogenization between LANs and WANs may be simply unworkable without tight design and technology constraints; however, this is likely to be commonplace in the future as reliance on networks for business processes increases and service providers begin to invade the local area with managed service offerings.

Monitoring SLAs

One of the obvious problems with QoS in the real world is that a customer attached to a public network places significant trust in the service provider once the SLA is in place. For some organizations this honor system may be unacceptable, since the operation of a business may depend to a large degree on the availability of network links to customers and partners. These organizations may wish to monitor the SLA and may also expect restitution if significant loss of service or service quality occurs. There are a number of vendors offering SLA monitoring tools that compile and present activity reports for a range of WAN services. This information can (in theory) be used to force providers to offer compensation if the SLA guarantees falls short in practice.

Before purchasing a monitoring tool you should first examine the SLA. No product can help you if your SLA is poorly defined or badly negotiated. The looseness of many agreements makes it easy for providers to ignore significant service shortfalls and the corresponding compensation due to customers. The interested reader is referred to standards bodies such as the Frame Relay Forum [15] and the Telemanagement Forum (formerly known as the Network Management Forum) [16] for tips on designing better SLAs. These bodies are defining key parameters that can be used in setting up SLAs, and this is an active area for collaborative discussion. It is worth noting that some carriers have standard SLAs that are nonnegotiable. In these cases if you are unhappy with the SLA and you cannot exert sufficient commercial pressure, then your only choices are either to talk to other providers or to live with the SLA offered.

Features of commercial monitoring tools

Monitoring tools come in a variety of guises. Some products have been designed from the ground up; others are extensions of existing network management and monitoring systems. In any case, it is important when deploying these systems to understand how a product gathers information, what level of granularity is available, and what kind of polling mechanisms are used. In general they comprise the following elements:

Data collection models—Products may be purely software (data collection modules installed on key network components, with collection from standard networking devices) or a combination of software and dedicated hardware (in the form of remote monitoring probes or CSU/DSUs). Probes are typically more expensive than CSU/DSUs, but they can track much more information, since they are able to access the upper layers of the protocol stack. CSU/DSU-based solutions are limited to Physical Layer statistics and may be limited in geographic scope (these products may not be certified or compatible for international networks).
WAN interfaces—Monitoring tools can typically monitor traditional WAN technologies such as Frame Relay, ISDN, and leased lines. Not all tools currently support ATM, and none can currently be used to monitor Internet services.
External data feeds—Products use a number of external interfaces to capture real-time data (e.g., SNMP, RMONv1 and RMONv2, serial interfaces, CLI capture, proprietary protocols, and management systems such as HP Openview). These data can be consolidated with data captured by their own agents (if deployed).
Predictive features—Predictive capabilities can be extremely useful when considering the position of the products within the infrastructure:
- Trend analysis can be used to graphically highlight key performance patterns and areas requiring additional capacity in the future. Many vendors offer this capability as part of their products. Forecasting also fits here; it makes use of statistical analysis to let network managers predict overall future performance based on current metrics.
- What-if analysis allows users to model and test scenarios for their possible impact on performance. For example, a link can be temporarily broken (in software) to determine how an outage would affect other circuits.
Traffic shaping—Some vendors incorporate traffic shaping and filtering facilities into their products to help in maintaining QoS pro-actively at the edge. These features can be used to optimize the use of WAN bandwidth. This also adds an extra level of control that could be used to extend service-level guarantees to in-house users. Some products in this class can be set up to identify periods of congestion and can then allocate bandwidth dynamically to keep mission-critical applications alive.
Data capture and storage—On a large high-speed network, database requirements can be brutal, with respect to both storage space and database access speed (particularly Write actions). Consider the number of objects required to be stored from multiple feeds if all data are to be recording in real time. Many products have filtering capabilities to reduce the amount of data captured. Some customers may, however, insist that all pertinent data be recorded.
Reporting features—An important feature of these products is report generation. Pay particular attention to Web accessibility, frequency of report generation, composite measurement capabilities, security, and customization. Database products offering SQL-based interfaces enable data to be easily exported to other reporting tools and databases.
Diagnostic features—Troubleshooting features range from basic threshold alarms that can be set when traffic patterns vary (something nearly every vendor offers) to more complicated event correlation tools and packet-decoding features. If customers discover a problem, they can pay to have it corrected immediately and then (in theory) get the provider to reimburse them later on.

Key metrics to monitor

Given the importance of SLAs one would assume that there are standard metrics to monitor for each technology type. In fact only the Frame Relay Forum has begun to make progress on this issue at the time of writing. The ATM Forum has discussions under way, but no documents have yet been drafted. The Telemanagement Forum (TMF) has established definitions and documents for joint use by carriers and their customers, but much of this work pertains specifically to theoretical models carriers can use in setting up customer service systems. The types of metrics you need to focus on include the following:

Network availability (uptime)—Armed with standard network equipment (such as a router) you can track availability by recording how long a service is up over a specified period of time, and these products also typically maintain statistics on error rates, throughput, utilization, and so on.
Circuit error rates (e.g., bit error rate, CRC failure rate)—Again, standard network equipment typically maintains statistics on error rates. Specialized equipment such as BERT testers can also be employed.
Throughput—Throughput is important as a measure of efficiency; not all products are currently able to provide a consistent way to measure the amount of data successfully delivered.
Network latency (delay) and possibly jitter (variability of delay)—Several monitoring products can track network latency, delay, and response time. They typically use the ICMP ping facility to reach an agent at the other end of the link, measuring how long it takes to get a response. Some products can track the response time of specific applications running over WAN circuits. These intelligent monitoring agents can run alongside popular business applications, monitoring time stamps to measure how well the application is responding.
Circuit stability (oscillations in link up/down status)—Link oscillations can cripple a network, even if they represent only minor downtime at the WAN interface. For example, in a routed intranet, if the WAN link is part of an OSPF routing domain, each oscillation could force the network to reconverge frequently; this could lead to packet or even session loss.

It is important to fully understand the implications of the SLA specifications. For example, an SLA specifying 99 percent uptime per month on a particular circuit may seem more than adequate. In practice this could mean that the user experiences over seven hours of downtime on a regular basis with the carrier still meeting its SLA. If your business demands no more than 30 minutes downtime per month, you would be looking to specify at least 99.932 percent availability. Clearly, there are cost implications here. For further information on useful parameters refer to [15, 16].

Challenging an SLA

Recording data is one thing; challenging a service provider on SLAs is not quite so straightforward. Capture agents must collect data at frequent intervals in order to present a realistic view, and reports must be unambiguously timestamped and clearly identify the resources being tracked. Before investing in monitoring tools, you must discuss with your provider what it is prepared to reasonably accept in terms of SLA challenges. From the provider's perspective it is conceivable that you could be running inadequate tools to provide a credible challenge or that you could in some way alter the data. There are currently no hard rules or standards concerning what carriers will accept as proof of poor performance, other than their own measurements. Providers are likely to use their own tools, and these could offer much greater resolution than your own, or they might use commercial products that you could purchase and mirror at your site. Organizations that cannot afford to buy additional tools, or simply do not have the resources to manage the process, can outsource this activity to an independent service monitoring agency. Another subtle problem with SLA metric recording is security. If you cannot prove that the data collection devices are secure, then it is possible for a provider to argue that the data you have collected could have been tampered with. Surprisingly, only a few commercial products currently have security features that extend beyond simple password control. A small number of products now support SSL and integral firewall features

Key vendors

Key vendors of SLA monitoring solutions include ADC Kentrox, Concord Communications, Digital Link, Hewlett-Packard, Infovista, Ion Networks, The Knowledge Group, Netscout Systems, Netreality, Paradyne, Quallaby, Sync Research, and Visual Networks.