The Enterprise: Evaluating Service Provider Management Capabilities

One of the most important parts of the Request for Proposal (RFP) should be a section on service provider network management. It is crucial that each applicant be vetted on the following:

Provisioning What is the service provider's process for provisioning the VPN and subsequently new sites within it?
SLA monitoring How does the service provider intend to monitor the SLAs?
Fault management Includes the tools, what faults are collected, and escalation procedures.
Reporting Accessibility, frequency, and content are most relevant.
Root-cause analysis (RCA) After a "severe" fault, what RCA policy does the service provider have?

Provisioning

This involves not only the configuration changes required to create the VPN, but also the process by which the enterprise is informed of progress and the level of involvement it requires. This differs, of course, depending on whether the service is managed or unmanaged.

For managed services, the idealized model is one of almost zero touch, where the enterprise simply "plugs in" the CE device and it automatically provisions itself.

This model has tremendous advantages for both parties:

It reduces errors and operating expenditures (opex) while providing more centralized provisioning control.
It automates an otherwise-manual process, allowing service providers to invest their scarce resources in producing new revenue-generating services rather than maintaining old ones.
It helps service providers scale their limited technical resources by centralizing the configuration steps with a stored configuration.
It lets service providers offer incremental services by remotely adding new configurations and managing the end customers' configuration.
It reduces costs of warehousing, shipping, and manual intervention because CE devices can be drop-shipped directly to subscribers.
It shortens time to billable services.

For unmanaged services, there is of course a requirement on the enterprise to supply and configure the CE devices after agreement with the service provider on aspects such as routing protocols and security. Even here, however, the service provider may be able to help the process by supplying part of the necessary configuration for the CE.

SLA Monitoring

It is crucial that, having signed a specific SLA, the enterprise has a high degree of confidence in the service provider's capability to satisfy it. To begin, the enterprise should ask the service provider how it would ideally like to monitor the SLAs.

The level of monitoring required varies depending on the types of applications that will be delivered across the network. For example, voice over IP (VoIP) traffic is highly sensitive to delay and packet loss. It is advisable for the enterprise to require the service provider to monitor adherence of VoIP traffic to well-defined standards, such as International Telecommunication Union (ITU) G.114. It specifies exact metrics for delay characteristics in voice networks.

Another assessment criterion is to inquire whether the service provider has attained any form of third-party "certification." An example is the Cisco-Powered Network QoS Certification. This specific example means that the service provider has met best practices and standards for QoS and real-time traffic in particular. Taking voice as an example, this certification means that the service provider has satisfied the following requirements:

Maximum 150 ms one-way delay for voice/video packets
Maximum one-way packet jitter of 30 ms for voice/video traffic
Maximum voice/video packet loss of 1.0 percent

More information on this specific program can be found at http://www.cisco.com/en/US/netsol/ns465/net_value_proposition0900aecd8023c83f.html.

Other traffic types typically have their own QoS classes, as defined in the DiffServ model. The level of proactive monitoring expected of the service provider therefore should be negotiated for each class of traffic. For example, it might be that the enterprise specifies the following:

A target for overall per-class packet delivery (less than 0.0001 percent packet loss, rising to higher percentages for non-drop-sensitive traffic or low-priority classes).
Targets for delivery of interactive video AF4x class.
Targets for delivery of signaling/high-priority data classes. These are usually marked with the DiffServ code points AF3x or CS3.

The service provider can employ specific tools and techniques to adhere to such requirements. These are discussed more fully in the section "The Service Provider: How to Meet and Exceed Customer Expectations."

Fault Management

The purpose of fault management is to detect, isolate, and correct malfunctions in the network. For assessment purposes, the three main questions are as follows:

How does the service provider respond to faults after they are reported?
What is the service provider's passive fault management strategy?
What proactive fault monitoring techniques are used?

Handling Reported Faults

Sometimes, the enterprise detects faults with the VPN service and needs to report them to the service provider. These might range from serious outages to intermittent performance problems as experienced by some applications and end users.

In such scenarios, it is important to know what the processes, escalation, and reporting procedures are within the service provider. Here are some possible questions to ask:

What are the contact options for reporting a fault?
Is this 24/7?
How is progress reported?
What is the skill level of the Tier 1/help desk operator? For example, will he or she know the contracted service when a problem is reported? Will he or she be able to report the service's health within the service provider network?
If an outage is reported, will the service provider verify that the connection is healthy from CE-CE (managed service) and from PE-CE (unmanaged)?
What techniques will the service provider employ to determine reachability health?
If a performance issue is reported, what techniques will the service provider employ to validate that its network is responding correctly to the type of traffic experiencing problems from the enterprise perspective?
If a Tier 1 operator cannot provide immediate help on the problem, what are the escalation procedures? That is, to whom would a problem be assigned? How does this process continue, and what are the time frames? In other words, how long does a problem reside with one support person before being escalated?

Some of these issues are discussed from the service provider's perspective in the section "The Service Provider: How to Meet and Exceed Customer Expectations."

Passive Fault Management

Passive fault management can be further subdivided into monitoring network element-generated events and capturing and analyzing customer traffic.

Network Events

Events from network elements are usually obtained from Simple Network Management Protocol (SNMP) traps, Remote Monitoring (RMON) probes, and other, potentially proprietary messages (such as Syslog on Cisco equipment).

In terms of SNMP, the enterprise should ask if the service provider is monitoring notifications from the following Management Information Bases (MIBs):

MIBs related to MPLS transport include LDP-MIB and LSR-MIB:
- LDP-MIB The Label Distribution Protocol (LDP) MIB allows a network management station to retrieve status and performance data. But probably most importantly, it provides notifications on the health of LDP sessions. For example, if an LDP session fails, an mplsLdpSessionDown notification is generated. Within the notification, the LDP peer (who the session was with) is specified. This allows the network management station to perform some correlation (by examining other events received in the same time frame) and affect analysis (is the LDP session over links carrying customer traffic?).

- LSR-MIB The label switch router (LSR) MIB allows the network management station to monitor the data plane characteristics of MPLS. From a fault-management perspective, the enterprise should inquire which parameters are retrieved, because the vendor may not support the notifications in the MIB definition. Care should be taken with excessive use of this MIB because of scale implications. Tools starting to appear within vendor equipment will negate the need to poll for the existence and operational health of label-switched paths (LSPs). These are discussed later in this chapter.
MIBs related to MPLS VPNs include MPLS-VPN-MIB and vendor-specific MIBs such as the Cisco BGPv4 MIB:
- MPLS-VPN-MIB This MIB allows the network management station to monitor the health and existence of VPN routing and forwarding interfaces. For example, if one of these interfaces fails, the network management station is informed and should be able to determine which customers and locations are affected.

- Vendor-specific MIBs An example is the Cisco BGPv4 MIB. This MIB in particular facilitates monitoring of the multiprotocol Border Gateway Protocol (MP-BGP) sessions that are vital to the exchange of route information within VPNs.

There are also standard and proprietary MIBs for Open Shortest Path First (OSPF) Interior Gateway Protocol (IGP).

Certain generic MIBs must be included in any effective fault-management strategy. This is not only because they provide important data relating to the health of the router and its functions, but also because they may help diagnose and correlate MPLS VPN problems. These MIBs allow the network manager to focus on the following categories:

Hardware-related errors
Environmental characteristics
Resources and processes
Interface problems

Note

Vendor-specific MIBs are almost always available to provide more useful events, but these are beyond the scope of this book. One point to stress, however, is that when no notifications are explicitly supported, it may still be possible to achieve monitoring through the use of the EVENT and EXPRESSION MIBs (provided that they are supported!). These MIBs allow data from other MIBs to be defined such that when certain thresholds are crossed, events are generated for the network management system (NMS). For example, a network manager could define a rule that says that when available free memory drops below 1 MB, a notification should be generated.

Hardware-related errors The most important of these are the reload, coldStart (SNMPv2-MIB), and linkup/linkDown (IF-MIB).
Several vendor-specific MIBs provide notifications related to the chassis inventory, such as within the OLD-CISCO-CHASSIS-MIB and CISCO-ENTITY-MIB family.
Environmental characteristics Most of these are from vendor-specific MIBs and allow attributes such as temperature, power supply, and voltage to be monitored. In Cisco devices, the notifications are defined in the CISCO-ENVMON-MIB family.
Resources and processes This covers aspects such as CPU and memory utilization. Again, this is mainly covered via vendor-specific MIBs such as the Cisco CISCO-PROCESS-MIB family.

The section "The Service Provider: How to Meet and Exceed Customer Expectations" has more details for service providers on SNMP usage and proprietary events.

Of course, such events need to be captured and responded to. The enterprise should ask the service provider what tool(s) it uses to capture events and how quickly the tools can be customized to deal with new ones. It may also be necessary to reconfigure how the fault-management system deals with specific events. For example, SNMP traps usually are translated into an alarm with a specific severity rating (Informational, Warning, Error, Critical, and so on). It may be necessary to change this, especially if the enterprise is experiencing problems in a certain area. Ideally, the service provider should be able to reconfigure the tool without involving the manufacturer/reseller.

Related to this are the procedures after the alarm is raised. The service provider may automatically alert an operator when certain important alarms occur. This might be done by e-mail, pager, Short Message Service (SMS), or fax. Other alarms may be handled manually. Either way, the enterprise should ask the service provider what its procedures are. Ideally, they should contain an element of automation.

Customer Traffic Monitoring

Why should the enterprise care about the service provider's ability to monitor customer traffic? Apart from the obvious billing implications, this becomes important if the enterprise starts to experience problems, particularly those of a performance nature. Possible performance problems include the following:

End users start experiencing poor-quality voice calls.
End users start experiencing poor-quality video feeds.
Application responsiveness, such as e-mail and/or web access that is slow or degrading.

These performance problems are, of course, from an end user's perspective. The underlying causes are likely to be one or more of the following:

The enterprise is marking traffic incorrectly, resulting in the wrong treatment in the service provider network.
The enterprise is oversubscribing its classes of service (CoSs) and the VPN is (correctly) dropping or remarking the traffic.
The service provider has incorrect QoS configurations, resulting in dropped packets.
The service provider has reached bandwidth limits on certain links.
The service provider has a performance bottleneck on a router.

Within an MPLS VPN, the enterprise shares the service provider's infrastructure with other organizations. In the event of a performance problem, it is important for the enterprise to know if the service provider can clearly identify its traffic, as well as answer the following questions:

Are enterprise QoS classifications being preserved through your network?
Is the bandwidth allocated to the enterprise as agreed in the SLA?
Is any enterprise traffic being dropped? If so, which type of packets (protocol, source, destination, and so on)?

From the service provider's perspective, when an issue is reported, it quickly wants to determine if the problem is within its network. To do this, the service provider needs to use a variety of techniques. It is likely (and recommended) that it will use synthetic traffic probing (such as Cisco IP SLA probes) to measure the connection's responsiveness. If this initial test seems to pass, it is essential that the service provider examine specific customer traffic flows to help answer the enterprise's questions. The techniques and tools that can be employed here are mostly vendor-specific. Here are some examples:

Cisco NetFlow lets you identify traffic flows down to the protocol level.
Cisco Network-Based Application Recognition (NBAR)Classification at the application layer for subsequent QoS treatment.
Specific MIBs, such as CISCO-CLASS-BASED-QOS-MIB.
Specific command-line interface (CLI) commands for retrieving QoS, interface, and buffer data relating to customer traffic.

The section "The Service Provider: How to Meet and Exceed Customer Expectations" discusses in more detail these tools and how the service provider can use them.

Proactive Monitoring

One of the most important assessments the enterprise can make on the service provider is to ask what its proactive monitoring strategy is. For large VPNs, it is very difficult for the service provider to monitor every path, but the enterprise should at least expect the service provider to monitor the critical ones, such as from certain locations to the data center. In doing so, it is important to be able to distinguish between the control and data planes. It is fundamental that the service provider monitor at least part of the data plane in the enterprise connectivity path. This is because even though the control plane may appear healthy, the data plane may be broken. More information on this topic appears in the section "The Service Provider: How to Meet and Exceed Customer Expectations." For the moment, the following provides useful guidelines in forming questions in this area.

Figure 8-3 shows the areas within the service provider network where proactive fault monitoring can be employed.

Figure 8-3. Proactive Monitoring Segments in an MPLS VPN Network

As shown, a number of segments and paths can be monitored:

PE-CE (WAN) link
PE-PE
CE-CE
Combinations, such as PE-Core-PE-CE

Each of these segments and paths are discussed in the following sections.

PE-CE

Layer 1 is traditionally monitored using the passive techniques already discussed, such as link up/down traps. However, depending on the vendor, it may be necessary to poll MIB variables. An example might be for subinterface status, where if a subinterface goes down, a trap generally is not generated. Subinterfaces are usually modeled as separate rows in the ifTable from the IF-MIB and hence have an associated ifOperStatus.

Layer 2 is heavily dependent on the WAN access technology being used. Point-to-point technologies such as serial, PPP, and high-level data link control (HDLC) do not have monitoring built in to the protocol. This situation is also true for broadcast technologies, such as Ethernet (although developments in various standards bodies will shortly address this, specifically in the area of Ethernet OAM).

The situation is somewhat different for connection-oriented technologies, such as ATM and Frame Relay. Both these technologies have native OAM capability, which should be fully enabled if possible.

At Layer 3, some data-plane monitoring functions are starting to appear in the various routing protocols. The most important of these is Bidirectional Forwarding Detection (BFD). The enterprise should ask the service provider if and when it plans to use this technology.

It is more likely, however, that the service provider will actively probe the PE-CE link. This is done either via a management application simply pinging from PE-CE or by using one of the vendor-specific tools, such as IP SLA from Cisco.

In the control plane, the service provider should have some capability to monitor route availability within the virtual routing/forwarding instance (VRF) on the PE. This would let the service provider detect when specific customer routes were withdrawn, hence facilitating some form of proactive troubleshooting. Unfortunately, no MIBs provide notifications for this, so the service provider has to inspect the routing tables periodically using a management application.

PE-PE

This segment introduces MPLS into the forwarding path. In this context, the enterprise should ask if the service provider would monitor the LSPsin particular, LSPs being used to transport VPN traffic from PE-PE. An additional check would be the VPN path from PE-PE, which includes VPN label imposition/disposition. There may be some overlap in the latter case with SLA monitoring. For example, if the service provider is already monitoring PE-PE for delay/jitter data, ideally it would combine this with the availability requirement.

CE-CE

This option is only really available in a managed VPN service (unless the enterprise grants permission for the service provider to access its routers). In theory, this option offers the best monitoring solution, because it more closely follows the path of the customer's traffic. This option has some important caveats, however, as discussed in the section "The Service Provider: How to Meet and Exceed Customer Expectations."

The two main options are periodic pinging via a management application and using synthetic traffic from IP SLA probes. In both cases, the enterprise should ask what happens when a fault is detected. The expectation should be that an alarm is generated and fed into the central fault management system, from where troubleshooting can be initiated.

PE-Core-PE-CE

This option makes sense when the service provider wants to monitor as much of the customer path as possible but cannot access the CEs. The same techniques apply as for the CE-CE case.

Reporting

Enterprise customers should expect the service provider to offer a reporting facility for their VPN service. This normally takes the form of a web portal through which the enterprise network manager can log in and receive detailed metrics on the service's current state and performance. Figure 8-4 shows a performance-reporting portal.

Figure 8-4. MPLS VPN Performance-Reporting Portal

The following metrics should be supported:

Utilization per CoS
Packet loss
Latency
Jitter

If the service is managed, the reporting tool should also support CE-related data, such as memory, CPU, and buffer utilization.

Root Cause Analysis

Faults will occur in the service provider network. When they do, the enterprise should be informed as soon as possible. However, it is also important that when faults are rectified, the root cause be passed on. This allows the network manager to provide accountability information to his or her upper management and internal customers, as well as make ongoing assessments of the QoS he or she is receiving. For example, if outages related to maintenance occur, the enterprise might request more advance notice and specific details of the planned disruption.

Such data might take the form of a monthly report. The enterprise should look for the following data:

Site availability (expressed as a percentage)
Reliability
Service glitches per router
Defects encountered on vendor equipment, with tracking numbers
Partner site availability and reliability (if applicable)
Fault summary, including categorization (for example, configuration error, defect, weather, protocol, and so on)
Planned changes