This section discusses the implications of enterprise expectations on the service provider and how these translate into management functions and practices that help exceed them. ProvisioningThe provisioning function is heavily dependent on whether the service is managed or unmanaged, because a managed service adds the CE to the service provider's domain of responsibility. Zero-Touch DeploymentFor managed services, one of the most time-consuming and expensive tasks within the rollout or upgrade of the VPN is the provisioning of CE equipment. Vendors now provide solutions that can help automate this process. Figure 8-10 is a high-level example of how such a process might operate. Figure 8-10. Automated CE ProvisioningCisco Configuration Express, http://www.cisco.com/cx The steps are as follows:
Of course, certain key technologies need to be in place from the vendor for such a system to be used. The key factor here is that this "zero-touch provisioning" system pushes intelligence down into the network. Devices require more advanced management functionality and cooperation with offline systems. A typical system might contain the following components:
PE ConfigurationOf course, provisioning the CE is only one of the tasks required in turning up a new circuit. The PE is where most of the configuration is required. Because of the relative complexity of this task, it is highly recommended that the service provider automate this task from a management provisioning system. Most vendors supply such a system. In general, they take one of two approaches:
Fault MonitoringAn effective fault-monitoring strategy should concentrate on ensuring that any potential service-affecting events generated by vendor equipment are collected first. Enhancements can then be added to include functions such as impact analysis and correlation. Furthermore, a number of management products in this space provide specialized functions, such as route monitoring availability and correlation of control/data plane events. (For example, perhaps a connection was lost because a route was withdrawn, which in turn was caused by link failure in the network.) MPLS-Related MIBsMIBs are the primary source of fault-related events and data from the network elements, especially because "standard" MIBs are often implemented by different vendors, so they can simplify multivendor management. This section discusses relevant MPLS MIBs and what features are most relevant within them. Figure 8-11 shows the points in an MPLS VPN where the MIBs are applicable. Figure 8-11. MPLS MIB Applicability PointsMPLS-VPN-MIBThe following notifications are useful for monitoring the health of the VRF interfaces when they are created and when they are removed: mplsVRFIfUp/mplsVRFIfDown notifications These are generated when
Problems can sometimes occur when a PE starts to exceed the available resources. For example, memory and the routes use up this resource. Therefore, it can be beneficial to set route limits that warn when specific thresholds are reached. Additionally, service providers might want to charge their customers in relation to the number of routes. The following notifications are useful for both of these purposes: mplsNumVrfRouteMidThreshExceeded and mplsNumVrfRouteMaxThreshExceeded These are generated when
BGPv4-MIB and Vendor BGP MIBsIn the context of VPNs, the standard BGPv4-MIB does not support the VPNv4 routes used by MP-BGP. This functionality currently is provided by vendor-specific MIBs. For example, CISCO-BGPV4-MIB provides support for tracking MP-BGP sessions, which is essential for successful operation of the VPN. The Cisco MIB provides notifications that reflect the MP-BGP session Finite State Machine (FSM). For example, notifications are sent when the Border Gateway Protocol (BGP) FSM moves from a higher numbered state to a lower numbered state and when the prefix count for an address family on a BGP session has exceeded the configured threshold value. MIBs related to MPLS transport:
There are also standard and proprietary MIBs for the OSPF IGRP. Resource MonitoringTo effectively monitor the network, the network manager should pay specific attention to resources on PEs. Because of their position and role in the VPN, they are particularly susceptible to problems caused by low memory and high CPU. Some vendors may provide either MIBs or proprietary techniques to set thresholds on both these resources. However, it is recommended that baseline figures be obtained and alerts be generated should deviations occur. The service provider should consult each vendor for the best way to extract memory and CPU utilization data, especially because this may vary across different platform types and architectures. Vendors should be able to recommend "safe" amounts of free memory. Often, this may have to be done via scripts that log in and retrieve the data via the CLI, but the ideal approach is to use a vendor Event Management Service (EMS) that is tuned to detect and troubleshoot resource problems. The following factors might contribute to low resource levels:
OAM and TroubleshootingIt is inevitable that the service provider network will experience problems that affect enterprise VPN availability. When such situations occur, it is essential that the service provider have the optimal troubleshooting tool set at its disposal. Recent advancements in MPLS OAM in particular have the potential to provide this technology with carrier class OAM capability. As discussed, fault management is a combination of proactive and reactive techniques. From the service provider perspective, proactive monitoring and subsequent troubleshooting are ideal because they have the potential to detect problems before the end customer. The next section discusses both areas and recommends tools and strategies that the service provider can adopt. Legacy Layer 2 access technologies are beyond the scope of this book, the emphasis being very much on MPLS VPN. Proactive Monitoring in DetailThe "Proactive Monitoring" section earlier in this chapter discussed the scope options a service provider has when monitoring the VPN for faults. Fundamentally, the techniques employed are either "off-box" (NMS-based) or "on-box" (probe-based). This section discusses these options in more detail and outlines the differences between them and the relative advantages and disadvantages of each. These approaches are illustrated in Figures 8-12 and 8-13. Figure 8-12. Off-Box Reachability TestingFigure 8-13. On-Box Reachability TestingActive monitoring using intelligent probes is the idealized approach because it pushes the responsibility down into the network, limiting the amount of external configuration and traffic on the data communications network (DCN). However, there are several reasons why this might not always be chosen:
In fact, often the best approach for the service provider is to employ a mixture of off-box and on-box testing to gain the required coverage. The next section explores what on/off-box tools are available to the service provider. VPN LayerAt this layer, the service provider monitors VPN reachability. As discussed earlier, the three main test path options are PE-PE, PE-core-PE-CE, and CE-CE. Each is basically a trade-off between accuracy and scalability, although logistics issues (such as CE access) might restrict what can be done. PE-PE testing can be done in one of two ways. Either the service provider can build a dedicated "test VPN" or it can test within the customer VPN itself. These approaches are shown in Figures 8-14 and 8-15. Figure 8-14. PE-PE Monitoring Using a Test VPN
Figure 8-15. PE-PE Monitoring Within Customer VPNsIn the "test VPN," each PE tests reachability to all other PEs within the same route reflector domain, even though they may not have an explicit customer VPN relationship. This is still of the "n squared" magnitude discussed earlier. Service providers that use this technique therefore are more concerned that the PEs have basic VPN layer connectivity with one another. In "customer VRF" testing, the service provider verifies specific customer VRF paths. This is more accurate because it closely mimics the customer traffic path, but it has scale implications. Because of the potential number of VRFs involved, the service provider is faced with the decision of which VRFs to monitor. This is usually influenced by specific customer SLAs. Figure 8-15 shows PE-PE testing, but the concept is similar to (and more accurate for) the other paths. Off-Box TestingIn terms of the actual available instrumentation, off-box testing typically uses a "VRF-aware ping" that is supported by many vendors. This is essentially an Internet Control Message Protocol (ICMP) ping within the context of a VRF, using the VRF at the ingress/egress PEs and MPLS in the core to route the packet toward and back from the destination. If this ping succeeds, it provides strong validation that the PE-PE path is healthy. That is, the double label stack imposed as a result of using a VPN prefix is the correct one, the underlying MPLS transport is healthy, and the VRF route lookups and forwarding are correct. Here's an example of the VRF-aware ping: cl-7206vxr-4#ping vrf red_vpn 8.1.1.2 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 8.1.1.2, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 92/96/108 ms cl-7206vxr-4# On-Box TestingThis technique uses one of the proprietary probe tools available from vendors. A common deployment model for PE-PE testing is to use a separate, dedicated router attached to a PE to host the probes. This is commonly called a "shadow router." It has a number of advantages:
This scheme has two variations. The "shadow CE" is when the probe router is not VRF-aware and simply emulates one or more CEs. A problem with this approach is that it cannot deal with overlapping IP addresses that may be advertised by remote sites within different VPNs. An example using the Cisco IP SLA technology is shown in Figure 8-16. Figure 8-16. Shadow CE SchemeIn the "shadow PE" model, the probe router is VRF-aware and is effectively a peer of the other PE routers. This solves the overlapping IP address problem, as shown in Figure 8-17. Figure 8-17. Shadow PE SchemeCE-CE monitoring seemingly is the optimal solution. Indeed, it's often used even in unmanaged services where the enterprise may grant access to the CE for probe deployment. The main problem with this approach is one of scale: As the VPN grows, so do the number of remote sites that have to be tested, and hence the number of probes. Given that CEs are typically low-end devices, this can present a performance problem. Another issue is the maintenance overhead on the management systems in terms of keeping up with adds/moves/changes and also tracking and correlating events from the CEs. In practice, the service provider should work with the equipment vendors to establish the performance profile for probing. Such a profile should include factors such as QoS and SLA metrics required. This can then be used to negotiate with each customer which sites will be monitored end to end. An example might be that customers get monitoring from certain key sites to the hubs or data centers. Example 8-4 is a sample configuration taken from the Cisco IP SLA technology in the Cisco IOS. It shows a basic VRF-aware probe configuration. Example 8-4. VRF-Aware Probe Creation
Line 1 creates the probe. Line 2 specifies that a User Datagram Protocol (UDP) echo test is required toward the destination address and port. Line 3 specifies the VRF within which to execute the operation. Line 4 instructs the probe to start sending immediately with the default frequency. Note IP SLA was originally called the "real-time responder." This is reflected in the CLI of older IOS versions, such as rtr 3 in Example 8-4. If such a probe were to detect a connectivity problem followed by restoration, the traps would look similar to Example 8-5. Example 8-5. IP SLA Connection Lost/Restored Traps
Note This output can be obtained by using the show logging command with debug snmp switched on. In Example 8-5, the line in bold shows the var bind that indicates the connection being lost.
The preceding line in bold shows the var bind that indicates the connection being restored. MPLS LayerThe recent growth of MPLS as a transport technology has brought with it demand from network managers for OAM capabilities. Specifically, operators are asking for OAM features akin to those found in circuit-oriented technologies such as ATM and Frame Relay. Additionally, many service providers are used to basing services on highly resilient and fault-tolerant time-division multiplexing (TDM) networks. This has led to requirements being placed on MPLS to acquire built-in protocol operations to rapidly detect and respond to failure conditions. The common requirement for such tools is an ability to test and troubleshoot the data plane. This is because the data and control planes may sometimes lose synchronization. A key message for the service provider here is that it is not enough to simply monitor the control plane of an MPLS VPN. For example, even though routes may be installed in global and VRF tables, there is absolutely no guarantee that the data plane is operational. Only by testing the data plane does the service provider have confidence that customer traffic will be transported correctly across its network. Various standards have been proposed to embellish MPLS with such technology. The next section concentrates on the main ones that are available or will be shortly. LSP Ping/TracerouteThis tool set is currently an IETF draft, defined at http://www.ietf.org/internet-drafts/draft-ietf-mpls-lsp-ping-03.txt. The intent of these tools is to provide a mechanism to let operators and management systems test LSPs and help isolate problems. Conceptually, these tools mirror the same principles of traditional ICMP ping/trace: an echo request is sent down a specific path, and the receiver sends an echo reply. However, several differences are essential to ensuring the health of individual LSPs:
Figure 8-18 illustrates the LSP ping. Figure 8-18. LSP PingLS Embedded Management LSP Ping/Traceroute and AToM VCCV, http://www.cisco.com/en/US/products/sw/iosswrel/ps1829/products_feature_guide09186a00801eb054.html If you initiate an MPLS LSP ping request at LSR1 to a prefix at LSR6, the following sequence occurs:
The following is an example of using lsp ping via Cisco CLI. In this case, the LSP is broken, as revealed by the "R" return code (this means that the last LSR to reply is not the target egress router):
A successful LSP ping looks something like this:
LSP traceroute provides hop-by-hop fault localization and uses TTL settings to force expiration of the TTL along an LSP. LSP traceroute incrementally increases the TTL value in its MPLS echo requests (TTL = 1, 2, 3, 4, and so on) to discover the downstream mapping of each successive hop. The success of the LSP traceroute depends on the transit router processing the MPLS echo request when it receives a labeled packet with TTL = 1. On Cisco routers, when the TTL expires, the packet is sent to the route processor (RP) for processing. The transit router returns an MPLS echo reply containing information about the transit hop in response to the TTL-expired MPLS packet. The echo request and echo reply are UDP packets with source and destination ports set to 3503. Figure 8-19 shows an MPLS LSP traceroute example with an LSP from LSR1 to LSR4. Figure 8-19. LSP TracerouteCisco Configuration Express, http://www.cisco.com/cx If you enter an LSP traceroute to an FEC at LSR4 from LSR1, the steps and actions shown in Table 8-1 occur.
Here's a CLI example of a broken path:
In this case, the break occurs because the LSP segment on interface 6.6.1.5 sends an untagged packet. By way of comparison, a successful traceroute looks something like this: cl-12008-1#traceroute mpls ipv4 6.6.7.4/32 Tracing MPLS Label Switched Path to 6.6.7.4/32, timeout is 2 seconds Codes: '!' - success, 'Q' - request not transmitted, '.' - timeout, 'U' - unreachable, 'R' - downstream router but not target, 'M' - malformed request Type escape sequence to abort. 0 6.6.1.25 MRU 1709 [implicit-null] ! 1 6.6.1.26 4 ms The main difference here is that the successful traceroute ends with a ! as per regular IP ping. As will be discussed shortly, these tools can be used as essential building blocks in a service provider's MPLS VPN troubleshooting strategy. Proactive Monitoring of PE-PE LSPsAlthough the LSP ping/trace tools provide invaluable troubleshooting capability, they are not designed for monitoring. Instead, they are of more use to an operator who wants to troubleshoot a reported problem or verify the health of some paths after network changes. Vendors such as Cisco are developing probe-based techniques that use the LSP ping/trace mechanism, but in such a manner as to allow monitoring of a full-mesh PE-PE network. The key to scalability of such probes is to test only those paths that are relevant to service delivery. In the context of an MPLS VPN, this means that from a given ingress PE, only LSPs that are used to carry VPN traffic are tested. This is important, because traffic in different VPNs that is destined for the same egress PE is essentially multiplexed onto the same transport LSP. This concept is shown in Figure 8-20. Figure 8-20. Intelligent LSP ProbingThis figure shows that Blue VPN Site 1 and Red VPN Site 3 share the same transport LSP to reach Blue VPN Site 3 and Red VPN Site 1. With such a mechanism in place, the service provider has the means to monitor LSPs at a high rate. Failure of an LSP results in an SNMP notification being sent to the NMS, whereupon service impact, correlation, and troubleshooting can begin. One point to stress in such testing is the presence of Equal-Cost Multiple Paths (ECMP). Very often, ECMP is in use within a service provider core, meaning that multiple LSPs may be available to carry traffic from ingress to egress PE. Service providers therefore should ask vendors how both the reactive CLI and any probe-based tools will test available ECMPs. Equally important is that if a failure occurs, the notification sent to the NMS clearly identifies which one. Performance ProblemsThis is perhaps the most difficult problem category for an operator to troubleshoot. Not only do these problems tend to be transient in nature, but they also are inherently more complex due to the many different network segments and features involved. A typical example of a problem in this area is an enterprise customer reporting poor performance on one of his or her services, such as VoIP. Packet loss, delay, and jitter all adversely affect the quality of this service. How does the operator tackle such problems? The first task is to identify the customer traffic at the VRF interface on the PE. Sampling tools such as Cisco NetFlow are extremely useful here. They identify flows of traffic (for example, based on source/destination IP addresses and ports) and cache statistics related to these flows, which can then be exported offline. An important piece of data is the ToS marking (IP Precedence or DSCP). If this can be done, the operator can answer one of the first questions: Is the customer's data being marked correctly? Next, the QoS policies on the VRF interface can be analyzed to determine if any of the customer traffic is being dropped. The following example is from the Cisco modular QoS CLI (MQC) show policy-map interface command. It shows that 16 packets from the class (that is, traffic matching specific propertiesthe target traffic) have been dropped due to the policer's actions: Service-policy output: ce_6cos_out_A_40M_21344K (1159) Class-map: ce_mgmt_bun_output (match-any) (1160/7) 314997339 packets, 161278311131 bytes 5 minute offered rate 952503000 bps, drop rate 943710000 bps Match: access-group 199 (1161) 1 packets, 608 bytes 5 minute rate 0 bps Match: access-group 198 (1162) 0 packets, 0 bytes 5 minute rate 0 bps Match: ip precedence 0 (1163) 314997338 packets, 161278310523 bytes 5 minute rate 952503000 bps bandwidth: 1955 kbps (EIR Weight 0%) police: 8000 bps, 8000 limit, 8000 extended limit conformed 2580 packets, 1319729 bytes; rate 7000 bps; action: set-dscp-transmit 48 exceeded 16 packets, 7968 bytes; rate 0 bps; action: drop If traffic is being dropped, this might indicate that the customer is exceeding the allocated bandwidth (or bursting above agreed-on values), which may explain the problem. If the ingress PE classification and QoS policy actions seem correct, the next stage is to analyze the remaining path across the service provider network. The first task is to find out exactly what that path is. From PE-PE, this can be obtained using the LSP traceroute feature described earlier. Then, for each interface in the path, the effects of any QoS policies are examined to ensure that no unexpected drops are occurring. In an MPLS core, this typically involves looking at any QoS congestion avoidance or management mechanisms. For example, traffic-engineered tunnels are becoming increasingly popular for guaranteeing bandwidth across the core network. It may not always be possible to inspect the live customer traffic and associated network behavior. In these cases, synthetic probes are extremely useful because they allow an operator to probe the problem path for a certain time period. This reveals any major problems with the service provider network, such as wrong QoS classification/action, congestion, or bottlenecks. If these tests are positive, the problem is almost certainly with the customer traffic (incorrect marking or bandwidth being exceeded). Fault ManagementThe previous sections have illustrated techniques for both the data and control planes of an MPLS VPN. This is of little value, however, if the service provider does not have adequate fault management systems and processes in place. If the service provider detects faults, they need to be acted on. More significantly, however, the enterprise customer often detects problems before the service provider does. The ability to test connectivity at higher frequencies is one factor in this. However, individual end users also report problems, mainly of a performance nature, to their own IT departments. Many enterprises can rule out their own networks and systems as the root cause and blame the service provider. At this point, a call is placed to the service provider's first-line support. This section helps the service provider ensure that it has the correct reactive and proactive fault systems in place to deal with both scenarios. Proactive Fault ManagementAssume that a fault has been detected by the service provider's own monitoring system, within a VRF, from the PE across the MPLS core to the remote CE. Traditionally, such a fault would be picked up by network operations in the network operations center (NOC), and the problem's severity would be assessed. Given that we are talking about a potential outage scenario, how should operations proceed in isolating, diagnosing, and repairing the fault? This is in fact a multilayered problem that requires a systematic approach to troubleshooting. The first area that the service provider should look at is recent events that have been collected from the network. Problems within the network may have resulted in events being generated and immediately explain why there is an outage. Examples include link up/down, contact loss with a router, and specific protocol issues such as LDP or BGP session losses. What is really needed is a means to correlate the data plane outage to the other events that have been collected from the network. Several systems in the market perform this task and immediately point to a possible root cause. Furthermore, assume that there are no obvious reasons why the connection has failed. What is required is a means to identify what layer has the problem: VPN, IP, or MPLS. One approach is to rule out each one by performing data plane tests. If both the IP and MPLS data planes are correct, there is a problem with either VPN route availability or the VPN switching path, as shown in Figure 8-21. Figure 8-21. Troubleshooting a VPN Outage
After the initial diagnosis has been made (control/data plane, VRF/MPLS/IP, and so on), further troubleshooting is required to isolate and then fully diagnose the problem. Isolation within the data plane typically uses one of the traceroute tools discussed previously. You must be careful here, however, especially in heterogeneous networks. This is because not all boxes may support the OAM standards, and even if they do, some interoperability issues may exist. False negatives are the most obvious condition that may arise, such as one vendor's box failing to reply to an echo packet even though there is no problem. The service provider therefore should ask each vendor which draft version of the standard is supported and then conduct lab testing to identify any issues. Another point worth stressing with the OAM isolation tools is that the last box returned in a trace output may not always be the failure point. For example, the fault may in fact lie with the next downstream router, but it cannot reply for some reason. Some intelligence is needed with this part of the process to rule out the last node in the trace output before inspecting (ideally automatically) those downstream. Root cause identification requires inspecting the router's configuration, status, and forwarding engines. Although many issues are caused by misconfigurations and deterministic errors, defects within hardware or software are the most difficult to find and often are the most costly. Manual or automatic troubleshooting via management systems therefore should concentrate on ruling out obvious causes before looking for symptoms caused by defects in the data/control planes. A good illustration of the latter is label inconsistencies. For example, suppose an egress PE router receives a new route from a CE. BGP then allocates a label, installs the route and the label into the local forwarding tables, and propagates it to relevant ingress PE routers. Various things can go wrong here, such as allocation failures, local/remote installation problems, and propagation issues. These issues using Cisco CLI are shown in Figure 8-22. Figure 8-22. MPLS VPN Label Problem AreasProblems within the control plane require inspection of individual routers to resolve problems. For example, if the ingress PE router loses a route (or never receives it in the first place), the egress router from which it should first be learned should be inspected first (to see if it was the route received, and if it was then propagated, and so on). LDP problems tend to manifest themselves as MPLS data plane problems and hence can be found using the same isolation/inspection technique previously described. Hopefully, the picture being painted here is that MPLS VPN troubleshooting is not straightforward. The real message, however, is that the service provider should ask some serious questions of the vendors in terms of what management support they provide to help automate the troubleshooting process. At the minimum, vendors should supply element management systems that check the health of MPLS and VPN data/control planes on a perbox basis. More useful are network-level systems that set up proactive monitoring and respond automatically to faults when detected. Such systems should employ the systematic techniques described in this section, as well as offer integration points into other Operation Systems Support (OSS) components. Case Study: Troubleshooting a Problem with the Acme, Inc. VPNIt is useful to go through an example of troubleshooting an MPLS VPN connectivity problem within the Acme, Inc. VPN. This helps describe how the different tools fit together and how a service provider should prepare in the event that a fault is reported or detected. In this scenario, assume that Acme, Inc. reports loss of connectivity between sites Glasgow and London, as shown in Figure 8-23. Figure 8-23. Connectivity Problem Between Glasgow and LondonThe problem is reported into first-line support, at which point an attempt is made to reproduce it within the service provider network. In an unmanaged service, this could be done by testing from the local PEs to the source and destination sites, as shown in Figure 8-24. Figure 8-24. Testing VPN Connectivity in the Service Provider NetworkThese tests fail in both directions. At this point, first-line support may choose to open a trouble ticket and escalate the problem. (This is very much dependent on the service provider support model.) Assume that second-line support is now alerted. The EMS could be used to further narrow down the problem. It can do this by testing whether IP and LSP connectivity are valid in the core, as shown in Figure 8-25. Figure 8-25. Testing IP and MPLS PathsThese tests reveal that basic IP connectivity is healthy but the transport LSP is broken. The EMS now issues an LSP traceroute, as shown in Figure 8-26. Figure 8-26. Using LSP Trace to Help Isolate the ProblemIn this example, the traceroute gets as far as router P2. This router is now examined from both a control and data-plane perspective by the EMS, as illustrated in Figure 8-27. Figure 8-27. Diagnosing LSRs in a Broken PathNotice that in this case, the actual failure point was not the last node returned from the trace output. Because of this, the next hop (LDP downstream neighbor) had to be calculated and inspected. This illustrates the value of having an automated management solution to help with troubleshooting. Reactive Fault ManagementMany of the same principles already discussed apply to reactive fault management. However, some subtle distinctions need to be observed. First, in the scenario where a customer calls in to first-line support to report a problem, the service provider is looking for a management system that will help this part of the business, as well as link into the troubleshooting tools already discussed. An important aspect of handling customer-reported faults is to quickly be able to reproduce the problem, thereby verifying if the service provider or enterprise is the likely source. A second requirement is to deal with the varying skill sets of support personnel. Some service providers have good network knowledge. However, others have little or no understanding of an MPLS VPN. An ideal system is one that allows an inexperienced operator to simply enter the details of the customer and the source/destination of the problem. The tool then retrieves all necessary VPN and network-related data "behind the scenes" and attempts to verify the problem. If the problem exists, a report is generated, highlighting the nature of the problem and any other contextual data. This allows the operator to raise an appropriate trouble ticket and alert the necessary network engineers for further analysis. Service providers might choose to build (or have built) the "front-office" applications themselves. However, at some point in the troubleshooting process, detailed knowledge of vendor equipment is required if the whole process is to be simplified to reduce opex and maximize service uptime. Service providers therefore should discuss the availability and integration of vendor MPLS VPN NMS/EMS systems into their OSS. SLA MonitoringProviding an excellent service is the primary goal of any service provider. Fundamental to this is adhering to the SLAs agreed to with end customers. But how do service providers know they are satisfying such SLAs, especially given the more stringent and complex ones associated with IP services delivered across an MPLS VPN? The key to successful monitoring in an MPLS VPN is to ensure that the technology exists within vendor equipment to proactively monitor the network from a performance perspective. Next comes the OSS to allow configuration, detection, reporting, and troubleshooting related to such monitoring. As has been stressed, the technology usually exists in the form of synthetic probing, because this provides the most accurate data. Such probes must support the following requirements to be useful in monitoring modern SLAs:
AccuracyThe service provider should ask the vendor serious questions about the accuracy of probe data. If there is any dispute over the SLA, it is crucial that the service provider rely on the data provided by the probes to prove or disprove any performance claims. Here are some specific questions that should be asked:
Probe Metric SupportTo minimize overhead, the service provider should ideally have a probe technology that supports the basic metrics of delay, jitter, packet loss, and availability from within a single operation. This can then be combined with one of the core monitoring strategies outlined earlier to provide minimal SLA monitoring. For example, the Cisco IP SLA jitter probe supports these combined metrics. The following example shows how to configure such a probe that sends five packets every 20 ms at a frequency of once every 10 seconds with a packet size of 60 bytes:
The metrics collected by this probe would then be obtained from the show rtr operational-state command:
Here are some important points to note from this output (shown shaded):
QoS SupportEnterprise customers typically require QoS markings to be preserved across the service provider network. This means that when the packets arrive at their destination, they should have the same QoS classification as when they entered. However, it is not practical or scalable for the service provider to create unique QoS classes for each customer. A common approach is for the service provider to offer a standard set of QoS classes onto which customer traffic is mapped, such as voice, business-class, and best-effort. It then becomes essential that the service provider monitor how the network handles these classes. To do that, the probe technology needs to be able to set QoS in the packets. The following example shows how QoS is marked within the Cisco IP SLA technology. In this case, the ToS bits in the IP header are used (there is a standard mapping between ToS and DSCP). The following example shows how to configure a jitter probe to have a DSCP value of 101110 (ToS equivalent 0xB8), which is the recommended marking for voice traffic:
Specialized Voice ProbesDelay, jitter, and packet loss are the primary impairment factors with voice quality. Although it is essential that the service provider have access to detailed metrics around these properties, it can be complex to translate them into an instant assessment of voice quality. The service provider therefore should look to the vendor to provide voice quality "scores" as a guide. Two common examples are Mean Opinion Score (MOS) and International Calculated Planning Impairment Factor (ICPIF). Additionally, it is important that common codecs be supported. This would ideally select the appropriate probe and packet formats required to more accurately test the network. Examples include G.711 mu-Law (g711ulaw), G.711 A-Law (g711alaw), and G.729A (g729a). The following example shows how Cisco SAA (Cisco Service Assure Agent, renamed to IP SLA) has extended its jitter probe to support such features. This example simulates a G711u codec, 1000 packets, interval 20 ms, and frequency 1 minute:
Threshold Breach NotificationExpanding on the proactive fault management theme discussed earlier, probes ideally should support informing management stations when certain SLA conditions might be breached. A further requirement is the concept of low and high watermarks, as shown in Figure 8-28. Figure 8-28. High and Low Watermark ThresholdsThis minimizes notifications due to transient network behavior, resulting in events being sent only when the high watermark has been breached and thereafter only after the low watermark has been passed. Again, the service provider should look for these features to be present in the probe technology because it allows proactive management of network performance to be put in place, hopefully alleviating any problems before the end customer detects them. The following example shows how to configure high and low thresholds for a Cisco SAA jitter probe, with immediate generation of a trap:
ReportingA useful tool for the service provider to help support customer VPN reports is VPN-aware SNMP. If the vendor supports such a feature, it can help the service provider offer secure access to only the VPNs offered to a specific customer. This feature works by allowing SNMP requests on any configured VRF and returning responses to the same VRF. A trap host can also be associated with a specific VRF. This allows the service provider to restrict the view that a given user on a given SNMP server has. When polling a device through a VRF for a given MIB, the user has restricted access/view to specific tables of the MIBs. This concept is shown in Figure 8-29. Figure 8-29. VPN-Aware SNMP
A service provider could also use this technique to offer partial information views to a peering service provider or third party in charge of measuring performance and service uptime for SLA verification purposes. Also, the protocol's VRF awareness allows for a management VRF to be used to communicate with a NOC. |