The Service Provider: How to Meet and Exceed Customer Expectations


This section discusses the implications of enterprise expectations on the service provider and how these translate into management functions and practices that help exceed them.

Provisioning

The provisioning function is heavily dependent on whether the service is managed or unmanaged, because a managed service adds the CE to the service provider's domain of responsibility.

Zero-Touch Deployment

For managed services, one of the most time-consuming and expensive tasks within the rollout or upgrade of the VPN is the provisioning of CE equipment. Vendors now provide solutions that can help automate this process. Figure 8-10 is a high-level example of how such a process might operate.

Figure 8-10. Automated CE Provisioning

Cisco Configuration Express, http://www.cisco.com/cx


The steps are as follows:

Step 1.

The subscriber orders managed service from the service provider.

Step 2.

The service provider orders the subscriber CE from the vendor with the optional bootstrap configuration. Then it registers the CE with the configuration management application.

Step 3.

The vendor ships the customer CE with the bootstrap configuration. The service provider can track the shipment online.

Step 4.

The subscriber plugs in the CE and the device boots, pulls the service configuration, and validates.

Step 5.

The device publishes "configuration success," indicating that service is on.

Of course, certain key technologies need to be in place from the vendor for such a system to be used. The key factor here is that this "zero-touch provisioning" system pushes intelligence down into the network. Devices require more advanced management functionality and cooperation with offline systems. A typical system might contain the following components:

  • An online ordering system that supports the following functions:

    - A bootstrap configuration that lets the CE connect to the management system when it is powered up and connected at the customer site

    - A shipping address for the end customer

    - Interfaces with back-end manufacturing, which also results in a bootstrap configuration being created with chassis-specific details (such as MAC addresses and serial numbers)

    - Order tracking

    - Offline configuration management

  • An offline configuration management system must support the following functions:

    - Inventory of CE devices

    - Initial configuration of CE devices and submodules

    - Secure communication with CE devices

    - Building and downloading service configurations when requested by CE devices

    - Progress, reporting, and status of CE communication

    - Embedded intelligence

  • Embedded intelligence must support the following functions:

    - Automatically contacting the configuration management application on the bootstrap

    - Providing inventory (physical and logical) to the management system

    - Retrieval, checking, and loading of the configuration from the management system

    - Publishing status events on this process to the management system

PE Configuration

Of course, provisioning the CE is only one of the tasks required in turning up a new circuit. The PE is where most of the configuration is required. Because of the relative complexity of this task, it is highly recommended that the service provider automate this task from a management provisioning system.

Most vendors supply such a system. In general, they take one of two approaches:

  • Element-focused This essentially means that the system supplies a GUI and/or a northbound application programming interface (API) to support the configuration of individual PEs. Such systems tend to be driven from a higher-order provisioning system, which has a wider, VPN-centric view. Their main benefit is to provide a clean, abstract API to the underlying vendor elements.

  • Network-focused These systems tend to model the VPN and related network properties, such as sites and service providers. They may also support element-specific APIs but have the additional use of allowing the whole provisioning operation to be driven entirely by the GUI. Such systems may be more suitable for Tier 2/3 service providers or those that have single-vendor networks.

Fault Monitoring

An effective fault-monitoring strategy should concentrate on ensuring that any potential service-affecting events generated by vendor equipment are collected first.

Enhancements can then be added to include functions such as impact analysis and correlation. Furthermore, a number of management products in this space provide specialized functions, such as route monitoring availability and correlation of control/data plane events. (For example, perhaps a connection was lost because a route was withdrawn, which in turn was caused by link failure in the network.)

MPLS-Related MIBs

MIBs are the primary source of fault-related events and data from the network elements, especially because "standard" MIBs are often implemented by different vendors, so they can simplify multivendor management.

This section discusses relevant MPLS MIBs and what features are most relevant within them.

Figure 8-11 shows the points in an MPLS VPN where the MIBs are applicable.

Figure 8-11. MPLS MIB Applicability Points


MPLS-VPN-MIB

The following notifications are useful for monitoring the health of the VRF interfaces when they are created and when they are removed:

mplsVRFIfUp/mplsVRFIfDown notifications

These are generated when

  • The ifOperStatus of the interface associated with the VRF changes to up/down stat.

  • The interface with ifOperStatus = up is (dis)associated with a VRF.

Problems can sometimes occur when a PE starts to exceed the available resources. For example, memory and the routes use up this resource. Therefore, it can be beneficial to set route limits that warn when specific thresholds are reached. Additionally, service providers might want to charge their customers in relation to the number of routes. The following notifications are useful for both of these purposes:

mplsNumVrfRouteMidThreshExceeded and mplsNumVrfRouteMaxThreshExceeded

These are generated when

  • The number of routes in a given VRF exceeds mplsVrfMidRouteThreshold.

  • The number of routes contained by the specified VRF reaches or attempts to exceed the maximum allowed valuemplsVrfMaxRouteThreshold.

BGPv4-MIB and Vendor BGP MIBs

In the context of VPNs, the standard BGPv4-MIB does not support the VPNv4 routes used by MP-BGP. This functionality currently is provided by vendor-specific MIBs. For example, CISCO-BGPV4-MIB provides support for tracking MP-BGP sessions, which is essential for successful operation of the VPN.

The Cisco MIB provides notifications that reflect the MP-BGP session Finite State Machine (FSM). For example, notifications are sent when the Border Gateway Protocol (BGP) FSM moves from a higher numbered state to a lower numbered state and when the prefix count for an address family on a BGP session has exceeded the configured threshold value.

MIBs related to MPLS transport:

  • LDP-MIB In particular, the label distribution protocol (LDP) session up/down traps

  • LSR-MIB Segment and cross-connect traps (if implemented by the vendor)

There are also standard and proprietary MIBs for the OSPF IGRP.

Resource Monitoring

To effectively monitor the network, the network manager should pay specific attention to resources on PEs. Because of their position and role in the VPN, they are particularly susceptible to problems caused by low memory and high CPU.

Some vendors may provide either MIBs or proprietary techniques to set thresholds on both these resources. However, it is recommended that baseline figures be obtained and alerts be generated should deviations occur.

The service provider should consult each vendor for the best way to extract memory and CPU utilization data, especially because this may vary across different platform types and architectures. Vendors should be able to recommend "safe" amounts of free memory. Often, this may have to be done via scripts that log in and retrieve the data via the CLI, but the ideal approach is to use a vendor Event Management Service (EMS) that is tuned to detect and troubleshoot resource problems.

The following factors might contribute to low resource levels:

  • Memory:

    - Operating system image sizes

    - Route table size (the number of routes and their distribution)

    - BGP table size (paths, prefixes, and number of peers)

    - BGP configuration (soft reconfiguration and multipath)

    - IGP size (number of routes)

    - Any use of transient memory to send updates to linecards

    - Transient memory to send BGP updates to neighbors

  • Main CPU:

    - BGP table size (number of routes)

    - BGP configuration (number of peers)

    - IGP size (number of routes and peers)

    - Routing protocol activity

    - Network-based interrupts (SNMP and Telnet)

  • Linecard CPU:

    - Main factors contributing to higher resource usage

    - Rate of packets being switched by linecard CPU

    - Forwarding table updates

  • Linecard memory:

    - Route memory

    - Forwarding table size

    - Any sampling processes that are running, such as Cisco NetFlow

  • Hardware forwarding memory:

    - Forwarding table size

    - Number of labels

    - Number and size of access lists

    - Number of multicast entries

    - Type of routes in the forwarding table (IGP versus BGP versus BGP multipath)

OAM and Troubleshooting

It is inevitable that the service provider network will experience problems that affect enterprise VPN availability. When such situations occur, it is essential that the service provider have the optimal troubleshooting tool set at its disposal. Recent advancements in MPLS OAM in particular have the potential to provide this technology with carrier class OAM capability.

As discussed, fault management is a combination of proactive and reactive techniques. From the service provider perspective, proactive monitoring and subsequent troubleshooting are ideal because they have the potential to detect problems before the end customer. The next section discusses both areas and recommends tools and strategies that the service provider can adopt. Legacy Layer 2 access technologies are beyond the scope of this book, the emphasis being very much on MPLS VPN.

Proactive Monitoring in Detail

The "Proactive Monitoring" section earlier in this chapter discussed the scope options a service provider has when monitoring the VPN for faults. Fundamentally, the techniques employed are either "off-box" (NMS-based) or "on-box" (probe-based). This section discusses these options in more detail and outlines the differences between them and the relative advantages and disadvantages of each. These approaches are illustrated in Figures 8-12 and 8-13.

Figure 8-12. Off-Box Reachability Testing


Figure 8-13. On-Box Reachability Testing


Active monitoring using intelligent probes is the idealized approach because it pushes the responsibility down into the network, limiting the amount of external configuration and traffic on the data communications network (DCN). However, there are several reasons why this might not always be chosen:

  • Such probes are usually proprietary in nature and may not work well in a multivendor network. For example, if the network consists of multivendor PE devices, it may be that probes can be used with equipment from only one vendor, and this may complicate the management implementation.

  • Due to the scale of some VPNs, the probes may consume too many resources on the routers.

  • Probes must be maintained as the network grows.

  • Probes might be unavailable on the software releases deployed on the routers.

In fact, often the best approach for the service provider is to employ a mixture of off-box and on-box testing to gain the required coverage.

The next section explores what on/off-box tools are available to the service provider.

VPN Layer

At this layer, the service provider monitors VPN reachability. As discussed earlier, the three main test path options are PE-PE, PE-core-PE-CE, and CE-CE. Each is basically a trade-off between accuracy and scalability, although logistics issues (such as CE access) might restrict what can be done.

PE-PE testing can be done in one of two ways. Either the service provider can build a dedicated "test VPN" or it can test within the customer VPN itself. These approaches are shown in Figures 8-14 and 8-15.

Figure 8-14. PE-PE Monitoring Using a Test VPN


Figure 8-15. PE-PE Monitoring Within Customer VPNs


In the "test VPN," each PE tests reachability to all other PEs within the same route reflector domain, even though they may not have an explicit customer VPN relationship. This is still of the "n squared" magnitude discussed earlier. Service providers that use this technique therefore are more concerned that the PEs have basic VPN layer connectivity with one another.

In "customer VRF" testing, the service provider verifies specific customer VRF paths. This is more accurate because it closely mimics the customer traffic path, but it has scale implications. Because of the potential number of VRFs involved, the service provider is faced with the decision of which VRFs to monitor. This is usually influenced by specific customer SLAs. Figure 8-15 shows PE-PE testing, but the concept is similar to (and more accurate for) the other paths.

Off-Box Testing

In terms of the actual available instrumentation, off-box testing typically uses a "VRF-aware ping" that is supported by many vendors. This is essentially an Internet Control Message Protocol (ICMP) ping within the context of a VRF, using the VRF at the ingress/egress PEs and MPLS in the core to route the packet toward and back from the destination. If this ping succeeds, it provides strong validation that the PE-PE path is healthy. That is, the double label stack imposed as a result of using a VPN prefix is the correct one, the underlying MPLS transport is healthy, and the VRF route lookups and forwarding are correct. Here's an example of the VRF-aware ping:

cl-7206vxr-4#ping vrf red_vpn 8.1.1.2 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 8.1.1.2, timeout is 2 seconds: !!!!! Success rate is 100 percent (5/5), round-trip min/avg/max = 92/96/108 ms cl-7206vxr-4#


On-Box Testing

This technique uses one of the proprietary probe tools available from vendors. A common deployment model for PE-PE testing is to use a separate, dedicated router attached to a PE to host the probes. This is commonly called a "shadow router." It has a number of advantages:

  • In an unmanaged service, it allows CE emulation.

  • If existing PEs are overloaded, it avoids placing an additional burden on them.

  • If existing PEs are low on memory, it avoids further reduction.

  • If the probes and/or the associated engine need to be updated, this can be performed without disturbing the existing network.

This scheme has two variations. The "shadow CE" is when the probe router is not VRF-aware and simply emulates one or more CEs. A problem with this approach is that it cannot deal with overlapping IP addresses that may be advertised by remote sites within different VPNs. An example using the Cisco IP SLA technology is shown in Figure 8-16.

Figure 8-16. Shadow CE Scheme


In the "shadow PE" model, the probe router is VRF-aware and is effectively a peer of the other PE routers. This solves the overlapping IP address problem, as shown in Figure 8-17.

Figure 8-17. Shadow PE Scheme


CE-CE monitoring seemingly is the optimal solution. Indeed, it's often used even in unmanaged services where the enterprise may grant access to the CE for probe deployment. The main problem with this approach is one of scale: As the VPN grows, so do the number of remote sites that have to be tested, and hence the number of probes. Given that CEs are typically low-end devices, this can present a performance problem. Another issue is the maintenance overhead on the management systems in terms of keeping up with adds/moves/changes and also tracking and correlating events from the CEs. In practice, the service provider should work with the equipment vendors to establish the performance profile for probing. Such a profile should include factors such as QoS and SLA metrics required. This can then be used to negotiate with each customer which sites will be monitored end to end. An example might be that customers get monitoring from certain key sites to the hubs or data centers.

Example 8-4 is a sample configuration taken from the Cisco IP SLA technology in the Cisco IOS. It shows a basic VRF-aware probe configuration.

Example 8-4. VRF-Aware Probe Creation

Router(config)#rtr 3 Router(config-rtr)#type udpEcho dest-ipaddr 172.16.1.1 dest-port 1213 Router(config-rtr)#vrf vpn1 Router(config)#rtr schedule 3 start now

Line 1 creates the probe.

Line 2 specifies that a User Datagram Protocol (UDP) echo test is required toward the destination address and port.

Line 3 specifies the VRF within which to execute the operation.

Line 4 instructs the probe to start sending immediately with the default frequency.

Note

IP SLA was originally called the "real-time responder." This is reflected in the CLI of older IOS versions, such as rtr 3 in Example 8-4.


If such a probe were to detect a connectivity problem followed by restoration, the traps would look similar to Example 8-5.

Example 8-5. IP SLA Connection Lost/Restored Traps

rttMonConnectionChangeNotification notification received from: 10.49.157.202 at   13/03/2003 15:32:14   Time stamp: 0 days 08h:21m:01s.92th   Agent address: 10.49.157.202 Port: 56806 Transport: IP/UDP Protocol: SNMPv2c   Notification   Manager address: 10.49.157.206 Port: 162 Transport: IP/UDP   Community: (zero-length)   Bindings (5)     Binding #1: sysUpTime.0 *** (timeticks) 0 days 08h:21m:01s.92th     Binding #2: snmpTrapOID.0 *** (oid) rttMonConnectionChangeNotification     Binding #3: rttMonCtrlAdminTag.1 *** (octets) (zero-length) [ (hex)]    Binding #4: rttMonHistoryCollectionAddress.1.0.0.0 *** (octets) AC.10.01.01 (hex)     Binding #5: rttMonCtrlOperConnectionLostOccurred.1.0.0.0 *** (int32) true(1)

Note

This output can be obtained by using the show logging command with debug snmp switched on.


In Example 8-5, the line in bold shows the var bind that indicates the connection being lost.

[View full width]

rttMonConnectionChangeNotification notification received from: 10.49.157.202 at 13/03/2003 15:41:29 Time stamp: 0 days 08h:30m:17s.01th Agent address: 10.49.157.202 Port: 56806 Transport: IP/UDP Protocol: SNMPv2c Notification Manager address: 10.49.157.206 Port: 162 Transport: IP/UDP Community: (zero-length) Bindings (5) Binding #1: sysUpTime.0 *** (timeticks) 0 days 08h:30m:17s.01th Binding #2: snmpTrapOID.0 *** (oid) rttMonConnectionChangeNotification Binding #3: rttMonCtrlAdminTag.1 *** (octets) (zero-length) [ (hex)] Binding #4: rttMonHistoryCollectionAddress.1.0 .0.0 *** (octets) AC.10.01.01 (hex) Binding #5: rttMonCtrlOperConnectionLostOccurred.1 .0.0.0 *** (int32) false(2)



The preceding line in bold shows the var bind that indicates the connection being restored.

MPLS Layer

The recent growth of MPLS as a transport technology has brought with it demand from network managers for OAM capabilities. Specifically, operators are asking for OAM features akin to those found in circuit-oriented technologies such as ATM and Frame Relay. Additionally, many service providers are used to basing services on highly resilient and fault-tolerant time-division multiplexing (TDM) networks. This has led to requirements being placed on MPLS to acquire built-in protocol operations to rapidly detect and respond to failure conditions.

The common requirement for such tools is an ability to test and troubleshoot the data plane. This is because the data and control planes may sometimes lose synchronization.

A key message for the service provider here is that it is not enough to simply monitor the control plane of an MPLS VPN. For example, even though routes may be installed in global and VRF tables, there is absolutely no guarantee that the data plane is operational. Only by testing the data plane does the service provider have confidence that customer traffic will be transported correctly across its network.

Various standards have been proposed to embellish MPLS with such technology. The next section concentrates on the main ones that are available or will be shortly.

LSP Ping/Traceroute

This tool set is currently an IETF draft, defined at http://www.ietf.org/internet-drafts/draft-ietf-mpls-lsp-ping-03.txt. The intent of these tools is to provide a mechanism to let operators and management systems test LSPs and help isolate problems. Conceptually, these tools mirror the same principles of traditional ICMP ping/trace: an echo request is sent down a specific path, and the receiver sends an echo reply. However, several differences are essential to ensuring the health of individual LSPs:

  • The echo request packet uses the same label stack as the LSP being tested.

  • The IP address destination of the echo request is a 127/8 address, and the content of the packet Type-Length-Value (TLV) carries the forward error correction (FEC) information. This is important, because if a broken LSP is encountered, a transit router punts the packet for local processing when it detects the 127/8 IP address, resulting in the packet's being dropped. In other words, IP doesn't forward the packet toward the destination if the LSP is broken.

  • If the packet reaches an egress router, a check is performed to see whether this is the correct router for the FEC being tested. In the Cisco implementation, the 127/8 address forces the packet to be processed by the route processor at the egress LSR.

Figure 8-18 illustrates the LSP ping.

Figure 8-18. LSP Ping

LS Embedded Management LSP Ping/Traceroute and AToM VCCV, http://www.cisco.com/en/US/products/sw/iosswrel/ps1829/products_feature_guide09186a00801eb054.html


If you initiate an MPLS LSP ping request at LSR1 to a prefix at LSR6, the following sequence occurs:

Step 1.

LSR1 initiates an MPLS LSP ping request for an FEC at the target router LSR6 and sends an MPLS echo request to LSR2.

Step 2.

LSR2 receives the MPLS echo request packet and forwards it through transit routers LSR3 and LSR4 to the penultimate router, LSR5.

Step 3.

LSR5 receives the MPLS echo request, pops the MPLS label, and forwards the packet to LSR6 as an IP packet.

Step 4.

LSR6 receives the IP packet, processes the MPLS echo request, and then sends an MPLS echo reply to LSR1 through an alternative route.

Step 5.

LSR7 to LSR10 receives the MPLS echo reply and forwards it back toward LSR1, the originating router.

Step 6.

LSR1 receives the MPLS echo reply in response to its MPLS echo request.

The following is an example of using lsp ping via Cisco CLI. In this case, the LSP is broken, as revealed by the "R" return code (this means that the last LSR to reply is not the target egress router):

cl-12016-1#ping mpls ipv4 6.6.7.6/32 Sending 5, 100-byte MPLS Echos to 6.6.7.6/32,       timeout is 2 seconds, send interval is 0 msec: Codes: '!' - success, 'Q' - request not transmitted,        '.' - timeout, 'U' - unreachable,        'R' - downstream router but not target Type escape sequence to abort. RRRRR Success rate is 0 percent (0/5)



A successful LSP ping looks something like this:

[View full width]

cl-12008-1#ping mpls ipv4 6.6.7.6/32 Sending 5, 100-byte MPLS Echos to 6.6.7.6/32, timeout is 2 seconds, send interval is 0 msec: Codes: '!' - success, 'Q' - request not transmitted, '.' - timeout, 'U' - unreachable, 'R' - downstream router but not target, 'M' - malformed request Type escape sequence to abort. !!!!! Success rate is 100 percent (5/5), round-trip min /avg/max = 1/2/4 ms



LSP traceroute provides hop-by-hop fault localization and uses TTL settings to force expiration of the TTL along an LSP. LSP traceroute incrementally increases the TTL value in its MPLS echo requests (TTL = 1, 2, 3, 4, and so on) to discover the downstream mapping of each successive hop. The success of the LSP traceroute depends on the transit router processing the MPLS echo request when it receives a labeled packet with TTL = 1. On Cisco routers, when the TTL expires, the packet is sent to the route processor (RP) for processing. The transit router returns an MPLS echo reply containing information about the transit hop in response to the TTL-expired MPLS packet.

The echo request and echo reply are UDP packets with source and destination ports set to 3503.

Figure 8-19 shows an MPLS LSP traceroute example with an LSP from LSR1 to LSR4.

Figure 8-19. LSP Traceroute

Cisco Configuration Express, http://www.cisco.com/cx


If you enter an LSP traceroute to an FEC at LSR4 from LSR1, the steps and actions shown in Table 8-1 occur.

Table 8-1. LSP Traceroute Operation

MPLS Packet Type and Description

Router Action

MPLS echo request with a target FEC pointing to LSR4 and to a downstream mapping

Sets the TTL of the label stack to 1.

Sends the request to LSR2.

MPLS echo reply

Receives the packet with TTL = 1.

Processes the UDP packet as an MPLS echo request.

Finds a downstream mapping and replies to LSR1 with its own downstream mapping, based on the incoming label.

MPLS echo request with the same target FEC and the downstream mapping received in the echo reply from LSR2

Sets the TTL of the label stack to 2.

Sends the request to LSR2.

MPLS echo request

Receives the packet with TTL = 2.

Decrements the TTL.

Forwards the echo request to LSR3.

MPLS reply packet

Receives the packet with TTL = 1.

Processes the UDP packet as an MPLS echo request.

Finds a downstream mapping and replies to LSR1 with its own downstream mapping based on the incoming label.

MPLS echo request with the same target FEC and the downstream mapping received in the echo reply from LSR3

Sets the packet's TTL to 3.

Sends the request to LSR2.

MPLS echo request

Receives the packet with TTL = 3.

Decrements the TTL.

Forwards the echo request to LSR3.

MPLS echo request

Receives the packet with TTL = 2.

Decrements the TTL.

Forwards the echo request to LSR4.

MPLS echo reply

Receives the packet with TTL = 1.

Processes the UDP packet as an MPLS echo request.

Finds a downstream mapping and also finds that the router is the egress router for the target FEC.

Replies to LSR1.


Here's a CLI example of a broken path:

[View full width]

cl-12016-1#traceroute mpls ipv4 6.6.7.4/32 ttl 10 Tracing MPLS Label Switched Path to 6.6.7.4/32, timeout is 2 seconds Codes: '!' - success, 'Q' - request not transmitted, '.' - timeout, 'U' - unreachable, 'R' - downstream router but not target Type escape sequence to abort. 0 6.6.1.1 MRU 1200 [Labels: 24 Exp: 0] R 1 6.6.1.5 MRU 4474 [No Label] 1 ms R 2 6.6.1.6 3 ms R 3 6.6.1.6 4 ms R 4 6.6.1.6 1 ms R 5 6.6.1.6 2 ms R 6 6.6.1.6 3 ms R 7 6.6.1.6 4 ms R 8 6.6.1.6 1 ms R 9 6.6.1.6 3 ms R 10 6.6.1.6 4 ms



In this case, the break occurs because the LSP segment on interface 6.6.1.5 sends an untagged packet.

By way of comparison, a successful traceroute looks something like this:

cl-12008-1#traceroute mpls ipv4 6.6.7.4/32 Tracing MPLS Label Switched Path to 6.6.7.4/32, timeout is 2 seconds Codes: '!' - success, 'Q' - request not transmitted,        '.' - timeout, 'U' - unreachable,        'R' - downstream router but not target,        'M' - malformed request Type escape sequence to abort.   0 6.6.1.25 MRU 1709 [implicit-null] ! 1 6.6.1.26 4 ms


The main difference here is that the successful traceroute ends with a ! as per regular IP ping.

As will be discussed shortly, these tools can be used as essential building blocks in a service provider's MPLS VPN troubleshooting strategy.

Proactive Monitoring of PE-PE LSPs

Although the LSP ping/trace tools provide invaluable troubleshooting capability, they are not designed for monitoring. Instead, they are of more use to an operator who wants to troubleshoot a reported problem or verify the health of some paths after network changes. Vendors such as Cisco are developing probe-based techniques that use the LSP ping/trace mechanism, but in such a manner as to allow monitoring of a full-mesh PE-PE network.

The key to scalability of such probes is to test only those paths that are relevant to service delivery. In the context of an MPLS VPN, this means that from a given ingress PE, only LSPs that are used to carry VPN traffic are tested. This is important, because traffic in different VPNs that is destined for the same egress PE is essentially multiplexed onto the same transport LSP. This concept is shown in Figure 8-20.

Figure 8-20. Intelligent LSP Probing


This figure shows that Blue VPN Site 1 and Red VPN Site 3 share the same transport LSP to reach Blue VPN Site 3 and Red VPN Site 1.

With such a mechanism in place, the service provider has the means to monitor LSPs at a high rate. Failure of an LSP results in an SNMP notification being sent to the NMS, whereupon service impact, correlation, and troubleshooting can begin.

One point to stress in such testing is the presence of Equal-Cost Multiple Paths (ECMP). Very often, ECMP is in use within a service provider core, meaning that multiple LSPs may be available to carry traffic from ingress to egress PE. Service providers therefore should ask vendors how both the reactive CLI and any probe-based tools will test available ECMPs. Equally important is that if a failure occurs, the notification sent to the NMS clearly identifies which one.

Performance Problems

This is perhaps the most difficult problem category for an operator to troubleshoot. Not only do these problems tend to be transient in nature, but they also are inherently more complex due to the many different network segments and features involved.

A typical example of a problem in this area is an enterprise customer reporting poor performance on one of his or her services, such as VoIP. Packet loss, delay, and jitter all adversely affect the quality of this service. How does the operator tackle such problems?

The first task is to identify the customer traffic at the VRF interface on the PE. Sampling tools such as Cisco NetFlow are extremely useful here. They identify flows of traffic (for example, based on source/destination IP addresses and ports) and cache statistics related to these flows, which can then be exported offline. An important piece of data is the ToS marking (IP Precedence or DSCP). If this can be done, the operator can answer one of the first questions: Is the customer's data being marked correctly?

Next, the QoS policies on the VRF interface can be analyzed to determine if any of the customer traffic is being dropped. The following example is from the Cisco modular QoS CLI (MQC) show policy-map interface command. It shows that 16 packets from the class (that is, traffic matching specific propertiesthe target traffic) have been dropped due to the policer's actions:

Service-policy output: ce_6cos_out_A_40M_21344K (1159)     Class-map: ce_mgmt_bun_output (match-any) (1160/7)       314997339 packets, 161278311131 bytes       5 minute offered rate 952503000 bps, drop rate 943710000 bps       Match: access-group 199 (1161)         1 packets, 608 bytes         5 minute rate 0 bps       Match: access-group 198 (1162)         0 packets, 0 bytes         5 minute rate 0 bps       Match: ip precedence 0  (1163)         314997338 packets, 161278310523 bytes         5 minute rate 952503000 bps       bandwidth: 1955 kbps (EIR Weight 0%)       police:         8000 bps, 8000 limit, 8000 extended limit       conformed 2580 packets, 1319729 bytes; rate 7000 bps; action:   set-dscp-transmit 48         exceeded 16 packets, 7968 bytes; rate 0 bps; action: drop


If traffic is being dropped, this might indicate that the customer is exceeding the allocated bandwidth (or bursting above agreed-on values), which may explain the problem.

If the ingress PE classification and QoS policy actions seem correct, the next stage is to analyze the remaining path across the service provider network. The first task is to find out exactly what that path is. From PE-PE, this can be obtained using the LSP traceroute feature described earlier. Then, for each interface in the path, the effects of any QoS policies are examined to ensure that no unexpected drops are occurring. In an MPLS core, this typically involves looking at any QoS congestion avoidance or management mechanisms. For example, traffic-engineered tunnels are becoming increasingly popular for guaranteeing bandwidth across the core network.

It may not always be possible to inspect the live customer traffic and associated network behavior. In these cases, synthetic probes are extremely useful because they allow an operator to probe the problem path for a certain time period. This reveals any major problems with the service provider network, such as wrong QoS classification/action, congestion, or bottlenecks. If these tests are positive, the problem is almost certainly with the customer traffic (incorrect marking or bandwidth being exceeded).

Fault Management

The previous sections have illustrated techniques for both the data and control planes of an MPLS VPN. This is of little value, however, if the service provider does not have adequate fault management systems and processes in place. If the service provider detects faults, they need to be acted on. More significantly, however, the enterprise customer often detects problems before the service provider does. The ability to test connectivity at higher frequencies is one factor in this. However, individual end users also report problems, mainly of a performance nature, to their own IT departments. Many enterprises can rule out their own networks and systems as the root cause and blame the service provider. At this point, a call is placed to the service provider's first-line support.

This section helps the service provider ensure that it has the correct reactive and proactive fault systems in place to deal with both scenarios.

Proactive Fault Management

Assume that a fault has been detected by the service provider's own monitoring system, within a VRF, from the PE across the MPLS core to the remote CE.

Traditionally, such a fault would be picked up by network operations in the network operations center (NOC), and the problem's severity would be assessed. Given that we are talking about a potential outage scenario, how should operations proceed in isolating, diagnosing, and repairing the fault?

This is in fact a multilayered problem that requires a systematic approach to troubleshooting.

The first area that the service provider should look at is recent events that have been collected from the network. Problems within the network may have resulted in events being generated and immediately explain why there is an outage. Examples include link up/down, contact loss with a router, and specific protocol issues such as LDP or BGP session losses. What is really needed is a means to correlate the data plane outage to the other events that have been collected from the network. Several systems in the market perform this task and immediately point to a possible root cause.

Furthermore, assume that there are no obvious reasons why the connection has failed. What is required is a means to identify what layer has the problem: VPN, IP, or MPLS. One approach is to rule out each one by performing data plane tests. If both the IP and MPLS data planes are correct, there is a problem with either VPN route availability or the VPN switching path, as shown in Figure 8-21.

Figure 8-21. Troubleshooting a VPN Outage


After the initial diagnosis has been made (control/data plane, VRF/MPLS/IP, and so on), further troubleshooting is required to isolate and then fully diagnose the problem.

Isolation within the data plane typically uses one of the traceroute tools discussed previously. You must be careful here, however, especially in heterogeneous networks. This is because not all boxes may support the OAM standards, and even if they do, some interoperability issues may exist. False negatives are the most obvious condition that may arise, such as one vendor's box failing to reply to an echo packet even though there is no problem. The service provider therefore should ask each vendor which draft version of the standard is supported and then conduct lab testing to identify any issues.

Another point worth stressing with the OAM isolation tools is that the last box returned in a trace output may not always be the failure point. For example, the fault may in fact lie with the next downstream router, but it cannot reply for some reason. Some intelligence is needed with this part of the process to rule out the last node in the trace output before inspecting (ideally automatically) those downstream.

Root cause identification requires inspecting the router's configuration, status, and forwarding engines. Although many issues are caused by misconfigurations and deterministic errors, defects within hardware or software are the most difficult to find and often are the most costly. Manual or automatic troubleshooting via management systems therefore should concentrate on ruling out obvious causes before looking for symptoms caused by defects in the data/control planes. A good illustration of the latter is label inconsistencies. For example, suppose an egress PE router receives a new route from a CE. BGP then allocates a label, installs the route and the label into the local forwarding tables, and propagates it to relevant ingress PE routers. Various things can go wrong here, such as allocation failures, local/remote installation problems, and propagation issues. These issues using Cisco CLI are shown in Figure 8-22.

Figure 8-22. MPLS VPN Label Problem Areas


Problems within the control plane require inspection of individual routers to resolve problems. For example, if the ingress PE router loses a route (or never receives it in the first place), the egress router from which it should first be learned should be inspected first (to see if it was the route received, and if it was then propagated, and so on). LDP problems tend to manifest themselves as MPLS data plane problems and hence can be found using the same isolation/inspection technique previously described.

Hopefully, the picture being painted here is that MPLS VPN troubleshooting is not straightforward. The real message, however, is that the service provider should ask some serious questions of the vendors in terms of what management support they provide to help automate the troubleshooting process. At the minimum, vendors should supply element management systems that check the health of MPLS and VPN data/control planes on a perbox basis. More useful are network-level systems that set up proactive monitoring and respond automatically to faults when detected. Such systems should employ the systematic techniques described in this section, as well as offer integration points into other Operation Systems Support (OSS) components.

Case Study: Troubleshooting a Problem with the Acme, Inc. VPN

It is useful to go through an example of troubleshooting an MPLS VPN connectivity problem within the Acme, Inc. VPN. This helps describe how the different tools fit together and how a service provider should prepare in the event that a fault is reported or detected.

In this scenario, assume that Acme, Inc. reports loss of connectivity between sites Glasgow and London, as shown in Figure 8-23.

Figure 8-23. Connectivity Problem Between Glasgow and London


The problem is reported into first-line support, at which point an attempt is made to reproduce it within the service provider network. In an unmanaged service, this could be done by testing from the local PEs to the source and destination sites, as shown in Figure 8-24.

Figure 8-24. Testing VPN Connectivity in the Service Provider Network


These tests fail in both directions. At this point, first-line support may choose to open a trouble ticket and escalate the problem. (This is very much dependent on the service provider support model.)

Assume that second-line support is now alerted. The EMS could be used to further narrow down the problem. It can do this by testing whether IP and LSP connectivity are valid in the core, as shown in Figure 8-25.

Figure 8-25. Testing IP and MPLS Paths


These tests reveal that basic IP connectivity is healthy but the transport LSP is broken.

The EMS now issues an LSP traceroute, as shown in Figure 8-26.

Figure 8-26. Using LSP Trace to Help Isolate the Problem


In this example, the traceroute gets as far as router P2. This router is now examined from both a control and data-plane perspective by the EMS, as illustrated in Figure 8-27.

Figure 8-27. Diagnosing LSRs in a Broken Path


Notice that in this case, the actual failure point was not the last node returned from the trace output. Because of this, the next hop (LDP downstream neighbor) had to be calculated and inspected. This illustrates the value of having an automated management solution to help with troubleshooting.

Reactive Fault Management

Many of the same principles already discussed apply to reactive fault management. However, some subtle distinctions need to be observed.

First, in the scenario where a customer calls in to first-line support to report a problem, the service provider is looking for a management system that will help this part of the business, as well as link into the troubleshooting tools already discussed. An important aspect of handling customer-reported faults is to quickly be able to reproduce the problem, thereby verifying if the service provider or enterprise is the likely source. A second requirement is to deal with the varying skill sets of support personnel. Some service providers have good network knowledge. However, others have little or no understanding of an MPLS VPN. An ideal system is one that allows an inexperienced operator to simply enter the details of the customer and the source/destination of the problem. The tool then retrieves all necessary VPN and network-related data "behind the scenes" and attempts to verify the problem. If the problem exists, a report is generated, highlighting the nature of the problem and any other contextual data. This allows the operator to raise an appropriate trouble ticket and alert the necessary network engineers for further analysis.

Service providers might choose to build (or have built) the "front-office" applications themselves. However, at some point in the troubleshooting process, detailed knowledge of vendor equipment is required if the whole process is to be simplified to reduce opex and maximize service uptime. Service providers therefore should discuss the availability and integration of vendor MPLS VPN NMS/EMS systems into their OSS.

SLA Monitoring

Providing an excellent service is the primary goal of any service provider. Fundamental to this is adhering to the SLAs agreed to with end customers. But how do service providers know they are satisfying such SLAs, especially given the more stringent and complex ones associated with IP services delivered across an MPLS VPN?

The key to successful monitoring in an MPLS VPN is to ensure that the technology exists within vendor equipment to proactively monitor the network from a performance perspective. Next comes the OSS to allow configuration, detection, reporting, and troubleshooting related to such monitoring.

As has been stressed, the technology usually exists in the form of synthetic probing, because this provides the most accurate data. Such probes must support the following requirements to be useful in monitoring modern SLAs:

  • How do the probes achieve their accuracy? How do they account for other activities that routers may be performing?

  • There is an obvious overlap between probing for performance and probing for faults. If a path is broken, the SLA may be compromised. As will be discussed, the service provider might want to combine both strategies. But to do so, the probes used for SLA monitoring must also support basic detection and notification of unavailable paths.

  • SLA monitoring requires probes that support QoS.

  • Specialized voice quality support for codecs and quality measurements.

  • Metrics such as packet loss, delay, jitter, and out-of-sequence packets should be supported, ideally within the context of a VRF.

  • Thresholds should be supported that result in notifications being sent to proactively inform when certain parameters have been breachedfor example, 500-ms jitter.

Accuracy

The service provider should ask the vendor serious questions about the accuracy of probe data. If there is any dispute over the SLA, it is crucial that the service provider rely on the data provided by the probes to prove or disprove any performance claims. Here are some specific questions that should be asked:

  • What testing does the service provider perform, and on what platforms? For example, does the service provider test probes with known delays in the Unit Under Test (UUT) that can be independently verified?

  • If the probes will be deployed on production routers carrying customer traffic (nonshadow model), are they suitably loaded with control/data traffic as per a live network?

  • If basic ICMP ping figures were used as a performance metric, one of the major problems would be that ICMP traffic is generally treated as low-priority by the routers in the path. This naturally leads to inaccurate results, because the reply to test packets might be sitting in a low-priority queue, waiting to be processed. The service provider should therefore ask how such delays are accounted for in the vendor's probe measurements.

Probe Metric Support

To minimize overhead, the service provider should ideally have a probe technology that supports the basic metrics of delay, jitter, packet loss, and availability from within a single operation. This can then be combined with one of the core monitoring strategies outlined earlier to provide minimal SLA monitoring. For example, the Cisco IP SLA jitter probe supports these combined metrics. The following example shows how to configure such a probe that sends five packets every 20 ms at a frequency of once every 10 seconds with a packet size of 60 bytes:

[View full width]

(config)#rtr 1 (config-rtr)#type jitter dest-ip 10.51.20.105 dest-port 99 num-packets 5 interval 20 (config-rtr)#frequency 10 (config-rtr)#request-data-size 60



The metrics collected by this probe would then be obtained from the show rtr operational-state command:

[View full width]

red-vpn#sh rtr op 1 Current Operational State Entry Number: 1 Modification Time: 08:22:34.000 PDT Thu Aug 22 2002 Diagnostics Text: Last Time this Entry was Reset: Never Number of Octets in use by this Entry: 1594 Number of Operations Attempted: 1 Current Seconds Left in Life: 574 Operational State of Entry: active Latest Operation Start Time: 08:22:34.000 PDT Thu Aug 22 2002 Latest Oper Sense: ok RTT Values: NumOfRTT: 997 RTTSum: 458111 RTTSum2: 238135973 Packet Loss Values: PacketLossSD: 3 PacketLossDS: 0 PacketOutOfSequence: 0 PacketMIA: 0 PacketLateArrival: 0 InternalError: 0 Busies: 0 Jitter Values: MinOfPositivesSD: 1 MaxOfPositivesSD: 249 NumOfPositivesSD: 197 SumOfPositivesSD: 8792 Sum2PositivesSD: 794884 MinOfNegativesSD: 1 MaxOfNegativesSD: 158 NumOfNegativesSD: 761 SumOfNegativesSD: 8811 Sum2NegativesSD: 139299 MinOfPositivesDS: 1 MaxOfPositivesDS: 273 NumOfPositivesDS: 317 SumOfPositivesDS: 7544 Sum2PositivesDS: 581458 <snip>



Here are some important points to note from this output (shown shaded):

  • Three packets have been lost from approximately 1000 sent.

  • The average round-trip time was 458,111/997 = 459 ms.

  • Source-to-destination jitter fields are postfixed with "SD," such as NumOfPositivesSD.

  • Destination-to-source jitter fields are postfixed with "DS," such as NumOfPositivesDS.

QoS Support

Enterprise customers typically require QoS markings to be preserved across the service provider network. This means that when the packets arrive at their destination, they should have the same QoS classification as when they entered. However, it is not practical or scalable for the service provider to create unique QoS classes for each customer. A common approach is for the service provider to offer a standard set of QoS classes onto which customer traffic is mapped, such as voice, business-class, and best-effort.

It then becomes essential that the service provider monitor how the network handles these classes. To do that, the probe technology needs to be able to set QoS in the packets. The following example shows how QoS is marked within the Cisco IP SLA technology. In this case, the ToS bits in the IP header are used (there is a standard mapping between ToS and DSCP). The following example shows how to configure a jitter probe to have a DSCP value of 101110 (ToS equivalent 0xB8), which is the recommended marking for voice traffic:

[View full width]

Router(config)#rtr 1 Router(config-rtr)#type jitter dest-ipaddr 10.52. 130.68 dest-port 16384 \ num-packets 1000 interval 20 Router(config-rtr)#tos 0xB8 Router(config-rtr)#frequency 60 Router(config-rtr)#request-data-size 200 Router(config)#rtr schedule 1 life forever start-time now



Specialized Voice Probes

Delay, jitter, and packet loss are the primary impairment factors with voice quality. Although it is essential that the service provider have access to detailed metrics around these properties, it can be complex to translate them into an instant assessment of voice quality. The service provider therefore should look to the vendor to provide voice quality "scores" as a guide. Two common examples are Mean Opinion Score (MOS) and International Calculated Planning Impairment Factor (ICPIF).

Additionally, it is important that common codecs be supported. This would ideally select the appropriate probe and packet formats required to more accurately test the network. Examples include G.711 mu-Law (g711ulaw), G.711 A-Law (g711alaw), and G.729A (g729a).

The following example shows how Cisco SAA (Cisco Service Assure Agent, renamed to IP SLA) has extended its jitter probe to support such features. This example simulates a G711u codec, 1000 packets, interval 20 ms, and frequency 1 minute:

[View full width]

Router(config)#rtr 2 Router(config-rtr)#type jitter dest-ipaddr 10.52. 132.71 \ Router(config-rtr)#dest-port 16001 codec g711alaw Router(config-rtr)#tos 0xB8 Router(config)#rtr schedule 2 life forever start-time now



Threshold Breach Notification

Expanding on the proactive fault management theme discussed earlier, probes ideally should support informing management stations when certain SLA conditions might be breached.

A further requirement is the concept of low and high watermarks, as shown in Figure 8-28.

Figure 8-28. High and Low Watermark Thresholds


This minimizes notifications due to transient network behavior, resulting in events being sent only when the high watermark has been breached and thereafter only after the low watermark has been passed.

Again, the service provider should look for these features to be present in the probe technology because it allows proactive management of network performance to be put in place, hopefully alleviating any problems before the end customer detects them.

The following example shows how to configure high and low thresholds for a Cisco SAA jitter probe, with immediate generation of a trap:

(config)#rtr 1 (config-rtr)#threshold 100 (config)#rtr reaction-configuration 1                  threshold-type immediate                  action-type trapOnly                  threshold-falling 50



Reporting

A useful tool for the service provider to help support customer VPN reports is VPN-aware SNMP. If the vendor supports such a feature, it can help the service provider offer secure access to only the VPNs offered to a specific customer.

This feature works by allowing SNMP requests on any configured VRF and returning responses to the same VRF. A trap host can also be associated with a specific VRF. This allows the service provider to restrict the view that a given user on a given SNMP server has. When polling a device through a VRF for a given MIB, the user has restricted access/view to specific tables of the MIBs. This concept is shown in Figure 8-29.

Figure 8-29. VPN-Aware SNMP


A service provider could also use this technique to offer partial information views to a peering service provider or third party in charge of measuring performance and service uptime for SLA verification purposes. Also, the protocol's VRF awareness allows for a management VRF to be used to communicate with a NOC.




Selecting MPLS VPN Services
Selecting MPLS VPN Services
ISBN: 1587051915
EAN: 2147483647
Year: 2004
Pages: 136

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net