Network Planning | Selecting MPLS VPN Services

Assuming that the decision has been made (based on business, technical, and operational considerations, as outlined in previous chapters) to migrate to a Layer 3 VPN, planning is required to successfully implement that migration. This section examines the issues of writing Request for Proposal (RFP) documents, issues to discuss with the selected provider(s) and how these relate to service-level agreements (SLAs), training operators in the new network, and recommendations for how to project-manage the migration.

Writing the RFP

After the decision to implement a Layer 3 VPN has been made, the provider currently delivering WAN connectivity should be assessed to determine whether that provider has the service model, geographic coverage, SLA commitment, support infrastructure, and track record necessary to deliver the new network. In many cases, having an existing relationship with a provider may count against that provider. Undoubtedly, there will have been service issues in the past, and other providers won't have that baggage associated with them. In other respects, the attitude of "better the devil you know than the one you don't" prevails. Neither option is completely objective, so actually writing down the requirements and evaluating all potential providers equally is the best course of action.

The first job is to select a short list of potential providers for supplying WAN connectivity that you will invest time into, send an RFP to them, and fully evaluate their responses. You should send RFPs to a reasonable number to make sure that you are getting a good feel for what is competitive in the market at that time. What that number is will vary depending on requirements, but between 5 and 15 is reasonable.

The first thing to consider is what type of information will be put into the RFP. You can base your RFP on this comprehensive table of contents:

[Pages 377 - 378]

Introduction
Scope and objectives
Nondisclosure agreement
Company contact information
Applicable law
General RFP conditions
6.1 RFP conditions
6.2 Timescale
6.3 Delivery of the proposal
6.4 Questions and answers
6.5 Evaluation criteria
Service provider introduction
7.1 RFP contact details
7.2 Customers/sales information
7.3 Reference accounts/customers
Architecture fundamentals
8.1 Quality of service (QoS) considerations
8.1.1 QoS technologies
8.1.2 QoS mapping
8.1.3 MPLS/IP VPN QoS architecture
8.1.4 MPLS/IP VPN routing protocol support
8.1.5 MPLS/IP VPN multicast support
8.2 Core locations
8.2.1 Core circuits
8.2.2 Internet Data Center (IDC) facilities
8.2.3 Higher rates or lambda services coverage
8.2.4 MPLS/IP VPN service coverage
8.2.5 MPLS/IP VPN service road map
8.3 Hub locations
8.3.1 IDC facilities
8.3.2 Higher rates or lambda services coverage
8.3.3 MPLS/IP VPN service coverage
8.3.4 MPLS/IP VPN service road map
8.3.5 Reposition of hub locations
8.4 Regional satellite locations
8.4.1 Reposition of satellite-to-hub connectivity
8.4.2 Engineering sites coverage
8.4.3 Large satellite sites coverage
8.4.4 ATM coverage to 80% of all satellite locations
8.4.5 E1 coverage to 80% of all satellite locations
8.4.6 MPLS/IP VPN service coverage
8.5 Backup considerations
8.5.1 Backup service coverage
8.5.2 Backup service solution
8.6 Telco hotel or IDC needs
8.6.1 IDC rack requirements for larger core/hub sites
8.6.2 IDC rack requirements for smaller hub sites
8.6.3 IDC rack requirements for satellite sites
8.6.4 Remote hands and eyes
8.6.5 IDC physical security
8.6.6 IDC disaster recovery/contingency planning
Technical requirements matrix
9.1 Scalability/manageability
9.2 Resilience
9.3 Circuit features
9.4 Performance/capacity
Vendor's network/solution
10.1 Geographic layout diagrams
10.2 Ownership (fiber infrastructure)
10.3 Ownership (secondary network)
10.4 Network equipment used
10.5 Vendor's network solution
SLAs and network performance
11.1 General condition
11.2 SLA requirement
11.3 Availability on the hub links (core to hub)
11.4 Availability on the satellite links (hub to satellite sites)
11.5 Delay variance (DV)
11.6 Round-trip delay (RTD)
11.7 Vendor meetings
11.8 Help desk support
11.9 On-site support
11.10 Installation implementation times
11.11 Performance failure credits
11.11.1 Installation lead times
11.11.2 Upgrade lead times
11.11.3 Delay variance and RTD
11.11.4 Availability
Operations
12.1 Account team
12.2 Network management
12.3 Network availability and performance reporting
12.4 Reports submission
12.5 Operations review
12.5.1 Monthly operations reviews
12.5.2 Root cause analysis
12.6 Response time to outages and escalation procedures
12.7 Service provider's local partners
12.8 Vendor network management
Pricing requirements
13.1 Price protection
13.2 Currency
13.3 Revenue commitment
13.4 Monthly circuit costs
13.5 Monthly backup costs
13.6 Telco co-location monthly costs
13.7 Installation costs
13.8 Future sites
Billing and invoice management
14.1 Billing media
14.2 Billing frequency
14.3 Monthly invoice accuracy report
14.4 Invoicing elements
14.5 Late and back-bill invoice charges
14.6 No back-bill invoicing greater than three months
14.7 Discount price rule

This table of contents is only a guideline; some elements may not apply to all networks. However, it is a good starting point for the things to consider that will be important to your network. For each of the technical issues previously described in this book, this framework allows you to identify what you want as a solution to fit your corporation's needs and how you want the providers to respond to those needs.

Note

It is worthwhile to decide ahead of time how you will rate the providers' responses by allocating some sort of marking or weighting scheme that places more importance on your corporate network needs than issues that are not so important. For example, you may determine that, because of operational reasons, you need the provider edge-customer edge (PE-CE) protocol to be Enhanced Interior Gateway Routing Protocol (EIGRP) rather than any alternative, so the marks awarded for offering this functionality may be greater than, say, multicast support if you do not make extensive use of multicast applications.

Architecture and Design Planning with the Service Providers

With the RFP written, responses gathered, and a selection made, a detailed design document must be created that documents the technical definition for the future WAN topology and services in the new network. This document forms the basis of the future design based on a peer-to-peer network architecture provided for by the new Layer 3 MPLS IP VPNs. This is a working document for both the VPN provider and those managing the network migration so that a common understanding of the technical requirements can be achieved. Clearly, this will closely resemble the requirements defined in the RFP. However, because compromises are always required in accepting proposals to RFPs, different trade-offs will be required when evaluating each provider's offerings. After the provider(s) are selected, this document replaces the RFP as a source of technical description and takes into account what the chosen provider(s) can actually offer and how that will be implemented in the enterprise network to deliver the desired service. The following is a sample of a table of contents for a design document:

[Pages 380 - 381]

Detailed design objective
QoS IP multicast Routing integration Using two independent IP/MPLS networks Key design elements Network today
Roles and responsibilities
WAN RFP design implications
WAN carriers
SP1 SP2
Next-generation network (NGN) network topology overview
Current network topology New network topology Core/IDC site topology Core sites
Regional hub site topologyIP VPN
Satellite site topologytype 1 Satellite site topologytype 2 Satellite site topologytype 3 Satellite site topologytype 4 Satellite site topologytype 6 Partner sites
IDCs/co-location connectivity IDC overview Infrastructure requirements Cabling specifications Environmental conditions Power requirements Security requirements Access control to the IDC rooms On-site assistance IDC and circuit topology
MPLS IP VPN architecture and theory
Routing over IP VPNs
IPV4 address/routing hierarchy Routing overview Default routing BGP weight attribute change Mechanisms to avoid routing anomalies Network management subnets BGP configuration for CE gateways
QoS
Edge-to-edge SLA Latency Jitter Loss Number of service provider classes of service Per-class admission criteria (DSCP/IPP) Policing treatment (per-class markdown or drop) Enterprise-to-SP mapping model Remarking requirements (CE to PE) MPLS/DiffServ tunneling mode in use (Uniform/Pipe/Short Pipe) Remarking requirements (CE from PE) MPLS traffic engineering MPLS DiffServ traffic engineering
Multicast
Network management Enterprise monitor Router real-time monitor Router latency monitor Traps and syslogs SLAs
Address management
Addressing schema
Security
Hardware and software specifications
CE device for connectivity greater than STM-1 For connectivity less than E3 (0-multiple E1s) Core router backbone switches Out-of-band routers Core site metro gateways Hub site metro gateways Port adaptors
Software specifications
Lab testing
Future considerations

This list only suggests topics to consider for the working document that defines how the network will be designed and how it will operate. The implementation teams from both the provider and the corporation need intimate working knowledge of the network's design and operations.

Should a systems integrator be used to manage the transition from frame-to-MPLS VPN connectivity, the systems integrator should be able to demonstrate to you a good understanding of these topics. Beyond basic understanding, a good systems integrator will be able to tell you about how different options that exist within each of these topics will affect your network after they are implemented.

Project Management

Converting a network from a Layer 1 time-division multiplexer (TDM), or from a Layer 2 offering, to a Layer 3 IP VPN is a significant task for any corporation. To successfully manage that transition, some minimal project management is advisable. Many project-management methods are available. A suitable one can efficiently do the following:

Define the order process.
Track orders against delivery dates.
Track changes to designs and contractual commitments.
Maintain reporting on risks to project success and provide an escalation path when needed.
Provide an updated Gantt chart of planned activities and resource allocation.
Track contracted to actual performance and keep track of project budgets.

SLAs with the Service Providers

This is one of the most contentious topics in negotiation between providers and their customers. It is natural that a customer paying for a service will want the delivered service to be measured against what is being paid for and will want a penalty to be in effect if the delivered service does not match what he paid for. With point-to-point Layer 2 connections, this is relatively simple. It's relatively easy to measure the path's availability and the delivered capacity on that path. However, after the any-to-any connectivity of an IP VPN is delivered, with support for multiple classes of service (CoSs), the situation is more complex.

The typical service provider SLA defines the loss latency and jitter that the provider's network will deliver between PE points of presence (POPs) in its network. In almost all cases, this is an average figure, so POPs near each other compensate for the more remote POPs in terms of latency contribution. Some providers also offer different loss/latency/jitter for different CoSs. Again, this is normally for traffic between provider POPs. What is of interest to enterprise applications, and hence to enterprise network managers, is the service's end-to-end performance, not just the bit in the middle. Specifically, the majority of latency and jitter (most commonly loss, too) is introduced on the access circuits because of the circuits' constrained bandwidth and slower serialization times.

To solve this problem, you need SLAs that reflect the service required by the applications. By this, I mean that latency and jitter can be controlled by implementing a priority queuing (PQ) mechanism. For a PQ system, a loss of this kind is a function of the amount of traffic a user places in the queue, which the provider cannot control. For classes using something like the Cisco class-based weighted fair queuing (CBWFQ), the latency and jitter are a function of the load offered to the queuing mechanism. This is not surprising, because this mechanism is designed to allocate bandwidth to specific classes of traffic, not necessarily to deliver latency or jitter guarantees.

Some providers have signed up to deliver the Cisco Powered Network (CPN) IP Multiservice SLA, which provides 60-ms edge-to-edge latency, 20-ms jitter, and 0.5 percent loss between PE devices. With this strict delivery assured, designing the edge connectivity to meet end-to-end requirements is simplified.

With advances to the Cisco IP SLA, it will be possible to link the measuring of latency and jitter to class load. It is then reasonable for a provider to offer delay guarantees for CBWFQ classes, provided that the offered load is less than 100 percent of the class bandwidth. This then puts the CBWFQ's latency and jitter performance under the enterprise's control. If the enterprise does not overload the class, good latency and jitter should be experienced; however, if the class is overloaded, that will not be the case.

There should be more to an SLA than loss, latency, and jitter characteristics. The SLA should define the metrics for each service delivered, the process each side should follow to deliver the service, and what remedies and penalties are available. Here is a suggested table of contents to consider when crafting an SLA with a provider:

Performance characteristics

Loss/latency/jitter for PQ traffic
Loss/latency/jitter for business data traffic
Loss/latency/jitter for best-effort traffic

Availability

Mean time to repair (MTTR)

Installation and upgrade performance

It is worth discussing each element in more detail. It is important to base performance characteristics on the requirements of the application being supported and to consider them from the point of view of end-to-end performance. Starting with PQ service, which will be used for voice, see Figure 10-1, which shows the results of ITU G.114 testing for voice quality performance. The E-model rating is simply a term given to a set of tests used to assess user satisfaction with the quality of a telephone call.

Figure 10-1. SLA Metrics: One-Way Delay (VoIP)

If you select a mouth-to-ear delay budget of 150 ms, you may determine that the codec and LAN delay may account for 50 ms, for example (this varies from network to network), leaving you 100 ms for the VPN. If the provider is managing the service to the CE, this is the performance statistic. However, if the provider is managing the service only to the PE, perhaps only 30 ms is acceptable to stay within the end-to-end budget. This more stringent requirement comes from the serialization times of the access link speed (for maximum-sized fragments), the PQ's queue depth, and the size of the first in, first out (FIFO) transmit ring on the routers in use as a PE, all taking up 35 ms for the ingress link and 35 ms for the egress link.

Whether the provider manages from CE to CE or PE to PE, targets must be set for the connection type, and reports need to be delivered against contracted performance. From the enterprise perspective, it's simplest to have the provider measure and report on performance from CE to CE; however, that does come with a drawback. To do so, the provider must be able to control the CE for the purposes of setting up IP SLA probes to measure the CE-to-CE performance and collect statistics. This is generally done by having the provider manage the CE device. However, not all enterprises want the IOS revision on the CE to be controlled by the provider, because the enterprise might want to upgrade its routers to take advantage of a new IOS feature. Clearly, this needs to be negotiated between the provider and enterprise to reach the optimum solution for the network in question.

For the data class, some research suggests that, for a user to retain his train of thought when using an application, the application needs to respond within one second (see Jakob Nielsen's Usability Engineering, published by Morgan Kaufmann, 1994). To reach this, it is reasonable to budget 700 ms for server-side processing and to require the end-to-end round-trip time to be less than 300 ms for the data classes.

Jitter, or delay variation, is a concern for real-time applications. With today's newest IP phones, adaptive jitter buffers compensate for jitter within the network and automatically optimize their settings. This is done by effectively turning a variable delay into a fixed delay by having the buffer delay all packets for a length of time that allows the buffer to smooth out any variations in packet delivery. This reduces the need for tight bounds on jitter to be specified, as long as the fixed delays plus the variable delays are less than the overall delay budget. However, for older jitter buffers, the effects of jitter above 30 or 35 ms can be catastrophic in terms of meeting user expectations for voice or other real-time applications. Clearly, knowledge of your network's ability to deal with jitter is required to define appropriate performance characteristics for the WAN.

The effects of loss are evident in both real-time and CBWFQ classes. For real time, it is possible for jitter buffers to use packet interpolation techniques to conceal the loss of 30 ms of voice samples. Given that a typical sample rate for voice is 20 ms, this tells you that a loss of two consecutive samples or more will cause a blip to be heard in the voice conversation that packet interpolation techniques cannot conceal. Assuming a random-drop distribution within a single voice flow, a 0.25-percent packet drop rate within the real-time class results in a loss every 53 minutes that cannot be concealed. The enterprise must decide whether this is acceptable or whether tighter, or less tight, loss characteristics are required.

For the data classes, loss affects the attainable TCP throughput, as shown in Figure 10-2.

Figure 10-2. TCP Throughput

Graph created based on information from "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm," Matthew Mathis, Computer Communication Review, July 1997.

In Figure 10-2, you can see the maximum attainable TCP throughput for different packet-loss probabilities given different round-trip time characteristics. As long as the throughput per class, loss, and round-trip time fall within the performance envelopes illustrated, the network should perform as required. The primary reporting concern with the data classes is how well they perform for delay and throughput, which depends almost entirely on the load offered to them by the enterprise. Should the enterprise send more than what is contracted for and set up within a data class, the loss and delay grow exponentially, and the provider can't control this. Realistically, some sort of cooperative model between the provider and enterprise is required to ensure that data classes are not overloaded, or, if they are, that performance guarantees are expected only when the class is less than 100 percent utilized.

Other subjects listed in the SLA are more straightforward. Availability, MTTR, and installation and upgrade performance are mostly self-explanatory:

Availability Defines the hours that the service should be available and the percentage of time within that availability window that the service must be available without the provider's incurring penalties.
MTTR Refers to how quickly the provider will repair faults within the network and restore service.
Installations and upgrade performance Tells the provider how long it has to get a new site operational after the order has been delivered by the enterprise, or how long it has to upgrade facilities should the enterprise order that.

Network Operations Training

Clearly, with a new infrastructure to support, system administrators need appropriate training in the technology itself, the procedures to use to turn up or troubleshoot new sites, and the tools they will have to assist them in their responsibilities. The question of whether to train the enterprise operations staff in the operation of MPLS VPNs (with respect to the service operation within the provider's network) is open. Some enterprises may decide that, because no MPLS encapsulation or MPLS protocols will be seen by the enterprise network operators, no training is necessary for this technology. However, experience to date has shown that when you troubleshoot issues with service provider staff, knowledge of MPLS VPN operation is helpful.

The following high-level topics were taught to a large enterprise that successfully migrated network operations to a provider-delivered MPLS VPN service. These topics can be used as a template to evaluate training offerings to see if all necessary topics are covered:

Routing protocols (PE-to-CE and BGP)
MPLS
QoS
Multicast

These topics can be covered with course outlines that are similar to the following:

[Pages 387 - 388]

Course 1: Routing on MPLS VPN Networks
Course Description
This course offers an integrated view of the PE-to-CE routing protocol and its interaction with the provider MPLS VPN, BGP, and basic MPLS/VPN operation. Both theory and hands-on practice are used to allow participants to configure, troubleshoot, and maintain networks using those protocols.
Prerequisite
Basic knowledge of TCP/IP, routing, and addressing schemes
Content
Routing (assuming EIGRP as the PE-to-CE protocol) EIGRP introduction EIGRP concepts and technology EIGRP scalability BGP route filtering and route selection Transit autonomous systems BGP route reflectors BGP confederations Local preference Multiexit discriminator AS-path prepending BGP communities Route flap dampening MBGP
MPLS VPN technology Terminology MPLS VPN configuration on IOS platforms CE-PE relations BGP OSPF RIP Static Running EIGRP in an MPLS VPN environment
Course 2: QoS in MPLS VPNs
Course Description
This course covers the QoS issues encountered when connecting campus networks to MPLS VPN WANs.
Prerequisites
A good understanding of generic QoS tools and their utility
Basic knowledge of MPLS and IP
Content
Overview
Modular QoS command-line interface (MQC) classification and marking CBWFQ
Low-latency queuing (LLQ) (both fall into the broader category of congestion management)
Scaling QoS
QoS tunnel modes in MPLS VPN networks
Monitoring QoS performance
Course 3: Multicast
Course Description
This course describes basic multicast applications, the challenges and resolution of implementing multicast over an MPLS VPN, and basic troubleshooting of that environment.
Prerequisites
A good understanding of multicast use and configuration
Basic understanding of MPLS/VPN networks
Content
Multicast operation PIM sparse mode SSM IPv6 Host-router interaction
Multicast on MPLS/VPN Multicast Distribution Tree (MDT) Default MDT Data MDT Deployment considerations