Section 8.1. SPF Enhancements | OSPF and IS-IS: Choosing an IGP for Large-Scale Networks: Choosing an IGP for Large-Scale Networks

8.1. SPF Enhancements

Chapter 2 gave you an idea of how an SPF algorithm works. And while that overview is enough to give you an understanding of the SPF algorithm, it is too simplistic for a practical link state protocol implementation. In fact, even if you follow the OSPF and IS-IS standards, you will get an implementation that is workable (if naïve) for small networks, but that lacks the stability, scalability, and accuracy necessary to survive in large-scale networks. For router vendors wanting to market to the largest network operators, sophisticated OSPF and ISIS implementations are essential. Demonstrably robust implementations are a competitive advantage, and as a result, many of the enhancements to the SPF algorithmand in fact the SPF algorithm itselfcan be closely guarded corporate secrets. Therefore, it is impossible to detail many vendor-specific enhancements, but the general areas of enhancement can be examined.

8.1.1. Equal-Cost Multipath

The basic Dijkstra calculation described in Chapter 2 uses the following steps:

1.	The router adds itself to the tree database as the root of the tree. It shows itself as its own neighbor, with a cost of 0.
2.	All entries in the link state database describing links from the root to its neighbors are added to the candidate database.
3.	The cost from the root to each node in the candidate database is calculated. The link in the candidate database with the lowest cost is moved to the tree database, along with the cost from the root. If two or more links are an equally low cost from the root, choose one. If any entries are left in the candidate database with a link to the neighbor just moved to the tree, those entries are deleted from the candidate database.
4.	The router ID of the neighbor on the link just added to the tree is examined. Entries originated by that neighbor are added to the candidate database, except for entries in which the ID of the neighbor is already in the tree database.
5.	If entries remain in the candidate database, return to Step 3. If the candidate database is empty, terminate the calculation. At this time, every router in the network should be represented as a neighbor on one of the links in the tree database, and every router should be represented just once.

However, there is an inefficiency in Step 3. Namely, "If two or more links are an equally low cost from the root, choose one. If any entries are left in the candidate database with a link to the neighbor just moved to the tree, those entries are deleted from the candidate database." An example is needed to understand why this is an inefficiency. Figure 8.1 shows a small network and its link state database. What is important about this network is that unlike the network in Figure 2.15 that was used for the SPF example in that chapter, all of the links in this network have the same cost. This example will show the SPF calculation at R1.

Figure 8.1. All links in this network have a cost of 2.

Steps 1 and 2 are shown in Figure 8.2. R1 adds itself to the tree database, the links to its neighbors are added to the candidate database, and the costs from the root to the destination nodes are calculated.

Figure 8.2. The links from R1 to its neighbors are added to the candidate database.

Step 3 is performed in Figure 8.3. The costs to the root are equal, so one of the entries is randomly chosen and moved from the candidate database to the tree database. The inefficiency in Step 3 will not be apparent until the end of the SPF calculation.

Figure 8.3. The link from R1 to R2 is moved to the tree database.

Step 4 is performed in Figure 8.4. R2 was added to the tree database, so all the links from R2 to its neighbors are examined in the link state database. R1 is already in the tree database, so only the link from R2 to R4 is added to the candidate database. The cost from the root to R4 is 4.

Figure 8.4. The link from R2 to R4 is added to the candidate database.

There are entries remaining in the candidate database, so, as Step 5 prescribes, Figure 8.5 shows a return to Step 3. The cost to root associated with the R1-R3 link is the lowest, so that link is moved from the candidate database to the tree database.

Figure 8.5. The R1-R3 link is moved to the tree database.

With R3 added to the database, the links to its neighbors are examined in the link state database. R1 is already in the tree database, so only the R3-R4 link is added to the candidate database in Figure 8.6. The cost from the root to R4 is 4.

Figure 8.6. The R3-R4 link is added to the candidate database.

In Figure 8.7, the cost from the root of both entries in the candidate database is 4, so one (R2-R4) is randomly chosen and moved to the tree database.

Figure 8.7. The R2-R4 link is moved to the tree database.

In Figure 8.8, the link state database is examined for R4's links to its neighbors: There are two, R4-R2 and R4-R3. However, both R2 and R3 are already in the tree database, so there are no new entries to add to the candidate database. But there is still an entry in the candidate database, and Step 5 says the SPF calculation stops only when the candidate database is empty. This is the reason for the part of Step 3 that says, "If any entries are left in the candidate database with a link to the neighbor just moved to the tree, those entries are deleted from the candidate database." If the "leftover" entries in the candidate database are not removed, the SPF calculation cannot end.

Figure 8.8. The R3-R4 link remains on the candidate database and must be removed before the SPF calculation can end.

But the inefficiency of this rule also reveals itself at this point. If multiple equal-cost paths exist, choosing only one of the paths means all traffic will be routed on that path, possibly congesting it, while other equally good paths are ignored. We can modify Step 3 so that if there are multiple links in the candidate database to the same node and equally low cost, all the links can be moved from the candidate database to the tree database. In Figure 8.9, moving the R3-R4 link to the tree database empties the candidate database so that the SPF can stop. An enhancement to the router's forwarding processes can then be made that takes advantage of these multiple equal-cost paths by spreading traffic across all of them. This is equal-cost multipath (ECMP), or load balancing.^[1]

^[1] Load balancing is a more generic term that can also refer to data processing, such as spreading traffic across multiple Web servers. ECMP is more specific to the forwarding of network traffic.

Figure 8.9. By moving the R3-R4 link to the tree database rather than just deleting it from the candidate database, an SPF tree is created in which two equal-cost paths from R1 to R4 exist.

Figure 8.10 shows one situation in which ECMP can apply. In this topology, the next hop from Router A to subnet 10.1.1.0 is Router B. But there are three links to Router B, each with a cost of 50. Router A can balance the traffic going to 10.1.1.0and any other destination for which Router B is the next hopacross the three links.

Figure 8.10. ECMP allows Router A to utilize all three links to Router B for traffic going to subnet 10.1.1.0.

Figure 8.11 shows another application of ECMP. Here, Router A has two next hopsRouter B and Router Cto subnet 10.2.2.0. Routes to the destination through either next hop have the same cost because all four of the links shown have the same cost. So in this case, Router A can select both next hops and balance the traffic to 10.2.2.0and any other traffic to destinations that pass through Router Dbetween Routers B and C.

Figure 8.11. ECMP allows Router A to use two equal-cost routes to subnet 10.2.2.0.

Given multiple equal-cost pathseither to the same next-hop router as in Figure 8.10 or through separate next-hop routers as in Figure 8.11a router can balance traffic across them in several ways. One approach is per-packet load balancing, in which the router forwards packets round-robin or randomly over each path as shown in Figure 8.12. The problem with this approach is that different links, although of equal cost, might not be of equal delay. Link propagation, router latency and buffering, and link MTUs requiring fragmentation might differ from one path to another. As a result, packets can become reordered at the receiving end. This can cause problems for TCP, which when it sees out-of-sequence packets can request retransmits. The result is reduced performance and wasted bandwidth.

Figure 8.12. Per-packet load balancing distributes packets individually over equal-cost paths.

A better approach is per-destination load balancing, in which packets are distributed across multiple paths by destination, as shown in Figure 8.13. In this example, Router A knows it has two equal-cost paths to the destinations 10.1.1.0 and 10.2.2.0. It chooses Router B as the next hop for the route to 10.1.1.0, and Router C for the route to 10.2.2.0. As Router A discovers more routes to destinations reachable through Router D, it continues to alternately assign Router B and Router C as next hops. The functional difference is that per-packet load balancing assigns all ECMP next hops to each route, whereas per-destination load balancing selects, either round-robin or randomly, one next hop from the set of ECMP next hops for each route.

Figure 8.13. Per-destination load balancing distributes next hops or outgoing interfaces among multiple equal-cost routes.

Although per-destination load balancing is an improvement over per-packet, it is still not very good. If substantially more packets are sent to one destination than to another, the overall bandwidth utilization will be uneven. For example, suppose Router E in Figure 8.14 is providing Internet access. When it advertises BGP routes to the other routers in the figure, the routes are given a next-hop address of 10.255.0.1, Router E's loopback address. This is common practice; Router E is advertising that it is the next hop for any Internet destinations. When Router A receives these BGP advertisements, it does a lookup of 10.255.0.1 in its IGP routes to find out how to reach the BGP next hop. If the router is performing per-destination load balancing, it selects either Router B or Router C as the "next hop to the next hop"that is, the IGP next hop to the BGP next hop. The potential problem is that if traffic passing through Router A to the Internet is relatively heavy compared to other traffic passing through Router A to Router D, the loading across the equal cost paths can be severely unbalanced.

Figure 8.14. Per-destination load balancing might be inefficient in this topology.

Per-flow load balancing improves the traffic distribution by identifying individual traffic flows and forwarding all packets belonging to the flow to the same next hop or out the same interface. At a minimum, a flow might be defined as all packets with the same source and destination addresses. However, flows can be made more granular by defining a flow not only by source and destination address but also by protocol number, source and destination port,^[2] and perhaps ToS or DSCP (differentiated services code point) values. When a packet is to be forwarded, the values of these fields become the input to a hashing algorithm. (Router vendors define their own hashing algorithms.) The packet is then associated with a flow based on the resulting hash value, and all packets belonging to the same flow are forwarded to the same next hop.

^[2] When source and destination ports are included in the flow identification, the flow is called a microflow. Implementations that load balance by microflow are more challenging to create because they require the router to look further into the packet than just the IP header.

8.1.2. Pseudonodes and ECMP

If some but not all of the links in an ECMP group are broadcast links, a subtle problem can arise from simplistic SPF implementations. Another example SPF run will help you understand the problem.

The physical topology in Figure 8.15 shows two routers, R1 and R2, interconnected with both a point-to-point link and an Ethernet link. The outgoing cost on both of R1's interfaces is 2, so the two links comprise equal cost paths, and load balancing can be performed between R1 and R2. The logical topology shows that the Ethernet link is represented as a pseudonode, which is labeled in this example as P3. Recall from the discussion of pseudonode basics in Section 4.4 that the cost from a pseudonode to any of its attached routers is 0, so that the pseudonode does not affect the transit cost of the broadcast link. Figure 8.16 shows the start of the SPF calculation at R1 for this small network, in which R1 installs itself in the tree database as the root and then adds all links to its neighbors to the candidate database.

Figure 8.15. There are two equal-cost paths between R1 and R2, one of which is point-to-point and one of which is broadcast.

Figure 8.16. R1 installs itself as root and adds the links to its neighbors to the candidate database.

The two links added to the candidate database are of equal cost (2) to the root, so as in previous examples, one is chosen at random and added to the tree database (Figure 8.17). In this case, the chosen link is R1-R2. Although not yet apparent, this random choice will cause a problem with ECMP.

Figure 8.17. The R1-R2 and R1-P3 links are of equal cost from the root, so one is randomly chosen to move to the tree database.

Because a link to R2 was added to the tree, the links to R2's neighbors are examined in the link state database (Figure 8.18). There are two, R2-R1 and R2-P3. R1 is already in the tree database, so only the R2-P3 link is added to the candidate database.

Figure 8.18. A link to R2 was added to the tree database in the last step, so the R2-P3 link is added to the candidate database.

In Figure 8.19, R1-P3 has the lowest cost from the root and so is moved to the tree database. Because there is now a path to P3 in the tree database, the higher-cost R2-P3 link is deleted from the candidate database.

Figure 8.19. The lower-cost R1-P3 link is moved to the tree database and the higher-cost R2-P3 link is deleted from the candidate database.

With P3 added to the tree, its links to its neighbors are examined in the link state database. There are two, P3-R1 and P3-R2. But both R1 and R2 are already on the tree, so neither of these links is moved to the candidate database. At this point, the candidate database is empty, so the SPF calculation stops. And now, in Figure 8.20, we can look at the resulting tree and see the ECMP problem. R1 has branches to R2 and to P3, each at cost of 2, but the tree does not include a branch from P3 to R2. Therefore, R1 forwards all traffic to R2 across the point-to-point link and none across the Ethernet link. No load balancing can take place.

Figure 8.20. The resulting tree does not allow load balancing across the two equal-cost paths because it does not include a branch from P3 to R2.

As mentioned, the problem actually arose with the random selection in Figure 8.17 of the R1-R2 link from the candidate database. So let's back up to that step and select the other link instead. Figure 8.21 shows the candidate database in the same state as in Figure 8.17, but now R1-P3 is selected and moved to the tree database instead of R1-R2.

Figure 8.21. Returning to the point in the SPF calculation where either R1-R2 or R1-P3 is randomly selected for the tree database, R1-P3 is now chosen instead of R1-R2 as in Figure 8.17.

Continuing on with the SPF calculation, because a link to P3 was added to the tree, the links to P3's neighbors are examined in the link state database. R1 is already on the tree, so only P3-R2 is added to the candidate database (Figure 8.22).

Figure 8.22. The P3-R2 link is added to the candidate database.

Now, in Figure 8.23, the candidate database contains two equal-cost entries for links to R2. Using the modification to the SPF algorithm described in Section 8.1.1 to accommodate ECMP, rather than randomly selecting one and discarding the other, both are moved to the tree database. No links remain in the link state database to nodes that are not already on the tree, and the candidate database is now empty, so SPF stops.

Figure 8.23. R1-R2 and P3-R2 are equal-cost links to the same node, so they are both moved to the tree database.

Figure 8.24 shows the resulting tree. Both equal-cost paths are now represented, and R1 can load balance traffic to R2. The significant step that prevented disruption of ECMP is that in Figure 8.21 the link to the pseudonode was selected for the tree database before the link to R2. Therefore, to ensure that ECMP works correctly when the ECMP group consists of a mixture of point-to-point and broadcast links, the following simple rule is added to the SPF procedure:

If there are multiple entries in the candidate database with equally low cost, and if at least one link is to a pseudonode and at least one link is to a router, always select the link to the pseudonode first rather than randomly selecting among the links.

Figure 8.24. The resulting tree correctly accounts for the two equal-cost paths, and so allows load balancing.

Specific OSPF or IS-IS implementations might or might not include this modification. So if load balancing is a part of your network design, be sure to verify with your vendor that their SPF routine includes this rule.

8.1.3. Incremental SPF Calculations

It is not difficult to find network engineers and operators who think distance vector protocols are preferable over link state for all but the largest networks. Ask them why, and their answer is "because the SPF calculations are complex and CPU intensive, and they will hurt my routers' performance." These folks are stuck in the nineties. Concern over the price in router resources needed to pay for SPF was partially justified a decade or so ago, but there is no longer reason to fear SPF (in relation to distance vector protocols, at least). Several factors contribute to this change in attitude:

Increased router performance and memory capacity
Increased sophistication of physical and logical router architectures
Increased base of experience designing and operating large, complex networks
Increased base of vendor experience designing sophisticated SPF processes

One of the improvements several router vendors have incorporated into their SPF procedures is Incremental SPF (iSPF). To understand iSPF, consider the network shown in Figure 8.25. Using the link costs given in the illustration, you can determine with little trouble that R1 will calculate the shortest-path tree shown in Figure 8.26.

Figure 8.25. An example network and link costs.

Figure 8.26. The shortest-path tree calculated by R1.

Now consider a change to the network topology. Figure 8.27 shows the addition of a new node, R8. The change to the tree to account for the new node is shown in Figure 8.28. With the SPF procedures as you have seen them so far, the addition of the new node requires the reflooding of LSAs or LSPs and a recalculation of SPF by all routers in the area. But a cursory look at the new topology shows that most of the tree has not changed; the addition of R8 involves the simple addition of a branch from R5.

Figure 8.27. Node R8 is added to the topology of Figure 8.25.

Figure 8.28. The new shortest-path tree from R1 shows the added router R8.

Consider another topological change, shown in Figure 8.29. Here, the link between R4 and R5 has failed or been disabled. Again, our understanding of SPF so far requires that this information be flooded throughout the area and that all nodes rerun SPF. But this link is not a part of the shortest-path tree shown in Figure 8.26, so rerunning SPF results in a tree that looks exactly the same as it did before the link failure.

Figure 8.29. Failure of the R4-R5 link does not change the shortest-path tree, because it was not a branch of the tree to begin with.

These two topological changes are the basis for iSPF. In the first example, in which a router with a single link to the network was added (that is, a stub router), there are no alternative links to consider and so no need to attempt to calculate a shortest path. There is only one path to the new node. Therefore, a full SPF run is unnecessary for the routers that were already on the tree; they only need to add the new branch from R5 to R8.

In the second example, the failure of the link has no affect whatsoever on the shortest-path tree and so again there is no need for the nodes in the area to rerun SPF. Only a reduction in the metrics associated with the R4-R5 link would necessitate a full SPF run.

iSPF takes into consideration topological changes, such as the two shown in this section, and performs SPF only to the extent necessary to allow for the changes. In the first case, a simple distance vectorlike addition to the tree is required; in the second case, no action at all is required. Experience has shown that iSPF provides the greatest benefit in cases of the first example, when stub routers are added to or removed from the topology.

8.1.4. Partial Route Calculations

The topology changes described in the previous section are by no means unusual in an IGP-routed topology, so it is easy to understand how the changes to the basic SPF procedure incorporated by iSPF can improve efficiency. Another network change, as common as those of the previous section, is the addition, deletion, or metric change of IP addresses. Reviewing the basic SPF procedure, you can see that the only addresses used in the computation of the shortest-path tree are the RIDs. Although recording the distance and direction of IP prefixes from each router is the ultimate goal of any routing protocol, the prefixes themselves have no bearing on an SPF calculation; after the location and cost of each node on the shortest-path tree are determined, it should be a simple matter to record what prefixes are attached to or advertised by what node.

This is the basis of Partial Route Calculations (PRC). When a node advertises the addition, deletion, or metric change of an IP prefix, rather than unnecessarily rerunning the SPF calculation the other nodes on the tree simply record the change. Although a full SPF run in even large networks is very fasta high-performance router should be able to process 500 to 1000 nodes in 50 to 100msPRC is substantially faster, ranging from 0.5 to 10ms depending on the number of prefixes to be scanned.

PRC's increased efficiency is more substantial in IS-IS than in OSPFv2 because of the way each protocol advertises IP prefixes. IS-IS advertises all IP prefixes in IP Reachability TLVs, whereas node information necessary for SPF calculations is advertised in IS Neighbors or IS Reachability TLVs. This clear separation of prefix information from topology information makes PRC easily applicable to any IP address change.

OSPFv2, on the other hand, incorporates IP address semantics into type 1 and type 2 LSAs. That is, routers or pseudonodes advertise both topological information and their attached IP addresses in the same LSAs. So for example, if an IP address attached to an OSPFv2 router changes, the router must originate a type 1 LSA to advertise the change. However, because node information is also carried in type 1 LSAs, flooding the LSA triggers a full SPF calculation in all routers in an areaeven though the address change is irrelevant to the node topology. As a result, only type 3, 4, 5, and 7 LSAswhose only purpose is to advertise prefix informationtrigger a PRC in OSPFv2.

The applicability of PRC to all IP prefixes in an IS-IS domain, not just prefixes external to an area as in OSPFv2, is a significant contributing factor to the better scaling properties of IS-IS in a single area. You will see in Chapter 12 that OSPFv3 improves its scaling by removing addressing semantics from its Router and Network LSAs and using a new type of LSA to advertise attached prefixes.

8.1.5. SPF Delay

Although the kinds of network events that trigger iSPF and PRC are common, they are not necessarily frequent. In large networks, the kinds of address changes that trigger PRC happen only a few times a day, and the kinds of topological changes that trigger iSPF might happen only a few times a week. Moreover, a full SPF run in even a very large network takes only tens of milliseconds. Given these facts, the value of iSPF and PRC becomes evident mainly when abnormal things happen. But in a large area, LSAs or LSPs will flood regularly because of the random expiration of refresh timers around the network, requiring regular SPF runs. And when bad things happen, routers can become inundated with changing LSA/LSPs. If each LSA/LSP triggers a full or incremental SPF run, and if they are arriving fast and furious, SPF can begin eating up the majority of CPU cycles. Scheduler slips result, important tasks become delayed or completely missed, and the router can become seriously destabilized.

The challenge in large-scale networks, then, is to quickly react to network changes while at the same time not allowing SPF calculations to dominate the route processors. This is the goal of SPF delay, also called SPF holddown or SPF throttling. Rather than kick off an SPF calculation every time a new LSA/LSP arrives, SPF delay forces the router to wait a bit between SPF runs. If a large number of LSA/LSPs are being flooded, a delay between SPF runs means that more LSA/LSPs are added to the link state database during the holddown period. Efficiency is then increased because when the holddown period expires and SPF is run, more network changes are included in a single calculation.

SPF delay arguably adds another small efficiency. When an SPF calculation is being run, it is necessary to "freeze" the link state database so that new LSA/LSPs are not added.^[3] The reason for this is obvious: Changing the link state database in the middle of an SPF calculation could corrupt the results. So, all LSA/LSPs that arrive during a running SPF calculation must be buffered until the calculation ends, and only then added to the link state database (likely triggering another SPF run). SPF delay, by reducing the overall frequency of SPF runs, reduces the frequency that LSA/LSPs must be buffered.

^[3] A more elegant approach than freezing the database is to lock just the LSPs that are being processed.

Of course, the price you pay for delaying an SPF calculation is an increased network convergence time. So, the challenge when determining a delay interval is to make it long enough that a heavy surge of LSA/LSPs does not destabilize the router while still keeping it short enough that convergence is not hurt significantly. Even better would be to have normal "fast" SPF calculations when conditions in the network are normal, and delayed SPF calculations when the network is unstable. Several vendors do offer such adaptive SPF timers.

Juniper Networks uses a delay scheme in which the normal period between SPF runs is short. This period is 200ms by default, meaning an SPF run cannot take place any sooner than 200ms after the last run. The period is configurable with the spf-delay command to between 50 and 1000ms; the command can be used with both OSPF and IS-IS, and applies to both full and partial SPF calculations. If three SPF runs are triggered in quick succession, indicating instability in the network, the delay period is automatically changed to 5 seconds. This "slow mode" period is not configurable. The routers remain in this "slow mode" until 20 seconds have passed since the last SPF runindicating that the network has stabilizedand then switches back to "fast mode."

Cisco Systems originally had a similar linear fast/slow algorithm (configured for OSPF with the command timers spf). But more recently, Cisco has taken a different approach to adaptive SPF delays by using an exponential backoff algorithm. Initial delay, delay increment, and maximum delay periods^[4] are configured. The router waits the initial delay period before first running SPF. After the first run, the delay is increased by doubling the delay increment every time SPF runs. So for example, if the initial delay is 100ms and the delay increment is 1000ms, the router delays the first SPF run by 100ms, the second by 1000ms, the third by 2000ms, the fourth by 4000ms, and so on. The maximum delay value specifies in seconds the largest value to which the delay can be incrementedan obvious necessity to prevent an unstable network from causing the SPF delay to increase so much that SPF does not run at all. When SPF has not run for twice the time specified by the maximum delay period, the router switches back to "fast" mode in which the initial delay period is used.

^[4] The actual IOS command terms for these three values varies depending on whether you are configuring exponential backoff for OSPF or IS-IS, but their function is the same.

These three timers are specified for OSPF with the IOS command timers throttle spf and for full IS-IS SPF with the command spf-interval. Partial SPF for IS-IS is configured separately with the IOS command prc-interval.