3.4 Routing protocols

3.4 Routing protocols

3.4.1 Routing Information Protocol (RIP)


RIP is formally defined in the XNS Internet Transport Protocols publication (1981) and in RFC 1058. Since its initial release there have been several improvements to RIP, culminating in a new standard called RIP version 2 (RIPv2), specified in RFC 2453. RIPv2 overcomes some of the more basic limitations of RIPv1, most significantly with the following features:

  • The ability to carry subnet mask data with routing entries

  • Support for authorization data for link security

  • Optional support for sending routing updates using a multicast destination address (instead of broadcasting)

  • Support for autonomous systems and IGP/EGP interactions through externally tagged information

These features are all welcome extensions to RIPv1, but they do nothing to deal with scalability limitations such as slow convergence and routing hop limitations.

Protocol stack


RIP runs directly over UDP and uses socket 520 for all RIP communication.

Packet formats

Common header format

Figure 3.7 illustrates the RIP packet header format for both RIPv1 and RIPv2 (specified in RFC 1058).

click to expand
Figure 3.7: RIPv1 header format.

RIPv1 header definitions
  • Cmd—The command field indicates either a request or a response. A request command asks another system to send all or part of its routing table. A response command can be a reply to a request or a regular unsolicited routing update. With unsolicited routing update response messages the host router will include the entire routing table.

  • Vers—Specifies the RIP version implemented. This field is used to differentiate potentially incompatible implementations.

  • Reserved—A 16-bit field of all zeros. Must be zero. Not used in RIPv1.

RIPv1 routing entries

RIPv1 routing entries (see Figure 3.8) do not carry subnet mask information and so cannot distinguish among various types of addresses (host address, subnet number, network number, or all zeros—default route).

click to expand
Figure 3.8: RIPv1 routing entry format.

RIPv1 routing entry definitions
  • AFI—This is an address family identifier specifying the address family being used. With IP this address family is set to IP (value = 2), but other network address types are allowed (e.g., Novell IPX).

  • Reserved—A 16-bit field of all zeros. Must be zero. Not used in RIPv1.

  • Net Addr—One of a series of network addresses. With IP only the first four bytes of each address field are used (leaving eight bytes effectively wasted).

  • Metric—This is a distance-vector metric associated with each route entry and is implemented as a hop count. The hop count indicates how many hops must be traversed before the destination can be reached. The limit is 15.

An RIP router must, therefore, use the mask associated with an interface (if a direct route) or the default natural mask when interpreting routing updates. RIPv1 cannot, therefore, support Variable-Length Subnet Masks (VLSM), and this is a significant drawback in anything other than small internetworks. The maximum datagram size (i.e., the RIP portion of the packet) is 512 octets.

A typical RIPv1 router update is illustrated in Figure 3.9. A single RIP routing update can hold up to 25 route entries. The RIP routing update timer is generally set to 30 seconds, ensuring that each router will send a complete copy of its routing table to all neighbors every 30 seconds. The route invalid timer determines how much time must expire without a router having heard about a particular route before that route is considered invalid. When a route is marked invalid, neighbors are notified of this fact by setting the metric to 16 (infinity) in the next routing update. This notification must occur prior to expiration of the route flush timer.

start figure

 File:IPXDIRSV.ENC  Type:SNIFFER_ENC Mode:••••••• Records:402 ====== 306       Error  : None T Elapsed: 01:27:19:913      T Delta : 00:00:00:527 -------------------------------[mac]--------------------------------------- Dest Mac : ffffffffffff       Sourc Mac: Xyplex12a362   Type   : IP -------------------------------[ip]---------------------------------------- IP Ver  : 4        IP         HLen : 20 Bytes TOS   : 0x00       Pkt Len : 292       Seg ID  : 0x008f Flags  : FRAG:X.LAST   Frag Ptr : 0   (8 Octet) TTL   : 64 PID   : UDP ( 17)    Checksum : 0x7599 (Good) Dest IP : Source IP: -------------------------------[udp]--------------------------------------- Dest Port: RIP [  520]         Src Port : RIP [ 520] Length  : 272             Checksum : 0x0000 -------------------------------[rip]--------------------------------------- Command : RESPonse     Version : 1        Reserved : 00 -------------------------------[routes:  13]------------------------------- Family: 2 {192. 0. 1. 0} Hops: 1  Family: 2 {192. 0. 2. 0} Hops: 1 Family: 2 {192. 0. 3. 0} Hops: 1  Family: 2 {193.128. 2. 64} Hops: 1 Family: 2 {193.128. 2. 96} Hops: 1  Family: 2 {193.128. 2.236} Hops: 1 Family: 2 {193.128. 2.237} Hops: 1  Family: 2 {193.128. 2.238} Hops: 1 Family: 2 {193.128. 9. 0} Hops: 1  Family: 2 {194. 0. 2. 0} Hops: 1 Family: 2 {195. 0. 0. 0} Hops: 1  Family: 2 {196. 0. 0. 0} Hops: 1 Family: 2 {197. 0. 0. 0} Hops: 1 ===============================[data: 0]=================================== 

end figure

Figure 3.9: RIPv1 routing update message.

RIPv2 routing entries

RIPv2 uses the same header format as RIPv1 but adds functionality to routing entries. Since RIPv1 has a lot of spare capacity in its packet structure, all of the RIPv2 changes are neatly packaged into the same RIP messages, as illustrated in Figure 3.10.

click to expand
Figure 3.10: RIPv2 routing entry format.

All fields are the same as RIPv1 with the following exceptions.

RIPv2 routing entry definitions
  • Route tag—The RT field is an attribute assigned to a route that must be preserved and readvertised with a route. The intended use of the route tag is to provide a method of separating internal RIP routes from external routes (e.g., those imported from an exterior gateway protocol).

  • IP address—A 32-bit IP address.

  • Subnet mask—A 32-bit subnet mask. If this field is zero, then no subnet mask has been included for this entry.

  • Next hop—The immediate next-hop IP address should be forwarded to the destination specified by this route entry. A value of in this field indicates that routing should be via the originator of the RIP advertisement. The purpose of the next-hop field is to eliminate packets being routed through extra hops in the system. It is particularly useful when RIP is not being run on all of the routers on a network.

  • Metric—This is a distance-vector metric associated with each route entry and is implemented as a hop count. The hop count indicates how many hops must be traversed before the destination can be reached. The limit is 15.

Handling new subnet mask information

RIPv2 is likely to coexist in environments where RIPv1 is also running, so special attention needs to be paid on how routing information is interpreted and distributed. On an RIPv2-enabled interface the following rules apply:

  • Information internal to one network must never be advertised into another network.

  • Information about a more specific subnet may not be advertised where RIPv1 considers it a host route.

  • Supernet routes (where the network mask is less specific than the natural mask) must not be advertised where they could be misinterpreted by RIPv1 routers.


RIPv1 has no security features; RIPv2 improves on this by implementing authentication support on a per message basis. It was thought that there was insufficient space in the RIP header for a useful authentication scheme (only two bytes spare), so the authentication scheme for RIPv2 uses a whole RIP entry, denoted by a special AFI value of 0xFFFF. This means that there are only 24 routing entries per advertisement if authentication is used. The RIPv2 authentication entry format is illustrated in Figure 3.11. The only authentication type supported at present is a simple password (authentication type 2). The remaining 16 bytes contain the plaintext password. If the password is under 16 bytes, it must be left justified and padded to the right with null (0x00) characters. RFC 2082 describes enhancements to this scheme.

click to expand
Figure 3.11: RIPv2 authentication format.


RIPv1 uses broadcasts (both network and MAC) to advertise and request routes; all updates will have a MAC destination of 0xFFFFFFFFFFFF and an IP destination of This makes RIPv1 traffic difficult to isolate without using network Layer 4 filters (e.g., on dial-up bridge links you may want to isolate some broadcast types, and with RIP you would have to access at least the UDP port fields). RIPv2 supports either broadcast or multicast operations. Multicasts can be easily filtered and also reduce unnecessary load on those hosts that are not listening to RIPv2 messages. The IP multicast address can optionally be used by RIPv2 stacks for periodic routing updates. On NonBroadcast MultiAccess (NBMA) networks, unicast addressing may also be used; however, if a response addressed to the RIPv2 multicast address is received, it should be accepted and processed.


Routing algorithm

RIP uses a distance-vector algorithm to calculate optimal routes; more specifically it uses the Bellman-Ford Algorithm. Each router is responsible for building its own shortest path routing table, based on its own knowledge of directly attached interfaces and announcements from other routers. Each route has an associated distance (in hops) to a destination network. The metric takes no account of link speed or delay. The SPF algorithm is very effective (in some cases more efficient than Dijkstra); the distribution protocol is quite inefficient, and sometimes these two aspects are confused. One key aspect to appreciate here is that an RIP router is aware only of the next hop to a destination network; it has no higher-level map of the topology (as implemented in link-state protocols such as OSPF).

Each RIP router sends out regular announcements to all RIP-enabled interfaces every 30 seconds. These messages contain the entire routing table and are propagated to all neighbors on directly connected segments. Clearly, when a router first boots up, the only information it can announce is the distance to its own interfaces, at a cost of 1. Announcements are sent whenever an update timer elapses or whenever there is a detected change (via triggered updates). Once a router has integrated all routing information from its neighbors, it can rebuild a Spanning Tree based on which immediate neighbors offer the lowest metric to each destination. If a new destination is subsequently announced, this will be added into the routing table. If a destination is subsequently discovered with a lower metric, the current route entry is replaced. If a destination is announced that has the same next hop as the current route entry, but a lower metric, then the new metric will be used.

Routing tables

Each entry in an RIP routing table includes the following information: the destination network IP address, the next-hop router's IP address, and the cost to get there (i.e., the metric). The metric indicates the distance in number of hops to the destination—that is, how many routers the packet must traverse in order to reach the destination. Default routes are advertised as Depending upon the implementation, other information may also be present in the routing table (such as timers associated with the route; the source of the routing information; and possibly an administrative distance, which is used to set preferences if two routes to the same destination are advertised from different IGPs). A typical RIP routing table would look as follows:


Next Hop












RIP selects the best route to a destination; there is no inherent support for load balancing. All routes are advertised regularly in routing updates as broadcasts. If a topology change occurs, it is reflected in the next routing update.


The main timers controlling RIP operation are as follows:

  • Routing update timer—Default is 30 seconds (the basic route table update interval).

  • Dead interval—Default is 120 seconds (triggered when a route metric goes to 16).

  • Garbage collection timer—Default is 120 seconds (triggered after dead interval lapses).

  • Hold-down timer—Default is 180 seconds (optional, depending upon implementation).

On some implementations the first three timers may be referred to as update, invalid, and flush, respectively, and the defaults may vary. For example, on a Cisco router these timers will be 30, 180, 240, and 180, respectively.

Passive and active RIP

A router or host will typically offer the choice of configuring RIP on specific interfaces, in either active or passive mode. Active mode means that full routing capability is enabled. Passive mode means that the interface is in listen-only mode and no RIP updates are sent out. For example, on a Cisco router we could define the first serial interface as passive using the following command:

   Helsinki(config)#router rip   Helsinki(config-router)#passive interface serial 0 

RIP issues and enhancements

RIP suffers from the classic issues associated with distance-vector routing protocols (i.e., routing loops and slow convergence). As such, RIP implementations typically include several enhancements—namely, triggered updates, hold-down timers, split horizon, and poison reverse. Check your vendor's implementation to see which of these features are supported.


RIP has several significant performance implications for any network. When planning any design, especially with low-speed WAN links, you must be aware of the following issues:

  • Broadcast overheads

  • Protocol inefficiency

  • Convergence


RIP specifies a maximum hop count of 15, so by definition we cannot build a backbone with RIP that is larger than 15 hops in diameter (or we could, but the farthest points of such a network would not be able to communicate). This generally limits the use of RIP to small local internetworks and end-system integration. Another problem with RPv1 is the lack of VLSM support. This can seriously deplete the available address space and limit the growth of your internetwork. RIPv2 does support VLSM.

Demand circuits

RIP may be required to operate in an environment with demand circuits. Demand circuits refer to those network segments whose monetary cost depends on factors such as connect time or use (typically expressed in terms of bytes or packets). Examples include ISDN dial circuits and X.25 SVCs. A generalized algorithm for demand circuits that can be applied to distance-vector routing protocols (such as IP RIP, NetWare RIP, or NetWare SAP) is documented in RFC 1582. A summary of the main features is as follows:

  • Presumption of reachability—Once routing information is exchanged over dial-up links, those networks are assumed to be reachable even though a circuit's data-link connection may subsequently be closed, unless error conditions indicate otherwise.

  • Triggered updates—Routing updates are sent only when there has been an update to the routing database, a change in the reachability of a next-hop router, or a specific request for a routing update has been received. Low-demand RIP adds support for triggered requests, triggered responses, and triggered acknowledgments.

  • Acknowledgments and retransmissions—An acknowledgment and retransmission system is required to provide reliability. If a triggered request or response is not acknowledged after ten retransmissions, routes to the destination should be marked as unreachable for the duration of a hold-down timer before being deleted. The destination should then be polled at a lower frequency using triggered request packets.

  • Flow control—To overcome the bandwidth limitations associated with dial-up and low-bandwidth environments, the routing application can perform a form of self-imposed flow control, spreading out routing updates over a period of time.

RIPv1 and RIPv2 compatibility

For compatibility RFC 2453 specifies that packets received by an RIPv1 node that have a version number greater than one should not have their reserved (i.e., must be zero) fields validated. This means that, in theory, an RIPv2 packet should be backward compatible. Unfortunately, this depends on which RIP implementation you are talking to, since vendors may have chosen to ignore this recommendation, and several implementations precede the standard. In order to maintain backward compatibility, the use of the multicast destination address should also be configurable. If multicasting is used, it should be used on all interfaces that support it.

Distance-vector routing enhancements

Distance-vector routing protocols such as RIP and IGRP include a number of additional features (or bug fixes, depending upon your point of view) designed to make routing operations more stable, speed up convergence, and preclude failures such as counting to infinity (which results from the creation of routing loops). These enhancements include the following:

  • Triggered updates—Used to speed up the distribution of routing changes rather than waiting for the next announcement interval.

  • Split horizon—Routes are not propagated down the interface from which they were learned. Prevents routing loops between adjacent routers.

  • Poison reverse—Routes are propagated down the interface from which they were learned with a cost of infinity. Prevents routing loops and speeds up convergence.

  • Hold downs—Prevent larger routing updates by imposing a period of calm until the topology is deemed to be sufficiently stable. Prevents counting to infinity.

All of these features are especially important during periods of topological instability. They effectively prevent a bad situation from getting much worse.

3.4.2 Open Shortest Path First (OSPF)


OSPF is currently perhaps the most important open standard IGP for medium to large IP network designs. Many major commercial networks run OSPF backbones, including many sizable Cisco backbones. OSPF has two key characteristics: it is open, in that its specification is in the public domain, and it is based on the shortest path first, or link-state, technology (specifically the Dijkstra algorithm), where the routing database is fully distributed and each router is responsible for calculating the entire routing table independently. OSPF is a hierarchical routing protocol. It distributes routing information between routers belonging to a single Autonomous System (AS).

OSPF was developed by the IGP working group of the IETF, which was formed in the spring of 1988 to design an IGP based on the Shortest Path First (SPF) algorithm for use on the Internet. OSPF was developed to provide efficient and scalable routing for large IP internetworks, primarily due to the failings of existing distance-vector protocols such as RIP and IGRP. The first version of the OSPF protocol was specified in RFC 1131; OSPF version 2 is specified in RFC 1583. OSPF modifications for supporting IPv6 are specified in RFC 2740.

OSPF promotes a hierarchical design, scales well, and converges quickly. It provides features such as a distributed routing database, triggered updates, designated routers, reliable flooding, and explicit support for VLSM and the tagging of externally derived routing information. Optional OSPF features include the authentication of routing updates, equal cost, multipath routing, and routing based on Type of Service (ToS) requests. ToS-based routing is not widely implemented at present, since it is very resource intensive (you need a separate SPF running for each ToS). OSPF is relatively easy to deploy and offers major benefits while demanding more discipline at the design stage.

Packet formats

Figure 3.12 illustrates an OSPF header format.

click to expand
Figure 3.12: OSPF header format.

OSPF field definitions
  • Ver—Identifies the version number of the OSPF implementation being used.

  • Type—Specifies one of the following OSPF packet types:

    • Hello—Sent at regular intervals to establish and maintain neighbor relationships.

    • Database description—Describes the contents of the topological database and are exchanged when an adjacency is being initialized.

    • Link-state request—Requests pieces of a neighbor's topological database. They are exchanged after a router has discovered (through examination of database description packets) that parts of its topological database are out of date.

    • Link-state update—Responses to link-state request packets. They are also used for the regular dispersal of LSAs. Several LSAs may be included within a single packet.

    • Link-state acknowledgment—Acknowledges link-state update packets. Link-state update packets must be explicitly acknowledged to ensure that link-state flooding throughout an area is a reliable process.

    • Each LSA in a link-state update packet contains a type field. There are four LSA types: Router Links Advertisements (RLAs), Network Links Advertisements (NLAs), Summary Links Advertisements (SLAs), and AS external links advertisements.

  • Packet length—Specifies in bytes the packet's length, including the OSPF header.

  • Router ID—Identifies the packet's source.

  • Area ID—Identifies the area to which the packet belongs. All OSPF packets are associated with a single area.

  • Checksum—Checks the entire packet contents for potential damage suffered in transit.

  • Authentication type—Contains an authentication type. "Simple password" is an example of an authentication type. All OSPF protocol exchanges are authenticated. The authentication type is configurable on a per area basis.

  • Authentication—Contains authentication information and is 64 bits in length.

OSPF communication uses a number of variable-length packets for ensuring both neighbor reachability and database distribution. A typical hello packet is shown in Figure 3.13. Notice that the Designated Router (DR) field is blank, because the router that generated this packet was rebooted and had not yet synchronized with any other router; it was not aware of an established DR on the network. Notice also that OSPF does its own checksumming. The hello priority field is used during the election process for designated and backup designated routers.

start figure

 File:OSPF07.SCP   Type:XYP1820_SCP Mode:••••••• Records:114 ================================================================================ Frame  : 61        Len   : 78        Error  : None T Elapsed: 01:02:26:385   T Delta : 00:00:00:270 ---------------------------------- -------------------------------[mac]----------------------------------------------- Dest Mac : 01005e000005   Sourc Mac: Xyplex1208e5   Type   : IP -------------------------------[ip]------------------------------------------------ IP Ver  : 4        IP HLen : 20 Bytes TOS   : 0x00       Pkt Len : 64        Seg ID  : 0x003a Flags  : FRAG:X.LAST   Frag Ptr : 0   (8 Octet) TTL   : 1 (Low) PID   : OSPF ( 89)    Checksum : 0xf4f1 (Good) Dest IP :    Source IP: -------------------------------[ospf]--------------------------------------------- OSPF Ver : 2        Type   : HELLO      Length  : 44 Router ID:  Area  ID:     Checksum : 0xe963 AU Type : 00        AU Key  : {    } -------------------------------[hello]--------------------------------------------- Hello Msk:  Hello Int: 10        Hello Pri: 1 Options : 00        Dead Int : 40 DR    :     BDR   : ===============================[data:  0]======================================== 

end figure

Figure 3.13: OSPF hello packet.

Link-State Advertisements (LSAs)

Link-State Advertisements (LSAs) include information about a router's adjacencies and are the fundamental data structure used to distribute routing information. There are four different types of LSAs, as follows:

  • Type 1 Router Links Advertisements (RLAs)—Describe the collective states of a router's links to a specific area. A router sends an RLA for every interface. RLAs are flooded throughout the entire area and no farther.

  • Type 2 Network Links Advertisements (NLAs)—NLAs are generated by Designated Router (DRs) and describe all routers that are attached to a multiaccess network (Ethernet, Token Ring, FDDI, or NBMA). NLAs are flooded throughout the area containing the multiaccess network.

  • Type 3 and Type 4 Summary Links Advertisements (SLAs)—Summarize routes to destinations outside an area but within the AS (this is how network reachability information is disseminated between areas). SLAs are generated by Area Border Routers (ABRs) and injected into the backbone (area 0), where they will be flooded and disseminated to other areas. Only intra-area routes are advertised into the backbone. Both intra-area and interarea routes are advertised into the other areas. ABRs also have the job of propagating the reachability of the ASBR (this is how routers learn how to get to external routes outside the AS). Type 3 LSAs are used when the destination is an IP network. Type 4 LSAs are used when the destination is an AS boundary router.

  • Type 5 AS external links advertisements—Describe a route to a destination that is outside the AS. These networks are injected into OSPF via redistribution, and it is the Autonomous System Border Router's (ASBR's) responsibility to inject these routes into the AS. These are the only type of LSA that is forwarded everywhere in the AS; all others are forwarded only within specific areas.

A new variation of the AS external LSA, called a Type 7 LSA, is defined for use with not so stubby areas, and is described later in the chapter.


The default metric used in OSPF is an arbitrary two-byte integer, and ToS is not supported as standard. ToS is optionally supported through the use of a separate metric for each of the eight combinations created by the three IP ToS bits (delay, throughput, and reliability). For example, if the IP ToS bits specify low delay, low throughput, and high reliability, OSPF calculates routes to all destinations based on this ToS designation. Note that there will be a separate routing table for each ToS combination, so we can see that the overheads of supporting ToS in this manner are quite extravagant.


OSPF uses a number of timers and variables—some configurable, some constant. The main timers in OSPF network design are as follows:

  • HelloInterval—A variable, defaults normally to ten seconds. The hello interval represents the length of time, in seconds, between the hello packets sent by a router on an interface. It must be the same for all routers attached to a common network. The smaller the hello interval, the faster topological changes will be detected but at the expense of increased traffic. For an X.25 PDN network we might choose 30 seconds; for a local area network we might choose ten seconds or less.

  • RouterDeadInterval—A variable, defaults normally to 40 seconds. The dead interval represents the number of seconds a router will wait before assuming that a neighbor is down (i.e., since the last hello packet received). This value again must be the same for all routers attached to a common network.

  • RxmtInterval—A variable, defaults normally to five seconds. The retransmit interval represents the number of seconds between link-state advertisement retransmissions for adjacencies belonging to an interface. It is also used when retransmitting database description and link-state request packets. The retransmit interval should easily exceed the expected round-trip time between neighbors. A typical LAN value is five seconds; it should, therefore, be set larger on low-speed serial lines and virtual links.

  • InfTransDelay—A variable, defaults normally to one second, must be > 0. The interface transit delay represents the estimated time (in seconds) taken to transmit an LSU over an interface. LSAs contained in the LSU must have their age incremented by InfTransDelay before transmission. InfTransDelay should take into account the transmission and propagation delays of the interface.

  • LSRefreshTime—A constant, set to 30 minutes. The LSA aging timer is not configurable. This timer controls how long a self-originated (i.e., not learned) LSA is held in the link-state database before it is considered too old. When the LS age field of an LSA reaches the value LSRefreshTime, a new instance of that LSA is originated. In practice this means that there will be a slow release of all LSAs over time, even though the physical link-states have not changed. This ensures integrity of the database. Note that this behavior differs in dial or demand circuit environments.

  • MaxAge—A constant set to one hour. MaxAge is also not configurable and represents the maximum age that an LSA can attain. If the LS age field of an LSA reaches MaxAge, it is reflooded in an attempt to flush the advertisement from the routing domain. LSAs of MaxAge are not used in the routing table calculation. MaxAge must be greater than LSRefreshTime.

The hello and dead intervals are often tuned together, according to the network topology and network medium speeds. These affect how often keep-alive messages are sent and how long a neighbor should wait before assuming a peer is down. By speeding up these timers you can improve convergence in the case where a peer is not responding. In general you should keep the dead interval as follows:

  • Dead interval >= 4 × hello interval

For detailed information about OSPF timers and variables, the interested reader is referred to RFC 1587.

VLSM support

OSPF routing updates support Variable-Length Subnet Masks (VLSM), since each route distributed by OSPF has a destination and a mask. With variable-length subnet masks, an IP network can be broken into many subnets of various sizes. This provides network administrators with extra network configuration flexibility.

Message authentication

OSPF contains an optional authentication field. If enabled, all routers within an area must agree on the value of the authentication field. The authentication field prevents bad routing exchanges and malicious attempts to destabilize the network. Currently there are three modes of operation. By default, a router uses a null authentication, which means that routing exchanges over a network are not authenticated. There are two additional authentication methods: simple password authentication and message digest authentication.

Protocol stack

OSPF runs directly over IP, as protocol 89 (0x59), and supports only Internet Protocol (IP) routing environments (i.e., it is not multiprotocol). OSPF has its own mechanisms for providing reliable delivery and authentication, which we discuss later. OSPF uses two reserved IP multicast addresses when sending and receiving the updates ( and Implementations may additionally support routing based on the IP Type of Service (ToS) bits in the IP packet header, and OSPF has been adapted for QoS-based routing.


OSPF is a link-state routing protocol, and each router is responsible for monitoring the state of its attached interfaces and the distribution of this status information to neighbors via Link-State Advertisements (LSAs). LSAs are reliably flooded throughout the AS (note that the term flooding here is possibly misleading, since each LSA has a TTL of one). Once OSPF routers have acquired neighbors and synchronized all link-state information, they run the SPF algorithm to calculate a shortest path tree to each node (specifically, the SPF algorithm used is Dijkstra). Each router within an area maintains an identical database describing the autonomous system's topology. Part of this database reflects an individual router's local state (i.e., the router's attached interfaces and neighbors). OSPF contrasts with distance-vector routing protocols RIP and IGRP, where each router distributes its entire routing tables.

In an OSPF network all routers run the exact same SPF algorithm in parallel. From the topological database, each router constructs a tree of shortest paths with itself as root. This shortest path tree gives the destination in the autonomous system. Externally derived routing information appears on the tree as leaves. OSPF optionally calculates separate routes for each Type of Service (ToS). When several equal-cost routes to a destination exist, traffic is load balanced among them. The cost of a route is described by a single, dimensionless 32-bit metric (either set automatically based on interface bandwidth or configured by the network administrator).

Routing hierarchy

OSPF is purely an Interior Gateway Protocol (IGP) but does provide excellent support for hierarchical designs within Autonomous Systems (ASs). OSPF enables an AS to be partitioned into one or more areas, joined together by a backbone. This leads to two different types of OSPF routing, depending on whether the source and destination are in the same or different areas: intra-area routing (source and destination in the same area) and inter-area routing (source and destination in different areas). This leads to the following four classifications of OSPF routers, as illustrated in Figure 3.14.

click to expand
Figure 3.14: Hierarchical OSPF internetwork showing the four basic router classifications.

  • Autonomous System Border Routers (ASBR)—where a router has external interfaces to another AS.

  • Backbone Routers (BR)—where each router has its interfaces connected only to the backbone.

  • Area Border Router (ABR)—where a router interfaces to multiple areas (including the backbone).

  • Internal Router (IR)—where all interfaces are connected purely within an area.

In Figure 3.14, all areas are directly connected to the backbone apart from Area 4. In cases where a direct physical link to the backbone is not possible, a virtual link must be configured to maintain backbone continuity. The figure shows various types of routing information. Routes originated and targeted within an area are called intra-area routes. Routes that originate from other areas are called interarea or summary routes. Routes that originate from external processes (e.g., RIP, BGP, or different OSPF processes) that are injected into OSPF via redistribution are called external routes. Multiple routes to the same destination are preferred in the following order: intra-area, interarea, external.


An area is essentially an arbitrary collection of subnetworks. Areas are used to constrain the flooding of link-state updates within manageable bounds. The scope of the Dijkstra algorithm on a router is limited to changes within an area, and all routers within an area hold an identical copy of the link-state database. There are several types of areas, according to the interfaces they have and the kind of routing information they are configured to carry, as follows:

  • Stub areas are recommended for simple areas that have only one ABR (i.e., a single exit point). Stub areas carry a default route, intra-area routes, and interarea routes. External routes, such as those redistributed from other protocols into OSPF, are not allowed to be flooded into a stub area. Virtual links cannot be configured through a stub area, and stubs cannot contain an ASBR.

  • Totally stubby areas (Cisco proprietary) are sometimes referred to as stub areas without summaries and are recommended for simple configurations where a single router connects an area to the backbone. Totally stubby areas have a single ABR and carry only a default route and intra-area routes. External routes and summaries are blocked.

  • Not So Stubby Areas (NSSAs) are stub areas that use default routes, but the restriction on not importing externals is relaxed. NSSAs have the capability of importing AS external routes in a limited fashion. NSSAs may be connected to the backbone at more than one ABR but may not be used as a transit area. NSSAs are designed to be used with moderately complex leaf sites.

  • Nonstub areas—that is, plain old areas, contain an ASBR. An area must also be non-stub if a virtual link is configured across it. Nonstub areas are the most resource-intensive type of area. Nonstub areas carry a default route, static routes, intra-area routes, interarea routes, and external routes.

The topology of an area is transparent to other areas and the rest of the AS. By keeping area topologies separate, OSPF passes less routing traffic than it would if the AS were not partitioned. The main advantages of area partitioning are as follows:

  • Faster convergence—Each area runs its own instance of Dijkstra, and so the number of nodes in the SPF calculation is limited to the area. This improves convergence and reduces the demand for CPU and memory on routers.

  • Better stability—Mission-critical areas of the network can be isolated from areas of potential instability. This partitioning also limits the scope of bad routing data. For example, a flaky wide area network need not bring down the whole server farm every time a link goes down.

  • Overall reduction in routing traffic by constraining information on a need-to-know basis.

Routers can be attached to multiple areas concurrently; these routers are called Area Border Routers (ABRs). An ABR maintains a separate topological database for each area and has the responsibility of disseminating routing information or routing changes between areas. The topological database contains the summary of all LSAs received from all routers in the same area.


When designing OSPF networks, if multiple areas are configured then one of these areas must be defined as area 0, the so-called backbone. The backbone operates as an area itself, so BRs maintain area routing information just as any area router would. It is generally good practice to start with area 0 before defining other areas. The backbone comprises all ABRs, networks not wholly contained in any area, and their attached routers. Ideally, all areas should be physically connected to the backbone, though in practise this may not be possible and virtual links may be required. The reason for this is that OSPF expects all areas to inject routing information into the backbone. The backbone has the special responsibility of disseminating that information into other areas. Note that the backbone in this context is an abstract routing entity and should not be confused with physical backbone networks such as an ATM or Frame Relay core. An OSPF backbone could, for example, be configured inside a router or over a single LAN segment.

Figure 3.14 shows an example of an internetwork with several areas. In this figure, routers 4, 5, 6, 7, 10, and 11 make up the backbone. If host H1 in area 2 needs to send a packet to host H2 in area 3, the packet is first sent to router 9, which forwards the packet to Area Border Router (ABR) 7. ABR 7 forwards the packet across the backbone to ABR 10, which forwards the packet through ABR 11 onto intra-area router 11 and finally host H2. The backbone topology is transparent to all intra-area routers, and individual area topologies are transparent to the backbone.

Virtual links

Virtual links are essentially unnumbered point-to-point links. Virtual links should be configured between any backbone routers that share a link to a nonbackbone area, as follows:

  • Scenario A—To logically connect an area to the backbone that has no direct physical connection.

  • Scenario B—To patch the backbone where a discontinuity occurs.

Routing information

There are four potential types of routing information in an area, as follows:

  • Default—If an explicit route cannot be found for a given IP network or subnetwork, the router will forward the packet to the destination specified in the default route.

  • Intra-area routes—Explicit network or subnet routes must be carried for all networks or subnets inside an area.

  • Interarea route—Areas may carry explicit network or subnet routes for networks or subnets that are in this AS but not in this area.

  • External routes—When different ASs exchange routing information, the routes they exchange are referred to as external routes. In general, it is desirable to restrict routing information in any area to the minimum required.

ASBRs running OSPF learn about external routes through dynamic exterior gateway protocols, such as BGP-4 or static routes. These externally derived routing data are passed transparently throughout the AS and kept separate from the OSPF link-state data. Each external route can also be tagged by the advertising router, enabling the passing of additional information between routers on the boundaries of the AS.

Neighbor acquisition

When an OSPF router is powered up, it initializes its data structures, checks for functional lower-level interfaces, and then begins to solicit potential neighbors by transmitting regular hello messages (advertising its presence on a subnet using the destination multicast address Any potential neighbors that receive these hello messages will attempt to peer with the transmitter. For two routers to become neighbors they must agree on the following: AreaID, authentication, hello/dead intervals, and stub area flag. If all of these are consistent, then the routers will attempt to establish an adjacency (where link-state databases are synchronized). If there is a Designated Router (DR) on the subnet, then only it will establish adjacencies to minimize overheads.


In order to establish an adjacency, each router will go through several well-defined states (usually you can view this in the neighbor events on a router). These states include the following:

  • Down—No hellos seen yet.

  • Attempt—On NBMA networks such as Frame Relay, this indicates that no recent information has been received from the neighbor. Efforts should be made to contact the neighbor by sending hello packets at the reduced rate.

  • Init—The interface has detected a hello packet coming from a neighbor; however, bidirectional communication has not yet been established.

  • Two-way—There is bidirectional communication with a neighbor. The router has seen its own address in the hello packets originating from its neighbor. The routers decide whether to carry on building an adjacency or not, based on whether one of the routers is a DR or a BDR, or the link is a point-to-point or a virtual link.

  • Exchange start—Routers begin to establish the initial sequence number used in the exchange. One router will become the primary and the other will become secondary. The primary router will poll the secondary for information.

  • Exchange—Routers describe their entire link-state database by sending database description packets. At this stage packets could be flooded to other interfaces on the router.

  • Loading—Routers finalize the database exchange. Routers have built a link-state request list and a link-state retransmission list. Any information that looks incomplete or outdated will be put on the request list. Any update that is sent will be put on the retransmission list until it gets acknowledged.

  • Full—The database synchronization process is complete and the neighbors are said to be fully adjacent.

Figure 3.15 shows a typical trace sequence from a real network where two routers become fully adjacent by announcing themselves to each other and exchanging link-state data. The first stage involves both routers describing (using database descriptions) the state of their own topology databases to each other; this establishes which entries in the database the two routers have in common and which entries they do not have in common. Armed with this knowledge each router will then request all unknown entries from its peer (using link-state requests) until both databases are synchronized. Updates are sent using link-state updates and are explicitly acknowledged using link-state acknowledgments. Once a router becomes adjacent (i.e., synchronized) it will use its own copy of the topological database to calculate a shortest path tree, with itself as root. The shortest path tree is used as the basis for the forwarding table.

start figure

 File:OSPF07.SCP   Type:XYP1820_SCP Mode:••••••• Records:114  A:[••00000] ================================================================================ FRAME HH:MM:SS:ms  LEN S PACKET CONTENT -------------------------------------------------------------------------------  59 00 00:00:069  78  1208e5->000005 IP OSPF Hello  61 00:00:00:300  78  1208e5->000005 IP OSPF Hello  62 00:00:00:270  82  1208e6->000005 IP OSPF Hello  67 00:00:00:458  66  1208e5->1208e6 IP OSPF DataDes  68 00:00:00:002  66  1208e6->1208e5 IP OSPF DataDes  69 00:00:00:002 346  1208e6->1208e5 IP OSPF DataDes  70 00:00:00:002 346  1208e5->1208e6 IP OSPF DataDes  71 00:00:00:001 134  1208e6->000005 IP OSPF LSUpd  72 00:00:00:003  66  1208e6->1208e5 IP OSPF DataDes  73 00:00:00:001  70  1208e6->1208e5 IP OSPF LSReq  74: 00:00:00:002 134  1208e5->1208e6 IP OSPF LSUpd  75 00:00:00:085  78  1208e5->000005 IP OSPF LSAck  77 00:00:00:040  78  1208e6->000005 IP OSPF LSAck  78 00:00:00:331 134  1208e5->000005 IP OSPF LSUpd  79 00:00:00:035  82  1208e5->000005 IP OSPF Hello  80 00:00:00:010 134  1208e6->000005 IP OSPF LSUpd  81 00:00:00:127 134  1208e6->000005 IP OSPF LSUpd  82 00:00:00:047  78  1208e5->000005 IP OSPF LSAck  83 00:00:00:071  82  1208e6->000005 IP OSPF Hello  85 00:00:00:499 134  1208e6->1208e5 IP OSPF LSUpd  86 00:00:00:001  78  1208e5->1208e6 IP OSPF LSAck  87 00:00:00:229  82  1208e5->000005 IP OSPF Hello  88 00:00:00:270  82  1208e6->000005 IP OSPF Hello =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 

end figure

Figure 3.15: Trace of OSPF neighbors forming an adjacency.

Adjacencies are a fundamental part of OSPF, since they determine the distribution of routing protocol packets. These packets are sent and received only between adjacent routers. Once adjacency is established, the routers maintain and monitor the relationship through the use of regular hello messages, which act as keep alives. We can see in packets 87 and 88 that both routers have reached a position where their respective databases are fully up-to-date. Note that even when routers are fully synchronized, each router periodically sends Link-State Advertisements (LSAs) due to an LSA aging process. LSAs are also sent immediately if any interfaces on the router change state. By comparing established adjacencies to link states, failed routers can be quickly detected and the network's topology altered appropriately.

Adjacencies on NonBroadcast MultiAccess (NBMA)

OSPF treats NonBroadcast MultiAccess (NBMA) networks like a regular broadcast media (e.g., Ethernet). However, NBMA networks are typically partial mesh topologies and so cannot provide the multiaccess features that OSPF DR operations require. The DR and BDR, therefore, need to have a static list of all other routers attached to the cloud; this list would normally be configured by the network administrator. There are methods to avoid configuring static neighbors and having specific routers becoming DRs or BDRs on the nonbroadcast cloud. These include the following:

  • Point-to-point subinterface—As far as OSPF is concerned, an adjacency is always formed over a point-to-point subinterface with no DR or BDR election. We could, for example, split a serial interface into two point-to-point subinterfaces: S0.1 and S0.2. The only drawback for the point to point is that each segment will belong to a different subnet. Another workaround is to use IP unnumbered interfaces on the cloud. This also might be a problem for some administrators who manage the WAN based on IP addresses of the serial lines.

  • Point-to-multipoint interfaces is defined as a numbered point-to-point interface having one or more neighbors. This should work well for people who are migrating into the point-to-point concept with no change in IP addressing on the cloud. Also, they would not have to worry about DRs and neighbor statements. The only drawback for point to multipoint is that it generates multiple host routes (routes with mask for all the neighbors.

  • Broadcast interfaces—This approach is a workaround for statically listing all existing neighbors. The interface will be logically set to broadcast and will behave as if the router were connected to a LAN. DR and BDR election will still be performed, so special care should be taken to assure either a full-mesh topology or a static selection of the DR based on the interface priority.

Designated router

The OSPF concept of a Designated Router (DR) has a very specific meaning and is often misunderstood. There is a common misconception among practicing engineers that DRs are directly associated with areas and that you can have only one DR per area. This is quite incorrect. DRs are used to promote scalability on multiaccess networks (typically LANs, such as Ethernet).

The primary purpose of the DR and BDR is to minimize the number of adjacencies on multiaccess media, so that both bandwidth and router resources are optimized (see Figure 3.16). DRs facilitate a significant reduction in network traffic and in the size of the topological database. Consider the situation where we have five OSPF routers on an Ethernet. Without the concept of the DR (Figure 3.16[a]) each router must establish a full-mesh adjacency with all other routers, which in effect means 4 + 3 + 2 + 1 = 10 sessions. Clearly, if one of these routers is nominated to be the relay point for all of this link-state information (Figure 3.16[b]), then the number of sessions drops to four, so we can see that this approach will scale much better. The BDR is simply implemented for backup, in case the DR dies—the idea being that the BDR also shadows the DR's adjacencies and cuts over quickly when the DR stops working, without all routers having to go through a new election process and resynchronize all their databases.

click to expand
Figure 3.16: (a) OSPF full-mesh peering required without designated router (DR) support (adjacencies scale exponentially and so clearly cannot scale). (b) Peering required when router 1 acts as a DR. Note that normally a Backup Designated Router (BDR) would also be used for resilience, so we scale adjacencies linearly as (n ×2) — 2.

The election of the DR is established via the hello protocol, where the router with the highest OSPF priority on a segment is elected DR for that segment. The same process is then repeated for the BDR. In the event of a tie, the router with the highest router ID (RID) wins. The default for the interface OSPF priority is one. A priority value of zero is typically used to indicate that an interface should never be elected (typically shown on an OSPF interface as the DROTHER state).


One of the most attractive features about OSPF is the ability to quickly adapt to topology changes while using a minimum of routing traffic. There are two components to routing convergence, as follows:

  • Detection of topology changes—OSPF uses two mechanisms to detect topology changes. Interface status changes, and failure of OSPF to receive a hello packet from its neighbor within a timing window is called a dead interval.

  • Recalculation of routes—Once a failure has been detected, the router that detected the failure sends a link-state packet with the change information to all routers in the area. All the routers recalculate all of their routes using the Dijkstra (or SPF) algorithm. The time required to run the algorithm depends on a combination of the size of the area and the number of routes in the database.


The ability to scale an OSPF internetwork will be influenced by several factors, including the following:

  • Future-proof network addressing

  • Imposing hierarchy and enabling route summarization

  • Ensuring that routing resources and bandwidth are adequately provisioned

    We now discuss each topic briefly.

Addressing model OSPF networks should be designed so that areas do not need to be split to accommodate growth. Address space should be reserved to permit the addition of new areas. We covered addressing models and the techniques involved in Chapter 2.

Routing resources and bandwidth provisioning Routing devices should be factored into network growth projections. Scaling is determined by the utilization of memory, CPU, and bandwidth, as follows:

  • Memory—An OSPF router stores all link-states for all of the areas that it is configured for. It can also store summaries and externals. Judicious use of route summarization and stub areas can reduce memory overheads substantially.

  • CPU—An OSPF router uses CPU cycles whenever a link-state change occurs. Keeping areas small and using route summarization dramatically reduces CPU overheads.

  • Bandwidth—OSPF sends partial updates when a link-state change occurs. The updates are flooded to all routers in the area. In a quiet network, OSPF is a quiet protocol. In a network with many probable topology changes you should try to isolate the main culprits from critical resources.

It is possible on some routers to run multiple OSPF processes on the same router (say, to handle different areas discretely); however, it is not recommended since it creates multiple database instances that add extra overhead to the router.

Hierarchy and route summarization As we saw in Chapter 2, hierarchical routing is a key technique used to build very large networks. As networks grow, their use of finite routing resources, such as memory, CPU, and bandwidth, grow; a more efficient design will reduce the overall resource deterioration. In flat network designs the OSPF routing table grows linearly, as the number of IP segments increases (at a rate of one for one). Each OSPF router in a flat network is aware of all network segments. By implementing hierarchical routing we can slow the growth of the routing table sizes to a logarithmic rate, O(log[n]), where n is the number of segments. It is important to note that designing networks using hierarchical routing can prove difficult, and many of the bugs found in OSPF over the years (including some only recently) have been related to this mode of routing support. In general, however, the benefits for large networks far outweigh any disadvantages. As Figure 3.17 illustrates, hierarchical routing is clearly beneficial as networks become very large.

click to expand
Figure 3.17: Linear versus logarithmic growth.

Choosing an exterior gateway protocol

Exterior Gateway Protocols (EGPs) are not really routing protocols; it would be more accurate to call them reachability protocols. The choice of exterior gateway protocols is also somewhat easier than choosing an IGP; essentially it boils down to the following options:

  • Static routes have been used for many years in the AS-AS environment, primarily because they are easy to deploy, require no routing protocol, and by their very nature, enforce policy (i.e., no route equals no access). On the downside, static routes do not promote scalability, can be difficult to maintain in a changing environment, and do not respond to topology changes (as mentioned previously, some vendors associate preferences with default routes to enable a limited form of dynamic path selection).

  • EGP was the first real protocol to be available in the AS-AS domain and is described in [5, 6]. Unfortunately, it suffers from several design limitations that make it unsuitable for a scalable and robust internetwork design (it requires a single Spanning Tree for a start). EGP has been subsequently declared historical by the IETF and is now obsolete.

  • BGP (BGP-4 is the latest incarnation) represents a new generation of AS-AS protocol, and overcomes many of the design limitations of EGP, while also supporting Classless InterDomain Routing (CIDR). BGP-4 is the favored protocol of ISPs and carriers, in combination with static routes.

3.4.3 Border Gateway Routing Protocol (BGP)


Exterior Gateway Protocols (EGPs) are designed to route between groups of routing domains called Autonomous Systems (ASs). The simplest form of exterior gateway protocol is the static route. Static routes are explicit in enforcing policy and are simple to configure and understand. The first dynamic exterior gateway protocol to achieve widespread acceptance was the Exterior Gateway Protocol (EGP). EGP, unfortunately, had several major weaknesses, including its inability to detect routing loops and support multiple paths, which meant it was somewhat limiting for robust internetworking applications such as the Internet. EGP has subsequently been displaced by the Border Gateway Protocol (BGP), another inter-AS routing protocol. Today, BGP is widely deployed over Internet Service Provider (ISP) networks. It addresses the most serious EGP problems, runs over the reliable TCP transport protocol, and includes many other useful enhancements. The latest version of BGP is version 4 (BGP-4) and is defined in RFC 1771. In effect, EGP and all previous versions of BGP are now obsolete. This section, therefore, discusses only the capabilities of BGP-4.

Protocol stack

BGP runs directly over TCP on port 179 (0×b3). The TCP connection is essential in order for the two peer routers to start exchanging routing updates reliably. This contrasts with EGP, which runs directly over IP (port 8) and is essentially a best-effort service. Note that BGP supports only IP routing environments; it is not multiprotocol.

Packet formats

BGP fields
  • Marker (16 bytes)—Contains a value that the receiver of the message can predict. This field is used for authentication.

  • Length (2 bytes)—Contains the total length of the message in bytes.

  • Type (1 byte)—Specifies the message type.

  • Data (variable)—Additional fields and/or data associated with the message type.

The BGP packet format is shown in Figure 3.18. All BGP packets use the common 19-byte header. The smallest BGP message is 19 bytes; the largest BGP message supported is 4,096 bytes. RFC 1163 specifies four message types: open, update, notification, and keep alive. Keep-alive messages require only the 19-byte header; open, update, and notification messages require additional fields. These message types are defined as follows:

Figure 3.18: BGP packet format.

Message types

  • Open—The first message sent by each side after a transport connection is established is an open message. If accepted by the recipient, a keep-alive message is returned as confirmation; thereafter, updates, keep alives, and notifications may be exchanged. In addition to the BGP header, open messages define several fields.

    • Version provides a BGP version number and allows the recipient to check that it is running the same version as the sender.

    • My autonomous system provides the AS number of the sender.

    • Hold time indicates the maximum number of seconds that may elapse without receipt of a message before the transmitter is assumed to be dead.

    • BGP identifier—The sender's ID (typically the router ID, though this is implementation dependent).

    • Optional parameter length—The length of a list of any optional parameters included in the message.

    • Optional data—A list of optional parameters and data in <type, length, value> format. This includes the ability to perform authentication.

  • Update—BGP update messages provide routing updates to other BGP systems. Network Layer Reachability Information (NLKRI) included in these messages is used to construct a graph describing the relationships of the various ASs. In addition to the common BGP header, update messages include several additional fields, including unfeasible routes length, withdrawn routes, and total path attribute length. These fields provide routing information by listing path attributes corresponding to each network. Path attributes are covered later in this section.

  • Notification—Sent when an error condition is detected, and a router needs to inform its peer why it is terminating the connection between them. In addition to the BGP header, notification messages include an error code field, an error subcode field, plus error data. The error code indicates one of six classes of error, some of which have subcodes indicating more specific problems: 1—message header error, 2—open message error, 3—update message error, 4—hold time expired, 5—finite state machine error, and 6—cease. For full details refer to RFC 1771.

  • Keep alive—These messages comprise only the BGP header. Keep alives are exchanged between peers periodically to maintain peer integrity (i.e., by keeping the hold timer from expiring).


Fundamental to the operation of BGP is the concept of routing policy. All routing information passed between BGP routers is normally done under policy control. The network administrator configures routing policy manually, since by default a BGP router is usually configured not to receive or transmit any routing data. BGP routers exchange network reachability information, which basically comprises information on the various Autonomous Systems (AS) that need to be traversed in order to reach other networks. This information is then used to construct a graph of AS connectivity from which routing loops are pruned and with which AS-level policy decisions can be enforced. Each BGP router maintains a routing table that lists all feasible paths to a particular network.

BGP operates as a conventional link-state protocol, in that BGP neighbors initially exchange complete routing information when they peer, after which only incremental updates are used for any detected changes. Although a BGP router maintains a complete routing table, with all feasible paths to a particular network, it advertises only the primary (i.e., optimal) path in its update messages. BGP update messages comprise <network number><AS path> pairs; the AS path contains the string of ASs through which the specified network can be reached. Update messages are sent over TCP to ensure reliable delivery.

BGP peering

A pair of BGP speakers establish a peer or neighbor relationship with other BGP routers by first establishing a reliable transport connection via TCP. Once the connection is up, both routers use open messages to establish a BGP routing session and exchange values such as the AS number, the BGP version they are running (version 3 or 4), the BGP router ID, and the keep-alive hold time. BGP sessions may begin by using BGP version 4 and negotiating downward to earlier versions if necessary (some routers allow this negotiation to be disabled—for example, in a BGPv4-only network). Once these values are confirmed and accepted the neighbor connection is established. Any state other than established means that the two routers could not become neighbors (and so BGP updates cannot be exchanged). Assuming peering is successful, routing information is then passed between neighbors via update messages, with any errors or special conditions flagged by notification messages. In order to maintain sessions BGP peers regularly exchange keep-alive messages. This exchange is illustrated in Figure 3.19. After the TCP session is established both routers open up the BGP session and update each other with routing information. Thereafter, the session is maintained via keep alive.

click to expand
Figure 3.19: Two BGP routers peering.

During session establishment, BGP peers initially exchange their full BGP routing tables. Thereafter, BGP peers send incremental updates only. In Figure 3.20, where multiple BGP routers are enabled inside an AS, internal BGP (iBGP) is run between these routers (see R10 and R9). Between ASs external BGP (eBGP) is run. Peering is normally done via the loopback interfaces of the routers rather than using real interface addresses (for fault tolerance). Note that between R1 and R3 there is a non-BGP router, so eBGP multihop must be configured from R1 to R3's loopback address and vice versa. A combination of static and IGP routing is required to ensure reachability between Rl and R3. The routing information consists of a series of AS numbers that describe the full path to the destination network. BGP uses this information to construct a loop-free map of ASs.

click to expand
Figure 3.20: Conceptual BGP design.

All BGP speakers within an AS must establish a peer relationship with each other before BGP can exchange routing information with other ASs. In effect this means that iBGP routers inside an AS must be fully meshed logically (to improve scalability BGP-4 provides two techniques to alleviate this requirement—confederations and route reflectors—described later in this chapter). This process ensures that networks within the AS are reachable and is achieved by peering among iBGP routers and by redistributing BGP routing information to IGPs (such as OSPF) that run within that AS. A BGP peer group is a group of BGP neighbors that share the same update policies.

Internal and external BGP

Although BGP was designed primarily as an inter-AS protocol, it can be used within ASs and may also operate over ASs that do not support BGP explicitly, as follows:

  • Inter-AS routing is the conventional EGP mode, where two or more BGP routers in different ASs peer with each other directly to maintain a consistent view of the internetwork topology. BGP neighbors communicating between ASs must reside on the same physical network. For example, the Internet uses this type of routing to integrate hundreds of ASs from various universities, corporations, and government bodies. BGP is frequently used for path determination to provide optimal routing within the Internet.

  • Intra-AS routing occurs between two or more BGP routers located within the same AS. BGP peer routers within the same autonomous system use BGP to maintain a consistent view of the system topology. BGP also is used to determine which router will serve as the connection point for specific external autonomous systems. Once again, the Internet provides an example of inter-AS routing. An organization, such as a university, could make use of BGP to provide optimal routing within its own administrative domain or AS. The BGP protocol can provide both inter- and intra-AS routing services.

  • Pass-through (transit) AS routing occurs between two or more BGP peer routers that exchange traffic across an AS that does not run BGP. In a pass-through AS environment, the BGP traffic does not originate within the AS in question and is not destined for a node in the AS. BGP must interact with whatever intra-autonomous system routing protocol is being used to successfully transport BGP traffic through that AS.

Routers that belong to the same AS and exchange BGP updates are said to be running internal BGP (iBGP), and routers that belong to different ASs and exchange BGP updates are said to be running external BGP (eBGP). Note that eBGP neighbors are typically directly connected (e.g., over a point-to-point WAN link); iBGP neighbors do not have to be directly connected as long as there is an IGP (or static route) that enables the two neighbors to reach each another.

Internal BGP (iBGP)

Internal BGP (iBGP) is the form of BGP that exchanges BGP updates within an AS. This might, for example, be used by a large organization to exchange routing information between different domains under its administrative control. Use of iBGP in this respect is not mandatory; the routes learned via eBGP could be redistributed into an IGP within the AS and then redistributed again into another AS. However, iBGP is more flexible, provides more efficient ways of controlling the exchange of information within the AS, and presents a consistent view of the AS to external neighbors. For example, iBGP provides ways to control the exit point from an AS.

An important point to remember is that when a BGP speaker receives an update from other BGP speakers in its own AS (via iBGP), the receiving BGP speaker will not redistribute that information to other BGP speakers in its own AS. The receiving BGP speaker will redistribute that information to other BGP speakers outside of its AS using eBGP. This is why it is necessary for BGP speakers within an AS to be fully meshed.

External BGP (eBGP)

When two BGP speakers in different ASs run BGP to exchange routing information, they are said to be running external BGP (eBGP). Again, it is not mandatory to run BGP between ASs (one could use static or default routes depending upon the topology), but the advantages in a large dynamic environment far outweigh the additional complexity required. A good example of the application of eBGP is in the design of multihomed ASs connected to the Internet, where multiple service providers are used to provide resilience and possible load sharing.

BGP decision algorithm

When a BGP speaker receives updates from multiple ASs that describe different paths to the same destination, it must choose the single best path for reaching that destination. Once chosen, BGP puts the selected path in its routing table and propagates the path to its neighbors. The decision is based on the value of path attributes (such as next hop, administrative weights, local preference, the origin of the route, and path length) that the update contains and other BGP-configurable parameters. BGP uses the following criteria, in the order presented, to select a path for a destination:

  • If the path specifies a next hop that is inaccessible, drop the update.

  • Prefer the path with the largest local preference.

  • If the local preferences are the same, prefer the path that was originated by BGP running on this router.

  • If no route was originated, prefer the route that has the shortest AS path.

  • If all paths have the same AS path length, prefer the path with the lowest origin type (where IGP is lower than GP, and EGP is lower than incomplete).

  • If the origin codes are the same, prefer the path with the lowest MED attribute.

  • If the paths have the same MED, prefer the external path over the internal path.

  • If the paths are still the same, prefer the path through the closest IGP neighbor.

  • Prefer the path with the lowest IP address, as specified by the BGP router ID.

BGP metrics

BGP uses an abstract routing metric (the inter-AS metric) to determine the best path to a given network. This metric is an arbitrary number that specifies the preference for a particular path and is typically assigned to each path by the network administrator. The value assigned can be based on any number of criteria, including AS count (the number of ASs through which the path passes), stability, speed, delay, cost, and other factors.

BGP attributes

BGP attributes comprise a number of parameters used to maintain path information, such as preferences, next hops, and so on. BGP update messages include variable-length sequences of these attributes, each formatted as a <type, length, value> triple. Attributes are classified as follows:

  • Well-known, mandatory—Must be included in an update message and must be recognized by all BGP implementations.

  • Well-known, discretionary—May or may not be included in an update message but must be recognized by all BGP implementations.

  • Optional, transitive—If the transitive flag is set, then this attribute should be accepted and passed along to other BGP speakers.

  • Optional, nontransitive—If the transitive flag is not set, then this attribute should be ignored and not passed along to other BGP speakers.

The attribute type field is a two-byte field that comprises a one-byte flag and one-byte type code. The most significant bit of the flag field indicates whether the parameter is well known (0) or optional (1). The next bit indicates whether the parameter is nontransitive (0) or transitive (1). For full details of these attributes refer to RFC 1771.

Scalability and large-scale designs

For improved scalability you should make use of attributes (especially community). In large BGP networks also make use of peer groups and route reflectors. For stability use loopback addresses for iBGP, generate aggregates, apply passwords, and always filter inbound and outbound.

Because of the increasing size of Internet BGP routing tables there is work underway to research the possibility of imposing additional hierarchy on backbones, enabling the routing database to be partially distributed with a few core super routers holding a complete database.

Route filtering and policy

Route filtering enables a BGP speaker to control what routes to send and receive from any of its peers. Route filtering is, therefore, key to defining routing policy at the AS level. Route filtering is also used to control route importing and exporting from other protocols (such as OSPF or RIP). Note that some implementations of BGP use the term route maps when referring to this capability.

Filtering may be inbound or outbound and available actions include permit or deny. Routes may typically be identified by criteria such as the IP-prefix, source ASN, intermediate ASNs, or specific attributes (the usual method is by AS-path list or Network Layer Reachability Information [NLRI]). Routes permitted through a filter may either be accepted as is or have their attributes modified. Routes that are denied are simply discarded. Attribute manipulation in route filtering is key to establishing successful routing policies and also to enable load balancing and routing symmetry. This requires careful planning and attention to detail.

CIDR and aggregate addresses

BGP-4 supports supernetting (also known as Classless InterDomain Routing [CIDR]), which is a major improvement over BGP-3 and BGP-2. CIDR enables BGP to aggregate routes, combining several distinct routes within a single route advertisement. This minimizes the size of routing tables and reduces the bandwidth overhead, particularly in backbone routers using a single route entry (an aggregate). Note that a BGP router cannot aggregate an address if it does not have a more specific route (i.e., a route with a longer prefix) in the BGP routing table.

Multihoming introduces some interesting problems for applying aggregation rules. In the example in Figure 3.21 we see a customer network, comprising sites in New York and London, multihomed to a single service provider at these two locations. The customer has a class B network, which has been subnetted using a 24-bit prefix. Clearly, it would at first glance appear pointless to advertise all of these routes to AS 200, since we can simply summarize them using a aggregate route. We could also change attributes for the two aggregates that are advertised, if required. For example, we could set the MED differently for each of the two links.

click to expand
Figure 3.21: BGP aggregation.

Default routes

The default route is by convention expressed as network and associated with a next-hop router address, router interface, or network number. In BGP default routes have the same use as any other routing protocol; they indicate the last resort route when no other routing information is available or applicable. There may be a single default route or multiple default routes pointing to different next hops with different preference levels (i.e., a primary default route with one or more backups in case it fails).

Default routes are clearly useful in single-homed situations, where it makes little sense to advertise large numbers of external routes when there is only one way out of the network; however, they may also be used in multi-homed scenarios. Figure 3.22 illustrates a customer network (AS 100), which is multihomed to the same provider (AS 200). Note that if AS 100 used two routers to give two parallel links, then the only additional requirement would be to run iBGP between the two customer routers.

click to expand
Figure 3.22: BGP default routes.

Static configuration is often preferred because it gives the designer more control and avoids possible routing loops (service providers also tend to filter out dynamically learned default routes to avoid any possible issues).

Backdoor routes

Cisco promotes a concept called backdoor routes. This enables IGP routes to take precedence over eBGP routes where a more direct path is available. This concept relies on eBGP routes being tagged as backdoor routes with an administrative distance of 200 (same as BGP local in Cisco's implementation). In this way learned IGP route will take precedence, since it will have a lower distance. For information on how Cisco applies protocol distances refer to [4].

Multihoming, load balancing, and symmetry

Figure 3.23 illustrates some of the main topology options for BGP when connecting to service providers. Multihoming is an important concern in BGP designs because of the need for resilient Internet access (especially if the business relies on the Internet for e-business functions), the cost of wide area links, and the relative clumsiness of BGP control features used to control routing preferences.

click to expand
Figure 3.23: BGP homing topology options. (a) Single homing. (b) Multihoming, single provider. (c) Multihoming, multiple providers. (d) Multiple customers sharing a private iBGP link to a single multihomed provider. (e) Multiple customers sharing a private iBGP link to multiple multihomed providers.

When a customer has multiple paths to the Internet, as shown in Figure 3.23(a) through (e), it is often desirable to apply some form of load balancing on these connections to maximize the available bandwidth. It is also desirable to have control on how traffic flows exit and enter the network. For performance reasons it is generally desirable to have traffic exit and enter the network at the same interface and at an interface closest to the originators. With BGP, administrative control of routing choices is devolved, and there are a number of ways to influence path cost that need to be consistently configured between the provider and a customer network. Without careful design, there is a real possibility of session asymmetry in a multihomed environment. There are several ways to influence path selection with BGP, including the following:

  • AS path attribute

  • Local preference attribute

  • MultiExitDiscriminator (MED) attribute

  • Default, partial, and full routing

A key point to be aware of here is the possibility of routing loops, where routing information conflicts, and this is a real possibility where a combination of IGP metrics and default routes are used. Another point to appreciate is that although the path taken by inbound traffic depends on how a customer advertises paths externally, this information may be ignored by the routing policies of other ASs. The only guaranteed way to ensure session symmetry in a multihomed environment is to clearly nominate one link as primary for both incoming and outgoing traffic.

Choosing a router discovery protocol

Most routers today support most, if not all, of the router discovery protocols listed in section 3.3.7. You should, therefore, choose the most appropriate mechanism for your network environment. As a general rule, the preferred solution is the one that imposes the least demand on network and system load, while maintaining some level of automation and dynamic discovery attributes. Using those prerequisites IDRP represents the best choice available today, though it may not be available on all hosts. On legacy host equipment you may be limited to static gateway configuration or passive RIP.

If faced with a large number of end systems with only static gateway configuration available, and it is critical that service is uninterrupted, then all is not lost. An increasingly attractive alternative is to run a protocol such as the Virtual Redundant Router Protocol (VRRP) between multiple routers.

ICMP Router Discovery Protocol (IRDP)

Reference [7] describes a dynamic router discovery mechanism called ICMP Router Discovery Protocol (IRDP), which relies on extensions to ICMP and multicast interfaces. IRDP also scales to more than two routers. IRDP is not a routing protocol; although it facilitates the discovery of nearby routers it does not enable hosts to discover optimal routes to a destination. If a host chooses a poor first-hop router for a particular destination, it should receive an ICMP redirect from that router, indicating a better first hop. IRDP does, however, eliminate the need for manual configuration of router addresses and is independent of any specific routing protocol. IDRP operations require both the end system and router to support the enhanced version of ICMP. Reference [7] specifies two new ICMP router discovery messages: router advertisements and router solicitations.


Each router periodically multicasts a router advertisement (default address, ICMP type 9, code 0) from each of its multicast interfaces, announcing the IP address(es) of that interface. If the router does not support IP multicast, then advertisements are sent using a limited broadcast ( Hosts discover the addresses of their neighboring routers simply by listening for advertisements. When a host attached to a multicast link starts up, it may multicast a router solicitation (default address, ICMP type 10, code 0) to request immediate advertisements, rather than waiting for the next update. If no advertisements are received, the host may retransmit the solicitation a few times and is then required to back off. Any routers that subsequently start up, or that were not discovered because of packet loss or temporary link partitioning, are eventually discovered through their periodic advertisements. Links that are susceptible to high packet loss or frequent partitioning can be accommodated by increasing the advertisements rate. For each multicast interface on a router the following timers are available:

  • MaxAdvertisementInterval—Maximum time allowed between sending router advertisements from the interface, in seconds. Must be not less than four seconds and not greater than 1,800 seconds. Default: 600 seconds.

  • MinAdvertisementInterval—The minimum time allowed between sending unsolicited router advertisements from the interface, in seconds. Must be not less than three seconds and not greater than MaxAdvertisementInterval. Default: 0.75 × MaxAdvertisementInterval.

  • AdvertisementLifetime—The value to be placed in the lifetime field of router advertisements sent from the interface, in seconds. Must not be less than MaxAdvertisementInterval and not greater than 9,000 seconds. Default: 3 × MaxAdvertisementInterval.

For each of the router's IP addresses on its multicast interfaces the following features are configurable:

  • Advertise—A flag indicating whether or not the address is to be advertised. Default: True.

  • PreferenceLevel—The preferability of the address as a default router address, relative to other router addresses on the same subnet. A 32-bit, signed, two's-complement integer, with higher values indicating preferability. The minimum value (0x80000000) is used to indicate that the address, even though it may be advertised, is not to be used by neighboring hosts as a default router address. Default: 0.

A router advertisement includes a preference level for each advertised router address. When a host must choose a default router address (i.e., when, for a particular destination, the host has not been redirected or configured to use a specific router address), it is expected to choose from those router addresses that have the highest preference level [8]. A network administrator can configure router address preference levels to encourage or discourage the use of particular routers as default routers. A router advertisement also includes a lifetime field, specifying the maximum length of time that the advertised addresses are to be considered as valid router addresses by hosts, in the absence of further advertisements. This is used to ensure that hosts eventually forget about routers that fail, become unreachable, or stop acting as routers. A network administrator, who wishes to employ advertisements as a supplemental black-hole detection mechanism, is free to configure smaller values. For further information the interested reader is referred to [9].

Data Networks. Routing, Seurity, and Performance Optimization
ActionScripting in Flash MX
EAN: 2147483647
Year: 2001
Pages: 117

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net