The operational functions of BGP are defined in RFC 1771. These definitions should be complied with for any implementation to be successful. Failure to do so could give rise to serious concerns about the integrity of the information provided by the protocol and potentially create a chaotic state of routing. It is necessary to understand that BGP operates as a finite state machine (FSM). This means, that at all times of operation, there is a defined state of the process. This process cannot move on to the next state or perform other functions without first meeting a predetermined set of criteria. Once these criteria are met, based on the conclusion, the process will proceed to another predefined state, and go through the process of meeting certain criteria, before proceeding. And on and on it goes. This also implies the ability to handle error conditions. 9.2.1 TransportBGP uses TCP (port 179) as the underlying transport protocol. This provides the reliable transport function, error correction, and retransmission of the higher-level data, if necessary. TCP relies on IP as the network layer protocol. So, if there is IP connectivity to a destination, the TCP session should remain active. This will become relevant during our discussion on peering and how logical peering works. 9.2.2 EventsIn BGP there are 13 different events that cause the FSM to change states. Since an FSM is based on a calculated response to a predetermined number of possible events, the state change is predictable. However, the state-event correlation relies not only on the event, but on the current state of the FSM. The following events allow the FSM state to change:
The next section will discuss the various connection states and how the above listed events determine state change. 9.2.3 Connection StatesThere are six different states to the FSM as documented in RFC 1771. Each state represents where the BGP process is in terms of operation, but also predicates what type of events can occur to cause the state to change. 9.2.3.1 IdleThe initial state of the BGP process is Idle . In this state, the local system is essentially waiting for some start event to get the process moving. The most common start event is nothing more than the network administrator configuring the local system to peer with a remote system. Depending on which configuration parameters are used, the local system will either wait for incoming TCP connection on port 179 and subsequent OPEN messages, or it will attempt to establish a TCP connection and begin sending OPEN messages. In JUNOS, if the passive parameter is used, the local system will listen for an incoming TCP connection on TCP port 179, then listen for incoming OPEN messages. If the passive parameter is not used, the local system will start a connect-retry timer and attempt to establish a TCP connection on port 179 to the remote system. BGP is typically configured on one router and then on the other, so the first one is usually trying to establish a connection prior to the remote system being configured. Regardless, if both systems attempt to create a peering session at the same time, the potential for two peering sessions between the same neighbors exists. RFC 1771 outlines a mechanism called connection-collision detection, which will prevent multiple sessions between the same neighbors from being established. Once the TCP connection attempt is started, the FSM will transition to the Connect state. 9.2.3.2 ConnectThis state occurs when the local system sees a TCP connection initiated on port 179. This state is key to the overall operation of BGP. If the local system cannot transition out of the Connect state, there is a potential problem with establishing the TCP connection. This can be a useful tip in troubleshooting BGP peering problems. When the TCP connection is established, the local system will reset the connect-retry timer it started earlier and will send an OPEN message to the remote system. When the local system sends the first OPEN message, the FSM transitions to the OpenSent state. 9.2.3.3 ActiveIf the router is unable to create the TCP connection, the FSM will transition to the Active state. In this state, the local system will still try to create the TCP connection. If it is able to complete the connection, the local system will send out the OPEN message to the potential peer, and the FSM will transition to the OpenSent state. In the Active state, the local system is capable of receiving a connection attempt from a potential neighbor. 9.2.3.4 OpenSentWhen the local system sends out its OPEN message, the FSM will transition to the OpenSent state. When this occurs, the local system will listen for an OPEN message from the remote system. When the local system receives an OPEN message from the remote system, it sends out a KEEPALIVE message to the remote, and the FSM transitions to the OpenConfirm state. When the OPEN message is received from the remote system, the local system is able to determine certain characteristics about the session. If the ASN is different than the local system, the session will be external; if it is the same, then it will be internal. 9.2.3.5 OpenConfirmWhen the FSM in the local system reaches this state, it waits for the KEEPALIVE message to be sent from the remote system. When the local system receives the KEEPALIVE message from the remote system, the FSM will then change to the Established state. If the hold timers expire here, the local system will send a NOTIFICATION message to the remote system, and the FSM will change to the Idle state. 9.2.3.6 EstablishedWhen the FSM for the peering session transitions to the Established state, the peers are now ready to begin sending UPDATE messages to exchange reachability information. When a BGP session is truly UP and the FSM is in the Established state, the following can be assumed:
9.2.4 Message Types and FormatsThe importance of messages was shown in the previous section on FSM events and states. If you refer to the list of events, you will see the last four events involve the OPEN , UPDATE , NOTIFICATION , and KEEPALIVE messages. Each message has a distinct purpose and provides details necessary for establishing, maintaining, and discontinuing BGP peer sessions between local and remote systems. The maximum message size supported in BGP is 4,096 bytes, and the minimum message size is 19 bytes. The minimum size message contains only the BGP header without any trailing data and is used as the KEEPALIVE message. Each message header has a fixed length and does not have to contain data in each bit. There are three parts to a message header (see Figure 9-14):
Figure 9-14. BGP Message Header
9.2.4.1 OPEN MessageThe OPEN message is sent by the local system once the TCP connection between the two potential peers has been created. Figure 9-15 illustrates the following OPEN message fields:
Figure 9-15. OPEN Message Format
9.2.4.2 UPDATE MessageBGP uses UPDATE messages to exchange all routing information related to BGP. A single message can contain both NLRI with attributes and a list of withdrawn routes. If the attributes are equal, multiple prefixes can be sent in a single message. This functionality provides an efficient method of information exchange by combining multiple functions into a single message. The UPDATE message can be thought of as having three distinct categories:
Unlike some IGPs, where route information is used to construct a topology (usually with the local system as root of that tree), BGP uses a list of ASs that the NLRI has passed through. This list is built on the concept of creating a loop-free path to the destination prefix. Figure 9-17 illustrates the format of the UPDATE message. The BGP UPDATE message fields are as follows :
Figure 9-17. UPDATE Message
The above example refers to 10.5.0.0/16 . There are two important bits of information regarding the withdrawn routes field. First, the prefix length and value can be set to 0, which would withdraw all routes learned from the advertising neighbor. Second, regardless of the length of the actual prefix field, RFC 1771 calls for it to add trailing bits to keep the length of the field equal to the next highest byte count. These trailing bits are insignificant as they are only used for padding. Figure 9-19 illustrates the attribute encoding of the UPDATE message.
Figure 9-19. UPDATE ”Attribute Encoding
Attributes are used to provide specific information regarding the characteristic of a particular prefix being advertised. Each of these attributes and their meanings are described in Section 9.2.5. However, for now it is important to understand that BGP interprets and advertises attributes based upon four distinct categories as defined in RFC 1771 and shown in Figure 9-20. The first high-order bit is representative of either a well-known or optional attribute (0 = well known, 1 = optional). The second high-order bit is representative of either transitive or nontransitive (0 = nontransitive, 1 = transitive). For well-known attributes, the bit must be set to 1 for transitive. The third high-order bit is representative of the partial bit. This bit must be set to 0 for well-known and optional nontransitive attributes. The fourth high-order bit is representative of the extended length bit. If this bit is set to 1, then the extended length may be used, but only if the length of the attribute is greater then 255 bytes. The four low-order bits are not currently used in BGP:
Figure 9-20. UPDATE ”Attribute Type Field
9.2.4.3 NOTIFICATION MessageAny BGP-speaking router will send a NOTIFICATION message whenever an error condition exists. When this message is sent, the sending router closes the BGP session and transitions to the Idle state. The notification message consist of three fields (see Figure 9-21):
Figure 9-21. NOTIFICATION Message Format
The error codes and subcodes indicate the major and minor reasons for which the error was detected . The various error codes and subcodes are essential to the control of the BGP process over the session. If this particular information was not evaluated and the error codes were not available, there would be devastating effects on the integrity of the route information passed through ASs and the Internet. Table 9-3 lists these error codes. The numbers following each code indicate the decimal equivalent. Table 9-3. NOTIFICATION Message Error Codes
9.2.4.4 KEEPALIVE MessageKEEPALIVE messages are exchanged to let each peering neighbor know that the other is there (see Figure 9-22). Two other elements are used: the hold timer and the KEEPALIVE messages. These messages cannot be any more frequent than one per second. When the BGP session is first negotiated, the HoldTime is agreed upon. If the agreed upon HoldTime is set to 0, then no KEEPALIVE messages will be sent. As noted previously, the KEEPALIVE messages consist only of the BGP header. Figure 9-22. KEEPALIVE Message Format
9.2.5 AttributesThis section will discuss the attributes associated with BGP and what each means. Attributes are passed in UPDATE messages. There are four categories of BGP path attributes:
Table 9-4 lists the ten most common attributes used in BGP by category name and type code. The type code is the decimal value that is used in the UPDATE message to specify the type of attribute being passed. Table 9-4. BGP Path Attributes
9.2.5.1 ORIGINThis attribute gives an indication of how this particular prefix was learned. The following possible values can be used:
By default, JUNOS uses the ORIGIN code value 0, whether or not the route was learned from the IGP, statically defined, or part of an aggregate route. 9.2.5.2 AS_PATHThe AS_PATH attribute lists ASs for which a prefix has been announced. This attribute serves two functions: routing loop avoidance and path selection. If the receiving AS sees its own ASN in the AS_PATH list, it will ignore that announcement. When a prefix is announced, the AS_PATH only lists the AS that announced the prefix. A single AS may have 10 routers or 100. So, the AS_PATH provides no additional granularity into how the packet would travel to a destination within a given AS. The AS_PATH is set using the type field. If the type field has a value of 1 ( AS_SET ), then the resulting list is an unordered set of ASs. When an AS_SET is included, if the type value is 2 ( AS_SEQUENCE ), then the list that results is an ordered set of ASs. This means that when the local system readvertises the prefix, it will prepend the AS_SEQUENCE with the local system's ASN. Prepending always occurs in the left-most bits of the field. If the prefix originates in the local AS, then the border router will add the local AS to the prefix and send it to the external neighbor. If the prefix is advertised internally, then no prepending is necessary. Remember, the AS_PATH is the listing of ASs that have ANNOUNCED the prefix, ANNOUNCED meaning advertising to other external neighboring ASs. 9.2.5.3 NEXT_HOPThis attribute is vital in the route-selection process. In short, the NEXT_HOP attribute indicates the IP address of the border router that can be used to reach a given destination, not the next-hop as in interface or gateway to the next Layer 3 device. The show route <prefix> detail or show route <prefix> extensive commands can be used to see both physical next-hops and protocol or border router next-hops. Understanding NEXT_HOP and how it is used is essential to understanding BGP route selection. A case study in Sections 10.2.2 and 10.2.3 covers NEXT_HOP . 9.2.5.4 MEDMED is a metric specified by an announcing external neighbor to identify the ingress point to use in the announcing AS for a given prefix. This attribute is used by the announcing system to influence the local system's decision process. When it is received, it can be propagated via IBGP, but does not get propagated when the local AS advertises the route to another external AS. A case study in Section 10.2.4 can be referenced for more information. 9.2.5.5 LOCAL_PREFLOCAL_PREF is used by IBGP to influence internal routers to use a particular border router to reach a given prefix. The higher the value, the better the degree of preference. This means that if border router A advertises prefix 10.10.0.0/16 with a LOCAL_PREF of 100 , and border router B advertises 10.10.0.0/16 with a LOCAL_PREF of 150 , the internal routers will choose to send packets via border router B. 9.2.5.6 ATOMIC_AGGREGATEAggregation is a method by which the local system advertises a route representative of several more specific routes that it knows about. When this occurs, there is a potential loss of information relating to the more specific prefixes, such as AS_PATH . When this occurs, the local system will attach the ATOMIC_AGGREGATE attribute to the prefix when advertising it. This is important for the receiving system of the route. If a local system receives a prefix with the ATOMIC_AGGREGATE attribute set and does have a more specific route, it will not advertise the more specific route. With this being said, it can be assumed in some cases that routes with the ATOMIC_AGGREGATE attribute included will traverse ASs that may not be in the AS_PATH list. Aggregation can naturally cause loss of path information, hence the need to signal other systems that this has occurred. 9.2.5.7 AGGREGATORIf a local system performs aggregation on a series of routes, it will include in the aggregated prefix advertisement the local system ASN and local system IP address that performed the aggregation. 9.2.5.8 COMMUNITYCommunities and policy go hand in hand with BGP. You can assign several prefixes the same values by including them in a particular community. Associated with this attribute are three well-known communities, as defined in RFC 1997:
Communities are essential in service provider networks. They can play a vital role in route coloring and further enhancing the routing domain's ability to maintain routing-policy control. Section 11.7 provides coverage on the use of communities. 9.2.5.9 ORIGINATOR_IDThis attribute is defined in RFC 1966. Simply put, the ORIGINATOR_ID is the RID of the router that originated the route into the AS. Route reflectors will not send a route learned from an originator back to that originator. The route-reflector server will set this attribute when advertising the prefix to internal neighbors. This attribute will not be sent to external neighbors. 9.2.5.10 CLUSTER_LISTThis is an optional nontransitive attribute and is used by BGP in route-reflector scenarios as well. The route-reflector server sets the CLUSTER_LIST value. Any routes received with this attribute set to the local CLUSTER_ID will be ignored. This, too, is part of the loop avoidance scheme in BGP route reflection and is especially useful when implementing multiple route-reflector clusters within a single AS. |