Section 12.5. Routing

   


12.5. Routing

The networking system was designed for a heterogenous network environment, in which a collection of local-area networks are connected at one or more points through routers, as shown in the example in Figure 12.7. Routers are nodes with multiple network interfaces, one on each local-area or long-haul network. In such an environment, issues related to packet routing are important. Certain of these issues, such as congestion control, are handled simplistically in FreeBSD (see Section 12.6). For others, the network system provides simple mechanisms on which more involved policies can be implemented. These mechanisms ensure that, as these problems become better understood, their solutions can be incorporated into the system. Note that at the time of the original design of this part of the system, a network node that forwarded network-level packets was generally known as a gateway. The current term is router. We use both terms interchangeably, in part because the system data structures continue to use the name gateway.

Figure 12.7. Example of the topology for which routing facilities were designed.


This section describes the facilities provided for packet routing. The routing facilities were designed for use by singly connected and multiply connected hosts, as well as for routers. There are several components involved in routing, illustrated in Figure 12.8. The design of the routing system places some components within the kernel and others at user level. Routing is actually an overbroad term. In a complex, modern network there are two major components to a routing system. The gathering and maintenance of route information (i.e., which interfaces are up, what the costs are in terms of using particular links, etc.) as well as the implementation of routing policies (which interfaces can be used to forward traffic in an administrative sense) are handled at user level by routing daemons. The actual forwarding of packets, which is the selection of the interface on which a packet will be sent, is handled by the kernel.

Figure 12.8. Routing design.


The forwarding mechanism is a simple lookup that provides a first-hop route (a specific network interface and immediate destination) for each outbound packet. The current design places enough information in the kernel for packets to be sent on their way without external help; all other components are outside the kernel. User-level routing daemons communicate with the kernel via a routing socket to manipulate the kernel forwarding table and to listen for internal changes such as interfaces being brought up or down. Each of these components is described in this section.

Kernel Routing Tables

The kernel routing mechanism implements a routing table for looking up first-hop routes (or next hop, when forwarding packets). It includes two distinct portions: a data structure describing each specific route (a routing entry) and a lookup algorithm to find the correct route for each possible destination. This subsection describes the entries in the routing table, and the next subsection explains the lookup algorithm. A destination is described by a sockaddr structure with an address family, a length, and a value. Routes are classified in the following ways: as either host or network routes and as either direct or indirect. The host-network distinction determines whether the route applies to a specific host or to a group of hosts with a portion of their addresses in common usually a prefix of the address. For host routes, the destination address of a route must exactly match the desired destination; the address family, length, and bit pattern of the destination must match those in the route. For network routes, the destination address in the route is paired with a mask. The route matches any address that contains the same bits as the destination in the positions indicated by bits set in the mask. A host route is a special case of a network route, in which all the mask bits are set, and thus no bits are ignored in the comparison. Another special case is a wildcard route a network route with an empty mask. Such a route matches every destination and serves as a default route for destinations not otherwise known. This fallback network route is usually pointed to a router that can then make more informed routing decisions.

The other major distinction between types of routes is either direct or indirect. A direct route is one that leads directly to the destination: The first hop of the path is the entire path, and the destination is on a network shared with the source. Most routes are indirect: The route specifies a router on a local network that is the first-hop destination for the packet. Much of the literature (especially for Internet protocols) refers to a local-remote decision, where an implementation checks first whether a destination is local to an attached network or is remote. In the first case, a packet is sent locally (via the link layer) to the destination; in the latter case, it is sent to a router that can forward it to the destination. In the FreeBSD implementation, the local-remote decision is made as part of the routing lookup. If the best route is direct, then the destination is local. Otherwise, the route is indirect, the destination is remote, and the route entry specifies the router for the destination. In either case, the route specifies only the first-hop gateway a link-level interface to be used in sending packets and the destination for the packets in this hop if different from the final destination. This information allows a packet to be sent via a local interface to a destination directly reachable via that interface either the final destination or a router on the path to the destination. This distinction is needed when the link-layer encapsulation is done. If a packet is destined for a peer that is not directly connected to the source, the packet header will contain the address of the eventual destination, whereas the link-layer protocol header will address the intervening router.

The network system maintains a set of routing tables that is used by protocols in selecting a network interface to use in delivering a packet to its destination. These tables are composed of entries of the form shown in Table 12.5.

Table 12.5. Elements of a routing-table entry (rtentry) structure.

Element

Description

rt_nodes[2]

internal and leaf radix nodes (with references to destination and mask)

rt_gateway

address of the router/next hop

rt_refent

reference count

rt_flags

flags; see Table 12.6

rt_ifp

reference to interface, ifnet

rt_ifa

reference to interface address, ifaddr

rt_genmask

mask for cloning

rt_llinfo

pointer to link-layer private data

rt_rmx

route metrics (e.g., MTU)

rt_gwroute

if indirect, route to gateway

rt_parent

cloning parent of this route

rt_output

output routine for this route/interface, unused

rt_mtx

mutex for locking the entry (kernel only)


Table 12.6. Route entry flags.

Flag

Description

RTF_UP

route is valid

RTF_GATEWAY

destination is a gateway

RTF_HOST

host entry (net otherwise)

RTF_REJECT

host or net unreachable

RTF_DYNAMIC

created dynamically (by redirect)

RTF_MODIFIED

modified dynamically (by redirect)

RTF_DONE

message confirmed

RTF_MASK

subnet mask present

RTF_CLONING

generate new routes on use

RTF_XRESOLVE

external daemon resolves name

RTF_LLINFO

generated by link layer

RTF_STATIC

manually added by administrator

RTF_BLACKHOLE

just discard packets (during updates)

RTF_PROTO1

protocol-specific routing flag

RTF_PROTO2

protocol-specific routing flag

RTF_PROTO3

protocol-specific routing flag

RTF_WASCLONED

route generated through cloning

RTF_LOCAL

route represents a local address

RTF_BROADCAST

route represents a beast address

RTF_MULTICAST

route represents an mcast address


Routing entries are stored in an rtentry structure, which contains a reference to the destination address and mask (unless the route is to a host, in which case the mask is implicit). The destination address, the address mask, and the gateway address are variable in size and thus are placed in separately allocated memory. Routing entries also contain a reference to a network interface, a set of flags that characterize the route, and optionally a gateway address. The flags indicate a route's type (host or network, direct or indirect) and the other attributes shown in Table 12.6 (on page 498). The route entry also contains a field for use by the link-layer driver, a set of metrics, and a mutex for locking the entry. The RTF_HOST flag in a routing-table entry indicates that the route applies to a single host, using an implicit mask containing all the bits of the address. The RTF_GATEWAY flag in a routing-table entry indicates that the route is to a router and that the link-layer header should be filled in from the rt_gateway field, instead of from the final destination address. The route entry contains a field that can be used by the link layer to cache a reference to the direct route for the router. The RTF_UP flag is set when a route is installed. When a route is removed, the RTF_UP flag is cleared, but the route entry is not freed until all users of the route have noticed the failure and have released their references. The route entry contains a reference count because it is allocated dynamically and cannot be freed until all references have been released. The RTF_CLONING flag indicates that a route is a generic route that must be cloned and made more specific before use. This flag is usually used for link-layer routes that apply to a directly attached network, and the cloned routes are generally host routes for hosts on that network that contain some link-level information about that host. When a route is cloned, an external agent may be invoked to complete the link-layer information needed for a destination. Other flags (RTF_REJECT and RTF_BLACKHOLE) mark the destination of the route as being unreachable, causing either an error or a silent failure when an attempt is made to send to the destination. Reject routes are useful when a router receives packets for a cluster of addresses from the outside but may not have routes for all hosts or networks in the cluster at all times. It is undesirable for packets with unreachable destinations to be sent outside the cluster via a default route because the default router would send back such packets for delivery within the cluster. Black-hole routes are used during routing transients when a new route may become available shortly.

Many connection-oriented protocols wish to retain information about the characteristics of a particular network path. Some of this information can be estimated dynamically for each connection, such as the round-trip time or path MTU. It is useful to cache such information so that the estimation does not need to begin anew for each connection [Mogul & Deering, 1990]. The routing entry contains a set of route metrics stored in a rt_metrics_lite structure that may be set externally or may be determined dynamically by the protocols. These metrics include the maximum packet size for the path, called the maximum transmission unit (MTU); the lifetime for the route; and the number of packets that have been sent using this route. In versions of FreeBSD prior to 5.2 path MTU discovery was implemented using a much larger structure, called rt_metrics. Most of the fields contained in the rt_metrics structure were related only to TCP, and so they were moved from the routing entry structure into the TCP code.

When a route is added or created by cloning, and when a route is deleted, the link layer is called via the ifa_rtrequest entry point stored in the ifaddr structure for this interface address. The link layer can allocate private storage associated with the route entry. This feature is used with direct routes to networks that are marked as cloning routes; the link layer can use this mechanism to manage link-layer address-translation information for each host. The address translation can be arranged within the system it can be handled outside the kernel when the RTF_XRESOLVE flag is set.

Routing Lookup

Given a set of routing entries describing various destinations, from specific hosts to a wildcard route, a routing lookup algorithm is required. The lookup algorithm in FreeBSD uses a modification of the radix search trie [Sedgewick, 1990]. (The initial design was to use a PATRICIA search, also described in Sedgewick [1990], which differs only in the details of storage management.) The radix search algorithm provides a way to find a bit string, such as a network address, in a set of known strings. Although the modified search was implemented for routing lookups, the radix code is implemented in a more general way so that it can be used for other purposes. For example, the filesystem code uses a radix tree to manage information about clients to which filesystems can be exported. Each kernel route entry begins with the data structures for the radix tree, including an internal radix node and a leaf node that refers to the destination address and mask.

The radix search algorithm uses a binary tree of nodes beginning with a root node for each address family. Figure 12.9 (on page 500) shows an example radix tree. A search begins at the root node and descends through some number of internal nodes until a leaf node is found. Each internal node requires a test of a specific bit in the string, and the search descends in one of two directions depending on the value of that bit. The internal nodes contain an index of the bit to be tested, as well as a precomputed byte index and mask for use in the test. A leaf node is marked with a bit index of -1, which terminates the search. For example, a search for the address 127.0.0.1 (the loopback address) with the tree in Figure 12.9 would start at the head and would branch left when testing bit 0, right at the node for bit 1, and right on testing bit 31. This search leads to the leaf node containing a host route specific to that host; such a route does not contain a mask but uses an implicit mask with all bits set.

Figure 12.9. Example radix tree. This simplified example of a radix tree contains routes for the IPv4 protocol family, which uses 32-bit addresses. The circles represent internal nodes, beginning with the head of the tree at the top. The bit position to be tested is shown within the circle. Leaf nodes are shown as rectangles containing a key (a destination address, listed as four decimal bytes separated by dots) and the corresponding mask (in hexadecimal). Some interior nodes are associated with masks found lower in the tree, as indicated by dashed arrows.


This lookup technique tests the minimum number of bits required to distinguish among a set of bit strings. Once a leaf node is found, either it specifies the specific bit string in question or that bit string is not present in the tree. This algorithm allows a minimal number of bits to be tested in a string to lookup an unknown, such as a host route; however, it does not provide for partial matching as required by a routing lookup for a network route. Thus, the routing lookup uses a modified radix search, in which each network route includes a mask, and nodes are inserted into the tree such that longer masks are found earlier in the search [Sklower, 1991]. Interior nodes for subtrees with a common prefix are marked with a mask for that prefix. (Masks generally select a prefix from an address, although the mask does not need to specify a contiguous portion of the address.) As the routing lookup proceeds, the internal nodes that are passed are associated with masks that increase in specificity. If the route that is found at the leaf after the lookup is a network route, the destination is masked before comparison with the key, thus matching any destination on that network. If the leaf node does not match the destination, one of the interior nodes visited during the route lookup should refer to the best match. After a lookup does not find a match at the leaf node, the lookup procedure iterates backward through the tree, using a parent pointer in each node. At each interior node that contains a mask, a search is made for the part of the destination under that mask from that point. For example, a search for the address 128.32.33.7 in the tree in Figure 12.9 would test bits 0, 18, and 29 before arriving at the host route on the right (128.32.33.5). Because this address is not a match, the search moves up one level, where a mask is found. The mask is a 24-bit prefix, and it is associated with the route to 128.32.33.0, which is the best match. If the mask was not a prefix (in the code, a route with a mask specifying a prefix is called a normal route), a search would have been required for the value 128.32.33.7 starting from this point.

The first match found is the best match for the destination; that is, it has the longest mask for any matching route. Matches are thus found by a combination of a radix search, testing 1 bit per node on the way down the tree, plus a full comparison under a mask at the leaf node. If the leaf node (either host or network) does not match, the search backtracks up the tree, checking each parent with a mask until a match is found. This algorithm avoids a complete comparison at each step when searching down the tree, which would eliminate the efficiency of the radix search algorithm. It is optimized for matches to routes with longer masks and performs least efficiently when the best match is the default route (the route with the shortest mask).

Another complication of using a radix search is that a radix tree does not allow duplicated keys. There are two possible reasons for a key to be duplicated in the tree: Either multiple routes exist to the same destination or the same key is present with different masks. The latter case is not a complete duplicate, but the two routes would occupy the same location in the tree. The routing code does not support completely duplicate routes, but it supports multiple routes that differ in only the mask. When the addition of a route causes a key to be duplicated, the affected routes are chained together from a single leaf node. The routes are chained in order of mask significance, most specific mask first. If the masks are contiguous, longer masks are considered to be more specific (with a host route considered to have the longest possible mask). If a routing lookup visits a node with a duplicated key when doing a masked comparison (either at the leaf node or while moving back up the tree), the comparison is repeated for each duplicate node on the chain, with the first successful comparison producing the best match.

As we noted, FreeBSD does not support multiple routes to the same destination (identical key and mask). The main reason to support multiple paths would be to allow the load to be split among the paths but interleaving of traffic across the available paths would often be suboptimal. A better design would be to add a pointer to an output function in each route. Most routes would copy the output pointer for the interface used by the route. Routes for which multiple paths were available would be represented by a virtual route containing references to the individual routes, which would not be placed in the radix tree. The virtual route would interpose an intermediate output function that would distribute packets to the output functions for the individual routes. This scheme would allow good packet interleaving even when a path was used by a single connection. Although the output routine field has been added to the rtentry structure, it is not in use in the code as of FreeBSD 5.2.

Routing Redirects

A routing redirect is a control request from a protocol to the routing system to modify an existing routing-table entry or to create a new routing-table entry. Protocols usually generate such requests in response to redirect messages that they receive from routers. Routers generate redirect messages when they recognize that a better route exists for a packet that they have been asked to forward. For example, if two hosts, A and B, are on the same network, and host A sends a packet to host B via a router C, then C will send a redirect message to A indicating that A should send packets to B directly.

On hosts where exhaustive routing information is too expensive to maintain (e.g., SOHO routers, DSL modems, PDAs and other embedded systems), the combination of wildcard routing entries and redirect messages can be used to provide a simple routing-management scheme without the use of a higher-level policy process. Current connections can be rerouted after notification of the protocols by the protocols' pr_ctlinput() entries. Statistics are kept by the routing-table routines on the use of routing-redirect messages and on the latter's effect on the routing tables. A redirect causes the gateway for a route to be changed if the redirect applies to all destinations to which the route applies; otherwise, a new host route is cloned from the appropriate network route. Cloned host routes can easily pollute the routing table because an automated method for removing stale host routes created by redirects does not exist in the system. A user-level routing daemon will normally clean up stale host routes, but most hosts do not run routing daemons.

Routing-Table Interface

A protocol accesses the routing tables through three routines: one to allocate a route, one to free a route, and one to process a routing-redirect control message. The routine rtalloc() allocates a route; it is called with a pointer to a route structure, which contains the desired destination, as shown in Figure 12.10, and a pointer that will be set to reference the routing entry that is the best match for the destination. The destination is recorded so that subsequent output operations can check whether the new destination is the same as the previous one, allowing the same route to be used. The route returned is assumed to be held by the caller until released with a call to the RTFREE macro. All accesses to the routing table must be properly locked in FreeBSD and the RTFREE macro handles the locking as well as decrementing the route's reference count, freeing the route entry when the reference count reaches zero. TCP uses a host cache to hold onto routes for the duration of the connection's lifetime. Connectionless protocols, such as UDP, allocate and free routes whenever the routes' destination address changes. The rtalloc() routine simply checks whether the route already contains a reference to a valid route. If no route is referenced or the route is no longer valid, rtalloc() calls the rtalloc1 () routine to lookup a routing entry for the destination, passing a flag indicating whether the route will be used or is simply being checked. If packets will be sent, the route is created by cloning if necessary.

Figure 12.10. Data structures used in route allocation.


The rtredirect() routine is called to process a redirect control message. It is called with a destination address and mask, the new gateway to that destination, and the source of the redirect. Redirects are accepted from only the current router for the destination. If a nonwildcard route exists to the destination, the gateway entry in the route is modified to point at the new gateway supplied. Otherwise, a new host route is cloned from an existing network route in the routing table. Routes to interfaces and routes to gateways that are not directly accessible from the host are ignored.

User-Level Routing Policies

The kernel routing facilities deliberately refrain from making policy decisions. Instead, routing policies are determined by user processes, which then add, delete, or change entries in the kernel routing tables. The decision to place policy decisions in a user process implies that routing-table updates may lag a bit behind the identification of new routes or the failure of existing routes. This period of instability is normally short if the routing process is implemented properly. Internet-specific advisory information, such as ICMP error messages, may also be read from raw sockets (described in Section 12.7).

Several routing-policy processes have been implemented. The system standard routing daemon, routed, uses the Routing Information Protocol (RIP) [Hedrick, 1988]. Many sites that require the use of other routing protocols or more configuration options than are provided by routed use either a commercial package or the open source Quagga Routing Suite [Ishiguro, 2003].

User-Level Routing Interface: Routing Socket

User-level processes that implement routing policy and protocols require an interface to the kernel routing table so that they can add, delete, and change kernel routes.

The interface to the kernel routing layer in FreeBSD uses a socket to communicate with the kernel routing layer. A privileged process creates a raw socket in the routing protocol family, AF_ROUTE, and then passes messages to and from the kernel routing layer. This socket operates like a normal datagram socket, including queueing of messages received at the socket, except that communication takes place between a user process and the kernel. Messages include a header with a message type identifying the action, as listed in Table 12.7. Messages to the kernel are requests to add, modify, or delete a route, or are requests for information about the route to a specific destination. The kernel sends a message in reply with the original request, an indication that the message is a reply, and an error number in case of failure. Because routing sockets are raw sockets, each open routing socket receives a copy of the reply and must filter for the messages it wants. The message header includes a process ID and a sequence number so that each process can determine whether this message is a reply to its own request and can match replies with requests. The kernel also sends messages as indications of asynchronous events, such as redirects and changes in local interface state. These messages allow a daemon to monitor changes in the routing table made by other processes, events detected by the kernel, and changes to the local interface addresses and state. The routing socket is also used to deliver requests for external resolution of a link-layer route when the RTF_XRESOLVE flag is set on a route entry.

Table 12.7. Routing message types.

Message type

Description

RTM_ADD

add route

RTM_DELETE

delete route

RTM_CHANGE

change metrics or flags

RTM_GET

report route and metrics

RTM_LOSING

kernel suspects partitioning

RTM_REDIRECT

told to use different route

RTM_MISS

lookup failed on this address

RTM_LOCK

lock specified metrics

RTM_OLDADD

caused by SIOCADDRT

RTM_OLDDEL

caused by SIOCDELRT

RTM_RESOLVE

request to resolve link address

RTM_NEWADDR

address added to interface

RTM_DELADDR

address removed from interface

RTM_IFINFO

interface going up or down

RTM_IFANNOUNCE

interface arrival and departure

RTM_NEWMADDR

multicast group membership being added to interface

RTM_DELMADDR

multicast group membership being deleted from interface


Requests to add or change a route include all the information needed for the route. The header has a field for the route flags listed in Table 12.6, and contains a rt_metrics structure of metrics that may be set or locked. The only metric that may be set is the MTU. All other metrics are ignored but were kept for backward compatibility with software that used the routing socket interface prior to FreeBSD 5.2. The header also carries a bit vector that describes the set of addresses carried in the message; the addresses follow the header as an array of variable-sized sockaddr structures. A destination address is required, as is a mask for network routes. A gateway address is generally required as well. The system normally determines the interface to be used by the route from the gateway address, using the interface shared with that gateway. By convention, direct routes contain the local address of the interface to be used. In some cases, the gateway address is not sufficient to determine the interface, and an interface address can be passed as well, generally using a sockaddr_dl structure containing the interface name or index (see Section 12.1).


   
 


The Design and Implementation of the FreeBSD Operating System
The Design and Implementation of the FreeBSD Operating System
ISBN: 0201702452
EAN: 2147483647
Year: 2003
Pages: 183

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net