16.3 Implementation

The following discussion divides the routing implementation in Linux into three functional units:

As described in Section 14.2, ip_route_input() and ip_route_output() are the two functions invoked when IP packets are handled to run routing-specific tasks; they will be described in Section 16.3.4. These functions are also called "forwarding functions" in the following discussion.
Routing rules and routing tables together form the so-called forwarding-information base (FIB). Whenever necessary, forwarding functions query the forwarding-information base; this action is also called a forwarding query or FIB request in the following discussion. An FIB request is initiated by calling the fib_lookup() function. The implementation of routing rules is strongly encapsulated within the FIB, so routing rules and routing tables will be discussed separately in Sections 16.3.1 and 16.3.2.
Because consulting the FIB for each single IP packet received or sent would require too much time, there is an additional routing cache that stores the table entries used the most recently and allows fast access to these entries. Section 16.3.3 describes how this routing cache is implemented.

16.3.1 Routing Rules

As we described in Section 16.1.6, rule-based routing uses a set of rules to decide which routing tables should be searched in which sequence for a suitable entry to forward a packet and whether the packet may be forwarded at all. The rules are processed successively by ascending priority value until a decision can be made.

The entire implementation of the rules-processing method, including the data types used, is included in the fib_rules.c file. The rather narrow interface is described by some function prototypes and inline functions in a common header file, ip_fib.h. If rule-based routing was disabled in the kernel configuration (CONFIG_IP_MULTIPLE_TABLES option; see Section 16.2.1), then fib_rules.c is not compiled. In this case, the "replacement functionality" (use of the two routing tables local and main, in this sequence) is fully handled by the inline functions in ip_fib.h.

Data Structures

The set of rules is represented in the kernel by a linear list of fib_rule structures, sorted in ascending order by priority value and hooked into the static fib_rules variable. Initially, this list contains three entries: the fib_rule structures default_rule, main_rule, and local_rule, which are statically defined. Figure 16-6 shows this initial state of the rules list. A read-write spinlock called fib_rules_lock is used to regulate access to the list.

Figure 16-6. List with routing rules (initial state). Optional structure entries are not shown.

`struct fib_rule`	net/ipv4/fib_rule.c

The fib_rule structure, to begin with, contains two management fields, a link pointer, struct fib_rule *r_next, and a reference counter, atomic_t r_clntref. The latter specifies the number of references to a specific instance of the structure. This counter is incremented by atomic_inc() when new rules are added, and especially when a reference to a rule is returned in a result for a forwarding request. A call to atomic_dec_and_test() in the interface function fib_rule_put(), which frees fib_rule instances, decrements the reference counter. The memory is actually freed when this counter reaches a value of zero. In this situation, the function additionally checks for whether the entry int r_dead was set to one by explicitly deleting the rule (by inet_rtm_delrule(); see further below). If this is not the case, then there must be an implementation error, which is written to the system log for the user's attention.

Next within the structure follows some information about the described rule: the priority value u32 r_preference, the unsigned char r_table identifier for a routing table to be used, and the field unsigned char r_action, which specifies the action that should run if the rule's selector matches the packet currently being processed. The five rule types mentioned in Section 16.1.6 can be coded with the values RTN_UNICAST, RTN_BLACKHOLE, RTN_UNREACHABLE, RTN_PROHIBIT, and RTN_NAT, which are declared in include/linux/netlink.h. Additional attributes for the action are the u32 r_srcmap und __u32 r_tclassid entries. The first of these two entries includes the new source address for static address translation. The second entry is present only if the kernel was configured with the CONFIG_NET_CLS_ROUTE option. (See Section 16.2.1.) It contains a class identifier for a queuing discipline that is assigned to packets, if the routing table entry we select later does not itself contain a class identifier.

The rule's selector is represented by an address prefix and a corresponding network mask for the packet's source and destination addresses (u32 r_src, r_src-mask, r_dst, r_dstmask), by the network interface index (int r_ifindex), by the content in the TOS field (u8 r_tos), and additionally by the fwmark (u32 r_fwmark), if activated in CONFIG_IP_ROUTE_FWMARK. Moreover, the structure includes the lengths of the address prefixes (unsigned char r_src_len, r_dst_len) and the network interface name (char r_ifname[IFNAMSIZ]). These attributes are used only when rules are inserted, deleted, and displayed at the RT netlink interface. In contrast, forwarding decisions use only the address masks and the much faster integer index for the network interface.

Though the u8 r_flags entry is accepted from the RT netlink interface, it has no meaning in rules processing.

Initialization and Internal Functions

The initialization function fib_rules_init(), which is invoked (from within ip_fib_init(), which, in turn, is invoked by ip_rt_init()) when the routing is initialized during system start, does not have to initialize the rules list, because its initial entries are already statically linked and anchored. However, it registers the callback function fib_rules_export() in the notification chain for state changes to network devices. (See Section 5.2.4.) If changes occur, fib_rules_event() is invoked, which branches to fib_rules_attach() when a new network device is registered, but to fib_rules_detach() when a network device is unregistered. These two functions visit all existing rules and correct the ifindex entry, which is meaningful for registered devices only and should otherwise be set to ?.

As was mentioned in connection with the r_clntref reference counter, the fib_rule_put() function is used to release references that point to fib_rule structures. This function is also declared in ip_fib.h, because it is invoked not only internally, but also when structures are released that were returned as replies to forwarding requests and contain a reference to the rule that led to the selection of a route from a routing table.

RT Netlink Interface

The RT netlink interface represents the only way to manage routing rules. For this purpose, the table inet_rtnetlink_table[] (net/ipv4/devinet.c) has pointers for the message types RTM_NEWRULE, RTM_DELRULE, and RTM_GETRULE that point to the functions inet_rtm_newrule(), inet_rtm_delrule(), and inet_dump_rules(), so that these functions are invoked to handle the corresponding RT netlink messages.

A large part of the implementation of these functions consists in converting between the data structures of RT netlink messages and the fib_rule structure. In addition, when new rules are added, the values entered are checked for whether or not they are valid and consistent. If a rule is added without stating a required routing table, then the function fib_empty_table() is invoked and searches for a table not yet used.^[2] If no priority is stated, then a new rule is added to the rules list before all other rules with a nonzero priority. When rules are polled (message type RTM_GETRULE), the auxiliary function inet_fill_rule() is used; it appends the information of a single rule to the RT netlink reply message currently being built.

^[2] If you invoke ip rule add without stating a table number, then it automatically uses the table main. To create an RT netlink message with unspecified table, you have to use the table 0 option specifically with ip rule add.

Interface to Forwarding Functions

The rules database represents the access point to the FIB virtually, because rules have to be used to identify a suitable routing table for each forwarding request. For this reason, the "central" request function, fib_lookup(), is also implemented in fib_rules.c (or in ip_fib.h, if rule-based routing is disabled). However, it handles only a small part of the work involved; important parts are handled by invoking functions from other FIB parts, as we will see later.

In addition to these FIB interface functions, there are several functions that access specific elements of the fib_rule structure. This structure is not visible outside the fib_rule.c file.

fib_rules_tclass() simply returns the queuing discipline's class identifier assigned to a rule.
fib_rules_policy() transforms the source address as specified by a rule, if applicable.
fib_rules_map_destination() transforms the destination address for NAT routes. (See Section 16.1.6.)

`fib_lookup()`	net/ipv4/fib_rules.c

This function, which represents the most commonly used FIB interface, returns a matching routing-table entry for a key passed as argument. The key is passed as pointer to an rt_key structure (declared in route.h). This structure contains the source and destination addresses (src and dst), the input and output network interface indices (iff and oif), the TOS value (tos), and the fwmark (?) for the packet to be forwarded, if applicable:

 struct rt_key {            __u32         dst;            __u32         src;            int           iif;            int           oif; #ifdef CONFIG_IP_ROUTE_FWMARK            __u32         fwmark; #endif            __u8          tos;            __u8          scope; };

The scope information (scope) can be used to limit the search range. For this, each entry in a routing table includes a scope identifier, and only entries with equal or smaller scopes are returned for a request (but smaller scopes have bigger identifiers). The following scopes are predefined:

Symbol	Value	Scope
`RT_SCOPE_UNIVERSE`	0	any destination
`RT_SCOPE_LINK`	253	destination in the same physical network
`RT_SCOPE_HOST`	254	destination in the local system
`RT_SCOPE_NOWHERE`	255	destination does not exist

To handle a request, the rules list is visited in the order of ascending priority value, and the action corresponding to the rule type runs for each rule with a selector that matches the key passed. Rules of the types unreachable, blackhole, and prohibit cause the function to be aborted and to return an appropriate error value. For unicast and nat rules, the routing table identified by the r_table entry of the rule is consulted.

Section 16.3.2 discusses the data structures used to represent routing tables and how these structures are searched. The interface for this is the function pointer tb_lookup() in the fib_table structure representing the root of a routing table. If a table search is successful, then the result supplied by tb_lookup() is returned; otherwise, fib_lookup() continues with the next rule.

`fib_select_default()`	net/ipv4/fib_rules.c

This function serves to select a route from several default routes; it is invoked whenever a previous FIB request returns a routing-table entry with a network prefix of length null. It obtains the request key and the request result as parameters. The default route is actually selected by the function pointer tb_select_default(), which is included in the fib_table structure.

16.3.2 Routing Tables

In the Linux kernel, routing tables are represented by rather complex data structures, which manage entries by using a number of hash tables for different prefix lengths.

Data Structures

A fib_table structure forms the basis for a routing table. This structure includes a pointer to an fn_zone structure for each potential prefix length (0 to 32 bits). All routing table entries with the same prefix length are allocated to a specific fn_zone structure. The fn_zone structure uses an additional hash table to store the individual entries, each represented by a fib_node structure. The hash function used for this purpose also uses the entry's network prefix. If several routing-table entries have the same hash value, then the corresponding fib_node structures are linked in a linear list. Ultimately, the actual data of an entry is not in the fib_node structure itself, but in a fib_info structure referenced in the former structure.

There are up to 255 different routing tables when rule-based routing is used. The associated fib_table structures are managed by using the array variable struct fib_table * fib_tables[RT_TABLE_MAX+1]. Their positions within the array correspond to the table numbers, which are used in routing rules to identify routing tables. Only position null is not used; at the interfaces, identifier null denotes an unspecified table and normally is mapped to the main table. If rule-based routing is not used, there are only two routing tables, each referenced by a global variable: local_table and main_table.

Figure 16-7 shows a possible instance of fib_table and fn_zone structures, where the routing table with number 254 (RT_TABLE_MAIN) includes entries with three different prefix lengths: 0 (default route), 16, and 24. The hash tables used to reference the respective routing-table entries are shown on the right-hand side of the figure. They have different sizes: For prefix length null, the hash function will always yield the same value, which is the reason why one single entry is sufficient here, while 16 entries are allocated for other prefix lengths. The hash tables grow automatically when they fill up, if the option CONFIG_IP_ROUTE_LARGE_TABLES is active.

Figure 16-7. The `fib_table` structure with references to zones of different prefix lengths.

The fib_node and fib_info structures belonging to the routing-table entries are not shown in Figure 16-7, because of limited space. They are shown in Figure 16-8, which is described further along.

Figure 16-8. Hash table of a zone with fib_node and fib_info structures.

`struct fib_table`	include/net/ip_fib.h

In addition to its number, tb_id, and an unused element by the name of tb_stamp, the fib_table structure includes a number of function pointers, forming the interface to manage the entries stored in the table:

tb_insert() and tb_delete() serve to insert and delete entries; they are used in net/ipv4/fib_frontend.c to handle RT netlink messages and ioctl() and kernel-internal calls.
tb_dump() serves to output entries over RT netlink, and tb_get_info() serves to output entries in the /proc/net/route format.

tb_lookup() searches the table for an entry matching a key; it is used by the main query function, fib_lookup().
tb_flush() frees all entries in the table that previously have been marked as deleted.
tb_select_default() serves to select one route from several existing default routes.

These function pointers are set to functions in fib_hash.c, with names matching those of the pointers, except for the prefix (fn_hash_ instead of tb_), when a new fib_table structure is created by a call to fib_new_table() (include/net/ip_fib.h and net/ipv4/fib_frontend.c). This initialization is accomplished by the function fib_hash_init() (net/ipv4/fib_hash.c), invoked from within fib_new_table().

These function pointers are the only way to access internal routing table data structures; their implementation is fully encapsulated. The "core" structures fn_zone and fib_node are defined exclusively in the file net/ipv4/fib_hash.c. Also, the fn_hash structure, which is physically part of the fib_table structure, is declared "anonymously" as unsigned char tb_data[0] in include/net/ip_fib.h; it is actually used only within net/ipv4/fib_hash.c. The fib_info structure, which includes information about individual entries, is the only structure visible to the outside.

`struct fn_zone`	net/ipv4/fib_hash.c

An fn_zone structure manages all entries with the same prefix length by use of a hash table. The fixed prefix length is noted in the fz_order element, and fz_mask includes an appropriate network mask. The hash table consists of an array of fib_node structures, which are referenced by the fz_hash pointer. The array size is specified by fz_divisor. fz_hashmask holds a bit mask used to mask the hash value to the range permitted for indexing in the array in its last computation step.

An fn_zone structure for a specific prefix length exists only provided that entries with this prefix length actually exist. All fn_zonestructures of a table are linked in a linear list in the order of descending prefix length (by using the fz_next element), which is hooked into the fn_zone_list element of the fn_hash structure at the bottom of the fib_table structure. This list forms the basis of search for an entry with the longest network prefix matching a specific destination address.

`struct fib_node`	net/ipv4/fib_hash.c

Each single entry in a routing table is represented by a fib_node structure. In its fn_key element, this structure contains the destination network prefix (with an identical length for all entries of one zone), and its fn_tos elements includes a TOS value, which is also part of the key used to search a routing table. The type and scope of an entry are coded in fn_type and fn_scope, and the fn_state element stores two flags used to manage the structure.

All additional information in a routing-table entry, which is not required for searching an entry, but represents merely part of the search results,^[3] is located in the fib_info structure, which can be reached over fn_info.

^[3] Though quality information (metric) plays a role in searching, it is not explicitly used. The reason is that it already influences an entry's position in the fib_node list when the entry is inserted, so that entries with lower metric values are automatically found earlier.

`struct fib_info`	include/net/ip_fib.h

The fib_info structure represents information about the result of an FIB query, including the output interface to be used and the next hop along the route to the destination system, if necessary. This information is included in a fib_nh structure in the element fib_nh of the fib_info structure. The fib_nh element is an array to represent the situation where several equivalent routes lead to the same destination in the FIB. This array is declared with size null, and sufficient space is reserved for new entries that all stated routes can be stored. The number of these routes is then stored in the fib_nhs element of the fib_info structure.

The fib_nh structure is declared in ip_fib.h and contains the output interface to be used in the form of its index (nh_oif); as a pointer to the net_device structure (nh_dev), it also includes the IP address of the next router (nh_gw).

The fib_info structure does not contain any backward references to its position within the routing-table data structures. Instead, the fib_next and fib_prev pointers serve to arrange all existing fib_info structures to form a doubly linked list, which is hooked into a global variable, fib_info_list. When creating a new entry, the function fib_create_info(), which is invoked by tb_insert(), first checks for whether an identical entry exists in the list. Rather than duplicating such an entry, it would merely increment the reference counter fib_clntref. This means that several fib_node structures can reference one single fib_info structure. When a fib_node is freed, then the fib_info structure is also freed, if the reference counter has reached null.

In contrast to fn_zone and fib_node, the fib_info structure becomes visible to the outside again: It is declared in include/net/ip_fib.h, and its contents are read directly in some places (e.g., in net/ipv4/route.c). All operations to manage fib_info structures that go beyond the plain reading of a data element are implemented in the file net/ipv4/fib_semantics.s, however.

Managing and Initializing Memory

The FIB implementation is initialized in ip_fib_init() (net/ipv4/fib_frontend.c). If rule-based routing is not used, then the function fib_hash_init() is invoked directly for the two tables table_local and table_main held in global variables. This function occupies memory for a fib_table structure and sets the enclosed function pointers to the fn_hash_ functions. The substructure of the type fn_hash, which contains the zone hash table, is initialized to null.

In contrast, if rule-based routing is used, routing tables are not previously initialized. The global array that references the tables is created as a memory area initialized to null when the kernel is loaded. Whenever a routing table is accessed via an element of the array, then it is checked whether the table already exists, and, if this is not the case, it is created by fib_hash_init().

The function fib_hash_init() also ensures that a slab cache (see Section 2.6.2) called "ip_fib_hash" exists. This slab cache supplies memory for fib_node structures, which are allocated in the function fn_hash_insert() to create new entries when needed.

Managing Hash Structures

The functions that access fn_zone and fib_node structures, which are used internally to manage routing-table entries by using hash tables, are collected in the file net/ipv4/fib_hash.c. They are invoked by all other FIB functions to access fn_zone and fib_node structures.

Encapsulation of these internal structures allows them to be reimplemented, if necessary, by using other data structures offering a more efficient search process, without the need to effect changes in other places.

The most important functions in net/ipv4/fib_hash.c are as follows:

fn_rehash_zone() enlarges the hash tables, if necessary.
fn_new_zone() creates a new fn_zone structure and sorts it into the zone list based on its prefix length.
fn_hash_lookup() handles the main task in an FIB query: The fn_zone structures in a routing table are walked through in the sequence of descending prefix length, and the hash table is searched for an entry matching the key passed as an argument.
fn_hash_select_default() selects one out of several default routes, considering whether the intermediate system specified as the next router is currently reachable.
fn_hash_insert(), fn_hash_delete(), and fn_hash_dump() serve to insert, delete, and display entries over the RT netlink interface.
fn_hash_flush() removes all fib_info structures of a zone that were previously marked as invalid.
fn_hash_get_info() serves to display routing table entries over the proc file system.

Interfaces to the User-Address Space

From within the user-address space, you can manage routing tables both over the traditional ioctl() interface and over RT netlink.

For the RT netlink interface, the functions inet_rtm_newroute(), inet_rtm_delroute(), and inet_dump_fib() from net/ipv4/fib_frontend.c are registered in the table inet_rtnetlink_table[] in net/ipv4/devinet.c, so that they can be invoked to add, delete, or output a routing table entry and handle corresponding messages.

ioctl() system calls are handled by ip_rt_ioctl() (net/ipv4/fib_frontend.c), which is invoked in af_inet.c by the general routine that handles ioctl() calls at PF_INET sockets. (See Chapter 26.) The parameters for the call are converted into an RT netlink message by fib_convert_rtentry() and passed to inet_rtm_newroute() or inet_rtm_delroute() for further handling.

proc File System

The contents of the pseudo file /proc/net/route, which can be used to view the main routing table, are created by the function fib_get_procinfo() (net/ipv4/fib_frontend.c), which is registered by ip_fib_init(), using proc_net_create(), for this purpose. The function creates a header line and uses the function pointer tb_get_info() from the main table, which normally points to fn_hash_get_info() (ipv4/net/fib_hash.c), to output the data. There, all fib_node structures in fn_zone_list are visited, and the appropriate data is eventually output by fib_node_get_info() (net/ipv4/fib_semantics.c).

Reacting to Changes in Network Interfaces

The functions fib_inetaddr_event() and fib_netdev_event() (net/ipv4/fib_frontend.c) are registered in two notification chains for state changes to network interfaces or changes to their IP addresses when ip_fib_init() initializes the FIB.

As soon as a network device obtains an IP address or when it is reactivated after it had addresses and was deactivated previously, then fib_add_ifaddr() (invoked by fib_inetaddr_event() or fib_netdev_event()) creates entries for local and broadcast routes in the local table. When addresses are removed, these entries are deleted accordingly by fib_del_ifaddr(). In addition, during removal of addresses, all routing entries that use the removed address as their preferred source address have to be deactivated. This is handled by the function fib_sync_down() from net/ipv4/fib_semantics.c.

When deactivating or deleting a network interface (and also when the last address of an interface was removed), fib_netdev_event() invokes the function fib_disable_ip() which, in turn, uses fib_sync_down() (parametrized differently) to declare all those fib_nh structures, that use the interface as their forwarding-output interface as invalid. Subsequently, the routing cache is deleted, and, finally, fib_disable_ip() invokes arp_ifdown() to inform the ARP implementation that the interface disappeared.

Interfaces to the Forwarding Functions

This section describes several operations in addition to those discussed in Section 16.3.1, which are used by the functions in net/ipv4/route.c to access the FIB.

`fib_validate_source()`	net/ipv4/fib_frontend.c

The function fib_validate_source() serves to check the source addresses of IP packets within a forwarding process. This check can be implemented elegantly by an FIB query, using fib_lookup() for the opposite direction (i.e., with source and destination addresses swapped): If the entry found is not of the type RTN_UNICAST (but RTN_LOCAL, for example, which means that the address is assigned to a local interface), then this address is not a valid source address for an incoming packet.

If the output interface noted in an inverted FIB query matches the actual input interface, then the validation is completed successfully. This is the only acceptable result when reverse-path filtering is active.

If the actual input device does not currently have an address, then the above check is considered successful, even if the FIB query returns no result at all. If the output device found does not match the actual input device, then another FIB query is done with the actual input device specified as a fixed output device. If this query supplies either a result of the type RTN_UNICAST or no result at all, the check is considered to have been successful.

`fib_select_multipath()`	net/ipv4/fib_semantics.c

When a routing-table entry with several routes is used in a forwarding process, the function fib_select_multipath() is invoked to select one of these routes. This decision is made randomly (the jiffies counter is used rather than a "real" random number generator), taking weights assigned to each of these routes into account.

`ip_dev_find()`	net/ipv4/fib_frontend.c

To find the network interface that belongs to an IP address passed as parameter, ip_dev_find() uses the function pointer tb_lookup() to search the local table. The result has to be an entry of the type RTN_LOCAL, and a reference to the wanted net_device structure can be taken from the entry.

`inet_addr_type()`	net/ipv4/fib_frontend.c

inet_addr_type() is another function that searches the local table for a specific address, but the address type is the result looked for in this case. The semantics is slightly different and is characterized by the specific requirements of the functions that invoke inet_addr_type(): Formally invalid addresses are previously filtered and then yield the RTN_MULTICAST result, and an address is treated as RTN_UNICAST, even if no entry is found in the local table.

16.3.3 The Routing Cache

Though the FIB data structures offer relatively fast queries, the cost to run such a query for each single IP packet would be altogether excessive. For this reason, the Linux kernel has a cache, in addition to the FIB, that stores the results of the forwarding queries used most recently allowing them to be accessed quickly. Each forwarding operation first consults the routing cache, and the FIB is queried only if no matching entry exists in the cache. The result of the FIB query is then used to create a new cache entry immediately.

The routing cache is based on a relatively simple data structure. One single hash table includes the cache entries, which are linearly linked when the hash value is identical. The hash function processes the source and destination addresses for packets to be forwarded, plus their TOS values. Each cache entry contains all information required to forward a packet.

Figure 16-9 shows the hash table (left-hand side), organized as an array of rt_hash_bucket structures. The pointer struct rt_hash_bucket *rt_hash_table references this array. Within each rt_hash_bucket structure, the chain element forms the anchoring point for a list of rtable structures, representing the cache entries. Access to this list is controlled by the read-write spinlock lock in the rt_hash_bucket structure. The size of the hash table is defined according to the main memory size when the routing code is initialized and remains unchanged afterwards.

Figure 16-9. Data structures of the routing cache.

`struct rtable`	include/net/route.h

The rtable structure is rather extensive, and Figure 16-9 shows only an excerpt. In addition, we somewhat simplified the representation of its first element. Actually, the first element of the rtable structure is a union structure, u, which contains a dst_entry structure and an rtable pointer:

 union {        struct dst_entry dst;        struct rtable *rt_next; } u;

Both elements of u are used concurrently, without representing a problem, because the first element of the dst_entry structure is the dst_entry pointer next, and dst_entry structures occur only within rtable structures. This means that u.dst.next and u.rt_next are two different names for the same pointer; based on their types, the total object or only part of it are referenced.

The rt_src and rt_dst elements specify the source and destination addresses for IP packets handled by this entry. rt_gateway stores the address of the next router or the destination address again, if no router is required. The interface identifier in iif can denote the output or input interface.

The rt_key structure integrated as element key is used as the key in searching the routing cache. The values stored there do not necessarily have to match those outside. For example, in searching for an entry of a packet that was created locally and should be sent now, normally no source address is specified, which means that rt_key.src in the cache entry is null. However, the source address to be assigned to the new packet is stored in rt_src.

`struct dst_entry`	include/net/dst.h

The dst_entry structure contains a large number of information elements. The most important elements are introduced below by origin and purpose:

Some pointers refer to other data structures required when forwarding IP packets and so they are available there without additional effort the net_device structure of the output interface, the neighbour structure to the next router or the destination system (corresponding to the rt_gateway element in the rtable structure), and the hh_cache structure, which includes a packet header of the data-link layer that just has to be copied.
Function pointers to operations for further processing when receiving (input()) or transmitting (output()) an IP packet that matches the cache entry are simply invoked at the appropriate positions within the IP protocol procedure, to result in the packet's being handled appropriately. The input() pointer in entries that can be used for incoming packets refers to ip_local_deliver() (unicast packets to be delivered locally), to ip_mr_input() (multicast packets), to ip_forward() (packets to be forwarded), or to ip_error() (forwarding impossible because destination is unreachable). The output() pointer refers to ip_output() (unicast packets) or ip_mc_output() (multicast packets).
A reference to a dst_ops structure for IPv4 entries always points to the globally defined structure ipv4_dst_ops. It contains several function pointers, and users of routing-cache entries (e.g., TCP or ARP) can use them to supply feedback to the routing cache (e.g., about connection failures). In addition, the dst_ops structure holds a counter for occupied cache entries and stores a threshold value which, when exceeded, causes the cache garbage collection to start (as described later).
The usage counter, __use, is incremented whenever the cache entry is used. The time of the last use is also stored in the lastuse element.
Various parameters for transport protocols are copied from the fib_info structure.
Some functions to manage dst_entry structures are defined in the files include/net/dst.h and net/core/dst.c.

Interface to Forwarding Functions

The implementation of the routing cache is hardly encapsulated against the other program logic used for forwarding; you find both in the file net/ipv4/route.c. For example, there is no separate function that searches for a cache entry. Each search operates directly on the cache data structure in both main interface functions for IP packet processing (ip_route_input() and ip_route_output() see Section 16.3.4). Part of the creation of new cache entries is done there as well.

Nevertheless, there are a few methods that are used by the forwarding functions, yet clearly belong to the routing cache. The most important such methods are described below.

`rt_hash_code()`	net/ipv4/route.c

rt_hash_code() is used to calculate the hash value from the source and destination addresses and the TOS value passed as parameters. This hash value serves as index in the hash table of the routing cache.

`rt_intern_hash()`	net/ipv4/route.c

The forwarding functions use the rt_intern_hash() function to insert an almost complete rtable structure into the hash table of the routing cache. New entries are always inserted in the first position in a potential collision-resolution list. If an entry with identical key exists in the list, then this entry is moved to the front, and the new entry is discarded.

In addition to inserting entries, rt_intern_hash() procures the pointer to a neighbour structure matching the next router or destination system from the rtable structure.

Initialization

The initialization function ip_rt_init() allocates memory for the routing cache array, once an appropriate size has been computed. Subsequently, each element of the array is initialized. In addition, the slab cache "ip_dst_cache" that serves to hold rtable structures is initialized. The initialization functions of the IP interface management and FIB are invoked, the timer for cache garbage collection is started (see below), and finally, proc entries are created (see below).

Cache Garbage Collection

An entry is added to the routing cache for each new communication partner in the Internet, so we have to ensure that old entries that are no longer needed are deleted occasionally to limit the memory requirement and to keep cache searches fast in the long run.

First, a timer called rt_periodic_timer, which is initially started by add_timer() (see Section 2.7.1) in the initialization function ip_rt_init(), invokes the function rt_check_expire() at regular intervals. This timer is restarted in rt_check_expire(), where the interval (measured in 1/HZ seconds) is specified by the global configuration variable ip_rt_gc_interval (which can be set by sysctl() or by a proc directory entry called /proc/sys/net/ipv4/route/gc_interval).

Second, whenever the number of entries in the cache exceeds a threshold value, the function rt_garbage_collect() is invoked to delete old entries also. A call to rt_garbage_collect() is also made when an attempt to allocate a neighbour structure has failed. This is done because it is then assumed that the corresponding tables are full, because some neighbour entries are referenced by routing cache entries and therefore they cannot be deleted. Old routing cache entries are deleted to remedy this problem.

`rt_check_expire()`	net/ipv4/route.c

The function rt_check_expire() successively checks cache-entry lists anchored in the elements of the array rt_hash_table[]. If the element u.dst.expires of an entry specifies an expiry time, this entry is deleted if this time has been exceeded. If no expiry time is specified, then rt_may_expire() is invoked to check on whether this entry should be deleted nevertheless.

First of all, rt_may_expire() refuses to delete referenced entries (u.dst_refcnt != 0). Otherwise, it determines how long an entry has not been used and compares this "age" with two threshold values passed as arguments. Three types of entries are distinguished:

Low-value entries: This includes broadcast and multicast entries, which are additionally not in the last position in a collision resolution list. Such entries are always deleted immediately.
"Normal" entries: These entries are deleted if their age exceeds the first threshold.
High-value entries: These are entries created by ICMP redirect messages. They are deleted only if their age exceeds the second threshold.

The first threshold passed by rt_check_expire() varies according to the entry's position in the collision-resolution list: Starting with a value identical to that of the second threshold, which comes from the configuration variable ip_rt_gc_timeout, it decreases by half in each step.

rt_check_expire() interrupts its work after a number of steps, depending on the configuration variables ip_rt_gc_interval and ip_rt_gc_timeout (by default, after one-fifth of the entire array has been processed), but at the latest when the jiffies counter changes its value. The next call continues processing from the same position in the array.

`rt_garbage_collect()`	net/ipv4/route.c

The approach taken by rt_garbage_collect() to deleting cache entries is similar to that of rt_check_expire(), except for the following points:

Instead of checking a fixed number of array elements for old entries, rt_garbage_collect() sets a specific target with regard to the number of entries to be deleted. It stops once this target has been reached (where the time is limited to the interval between two jiffies changes as well).
The second time threshold passed to rt_may_expire() is also specified dynamically; it reduces by half after each passage across the entire array until the target has been reached.
The number of entries to be deleted is computed on the basis of the number used in the last call, at the beginning of rt_garbage_collect(). The computation tries to find a state of equilibrium, where about as many entries are being deleted as are being newly created.

rt_garbage_collect() is a very labor-intensive function, so there must be at least the time distance defined in the configuration variable ip_rt_gc_min_interval between each two calls, except when there is acute lack of space. If this is not the case, then the function returns immediately.

RT Netlink Interface

Routing-cache entries can be read via the RT netlink interface. The corresponding messages are created by ip_rt_dump(), which is invoked by inet_dump_fib() (see above) when cache entries are requested.

In addition, you can simulate a forwarding process (by using the ip route get command) over the RT netlink interface. More specifically, this procedure invokes ip_route_input() or ip_route_output() and returns a result, with the side effect being to create an entry in the routing cache. The corresponding RT netlink message is processed by inet_rtm_getroute() from net/ipv4/route.c, registered in the table inet_rtnetlink_table[] in net/ipv4/devinet.c.

The Proc File System

Some configuration variables, including variables to control the behavior of the cache garbage collection, are mapped to files in the /proc/sys/net/ipv4/route directory. Except for the missing ip_rt_prefix, the file names correspond to those of the variables. The only exception is the flush file, which causes the routing cache to be deleted when it is accessed.

The files are defined by the control structure ctl_table ipv4_route_table[] in net/ipv4/route.c; the function ipv4_sysctl_rtcache_flush(), which handles access to the flush file, is also registered there. The ctl_table structure ipv4_table[] in net/ipv4/sysctl_net_ipv4.c includes a reference to ipv4_route_table[]. This structure is embedded in the central sysctl() tree over net_table[] in net/sysctl.net.c and root_table[] in kernel/sysctl.c.

The file /proc/net/rt_cache lets you read the content of the entire routing cache. The content is formatted by the function rt_cache_get_info() from net/ipv4/route.c, which is registered using proc_net_create() when the routing is initialized in ip_rt_init().

16.3.4 The Forwarding Process

Section 14.2.1 discussed how a forwarding query is embedded into the processing of incoming IP packets: ip_rcv_finish() invokes ip_route_input() to find a dst_entry structure to determine the packet's further route. Section 14.2.2 discussed outgoing packets: their routing decision is made in ip_route_output(), which is invoked by, for example, ip_queue_xmit().

`ip_route_input()`	net/ipv4/route.c

The function ip_route_input() is invoked for each IP packet arriving over a network interface. The parameters are a pointer to the socket-buffer structure, the destination and source addresses, the TOS value, and a pointer to the net_device structure of the receiving network interface.^[4]

^[4] The last four parameters could, alternatively, be worked out from the first. However, because they are passed separately, knowledge about their representation in the socket-buffer structure does not have to be present in ip_route_input(). Though the socket buffer structure is still not entirely treated as an encapsulated "black box," at least only few data elements especially present for routing are accessed.

First, rt_hash_code() is used on the addresses and the TOS value to compute an index in the hash table of the routing cache. If necessary, the list anchored in the chain element is walked through to find a cache entry matching addresses, input interface, TOS value, and fwmark, if present. If this search is successful, then a pointer to the entry is placed as dst in the sk_buff structure, and the task is complete.

If no matching cache entry is found, then either of the two following functions is responsible for further handling:

ip_route_input_mc() is invoked if the destination address is a multicast address. Another prerequisite is that the input interface either belongs to that multicast group or has been configured for multicast routing. The packet can be discarded if this is not the case. The function ip_route_input_mc() will be discussed later in the chapter about IP multicast Section 17.4. What is done there is similar to the procedure for local-destination addresses described below, the only difference being that the packet is always delivered to the local machine rather than causing an FIB query.
ip_route_input_slow() serves to handle "normal" destination addresses and is described next.

Both functions take the same parameters as ip_route_input() itself.

`ip_route_input_slow()`	net/ipv4/route.c

To begin with, an rt_key structure is filled with the parameters passed. However, before it is used to run an FIB query, the addresses are checked for invalid values multicast source addresses, and addresses moving a network prefix beginning with null. Such packets are dropped, and, if verbose messages are configured with CONFIG_IP_ROUTE_VERBOSE, they are registered in the system log. The use of 0.0.0.0 as source and destination addresses in the sense of limited broadcast is explicitly allowed as an exception, for this is occasionally used for automatic network configuration.

Next, the FIB query is started by calling fib_lookup(). If no matching entry is found, then ip_route_input_slow() also aborts processing and returns an error code, which subsequently causes ip_rcv_finish() (from where ip_route_input_slow() was invoked) to discard the packet.

If the routing NAT mentioned in Section 16.1.6 is active, the next step transforms the source address according to the information in the routing rule used (or the destination address, if the route found is an nat route). However, the new addresses are added only to the rt_key structure; the old addresses are maintained in the call parameters of the function and are available for other operations. Once the destination address has been transformed, fib_lookup() has to be invoked once more to find a regular routing entry (no other transformations are permitted) to the new destination address.

Among other things, the result from the FIB query also shows whether the destination address is a local address, which means that the packet is intended for the local system. This case is handled separately in the further process.

Local destination address: A new cache entry is created once the source address has been checked by the calling of fib_validate_source(). The function pointer output() gets the ip_rt_bug() value, because the packet is not allowed to leave the system. Next, the input() pointer is set to ip_local_deliver(), to cause the packet to be delivered to the local machine. Because there is no next router, the rt_gateway element of the cache entry is set to the destination address.
Broadcast addresses, such as the 0.0.0.0 address mentioned above as source and destination or the normally limited broadcast to 255.255.255.255 are detected in the address validation at the beginning and handled identically to local destination addresses. However, fib_validate_source() is not called for 0.0.0.0 addresses.
Nonlocal destination address: Nonlocal destination addresses have to be handled only if the forwarding function for the input interface is enabled, so this is checked first.

If the routing-table entry found describes several output routes, then fib_select_multipath() is called to select one of those routes. Subsequently, the source address is checked by the function fib_validate_source(), which can also consider the output network interface found in this case.

ip_forward() and ip_output() are set in the new cache entry for the input() and output() function pointers. The function rt_set_nexthop() does the necessary assignment to rt_gateway and also fills in other elements of the dst_entry structure that are of interest for forwarded packets only.

rt_intern_hash() is used to integrate the rtable structure, which is almost complete, into the hash table of the routing cache. It also supplies the return value of ip_route_input_slow(), which then is complete.

`ip_route_output()`	include/net/route.h

According to a comment in the route.h header file, where it is implemented as an inline function, the function ip_route_output() is going to be replaced by ip_route_output_key(). Actually, however, it is invoked in many different positions within the network implementation as the main routing interface (e.g., by the IP transmit function ip_queue_xmit() [see Section 14.2.2] or by udp_sendmsg() [see Section 14.2.2 25.3.1] for packets created by UDP).

Its only function currently is to create an rt_key structure with the source and destination addresses, the TOS value, and the output device from the passed parameters and to subsequently invoke the function ip_route_output_key(), which will be described next. The last parameter is the pointer struct rtable **rp, which serves to return the result; it is also passed to ip_route_output_key().

`ip_route_output_key()`	net/ipv4/route.c

The task of ip_route_output_key() is to determine a routing entry for the rt_key structure passed. The procedure is similar to that of ip_route_input(); the routing cache is searched for a matching entry, and the process branches to ip_route_output_slow() only if no such entry is found. Rather than the input interface, the output interface is used for hash computation and comparisons.

`ip_route_output_slow()`	net/ipv4/route.c

ip_route_output_slow() is invoked if no entry for the destination of a locally created IP packet exists in the routing cache. This function runs an FIB query, enters the result in the routing cache, and returns the new entry. In addition, it handles several special cases for which fib_lookup() alone is not sufficient.

As with ip_route_output_key(), the only input parameter is an rt_key structure. This structure initially is copied to a new structure of the same type, which can then be modified without losing the information passed. The iif and scope information is not considered; instead, the loopback device is always assumed for iif, and scope is set to either RT_SCOPE_LINK or RT_SCOPE_UNIVERSE, depending on the RTO_ONLINK flag in the tos element.

The input parameters are first checked for errors or special cases. For example, for multicast destination addresses, a route is created immediately without FIB query if a valid source address is specified, which can be used to identify a usable output device. This special handling simplifies the transmission of multicast packets (and some multicast tools that traditionally utilize this possibility continue to work). For a specified output interface, a matching source address is found, and 127.0.0.1 is used for the destination address, if none is specified.

An FIB query is started once all preparations have been completed. Notice that the process can, in some cases, continue even if this query returns a negative result. In fact, if an output interface was specified during the call, then this interface is used by simply assuming that the destination is in the adjacent network.

The query result can be used to distinguish between local and nonlocal destination addresses. The loopback device is always the output interface used for local addresses, but fib_select_multipath() might have to select one out of several routes, or fib_select_default() might have to choose from several default routes for non-local addresses.

Again, ip_route_output_slow() completes the job by filling a new rtable structure, where the function rt_set_nexthop() is used, similarly to ip_route_input_slow(). The output() function pointer is set to ip_output(), and the input() pointer is set to ip_local_deliver(), if the destination address is in the local system. rt_intern_hash() is used to add the cache entry to the hash table and also yields the return value for ip_route_output_slow().

16.3.1 Routing Rules

Data Structures

Figure 16-6. List with routing rules (initial state). Optional structure entries are not shown.

Initialization and Internal Functions

RT Netlink Interface

Interface to Forwarding Functions

16.3.2 Routing Tables

Data Structures

Figure 16-7. The fib_table structure with references to zones of different prefix lengths.

Figure 16-8. Hash table of a zone with fib_node and fib_info structures.

Managing and Initializing Memory

Managing Hash Structures

Interfaces to the User-Address Space

proc File System

Reacting to Changes in Network Interfaces

Interfaces to the Forwarding Functions

16.3.3 The Routing Cache

Figure 16-9. Data structures of the routing cache.

Interface to Forwarding Functions

Initialization

Cache Garbage Collection

RT Netlink Interface

The Proc File System

16.3.4 The Forwarding Process

Figure 16-7. The `fib_table` structure with references to zones of different prefix lengths.