3.7 Router architecture

This section reviews some of the basic operations and design choices applicable to modern routers. These have a fundamental impact upon price, performance, and the features available from today's commercial router platforms, and, as many engineers recognize, these design choices can significantly influence the network design. Note that here we are primarily interested in packet switching; for an overview of circuit-switching architectures refer to [11].

3.7.1 Router operations

The range of features supported by routers has mushroomed over the last decade, incorporating features such as multiprotocol operations, bridging, and basic firewalling. This is partly because of the drive for tighter integration and also because routers traditionally sit at key boundaries in the network infrastructure. Apart from specialized backbone routers optimized for packet forwarding, routers are becoming almost general-purpose forwarding devices, likely to be involved in the following operations:

Routing information maintenance—Routers participate in the maintenance and manipulation of routing information and typically maintain locally constructed views of the network topology called routing tables. Routers interact with their neighbors, listening for and generating updates as appropriate in order to maintain a real-time view of the network.
Layer 3 switching—When routers forward packets (based on routing information), they must create a new outbound Layer 2 encapsulation (since MAC addresses will differ) and generate a new Level 2 checksum (such as the Ethernet FCS). TTL or hop counts should be updated and, if appropriate, Layer 3 checksums should be recalculated.
Packet classification—Is for manipulation, queuing, and possibly filtering. Some routers provide advanced bandwidth management and QoS features.
Recording accounting and billing statistics—for example, interface statistics (packets/bytes in/out, error statistics, flow data, and so on).
Network management—Routers typically support in-band and out-of-band management features such as Telnet, SNMP, ICMP Echo (ping), trace route, and possibly HTTP server. Routers may also support RMON.
Security/NAT services—If appropriate, perform address translation and packet filtering.
Tunneling—Routers may originate and terminate secure Virtual Private Networks, (VPNs) and possibly carry nonroutable protocols inside tunnels.

Clearly, some of these operations are extremely complex and resource intensive, and for high-performance routing applications (such as enterprise and backbone nodes) it is important that the architecture of these devices is highly optimized for high-speed forwarding.

3.7.2 Router applications

Routers can be loosely classified into three classes, broadly divided by application: backbone routers, enterprise routers, and access routers. Each class of router is designed to deal with a different set of challenges and is, therefore, broadly differentiated based on factors such as performance, software functionality, port density, media support, and price.

Backbone routers

Backbone routers are used on public networks such as the Internet, telco networks, and large private internetworks. The primary issues here are, therefore, reliability and performance. Backbone routers enable organizations to build large, private international internetworks over high-bandwidth, long-haul trunks. They also enable ISPs and telcos to connect enterprise networks over public networks such as the Internet. The key characteristics of a backbone router are summarized as follows:

High-speed trunk connectivity
Performance
Minimal Layer 3 protocol support
Hardware fault tolerance
Software reliability
Management
Cost

In the gigabit range there are products such as the M40 (Juniper) and Cisco 7500. Notably, the M40 supports in-service software upgrades (the ability to upgrade the router code online without bringing down the router). At the top end of this class of routers are the so-called carrier class terabit router platforms, for which there are only a small handful of competing vendors (the term carrier class implies extremely high levels of fault tolerance and compliance with the Network Equipment Building Specification [NEBS], a Bellcore originated de facto standard for carrier class equipment). Products in this class include the NX64000 (Lucent), TSR (Avici Systems), Aranea Terabit Router-1 (Charlotte's Web Networks), Pluris 2000 Series TNR (Pluris), Versalar 25000 (Nortel), and the Cisco 12016.

Enterprise routers

Enterprise routers are typically employed within a campus or enterprise network. In this environment there may be thousands of LAN users interfacing with a wide diversity of interfaces. The key characteristics of an enterprise router are summarized as follows:

Bulk port concentration
Multiprotocol support
Service quality
Additional services

The job of designing an enterprise router is not easy; it must resolve the conflicting design goals of providing a rich feature set at each port, reducing the cost per port, and ensuring ease of configuration, QoS support, and strong management and diagnostic features. Examples of routers in this class include the Cisco AGS+ and 7000 series and the Xyplex 9000 Series.

Access routers

Access routers allow home users and small businesses to access the Internet via an Internet Service Provider (ISP). They are also used in remote offices for central-office connectivity. In the past, access networks began life as modem pools attached to terminal concentrators, serving a large number of low-speed dial-up circuits. In recent years this model has changed dramatically with the emergence of high-speed modems, ISDN, xDSL, and cable modems. ADSL is about to add an extra magnitude of affordable bandwidth into the SOHO environment, and this will increase the load on access routers yet further. The key characteristics of an access router are summarized as follows:

LAN-WAN interfaces
Multiprotocol support
Configuration management
VPN support
Voice encapsulation

Examples of routers in this class include the Cisco IGS and 2500 series, the AN range (Bay Networks), and Xyplex 3000 series.

Reducing port costs

Both enterprise and access routers are subject to very tight cost constraints, and stiff competition means there is always pressure to minimize the price per port. This ratio depends on factors such as memory size and type, interface logic, processing architecture, and the complexity of the protocol used between the port and the routing processor. Cost reduction is a serious consideration for all components on these platforms, and a constant concern for designers. Considerations include the following:

Processor architecture
Memory
Buffer memory
Control protocol

For the interested reader [12] provides an excellent review of this subject.

3.7.3 Router architecture and design

A router is essentially a specialized network computing device optimized for packet switching and as such typically contains CPU, ROM, RAM, some form of bus, possibly flash media, and NVRAM. In order to implement high-speed routing, the architecture of a generic router embodies several key components: a routing processor, switching fabric, and interface cards providing one or more input ports and output ports. While this may seem of academic interest to the casual observer, hard-bitten routing engineers understand all too well the implications of router architecture on performance and scalability in real networks. In the routing world, performance is often critical, especially as we go up the food chain into the backbone domain. The performance characteristics of a router become very exposed when CPU-intensive features such as quality of service are required on top of basic packet forwarding.

Packet switch evolution

We can broadly divide packet switches into three generations, as follows:

First-generation packet switches—Characterized by a general-purpose processor (e.g., a Pentium) distributing packets to dumb interface cards from main memory. The main bottlenecks are CPU and memory access speeds and possibly the interface adaptor. Such switches are cheap and easy to build and give relatively low performance.
Second-generation packet switches—Characterized by the more intelligent interface cards, which perform input/output queuing and are capable of offloading the CPU by distribution packets locally (i.e., by performing port mapping) over a shared bus. There is typically a fast-path and slow-path mode of operation, depending upon whether a packet can be switched or routed. The main bottleneck is shared bus contention. Typically these are high-end routers (e.g., the Cisco 7500) and ATM switches (e.g., FORE Systems ASX-200) or hybrid devices such as the Ipsilon (now part of Nokia) IP switching products (a combination of an IP router with an ATM switch, which enables flows to be classified and switched or routed dynamically).
Third-generation packet switches—Characterized by the ability to perform parallel packet transfers over multiple buses (i.e., a switch fabric). The switch fabric is self-routing; once the port mapper hands over the packet to the fabric it is automatically routed to the appropriate output port where it is queued. The main bottlenecks are the input queue arbiter and the output queue controller (depending upon whether the design has buffers largely at the input or output ports). Typically these are high-end ATM switches or high-end routers with ATM-like switching fabrics.

We will now briefly review the key components of router architecture and their potential impact on features and performance.

Routing processor

The routing processor participates in routing protocols and creates a forwarding table that is used in packet forwarding. The routing processor also runs the software used to configure and manage the router. It handles any packet whose destination address cannot be found in the forwarding table in the line card. Route processing can be implemented in several configurations, including the following:

Centralized, general-purpose processor
Centralized, general-purpose processor with hot standby or load-sharing twin processor
Centralized, general-purpose processor with subsidiary bit-slice processors to prefilter traffic
Centralized, general-purpose processor with auxiliary processors dedicated to specific tasks
Fully distributed processing

The architecture employed depends on a number of factors—the class of router, level of fault tolerance required, hardware specification, price, and the following:

Single processor—Single-processor systems are simpler to implement, but there is a threshold above which throughput will plateau as the offered load increases. Dedicated router platforms can offset this effect (referred to as the von Neuman bottleneck) by limiting the number and speed of interfaces serviced; employing multiple, subsidiary, or auxiliary processors; or enabling load sharing through clustering or stackable routers. Many routers opt for standard CISC processors such as the Intel x86 and Pentium range or the Motorola 680xx series. RISC processors (as implemented in Proteon's CNX 500 router) have largely failed to deliver the level of performance expected.
Subsidiary processors—Several router vendors have invested effort into offloading the main processor by using subsidiary Bit-Slice Processors (BSPs) on interface cards to prefilter traffic. Another area of development is the implementation of more sophisticated multiple bus architectures. Some of these implementations (such as the Cisco cBus) are essentially facelifts, designed to improve performance in existing product lines without a major redesign and to allow router vendors to move from small numbers of 10 Mbps Ethernet interfaces up to high-performance, higher-density interfaces for technologies such as FDDI, HSSI, and Fast Ethernet.
Coprocessors—Some router vendors have implemented auxiliary processors to improve performance by offloading nonrouting tasks, such as the Command-Line Interface (CLI), network management, compression, encryption, and encapsulation. This approach was used by Proteon in its CNX600 router design, where the central RISC CPU was supplemented by two AMD 29000 processors.
Multiprocessors—To date symmetric multiprocessor-like architectures have offered the highest switching rates for routing applications. Some router architectures compromise full symmetric multiprocessing by utilizing one CPU for packet forwarding and the other for general housekeeping. Even if one CPU is idle and the other busy, load cannot be shared because the tasks handled differ. Reference [13] provides some useful insights into why parallel processor designs have to date been applied only for niche, high-end platforms.

Another factor in these design options is whether the routing database itself is centralized, distributed but shared, or fragmented (distributed among routing processors, each with their own routing database). For backbone and large enterprise routers handling many interfaces it is advantageous to have a centralized or closely coupled distributed routing database; however, this requires either substantial horsepower and resources or a highly sophisticated multiprocessor architecture. Some mid-range enterprise and access routers employ a collapsed routing architecture, where each route processor acts like an independent router inside the chassis, connected over a common medium fabric (often internalized Ethernet, FDDI, or ATM cores). Although some highly flexible designs have been deployed, this may mean that internal routes (i.e., routes between cards in the chassis) appear in the routing table just like external routes, potentially slowing down route lookup.

Interface modules

An interface module (sometime called an IO card or line card) is the physical point of attachment for local and wide area circuits to the router. Interface cards typically support one or more physical ports, depending upon the media type, the amount of logic required on the card, and the size of the connectors. Incoming packets are fed into an input port, and outgoing packets are fed to an output port. Many media technologies operate in full duplex mode, so in some cases the input and output ports may refer to the same physical interface, differentiated only by the direction of traffic flow.

Switching fabric

The switching fabric interconnects input ports with output ports. If the switching fabric has a bandwidth greater than the sum of the bandwidths of the input ports, then packets are queued only at the outputs, and the router is called an output-queued router. Otherwise, queues may build up at the inputs, and the router is called an input-queued router. Fabrics may be physically represented as a traditional backplane (typically one set of large connectors at the back of the chassis) or as a midplane (two opposing sets of connectors in the middle of the chassis). Midplane chassis tend to offer greater flexibility but at the cost of additional complexity. One thing to take special note of is the physical connector presentation. Ideally the backplane/ midplane should present female interfaces, since this has the least potential for damage internally. If this is not the case, you run the risk of having to replace the whole chassis if you inadvertently damage the pins of a male connector.

Bus fabric

The traditional and simplest switching fabric to implement is a bus, which interconnects all the input and output ports. It is relatively easy to build a router today using a personal computer equipped with multiple line cards. The problems with this architecture are that the bus is a single point of failure and throughput is constrained by both the capacity of the bus and the overheads imposed by bus arbitration logic.

Shared-memory fabric

In a shared-memory router incoming packets are stored in a shared memory and only pointers to packets are switched. This increases switching capacity. However, there is an intrinsic performance limitation with shared-memory switch fabrics. The speed of the switch is limited to the speed at which we can access memory. Although memory sizes have roughly doubled every 18 months, memory access times have improved by only about 5 percent annually. Unlike crossbar switches, however, shared-memory switches do not suffer from Head-of-Line (HOL) blocking, and multicasting is relatively straightforward to implement.

Crossbar fabric

The crossbar fabric is the simplest form of switching fabric, comprising an N × N matrix of N input buses, N output buses, and N² crosspoints (this can be visualized as 2^N buses). A crossbar radically improves performance by enabling multiple simultaneous data paths to operate through one or more switching fabrics and is at best case N times faster than a second-generation switch.

On the downside, the control status of each crosspoint must be under continuous control for every flow of packets transferred in parallel across the crossbar fabric, and this requires a scheduler. Therefore, with the crossbar architecture, although much improved, the scheduler ultimately constrains performance through the fabric. Furthermore, although a crossbar is internally nonblocking, if multiple packets at the input port want to go to the same destination then output blocking occurs. This can be resolved by either running the crossbar N times faster than the input ports (difficult and expensive) or by placing buffers inside the crossbar. Crossbars do not scale well. Multicasting is complex in a crossbar switch and requires the switch to fabric to be quiet during the period of the multicast.

Hybrid media fabrics

During the mid-1990s some innovative designs emerged that used pseudo-media buses inside the chassis. Some of these devices also implemented a kind of symmetric multiprocessing (promoted by BAY Networks and Xyplex). Some of these implementations were closer to real symmetric multiprocessing than others. For example, Xyplex implemented a highly flexible midplane (as opposed to backplane) architecture in their 9000 series, using three internal Ethernet buses. This design is a form of multiprocessing; however, each route processor operates like a discrete router, even to the point of exchanging routing protocol over the internal media buses. A number of IO ports are dedicated to each processor, and ports can be soft-switched internally or externally, enabling a high degree of flexibility in the design.

This design offers an impressive number of topology options and appears to offer scalability and good fault tolerance. Scalability is limited only by the speed of the internal bus, and individual slots could run different processors to match different types of applications. Each slot is autonomous; in the event of a route processor failing, only the IO ports connected to that slot are affected. Hot swapping is available. Managed load-sharing power supplies are available. The disadvantages are that this is an expensive architecture to implement. Each slot has its own OS and routing processor and requires large amounts of RAM as we move into the backbone domain. Another disadvantage is that routing is taking place internally (e.g., OSPF peering across internal buses) and route entries for internal networks appear in the routing tables.

ATM-like fabrics

In the early 1990s designers began to tackle large ATM switch architectures. Although ATM did not dominate the networking world as predicted, designers of IP routers saw an opportunity to improve throughput and offer (conceptually) mixed-media services by wrapping segmentation and reassembly modules around ATM-like switching fabric cores. With this architecture ATM PVCs are mapped between all ports in the router (i.e., a fullmesh ATM backbone in a box). During the forwarding process, a longest-prefix match determines the destination port for an IP packet. The packet is then fragmented into 53-byte ATM cells and switched over the appropriate PVC to the output port, where it is reassembled back into an IP packet before transmission. From the IP perspective the ATM core could be viewed as a transparent fat pipe; however, this architecture does not allow the possibility of providing different service guarantees, and these designs suffer some of the inherent problems of ATM, including the following.

ATM switches do not generally provide good support for multicasting. Multicasting requires an ATM VCI to be mapped to multiple VCIs and cells to be replicated either at the input port or within the ATM switch fabric. This degrades the overall efficiency of the switching fabric.
A subtle problem occurs because traffic control algorithms are usually specified in terms of packets rather than in terms of cells. So, with cell-based fabrics, implementing semantics such as those required by shared filters in RSVP can be a challenge.

Despite these issues, ATM fabrics have appeared in many recent router designs. For further information, the interested reader is referred to [11, 12, 14] for more details of switch design and buffering techniques.

Issues in switch design

Buffering

One of the hardest problems in switch design today is not the fabric design but the buffering component. In short, there are three main buffering techniques, as follows:

Input queuing is where packets are queued at the input port and pulled off the queue when access to the switching fabric is gained via an arbiter. With FIFO input queuing, however, packets may suffer from head-of-line blocking, which can be avoided by using separate input queues for each output port.
Output queuing is where packets are queued at the output ports, awaiting scheduling for release onto external interfaces. The problem is where all packets at the input ports are destined for the same output port. Costs can be reduced by assuming a probability for N inputs requiring access to a single output simultaneously, scaling the output port speed accordingly (e.g., the so-called knockout switch—see [14]).
Shared memory is where the architecture input and output ports share common memory. The output port scheduler removes the packet from memory and passes it to the appropriate output port. Such a device is easy to build, since only headers are switched; however, an N ×N switch must process N packets in a single arrival time, and, since memory bandwidth is usually a major bottleneck, this limits the size of the switch.

For further information on buffering techniques, the interested reader is referred to [14]. For details on performance improvements, and arbitration algorithm design in relation to input queuing, see [15, 16].

Route lookup

Perhaps the single most important performance issue for backbone routers is route lookup time. The operation is based on the most specific (longest) IP prefix. This would be an O(N) operation per packet just to build a set of candidates and is completely unacceptable. This issue is amplified if small packets represent a high proportion of the traffic, since this ultimately means more lookups per second. The two main factors that affect the speed of a route lookup algorithm are memory access time and the data structures used to construct the forwarding table. Performance can be improved in a number of ways, including the following:

Optimized data structures—Trie-based algorithms [17] optimize storage space at the expense of performing more memory lookups, and, as memory prices drop, this is perhaps the wrong way to go. Recent research indicates that routing tables are essentially quite stable, requiring updates only about once every two minutes [18].
Hardware-oriented techniques—The most common hardware-oriented solutions to lookup problems are Content Addressable Memories (CAMs) and caches. Both techniques scale poorly with routing table size and, therefore, cannot be used for backbone routers. One approach to solve this problem integrates logic and memory together in a single device, dramatically reducing memory access time. Another solution is to increase the amount of memory used for the routing table [19], although cost would possibly preclude this for enterprise and access routers. One problem is that as the table grows it becomes very hard to update. Reference [19] describes inexpensive, special-purpose hardware that can be used to perform rapid updates.
Table compaction techniques—These techniques create complicated but compact data structures for the forwarding table. The table is stored in the primary cache of a processor, enabling route lookup at gigabit speeds. Reference [20] describes such an algorithm.
Hashing techniques—Hash-based solutions are commonly used for fast lookup problems, but there are problems applying hashing techniques to a forwarding table (given a destination IP address, we do not know in advance what prefix should be used for finding the longest match). Reference [21] offers a scalable, hash-based algorithm to look up the longest prefix match for an N bit address in O(log N) steps. This algorithm computes a separate hash table for each possible prefix length, and, instead of naively searching for a successful hash, this algorithm does a binary search on the prefix lengths.

Instead of reducing the cost of route lookups, backbone routers can use two techniques to avoid route lookups altogether, as follows:

Map destination addresses to VCIs at the edge—Backbone networks such as ATM provide a Virtual Circuit Interface (VCI). If edge devices map IP destination addresses to a VCI, then the longest prefix match problem can be avoided in backbone routers. Since VCIs are integers drawn from a small pool, they can be looked up with a single memory access. This approach, however, requires the edge devices to somehow distribute address-to-VCI mappings via protocols such as IP switching and tag switching.
Table size reduction through routing hierarchy—Another technique is to have backbone routers maintain routes purely for destinations served by that backbone. Since table size is greatly reduced, route lookups are considerably faster. All unknown destinations are routed to gateways at a Network Access Point (NAP), which holds the global routing table. This approach promotes scalability (in particular for BGP).

There are clearly a number of useful solutions to the fast route lookup problem, and several of these techniques will no doubt be incorporated into new generations of router designs. An excellent review of router design trends is given in [12].

3.7.4 Scheduling techniques

The simplest method of scheduling packets from a buffer is to release them in the order of arrival, a strategy commonly referred to as first come, first served (FCFS) or first in, first out (FIFO). Many first-generation routers provided a single FIFO as their own scheduling option (perhaps with one additional queue for control and management traffic). With mixed local and wide area routing there can be a much greater imbalance in line speeds, leading to frequent buffer overflows or timeouts if the amount of remote data is too large to be sustained.

Traffic prioritization and queuing mechanisms have become increasingly sophisticated over recent years and are closely associated with service guarantees. In addition to FCFS/FIFO, queuing strategies currently deployed include priority queuing, custom queuing, Weighted Fair Queuing (WFQ), and Random Early Detection (RED).

3.7.5 Router operating systems

Historically, routers have been viewed as hardware devices optimized for high-speed packet forwarding; consequently, the operating systems implemented in these systems were often not much more than stripped-down schedulers. Many router operating systems are derived from early versions of Free BSD UNIX, and there is increasing interest today in LINUX. Both of these environments have access to a huge suite of essentially free routing software developed over many years and publicly available on the Internet. Over the past decade, however, router hardware has become almost a commodity, and the software component of the router has been increasingly viewed as the true value item. Router vendors such as Cisco have differentiated themselves largely on increasingly rich and sophisticated software functionality, ranging from huge tracts of protocol support to exotic queuing systems and bandwidth management features.