2.2 Embedded Communications Software | Designing Embedded Communications Software

Host machines running general-purpose operating systems are not the best platforms for building communications devices. Even though some routers are built on top of UNIX and Windows NT, they have seen limited use in the Internet. These routers perform all processing in software and have to work within the constraints of the general-purpose operating system, for example, equal-priority, timeslice-based scheduling, which can result in packet processing delays. Moreover, often times, these general-purpose systems are used to run other application code at the same time as the networking application.

The solution is to use dedicated communication hardware or an “appliance” which could be a router, switch, terminal server, remote access server, and so on. For our discussion, the dedicated appliance is an embedded communications device, to differentiate it from the host, which participates in the communication.

Common characteristics of a communications appliance are as follows:

It typically runs a real-time operating system
It has limited memory and flash
It either has limited disk space or is diskless
It provides a terminal and/or Ethernet interface for control and configuration
It frequently has hardware acceleration capability

2.2.1 Real-Time Operating System

The real-time operating system (RTOS) is the software platform on which communications functionality and applications are built. Real-time operating systems may be proprietary (home grown) or from a commercial real-time operating system vendor such as VxWorks™ from Wind River Systems, OSE™ from OSE Systems, and Nucleus™ from Mentor Graphics.

The embedded market is still struggling with homegrown platforms. Often, engineers from the desktop world are surprised that so many embedded development engineers still develop their own real-time operating systems. Some commercial real-time operating systems have too much functionality or too many lines of code to be used for an embedded project. Perhaps more significant, many commercial real-time operating systems require high license fees. The per-seat developer’s license can be expensive, and royalty payments may be required for each product shipped with the RTOS running on it. Some engineering budgets are too tight to handle this burden. Last, some engineers feel they can do a better job than an off-the-shelf software component—since the component may or may not be optimal for the application, especially if the RTOS vendor does not provide source code. However, this is not always a simple task.

RTOSes for Communications Systems

Early communications architectures were quite simple. The entire functionality was handled by a big dispatch loop—when data arrived, it was classified and appropriate actions performed. The operations were sequential, and there was no interaction between the modules which performed the actions. In this case, an RTOS would have been overkill and the dispatch loop was more than sufficient. However, communications systems are rarely this simple. They require interaction between functional modules at various layers of the OSI Seven Layer Model. If, say, TCP and IP were implemented as separate tasks, they would need an inter-task communication mechanism between them, which could be in the form of queues, mailboxes, or shared memory. If the two tasks needed to access some common data such as a table, semaphores might be used to protect access to the table. Commercial real-time operating systems provide these mechanisms and functions, and there is no need to recreate them.

RTOS vendors typically support the complete development environment, including the processor used for the project, the “tool chain” like compilers, assemblers, and debuggers, and other development tools. RTOS vendors have also tuned their operating systems to utilize some of the features of the processor being used for the development—something that would take developers on a homegrown project a much longer time to accomplish and is usually beyond the scope of the development project.

It is strongly recommended that developers of communication systems use commercially available real-time operating systems and place engineering focus on communications development rather than infrastructure.

Typically, homegrown projects end up adding more systems software to handle common infrastructure functions that the RTOS provides out of the box. Furthermore, newly developed OS software tends to be more complicated, convoluted, and buggy, making it more expensive to develop and maintain than the commercial RTOS. The following table summarizes the issues to consider for using a commercial RTOS instead of developing your own for the specific project.

Issue	Standard RTOS	Proprietary RTOS
Performance for a specific application	Less Optimized	More optimized
Maintenance	Responsibility of RTOS vendor	Responsibility of developer
Portability to multiple processors	Provided by RTOS vendor via separate packages	Has to be provided by developer for each processor
Support for standard Ethernet/serial devices	Provided as part of RTOS package for board	Has to be developed
Modifiability	Can be done only if RTOS vendor provides source code	Easily modifiable since source code is available internally
Tool Chain Support	Supported by RTOS vendor	Need to build using third-party development tools
Standard Interfaces, IPC mechanisms and APIs	Provided by RTOS vendor	Has to be designed in by the developer
Cost	High in dollar terms	Low in upfront dollar terms but could be high because of development effort/time and debugging

Board Support Package (BSP)

A typical embedded communications platform has a CPU, flash, or DRAM from which the code runs, DRAM for the data structures, and peripherals like serial and Ethernet controllers. The CPU requires its initialization sequence, after which it can perform a diagnostic check on the various peripherals. This software is CPU and board specific and is usually included in the RTOS specific to the board. This software, including the RTOS is also known as a Board Support Package (BSP). For Common Off The Shelf (COTS) communications boards, the BSPs are provided by the board manufacturer via a license from the RTOS vendor. Alternately, the RTOS vendor can provide the BSP for several popular COTS boards along with the RTOS developer license.

For boards developed internally, engineers have to create the BSP, which is not a trivial task. Most vendors offer board porting kits, which instruct engineers on how to create and test a homegrown BSP. RTOS vendors and commercial system integrators often provide board support as a consulting service.

Once the BSP is created, it allows an executable image of the RTOS or a portion of the RTOS to run on the target board and can be linked in with the communications application to form the complete image.

2.2.2 Memory Issues

Embedded communications devices rarely have a disk drive except when they need to store a large amount of data. These systems boot off a PROM (or flash) and continue their function. In a typical scenario, the boot code [see Figure 2.4] on flash decompresses the image to be executed and copies the image to RAM. Once decompression is complete, boot code transfers control to the entry point of the image in RAM, from which the device continues to function.

click to expand
Figure 2.4: Boot sequence using ROM/Flash and RAM.

The embedded communications system comprises the RTOS and the communication application. The communication application is the complete set of protocol and systems software required for a specific function—for example, a residential gateway implementation. The software application and the RTOS use the RAM for data structures and dynamic buffer and memory requirements.

We earlier discussed memory protection in user and kernel modes in a UNIX host. The most significant difference between UNIX and the RTOSes is that there is no defined kernel mode for execution in RTOSes. Memory protection is enforced between tasks where appropriate, but that is all. Interestingly, many of the popular earlier RTOSes did not support memory protection between tasks. This did not limit the development engineers’ flexibility during testing and debugging. However, in the field, these systems still ran the risk that user tasks with buggy code could corrupt the RTOS system code and data areas, causing the system to crash.

Recently, RTOSes with memory protection are seeing use in communications systems. Note that memory corruption bugs can be manifested in several indirect ways. For example, a task can overwrite data structures used to maintain the memory pool. The system will crash only when malloc is called next, causing much grief to the developer, who cannot figure out why the system has crashed and which task is responsible. Our recommendation is to use memory protection if it is available via a memory management unit (MMU) and in the RTOS.

Configuration and Image Download

A communications system needs to be able to save its current configuration so that if the system goes down, it can read the last saved configuration and use it to start up and be operational as soon as possible. For this reason, the communications device configuration is often stored in local non-volatile memory like an EEPROM or flash. Alternately, the configuration can be saved to and restored from (on restart) a remote host. In the same context, these systems also permit the download of a new image to upgrade the existing image on the flash. Most communications systems require an extensive amount of configuration to function effectively. For example, a router needs to be given the IP addresses of its interfaces; routing protocols need to be configured with timer values, peer information, and so on. Losing this configuration on reset would require the user to set these parameters all over again—which is not only a demand on network administrator time but also has a potential for misconfiguration.

Frequently, the flash system is used for field upgrades to avoid shipping the system back to the manufacturer for bug fixes and upgrades. To ensure that a new image does not cause the system to become non-operational, systems provide a backup to the last saved image. That is, there will be two sections on the flash, one for the existing saved image and the other for the newly downloaded image.

The new image is downloaded to a separate area of the flash so that the system can recover in case the download is unsuccessful or if the new image has problems. If the new image is downloaded and overwrites the existing image, the system cannot recover from this error and requires manual intervention. A key feature of telecom systems deployed in service provider networks, is their ability to perform a stepwise upgrade and rollback to the previous version of the software if the upgrade is unsuccessful.

2.2.3 Device Issues

Unlike the PCs or workstations, embedded communications devices do not come with a monitor and a keyboard. The only way to communicate with the “headless” embedded device is through a serial port or Ethernet. Headless devices can be booted and continue to operate without a keyboard and monitor. To communicate with the device through a serial port, you need to connect a terminal or use a PC with a terminal emulation program (like Hyperterminal on a Windows PC). The communications device typically has a Command Line Interface (CLI), which allows the user to type commands for configuring the device, displaying status and statistics, and so on.

In addition, an Ethernet port is typically used for communicating with the device for management purposes. The Ethernet port can be used to boot up a system as well as download new software versions. The TCP/IP stack runs over this Ethernet port, so that the network administrator can telnet to the device over this port to access the CLI. This facility is used by network administrators to manage the device from a remote location.

Device Drivers

Some RTOSes have their own communications stacks integrated or available as an additional package. These stacks interface to the hardware drivers through standard interfaces. Usually, the only hardware driver provided is the default Ethernet driver for the device used for management communication. Other drivers need to be written for specific ports to be supported by the application.

Standard driver interfaces ensure that the higher layer stacks such as IP will use the same interfaces irrespective of the device they are running on. For example, as in UNIX, some RTOSes use the following set of calls to access drivers, independent of the type of driver (Ethernet, Serial, and so on).

open ()	causes the device to be made active
close ()	results in the device being made inactive
read ()	is used for reading data received by the device
write ()	is used for writing data to the device
ioctl ()	is used for configuring and controlling the device

Applications using this interface will not need to be modified when moving to a new platform, as long as the new driver provides the same interface.

2.2.4 Hardware and Software Partitioning

Partitioning functionality between hardware and software is a matter of technology and engineering with the constraints of optimization. For example, a DSL modem/router may implement some compression algorithms in software instead of hardware, to keep the cost of hardware lower. The application environment may not consider this as a restriction, if the performance of such a software implementation is sufficient for normal operation. As technology progresses, it may be easier and less expensive to incorporate such functionality into a hardware chip. Developers should provide flexibility in their implementation via clearly defined interfaces, so that an underlying function can be implemented in software or hardware.

Size–Performance Tradeoff

Students of computer science and embedded systems would be familiar with issues related to size and performance and how they sometimes conflict with each other. Caching is another size–performance tradeoff made in the embedded communications space. A lookup table can be cached in high-speed memory for faster lookups, but instead of caching the entire table, only the most recently accessed entries could be cached. While this improves the performance of the system, there is an associated cost in terms of additional memory usage and the complexity of the caching algorithms.

Depending on application and system requirements, memory is used in different ways. For example, the configuration to boot up a system could be held in EEPROM. The boot code can be in a boot ROM or housed on flash. DRAM is typically used to house the executable image if the code is not executing from flash.

DRAM is also used to store the packets/buffers as they are received and transmitted. SRAM is typically used to store tables used for caching, since caching requires faster lookups. SRAM tends to be more expensive and occupies more space than DRAM for the same number of bits.

High-speed memory is used in environments where switching is performed using shared memory. Dual-port memory is used for receive and transmit buffer descriptors in controllers such as the Motorola PowerQUICC™ line of processors.

For cost-sensitive systems, the incremental memory cost may not be justifiable even if it results in higher performance. Similarly, when performance is key, the incremental complexity of a design that uses SRAM for caching may be justified.

Fast Path and Slow Path

When designing communications systems, the architecture needs to be optimized for the fast path. This is the path followed by most of (the normal) packets through the system. From a software perspective, it is the code path optimized for the most frequently encountered case(s).

Consider a Layer 2 switch which needs to switch Ethernet frames at Layer 2 between multiple LAN segments. The same switch also acts as an IP end node for management purposes. The code path should be optimized so that the switching is done at the fastest rate possible, since that is the main function of the system. If some of the Ethernet frames are destined to the switch itself (say, SNMP packets to manage the switch), these packets will not be sent through the fast path. Rather, they will be processed at a lower priority, i.e., they will follow the slow path.

Host Operating Systems versus RTOSes

Host operating systems like UNIX or LINUX are seeing deployment in some embedded communications devices, though they were not originally tuned for real-time applications. The following provides a checklist about the issues to consider when choosing between host and real-time operating systems. The Linux operating system is chosen as an example of a desktop OS for this purpose.

Evaluation Criterion	Choice of Linux or RTOS
Does the application “really” need real-time performance? E.g., if the application tasks are scheduled only periodically and most of the time-critical functions are handled via hardware, then there is really no need to go for an RTOS.	Linux
Offer standard APIs (like the socket API) for applications	Linux or embedded RTOS (if it offers the same APIs)
Possibility of modifying the Linux scheduler to be “closer” to real time	Linux, if the modification is possible
Availability for a specific hardware platform	Commercial RTOSes support more platforms
Tool Chain Support	Commercial RTOSes have better tool chain integration
Cost	Linux has no upfront fee or royalties

From the table, it is clear that an open source operating system like Linux is a growing threat to the RTOS business. We recommend that developers choose an OS platform by using evaluation criteria similar to the ones above.

Another example of fast- and slow-path processing is the handling of IP packets with options in a router. The router normally forwards IP packets from one interface to another based on the destination address in the packet. However, the IP protocol defines some optional parameters, called IP options, that can be present in the IP header. One such option is the Record Route option, where the router has to record its IP address in the designated space in the IP header. This will indicate that this router was on the path that the packet took to reach the destination. Options are typically used for diagnostic purposes; most packets will not include options.

If the IP forwarding logic is done in software, the fast-path software will be the one optimized to handle packets without options. If this software sees a packet with options, it will hand the packet off to the slow-path software and return, so that it can process the next packet. The slow-path software, typically in a separate task, will handle the packet when it is scheduled.

The separation between the fast path and slow path is the basis for hardware acceleration, discussed next.

2.2.5 Hardware Acceleration

All the information presented earlier about networking software assumed that it runs on a single, general-purpose processor (GPP). These are the processors like the ones used in workstations. They include the MIPS™ and PowerPC™ line of processors, which have a strong RISC (Reduced Instruction Set Computer) bias.

While these processors are powerful, there is always a limit to the performance gain from a software-only implementation. For devices with a small number of lower speed interfaces like 10/100 Mbps Ethernet, these processors may be sufficient. When data rates increase, and/or when there are a large number of interfaces to be supported, software based implementations are unable to keep up.

To increase performance, networking equipment often includes hardware acceleration support for specific functions. This acceleration typically happens on the fast-path function. Consider a Layer 2 switch, which requires acceleration of the MAC frame forwarding function. A Gigabit Ethernet switch with 24 ports, for example, can be built with the Ethernet switching silicon available from vendors like Broadcom, Intel or Marvell along with a GPP on the same board. The software, including the slow-path functions, runs on the GPP and sets up the various registers and parameters on the switching chipset.

These chips are readily available for any Ethernet switch design from the semiconductor manufacturer and are known as merchant silicon. These chips are available with their specifications, application notes, and even sample designs for equipment vendors to build into their system. In the case of an Ethernet switching chip, the device specifies connectivity to an Ethernet MAC or to an Ethernet PHY (if the MAC is already included on the chip, as is often the case). Hardware designers can quickly build their boards using this silicon device.

The Broadcom BCM5690 is one such chip. It implements hardware-based switching between 12 Gigabit Ethernet ports, so two 5690s can be used to build a 24-port Gigabit Ethernet switch. Once the 5690s have been programmed via the CPU, the switching of frames happens without CPU intervention.

Software vendors are provided with the programming interfaces to these devices. For example, for a switching device, we may be able to program new entries into a Layer 2 forwarding table. These interfaces may be provided through a software library from the silicon vendor when the vendor does not want to disclose the internal details of the chip. Often, the details of the chip are provided, so software engineers can directly program the registers on the chip for the required function. Independent of the method used to program the device, software performance can be enhanced by offloading the performance-intensive functions to the hardware device.

In summary, hardware acceleration is used for the fast-path processing.

ASICs

Not all hardware acceleration devices are available as merchant silicon. Some equipment vendors believe that merchant silicon does not address all their performance requirements or support the number of ports required. For example, the design may require that we need to support MAC and switching functionality for 48 Gigabit Ethernet ports on a single line card. Merchant silicon may not be able to satisfy this requirement, so an Application Specific Integrated Circuit (ASIC) needs to be developed. While designing this chip, engineers can add functionality specific to their system—in our example, this can include additional functionality for Layer 3 and 4 switching, which may not be available in merchant silicon.

While an ASIC is very efficient for the functions needed, it is quite expensive to develop. It typically takes about nine months to develop and, depending upon the tools and the engineering effort needed, can run into even millions of dollars. The upside is that the equipment vendor now has a proprietary chip which provides superior functionality/performance to any merchant silicon and thus provides a competitive differentiation. Several communications equipment vendors do not use merchant silicon for their core products. They maintain large engineering teams dedicated to working on custom chips. Also, when the technology is not proven, or the vendor has a proprietary twist on a technology, ASICs are commonly used.

Some merchant silicon, like the Broadcom BCM5690 and Marvell Prestera line of products, are also known as Net ASICs or configurable processors , to distinguish them from network processors, discussed next.

Network Processors

Network processors (NPs) are another type of network acceleration hardware, available from vendors such as Agere, AMCC, IBM, Intel, and Motorola. A network processor is simply a “programmable ASIC,” which is optimized to perform networking functions. At high speeds, a processor has very little time to process a packet before the next packet arrives. So, the common functions that a packet processor performs are optimized and implemented in a reduced instruction set in a network processor. The performance of an NP is close to that of an ASIC; it has greater flexibility since new “microcode” can be downloaded to the NP to perform specific functions.

Programmable hardware is important because networking protocols evolve requiring changes in the packet processing. The hardware needs to analyze multiple fields in the packet and take action based on the fields. For example, for an application like Network Address Translation (NAT), there may be fields in the packet which need to manipulated to reflect an address change performed by the address translation device. These are implemented via functions called Application Layer Gateways (ALGs) which are present in the address translation device implementing NAT. ALGs are dynamic entities and more are defined as applications are developed. Since the address translation performance should not be degraded as more and more ALGs are added, programmable hardware like network processors are a good fit for implementing NAT.

2.2.6 Control and Data Planes

A high level method of partitioning the functionality of the system is by separating it into functions that perform:

All the work required for the basic operation of the system (e.g. switching, routing)
All the work required for (1) to happen correctly

The classical planar networking architecture model uses this partitioning scheme to separating the communications functionality into three distinct planes (see Figure 2.5):

Control Plane
Data Plane
Management Plane

Figure 2.5: Classical planar networking architecture.

The data plane is where the work required for the basic operation takes place. The control and management planes ensure that the data plane operation is accurate.

The control plane is responsible for communicating with peers and building up tables for the data plane to operate correctly. The functions include peer updates, signaling messages, and algorithmic calculations. The control plane functions are typically complex and might also involve conditional execution. Examples of control plane protocols include the Open Shortest Path First (OSPF), Signaling System 7 (SS7) in the telecom world, signaling protocols like Q.933 in Frame Relay, and so on.

The data plane is responsible for the core functions of the system. For example, in the case of a router, the data plane performs the IPv4 forwarding based on tables built up by control protocols such as OSPF. Data plane functions are relatively simple and repetitive and do not involve complex calculations. There is little or no conditional execution required on the data plane. The functions on the data plane are relatively simple and do not involve complex calculations, as in the control plane (e.g., the OSPF protocol requires a complex Shortest Path First calculation based on information obtained from protocol updates).

The management plane spans across both the control and data planes and is responsible for the control and configuration of the system. This is the part of the system which performs housekeeping functions. It also includes functions to change configuration and to obtain status and statistics. Functions like SNMP, Command Line Interface (CLI), and HTTP-based management operate on the management plane.

2.2.7 Engineering Software for Hardware Acceleration

Hardware acceleration is typically used in the data plane and typically for fast-path processing. Software on the data plane is responsible for initializing the hardware used for acceleration, configuration, and programming. It also handles the slow-path processing.

While writing communications software for the data plane, it is essential that we partition functionality so that hardware acceleration can be added very quickly. Consider an IPSec implementation. IPSec is used in the TCP/IP world for securing communications between hosts or between routers. It does this via adding authentication and encryption functionality into the IP packet. The contents of the authentication and encryption headers are determined by the use of security algorithms which could be implemented in hardware or software. An algorithm like Advanced Encryption Standard (AES) can be implemented in a security chip, like those from Broadcom and HiFn.

An IPSec implementation should be written only to make function calls to the encryption or authentication algorithm without the need to know whether this function is implemented in software or hardware. In Figure 2.6, an encryption abstraction layer provides this isolation for the IPSec module; the IPSec module will not change when moving to a new chipset or software-based encryption.

click to expand
Figure 2.6: Encryption abstraction layer for an IPSec module.

Software with Hardware Acceleration—A Checklist

The following is a checklist for engineering software in the presence of hardware acceleration. The underlying premise is that the software is modular, so it will be efficient with and without acceleration.

Design the code to be modular so that functions in the data plane can be easily plugged in or out depending upon hardware or software implementation.
Separate the fast-path and slow-path implementation of the data plane up front
Maximize performance in the data plane even if it is a software-only implementation—it is very likely that the software implementation is sufficient for some low-end systems. This includes efficient search algorithms and data structures.
Handle all exception processing in software
Ensure that interrupt processing code is efficient—for example, read counters and other parameters from the hardware via software as fast as possible, since they run the risk of being overwritten.
Do not restrict the design such that only certain parts of the data plane may be built in hardware. Network processor devices can move several data plane functions into software.
Ensure that performance calculations are made for the system both with and without hardware acceleration—this is also one way to determine the effectiveness of the partitioning.
When interfacing to hardware, use generic APIs instead of direct calls to manipulate registers on the controller. This will ensure that only the API implementations need to be changed when a new hardware controller is used. Applications will see no change since the same API is used.

The use of hardware acceleration is a function of the product requirements and the engineering budget. While it is certainly a desirable, developers should not depend upon it as a way to fix all the performance bottlenecks in their code.