4.3 Network Software

In order for applications to use the network, applications need to access the network via a set of software. This software stack will provide a range of functionality, and will exist in a number of forms. At the highest level, there are communication libraries, for example, MPI implementations. These are typically used by applications because they provide a transport and platform independent interface to communication. (These libraries won't be discussed in detail in this chapter; see Chapters 8–11 for details.) At a slightly lower level, protocol stacks are used. These protocol stacks provide transport properties like reliable message delivery, ordered message delivery, message framing, and flow control. The lowest level of network software is the driver layer. Network drivers interact directly with network interfaces to control transmission and receipt of packets on the network.

4.3.1 Network Protocols

Network protocols are a series of procedures used to setup and conduct data transmissions between a group of machines. Such protocols abstract the physical transmission medium to provide some portability to applications. Protocols are used to provide various properties to network communications sessions. Note that not all protocols provide all of these properties, and the following list is by no means exhaustive.

Media contention: work around collisions and other physical errors.
Addressing: A station addressing scheme that is network layer independent.
Fragmentation: A means to break down messages into smaller pieces (called datagrams) for transmission, and reassemble them at the receiver.
Reliable delivery: A means for the client to determine if transmission completed properly or an error has occurred.
Ordered delivery: Messages are delivered in order to the application from end to end.
Flow control: Transmission can be slowed to improve performance or prevent the exhaustion of resources at the destination or along the route to the destination.

Most applications will actually use a combination of network protocols in the course of communications. This means that all protocols do not need to provide all of the above properties. For example, the IP protocol only provides an addressing scheme and message fragmentation: the IP protocol provides an addressing scheme that allows a message to be delivered to another end station and that it is fragmented and reassembled if necessary. Most IP applications also use either TCP or UDP. TCP is used when reliability and ordered delivery are desired. The following are descriptions of a number of common network protocols and the properties they implement.

Ethernet provides media collision detection and avoidance. The Ethernet protocol also provides an addressing scheme. Each client uses a 48-bit address, assigned by the vendor of its network interface.

IP is a protocol that provides the features of addressing and fragmentation. Addressing is implemented in the following way. Each client address has a 32 bit address, broken into a network address and a host address. Network addresses are used to route packages from one network segment to another. Fragmentation is implemented using a identification field in the header. IP also includes a header field that specifies the transport layer protocol as well. This will in most cases be either TCP or UDP, but other protocols can be used as well.

The IP protocol must be adapted to the underlying physical network type. IP addresses must be able to be mapped to physical network addresses. In the case where IP is used on top of Ethernet, the address resolution protocol (ARP) is used to determine the Ethernet address of the intended recipient. This process consists of a broadcasted query for the MAC address of an IP address. The owner of that IP address will respond with the MAC address. This value is cached. At this point, IP can be used on top of Ethernet transparently. See Section 5.2 for a more detailed discussion of IP, TCP and UDP.

TCP specifies a set of steps required to establish a communication session. Once this is established, it provides reliable, in-order delivery of messages.

UDP provides about the same functionality as IP. It is generally used so that an application can implement its own network protocol for reliable delivery of messages. UDP is also used in cases where reliable message delivery is not as important as low latency or jitter. UDP is frequently used for streaming audio and video.

GM is the driver, firmware, and user-space library used to access Myrinet interfaces. It provides all of the properties necessary to use the network for reliable communications. Addressing is implemented in GM using interface hardware addresses and a routing table that exists on each node. This routing table has a set of source routes for all nodes on the network. Fragmentation is not necessary, as GM messages are not limited in size. GM also implements reliable, in-order message delivery. Because of the switched nature of myrinet switch complexes, media contention is not an issue.

The kernel driver providing support for GM on Myrinet interfaces also provides Ethernet emulation. This means that protocols like Ethernet and IP (TCP and UDP) can be run over Myrinet hardware.

4.3.2 Network Protocol Stacks

Network protocol stacks are the software implementations of the network protocols mentioned in the previous subsection. These implementations are typically operating system specific. Many of these are implemented inside of the kernel, but this is not universally the case. These stacks provide a syscall interface for user-space programs. The most common example of this interface is the socket interface used by all IP-based protocols. An application will set up a socket, and then send and receive data using this socket. All of these function calls are implemented as system calls. The network stack uses network drivers to actually send and receive data. The purpose of the syscall layer is to provide portability between different implementations of facilities provided by the kernel. This layer is tightly coupled with network drivers, as it is the sole consumer of their functions.

4.3.3 Network Drivers

Network drivers are the software that allows network interface hardware to be used by the kernel, network protocol stacks, and ultimately user applications. Network drivers have a few responsibilities. First, the driver will initialize the network card, so that it can be enabled. This setup consists of internal setup like on-card register initialization, but also includes external setup like link auto-negotiation. After these steps are complete, the network hardware should be initialized. This does not mean the interface is completely configured, as some configuration processes like DHCP use the network interface itself to configure settings like IP addresses.

The driver also provides functions necessary to send and receive packets via the network. The send functions are typically called from a protocol stack. The set of transmission steps is as follows: an application makes a system call, providing data to be sent. This data is processed by the network protocol stack. The protocol stack calls functions provided by the driver to copy the data across the I/O bus and actually transmit the data.

When receiving data, the network interface will receive data from the network. It will then do some amount of processing of the data. This processing varies from card to card. Some cards implement parts of the protocol stack in hardware in order to improve performance. When the card is finished processing the packets received from the network, it causes an interrupt. This causes the kernel to call functions defined in the network interface driver. These functions are called interrupt handlers. An interrupt handler will copy the data from the network interface to system main memory, via the I/O bus. At this point, the network protocol stack finishes processing the packets, and copies the data out to the application.

The process of servicing interrupts is very invasive; it typically causes other operations to be preempted. Under high network receive load, this causes the primary computational task of the system to be frequently stopped. As context switches are not free, this constant switch comes at a high performance price. In order to address this issue, most high-end (gigabit and custom network) NIC manufacturers have implemented interrupt mitigation, or coalescing strategies. This means that the NIC will buffer some number of processed packets before issuing an interrupt. This means that instead of interrupting after every packet is processed, the NIC may only issue an interrupt after 10 or 100 packets. This allows the spend less time switching between the network and computational task, and more time executing the user's application. See Sections 5.5.4 and 5.5.5 for more information about driver performance settings and techniques.

4.3.4 Network Software In Action

In general, cluster use is characterized by the execution of parallel applications. These applications consist of many instances of the same application, running on multiple cluster nodes simultaneously. These instances of the application use the system area network to communicate. These messages are typically used for coordination between instances of the parallel application.

When communication occurs, a complex series of actions is performed. First, the application makes a library call to initiate message transfer. This call usually does a variety of things; it will frame the message and potentially split the message into multiple packets if the message size is too large. At this point, the packet is passed to the network driver stack. The data is transferred across the I/O bus to the NIC.

The network controller transmits the packets to the network controller on the intended recipient. The packet reaches receiving network controller, where it is processed by the hardware and processed by the network driver stack. These packets are reconstituted into the original message by the protocol stack, and this message is passed to the application when if calls a receive function in its messaging library.