6.2 Processes on the Data-Link Layer

As was mentioned in the beginning of this chapter, the data-link layer forms the connecting layer between drivers or network devices and the higher world of protocol instances. This section gives an overview of the processes on the data-link layer. We will explain what activity forms play an important role on this layer and how the transition between them occurs. Section 6.2.1 describes the process involved when a packet arrives, and Section 6.2.2 discusses how a packet is sent. First, however, we introduce the activity forms and their tasks in the Linux network architecture.

Figure 6-2 gives an overview of the activity forms in the Linux network architecture. As compared with earlier kernel versions, Version 2.4 and up introduced significant performance improvements. Mainly, the use of software interrupts, as compared with the low-performing bottom halves, means a clear performance increase in multiprocessor systems. As shown in Figure 6-2, we can distinguish between the following activities:

Hardware interrupts accept incoming data packets from the network adapters and introduce them to the Linux network architecture (per Chapter 5). To ensure that the interrupt can terminate as quickly as possible (see Section 2.2.2), incoming data packets are put immediately into the incoming queue of the processing CPU, and the hardware interrupt is terminated. The software interrupt NET_RX_SOFTIRQ is marked for execution to handle these packets further.
The software interrupt NET_RX_SOFTIRQ (for short, NET_RX soft-IRQ) assumes subsequent (not time-critical) handling of incoming data packets. This includes mainly the entire handling of protocol instances on layers 2 through 4 (for packets to be delivered locally) or on layers 2 and 3 (for packets to be forwarded). This means that most of the protocol instances introduced in Chapters 7 through 25 run in the context of NET_RX soft-IRQ.
Packets incoming for an application are handled by NET_RX soft-IRQ upto the kernel boundary and then forwarded to the waiting process. At this point, the packet leaves the kernel domain.
Packets to be forwarded are put into the outgoing queue of a network device over the layer-3 protocol used (or by the bridge implementation). If the NET_RX soft-IRQ has not yet used more than one tick (1/H_Z) to handle network protocols, then it tries immediately to send the next packet. If the soft-IRQ was able to send a packet successfully, it will handle it to the point where it is passed to the network adapter. (See Chapter 5 and Section 6.2.2.)
The software interrupt NET_TX_SOFTIRQ (for short, NET_TX soft-IRQ) also sends data packets, but only provided that it was marked explicitly for this task. This case, among others, occurs when a packet cannot be sent immediately after it was put in the output queue for example, because it has to be delayed for traffic shaping. In such a case, a timer is responsible for marking the NET_RX soft-IRQ for execution at the target transmission time (see Section 6.2.2) and transmitting the packet.
This means that the NET_TX soft-IRQ can transmit packets in parallel with other activities in the kernel. It primarily assumes the transmission of packets that had to be delayed.
Data packets to be sent by application processes are handled by system calls in the kernel. In the context of a system call, a packet is handled by the corresponding protocol instances until it is put into one of the output queues of the sending network device. As with NET_RX soft-IRQ, this activity tries to pass the next packet to the network adapter immediately after the previous one.
Other activities of the kernel (tasklets, timer handling routines, etc.) do various tasks in the Linux network architecture. However, unlike the tasks of the activities described so far, they cannot be clearly classified, because they are activated by other activities upon demand. In general, these activity forms run tasks at a specific time (timer handling routines) or at a less specified, later time (tasklets).
Application processes are not activities in the operating-system kernel. Nevertheless, we mentioned them here within the interplay of activities of the kernel, because some are started by system calls and because incoming packets are forwarded to some.

Figure 6-2. Activity forms in the Linux network architecture.

As compared with earlier kernel versions, where no software interrupts were yet available, their task was executed by the net bottom half (NET_BH). Unfortunately, the NET_BH did not run on two processors in parallel, which means that it was much less performing than software interrupts, which can run in parallel on several CPUs.

The next two sections describe how packets are received and sent in the data-link layer, but details with regard to network adapters, which were introduced in Chapter 5, will be discussed only superficially.

6.2.1 Receiving a Packet

The path of each packet not generated locally in the computer begins in a network adapter or a comparable interface (e.g., the parallel port in PLIP). This port receives a packet and informs the kernel about its arrival by triggering an interrupt. The following process in the network driver was described in Chapter 5, but we will repeat it here briefly for the sake of completeness.

If the transmission was correct, then the path of a packet through the kernel begins at this point (as in Figure 6-3). Up to when the interrupt was triggered, the Linux kernel had nothing to do with the packet. This means that the interrupt-handling routine is the first activity of the kernel that handles an incoming packet.

Figure 6-3. The path of a packet in the data-link layer of the Linux kernel.

When it has received a packet correctly, the network adapter triggers an interrupt, which is handled by the interrupt-handling routine of the network driver. For the example driver described in Section 5.3 (drivers/net/isa_skeleton.c), this is the method net_interrupt(). As soon as the interruption was identified as an incoming packet, net_rx() is responsible for further handling. If the interrupt was caused not by an incoming packet, but by a message that a data transmission was completed, then net_tx() continues.
net_rx() uses dev_alloc_skb(pkt_len) to obtain a socket-buffer structure and copies the incoming packet from the network adapter to its packet-data space. (See Chapter 4 and Section 5.3.) Subsequently, the pointer skb->dev is set to the receiving network device, and the type of the data contained in the layer-2 data frame is recognized. For this purpose, Ethernet drivers can use the method eth_type_trans(). There are similar methods for other MAC technologies (FDDI, token ring).
The demultiplexing process takes place in the LLC layer at this point. The exact process will be explained in Section 6.3.1.
netif_rx() completes the interrupt handling. First, the current time is set in skb->time, and the socket buffer is placed in the input queue. As compared with earlier versions of the Linux kernel, there is now not only one single queue having the name "backlog"; instead, each CPU stores "its" incoming packets in the structure softnet_data[cpu].input_pkt_queue. This means that the processor that handles the interrupt always stores the packet in its queue. This mechanism was introduced to avoid kernel-wide locks of a single input queue.
Once the packet was placed in the queue, the interrupt handling is complete. As was explained in Section 2.2.2, the handling routine of a hardware interrupt should run only the operations absolutely required to ensure that other activities of the computer (software interrupts, tasklets, processes) won't be unnecessarily interrupted.
Incoming packets are further handled by the software interrupt (NET_RX_SOFTIRQ), which replaces the net bottom half (NET_BH) used in earlier versions of the Linux kernel. NET_RX_SOFTIRQ is marked for execution by __cpu_raise_softirq(cpu, NET_RX_SOFTIRQ). This mechanism is similar to bottom halfs, but the use of software interrupts allows much more parallelism and so makes possible improved performance on multiprocessor systems. (See Section 2.2.3.)

The path of a packet initially ends in the queue for incoming packets. The interrupt handling was terminated, and the kernel continued handling the interrupted activity (process, software interrupt, tasklet, etc.). When the process scheduler (schedule() in kernel/sched.c) is invoked once more after a certain interval, then it first checks for whether a software interrupt is marked for execution. This is the case here, and it uses do_softirq() to start the marked soft-IRQ. The following section assumes that this concerns the NET_RX soft-IRQ:

`net_rx_action()`	net/core/dev.c

net_rx_action() is the handling routine of NET_RX_SOFTIRQ. In a continuous loop (for(;;){...}), packets are fetched one after the other from the input queue of the processing CPU and passed to the protocol-handling routine, until the input queue is empty. The continuous loop is also exited when the packet-handling duration exceeds one tick (10 ms) or when budget =net_dev_max_backlog^[1] packets have been removed and processed from the queue. This prevents the protocol-handling routine from blocking the remaining activities of the computer and thereby inhibits denial-of-service attacks.

^[1] net_dev_max_backlog specified the maximum length of the (only) input queue, backlog, in earlier versions of the Linux kernel, and was initialized with the value 300 (packets). In the new kernel versions, this is the maximum length of the input queues of the processors.

The first action in the continuous loop is to request a packet from the input queue of the CPU by the method __skb_dequeue(). If a socket buffer is found, then the reference counter of the socket buffer is first incremented in skb_bond(). Subsequently, the socket buffer is transferred to instances of the handling protocols.

First, the socket buffer is passed to all protocols registered in the list ptype_all. (See Section 6.3.) In general, no protocols are registered in this list. However, this interface is excellently suitable for inserting analytical tools.

If the computer was configured as a bridge (CONFIG_BRIDGE) and the pointer br_handle_frame_hook() was set, then the packet is passed to the method handle_bridge(). It will then be processed in the bridge instance. (See Chapter 12.)

The last action (which is generally the most common case) passed the socket buffer to all protocols registered with the protocol identifier (dev->protocol). They are managed in the hash table (ptype_base). Section 6.3 will explain the details of how layer-3 protocols are managed.

For example, the method eth_type_trans() recognizes the protocol identifier 0x0800 and stores it in dev->protocol for an IP packet. In net_rx_action(), this identifier is now mapped by the hash function to the entry of the Internet Protocol (IP) instance. Handling of the protocol is started by a call of the corresponding protocol handling routine (func()). In the case of the Internet Protocol, this is the known method ip_rcv(). If other protocol instances are registered with the identifier 0x0800, then a pointer to the socket buffer is passed to all of these protocols one after the other.

This means that the actual work with protocol instances of the Linux kernel begins at this point. In general, the protocols that start at this point are layer-3 protocols. However, this interface is also used by several other protocols that instead fit in the first two layers of the ISO/OSI basic reference model. The following section describes the inverse process (i.e., how a data packet is sent).

6.2.2 Transmitting a Packet

As is shown in Figure 6-3, the process of transmitting a packet can be handled in several activity forms of the kernel. We distinguish two main transmission processes:

Normal transmission process, where an activity tries to send off ready packets and send them over the network device immediately after the placing of a packet in the output queue of that network adapter. This means that the transmission process is executed either by NET_RX soft-IRQ or as a consequence of a system call. This form of transmitting packets is discussed in the following section.
The second type of transmission is handled by NET_TX soft-IRQ. It is marked for execution by some activity of the kernel and invoked by the scheduler at the next possible time. The NET_TX soft-IRQ is normally used when packets are to be sent outside the regular transmission process or at a specific time for certain reasons. This transmission process is introduced after the section describing the normal transmission process.

The Normal Transmission Process

`dev_queue_xmit()`	net/core/dev.c

dev_queue_xmit(skb) is used by the protocol instances of higher protocols to send a packet in the form of a socket buffer, skb, over a network device. The network device is specified by the parameter skb->dev of the socket buffer structure. (See Figure 6-4.)

Figure 6-4. The process involved when sending a packet by `dev_queue_xmit()`.

First, the socket buffer is placed in the output queue of the network device. This is done by use of the method dev->qdisc->enqueue(). In general, packets are handled by the FIFO (First In ?First Out) principle. However, it is also possible to define several queues and introduce various mechanisms for differentiated handling of packets. (See Chapters 18 and 22.)

Once the packet has been placed in the queue by the desired method (qdisc), further handling of packets ready to be sent is triggered. This task is handled by qdisc_run().

There is one special case: that a network device has not defined methods for queue management (dev->enqueue == NULL). In this case, a packet is simply sent by dev->hard_start_xmit() right away. In general, this case concerns logical network devices, such as loopback, or tunnel network devices.

`qdisc_run()`	include/net/pkt_sched.h

qdisc_run(dev) has rather little functionality. All it actually does is call qdisc_restart() until it returns a value greater-equal null (no more packet in the queue), or until the network device does not accept any more packets (netif_queue_stopped(dev)).

`qdisc_restart()`	net/sched/sch_generic.c

qdisc_restart(dev) is responsible for getting the next packet from the queue of the network device and sending it. In general, the network device has only a single queue and works by the FIFO principle. However, it is possible to define several queues and serve them by a special strategy (qdisc).

This means that dev->qdisc->dequeue() is used to request the next packet. If this request is successful, then this packet is sent by the driver method dev- >hard_start_xmit(). (See Chapter 5.) Of course, the method also checks on whether the network device is currently able to send packets (i.e., whether netif_queue_stopped(dev) == 0 is true).

Another problem that can potentially occur in qdisc_restart() is that dev->xmit_lock sets a lock. This spinlock is normally set when the transmission of a packet is to be started in qdisc_restart(). At the same time, the number of the locking CPU is registered in dev->xmit_lock_owner.

If this lock is set, then there are two options:

The locking CPU is not identical with the one discussed here, which is currently trying to set the lock dev->xmit_lock. This means that another CPU sends another packet concurrently over this network device. This is actually not a major problem; it merely means that the other CPU was simply a little faster. The socket buffer is placed back into the queue (dev->adisc_>requeue()). Finally, NET_TX_SOFTIRQ is activated in netif_schedule() to trigger the transmission process again.
If the locking CPU is identical with the CPU discussed here, then this means that a so-called dead loop is present: Forwarding of a packet to the network adapter was somehow interrupted in this processor, and an attempt was made to retransmit a packet. The response to this process is that the packet is dropped and everything returns immediately from qdisc_restart() to complete the first transmission process.

The return value of qdisc_restart() can take either of the following values:

= 0: The queue is empty.
> 0: The queue is not empty, but the queue discipline (dev->qdisc) prevents any packet from being sent (e.g., because it has not yet reached its target transmission time in active traffic shaping).
< 0: The queue is not empty, but the network device currently cannot accept more packets, because all transmit buffers are full.

If the packet can be forwarded successfully to a network adapter, then the kernel assumes that this transmission process is completed, and the kernel turns to the next packet (qdisc_run()).

Transmitting over NET_TX Soft-IRQ

The NET_TX_SOFTIRQ is an alternative for sending packets. It is marked for execution (___cpu_raise_softirq()) by the method netif_schedule().netif_schedule() is invoked whenever a socket buffer cannot be sent over the normal transmission process, described in the previous section. This problem can have several causes:

Problems occurred when a packet was forwarded to the network adapter (e.g., no free buffer spaces).
The socket buffer has to be sent later, to honor special handling of packets. In the case of traffic shaping, packets might have to be delayed artificially, to maintain a specific data rate. For this purpose, a timer is used, which starts the transmission of the packet when the transmission time is reached. (See Figure 6-4.)

Now, if NET_TX_SOFTIRQ is marked for execution by netif_schedule(), it is started at the next call of the CPU scheduler.

`net_tx_action()`	net/core/dev.c

net_tx_action() is the handling routine of the NET_TX_SOFTIRQ software interrupt. The main task of this method is to call the method qdisc_restart() to start the transmission of the packets of a network device.

The benefit of using the NET_TX_SOFTIRQ software interrupt is that processes can be handled in parallel in the Linux network architecture. In addition to NET_RX_SOFTIRQ, which is responsible for the main protocol handling, the NET_TX soft-IRQ can also be used to increase the throughput considerably in multiprocessor computers.