5.3 Network Drivers | Linux Network Architecture

The large number of different protocols in the Linux network architecture leads to considerable differences in the implementations of drivers for different physical network adapters. As was mentioned in the section that described the net_device structure, the properties of different network adapters are hidden at the interface of network devices, which means that they offer a uniform view upwards.

Hiding specific functions (i.e., abstracting from the driver used) is achieved by using function pointers in the net_device structure. For example, a higher-layer protocol instance uses the method hard_start_xmit() to send an IP packet over a network device. Notice, however, that this is merely a function pointer, hiding the method el3_start_xmit() in the case of a 3c509 network adapter. This method takes the steps required to pass a socket buffer to the 3c509 adapter. The upper layers of the Linux network architecture don't know which driver or network adapter is actually used. The function pointer can be used to abstract from the hardware actually used and its particularities.

The following sections provide an overview of the typical structuring and implementation characteristics of the functions of a network driver, without discussing adapter-specific properties, such as manipulating the hardware registers or describing the transmit buffers. In general, these tasks depend on the hardware, so we will skip them here. Readers interested in these details can use the large number of network drivers included in the drivers/net directory as examples. We use the skeleton driver to explain how driver methods work. This is a sample driver used to show usual processes in driver methods rather than a real driver for a network adapter. For this reason, it is particularly useful for explaining the implementation characteristics of network drivers.^[2]

^[2] At this point, we would like to thank Donald Becker, who implemented most of the network drivers for Linux, greatly contributing to the success of Linux. Donald Becker is also the author of the skeleton driver used here.

Some of the methods listed below are not implemented by some drivers (e.g., example_set_config() to change system resources at runtime); others are essential, such as example_hard_start_xmit() to start a transmission process.

5.3.1 Initializing Network Adapters

Before a network device can be activated, we first have to find the appropriate network adapter; otherwise, it won't be added to the list of registered network devices. The init() function of the network driver is responsible for searching for an adapter and initializing its net_device structure with patching driver information. Because we search for a network adapter, this function is often called search function.

The argument of the init() method is a pointer to the initializing device dev. The return value of init() is usually 0, but a negative error code (e.g., -ENODEV) when no adapter was found.

`net_init()/net_probe()`	net/core/dev.c

The tasks of the method dev->init(dev) are explained in the source text of our example driver, isa_skeleton. There is an example driver in drivers/net/pci_skeleton.c for PCI network adapters, but we will not describe it here.

As was mentioned earlier, the main task of the init() method is to search for a matching network adapter (i.e., it has to discover the I/O port, especially of the basic address stored in dev->base_addr).

We distinguish between two different cases of searching for a network adapter:

Specifying the basic address: In this case, the previously created net_device structure of the network device is passed as parameter to the init() method. The caller can use this structure to specify a basic address for I/O ports in advance. When no matching adapter is found in this address, then the init() method returns the error message -ENODEV. The basic address can be specified in either of the two following ways:
- For modularized drivers, parameters can be passed when loading the module, including the I/O basic address (e.g., io=0x280). In this case, it should be transferred to the net_device structure of the network device in the init_module() method of the driver module, so that it will be considered during the search for the network adapter.
- For drivers permanently integrated in the kernel, we can also pass parameters when the system boots; these parameters are maintained in the list dev_boot_setup. They are transferred to the net_device structure of a network device in the method init_netdev() (see Section 5.2) and can be used when the network adapter is initialized.
Searching in known basic addresses: A network adapter generally supports a set of defined port addresses. If no basic address is specified when calling the init() method, then the addresses in this list can be probed one after the other. If no adapter can be found in any of these basic addresses in the list, then -ENODEV is returned.

The following source code of the init() method for the skeleton driver handles only the selection of basic addresses where we want to search (by the methods described above). The actual verification of a specific basic address and the initialization of the net_device structure takes place in the method netcard_probe1(dev, ioaddr), which is actually part of the init() method and was implemented separately to keep the code simple and easy to understand.

 /* The name of the card. Is used for messages and in the requests for  * io regions, irqs and dma channels */  static const char* cardname = "netcard" /* A zero-terminated list of I/O addresses to be probed. */  static unsigned int netcard_portlist[] __initdata =         { 0x200, 0x240, 0x280, 0x2C0, 0x300, 0x320, 0x340, 0}; /* The number of low I/O ports used by the ethercard. */  #define IO_NUM 32 /* Information that needs to be kept for each board. */  struct net_local {     struct net_device_stats stats;     long open_time;        /* Useless example local info. */       /* Tx control lock. This protects the transmit buffer ring        * state along with the "tx full" state of the driver. This        * means all netif_queue flow control actions are protected        * by this lock as well. */     spinlock_t lock; }; /* The station (ethernet) address prefix, used for IDing the board. */  #define SA_ADDR0 0x00  #define SA_ADDR1 0x42  #define SA_ADDR2 0x65 int__init netcard_probe(struct net_device *dev) {    int i;    int base_addr = dev->base_addr;    SET_MODULE_OWNER(dev);    if (base_addr > 0x1ff) /* Check a single specified location. */        return netcard_probe1(dev, base_addr);    else if (base_addr != 0) /* Don't probe at all. */             return -ENXIO;    for (i = 0; netcard_portlist[i]; i++) {        int ioaddr = netcard_portlist[i];        if (check_region(ioaddr, IO_NUM))            continue;        if (netcard_probe1(dev, ioaddr) == 0)            return 0;    }    return -ENODEV; }

Once we have selected a basic address for the network adapter in the above method, the method netcard_probe1(dev, ioaddr) tests whether the adapter we searched for is really at this basic address. For this purpose, the method has to check specific properties of the card, where access should be limited to read access on the I/O ports to ensure that no other adapters will be involved. At this point, it is still unknown whether the adapter we're searching for is really present in the basic address ioaddr.

A very simple method to identify the adapter compares the manufacturer identification with the MAC address. Each network adapter has a unique MAC address, where the first three bytes identify the manufacturer. This identification must correspond with the manufacturer code of the searched card. In any event, additional checks should be done, but they are adapter-specific and are not described in detail here.

Once we are sure that the network adapter we searched for is present in the basic address ioaddr, this address is stored in the net_device structure (dev->base_addr), and the network device is initialized. The I/O ports, starting from the basic address, are reserved by request_region(ioaddr, IO_NUM, cardname) at the end of the initialization function to ensure that no other initialization method can get write access to it.

The initialization process can be divided into the following three phases:

If the network adapter does not support dynamic interrupt allocation, then the interrupt set by jumpers on the network adapter should be determined and reserved at this point. The kernel supports the search for the interrupt number. Calling the method autoirq_setup() makes the kernel remember interrupt lines not currently registered in a variable. Subsequently, the network adapter should be caused to trigger an interrupt. We can then use the method autoirq_report() to discover, from the previously stored and the actual interrupt vectors, which interrupt was actually active. Next, the interrupt found is reserved for the network adapter by the method request_irq(). In addition, the DMA channel is determined and reserved by request_dma().
For modern adapters that do not necessarily require specific interrupt or DMA lines, the two system resources are allocated not at this point, but rather when the device is opened. This is necessary to avoid conflicts with other devices.
Once system resources have been allocated (for older adapters only), memory is reserved for the private data structure of the network device dev->priv and is initialized. This data structure stores the private data of the network driver and statistic information collected during the operation of the network device (net_device_stats structure).
Finally, the references to driver-specific methods are set in the net_device structure, so that they can be used by the higher layers and protocols. The adapter-specific methods (see also Section 5.1.1) have to be set explicitly. Methods specific to the MAC protocol used (e.g., Ethernet) can be set by special methods (e.g., ether_setup()).

If the network adapter was found and all data structures were initialized correctly, then dev->init() returns 0.

 /* This is the real probe routine. Linux has a history of friendly device  * probes on the ISA bus. A good device probe avoids doing writes, and  * verifies that the correct device exists and functions.*/  static int __init netcard_probe1(struct net_device *dev, int ioaddr) {     struct net_local *np;     static unsigned version_printed = 0;     int i;     /*      * For Ethernet adaptors the first three octets of the station address      * contains the manufacturer's unique code. That might be a good probe      * method. Ideally you would add additional checks.      */     if (inb(ioaddr + 0) != SA_ADDRO         ||   inb(ioaddr + 1) != SA_ADDR1         ||   inb(ioaddr + 2) != SA_ADDR2) {         return -ENODEV;     }     if (net_debug && version_printed++ == 0)         printk(KERN_DEBUG "%s", version);     printk(KERN_INFO "%s: %s found at %#3x, ", dev->name, cardname, ioaddr);     /* Fill in the 'dev' fields. */     dev->base_addr = ioaddr;     /* Retrieve and print the Ethernet address. */     for (i = 0; i < 6; i++)         printk(" %2.2x", dev->dev_addr[i] = inb(ioaddr + i)); #ifdef jumpered_interrupts     /* If this board has jumpered interrupts, allocate the interrupt      * vector now. There is no point in waiting since no other device      * can use the interrupt, and this marks the irq as busy. Jumpered      * interrupts are typically not reported by the boards, and we must      * used autoIRQ to find them. */     /* ... REMOVED for this book, details see in drivers/net/isa-skeleton.c */ #endif /* jumpered interrupt  */ #ifdef jumpered_dma     /* If we use a jumpered DMA channel, that should be probed for and      * allocated here as well. See lance.c for an example.*/     /* ... REMOVED for this book, details see in drivers/net/isa-skeleton.c */ #endif /* jumpered DMA */     /* Initialize the device structure. */     if (dev->priv == NULL) {         dev->priv = kmalloc(sizeof(struct net_local), GFP_KERNEL);         if (dev->priv == NULL)             return -ENOMEM; }     memset(dev->priv, 0, sizeof(struct net_local));     np = (struct net_local *)dev->priv;     spin_lock_init(&np->lock);     /* Grab the region so that no one else tries to probe our ioports. */     request_region(ioaddr, IO_NUM, cardname);     dev->open = net_open;     dev->stop = net_close;     dev->hard_start_xmit = net_send_packet;     dev->get_stats = net_get_stats;     dev->set_multicast_list = &set_multicast_list;     dev->tx_timeout = &net_tx_timeout;     dev->watchdog_timeo = MY_TX_TIMEOUT;     /* Fill in the fields of the device structure with Ethernet values. */     ether_setup(dev);     return 0; }

Helper Functions to Allocate System Resources

`request_region(), release_region(), check_region()`	kernel/resource.c

request_region(port, range, name) reserves a region of I/O ports, starting with the address port, and marks them as allocated. The kernel manages these reserved port ranges in a linear list. This list can be output from the proc file /proc/ioports, where name is the output name of the reserved instance.

We reserve ports to prevent a driver that searches for an adapter from accessing the ports of another device, causing that device to take an undefined or unintended state. For this reason, before port ranges are assigned, we should always use check_region() to check on whether that range is already taken. The address of the first I/O port of an adapter is stored in the variable dev->base_addr.

release_region(start, n) can be used to release allocated port ranges.

`request_irq(), free_irq()`	kernel/irq.c

request_irq(irq, handler, flags, device, dev_id) reserves and initializes the interrupt line with number irq. At the same time, the handling routine handler() is registered for this interrupt.

Similarly to what it does with I/O ports, the kernel manages a list of reserved interrupts and can output this list in the proc directory (/proc/interrupts). Again, the string device tells you who reserved this interrupt. The parameter flags can be used to output options when reserving an interrupt. For more information, see [RuCo01].

A reserved interrupt can be released by free_irq(irq, dev_id).

`request_dma(), free_dma()`	kernel/dma.c

request_dma(dmarr, device_id) tries to reserve the DMA channel dmarr. free_dma(dmarr) can be used to release a reserved DMA channel.

5.3.2 Opening and Closing a Network Adapter

We know from Section 5.2 that network devices are activated and deactivated by the command ifconfig. More specifically, ioctl() calls invoke the methods dev_open() or dev_close(), where the general steps to activate and deactivate a network device are executed. The adapter-specific actions are handled in the driver methods dev->open() and dev->stop(), respectively, of the present network adapter. We use the skeleton sample driver to explain these steps.

`net_open()`	drivers/net/isa_skeleton.c

The open() method is responsible for initializing and activating the network adapter. At the beginning, the system resources required (interrupt, DMA channel, etc.) are requested. To make available these system resources, the kernel offers various methods you can use as helpers. These methods were introduced briefly in the previous section. System resources are reserved in the open() method for modern adapters, which do not have fixed values for IRQ and DMA lines. For older cards, the resources are searched for and reserved in the init() method. (See init().)

Once a network adapter has been initialized successfully, the use counter of the module should be incremented for modularized drivers, to prevent inadvertent loading of the driver module from the kernel. We can use the macro MOD_INC_USE_COUNT for this purpose.

The network adapter is initialized when all system resources have been allocated successfully. Each adapter is initialized in an individual manner. Normally, a specific value is written to a hardware register (I/O port) of the adapter, which causes the adapter to initialize itself.

The transmission of packets over the network device is started by netif_start_queue(dev). Finally, the value 0 is returned if the transmission was successful; otherwise, a negative error code is returned.

 /*  * Open/initialize the board. This is called (in the current kernel)  * sometime after booting when the 'ifconfig' program is run.  *  * This routine should set everything up anew at each open, even  * registers that "should" only need to be set once at boot, so that  * there is non-reboot way to recover if something goes wrong.  */ static int net_open(struct net_device *dev) {     struct net_local *np = (struct net_local *)dev->priv;     int ioaddr = dev->base_addr;     /*      * This is used if the interrupt line can turned off (shared).      * See 3c503.c for an example of selecting the IRQ at config-time.      */     if (request_irq(dev->irq, &net_interrupt, 0, cardname, dev))         return -EAGAIN;     }     /*      * Always allocate the DMA channel after the IRQ, and clean up on failure.      */     if (request_dma(dev->dma, cardname)) {         free_irq(dev->irq, dev);         return -EAGAIN;     }     MOD_INC_USE_COUNT;     /* Reset the hardware here. Don't forget to set the station address. */     chipset_init(dev, 1);     outb(0x00, ioaddr);     np->open_time = jiffies;     /* We are now ready to accept transmit requests from      * the queuing layer of the networking.      */     netif_start_queue(dev);     return 0; }

Deactivating a Network Adapter

`example_stop()`	drivers/net/isa_skeleton.c

During deactivation of a network adapter, all operations done when the adapter was opened should be undone. This concerns mainly allocated system resources (interrupts, DMA channels, etc.), which should now be freed.

For modularized drivers, the use counter has to be decremented with MOD_DEC_USE_COUNT, and the network device must not accept any more packets from higher layers (netif_stop_queue). Again, the return value is either 0, if successful, or a negative error code.

 /* The inverse routine to net_open(). */ static int net_close(struct net_device *dev) {    struct net_local *lp = (struct net_local *)dev->priv;    int ioaddr = dev->base_addr;    lp->open_time = 0;    netif_stop_queue(dev);    /* Flush the Tx and disable Rx here. */    disable-dma(dev->dma);    /* If not IRQ or DMA jumpered, free up the line. */    outw(0x00, ioaddr+0); /* Release the physical interrupt line. */    free_irq(dev->irq, dev);    free_dma(dev->dma);    /* Update the statistics here. */    MOD_DEC_USE_COUNT;    return 0; }

5.3.3 Transmitting Data

Each data transmission in the Linux network architecture occurs over a network device, more specifically by use of the method hard_start_xmit() (start hardware transmission). Of course, this is a function pointer, pointing to a driver-specific transmission function, ..._start_xmit(). This method is responsible for forwarding the packet in the form of a socket buffer and starting the transmission. Before we discuss the usual steps involved in the driver method dev->hard_start_xmit() in this section, we will briefly describe the common architecture of network adapters.

A network adapter is an interface adapter that automatically transmits and receives network packets according to a defined MAC protocol (Ethernet, token ring, etc.). This means that a network adapter has an independent logic that works in parallel to the regular central processor(s). The network adapter and a system processor interact over I/O ports (hardware registers) and interrupts. When a processor wants to pass data to the network adapter, then the processor writes its data to the appropriate I/O ports and starts the desired action. When the adapter wants to pass data to the processor (e.g., a packet it received), then the adapter triggers an interrupt, and the processor uses the interrupt-handling routine of the network adapter to serve the network adapter. This shows clearly that system processors have a leading role versus interface adapters (master slave relationship).

Transmitting Data Packets

`net_start_xmit()`	drivers/net/isa_skeleton.c

dev->hard_start_xmit(skb, dev) is responsible for forwarding a data packet to the network adapter so that the latter can transmit it. The packet data of the socket buffer is copied to an internal buffer location in the network adapter, and the time stamp dev->trans_start = jiffies is attached, marking the beginning of that transmission. If this copying action was successful, it is also assumed that the transmission will be successful. In this case, hard_start_xmit() has to return a value of 0. Otherwise, it should return 1, so that the kernel knows that the packet could not be sent.

When forwarding network packets between the operating system and the network adapter, we can distinguish between two different techniques:

Older network adapters (e.g., 3Com 3c509) have an internal buffer memory on the adapter for packets to be sent. This means that the kernel can always forward only one single packet to the adapter at a time. If a buffer is free, a packet is copied to the adapter right away and the kernel can delete the corresponding socket buffer.
More recent network adapters work differently. The driver manages a ring buffer consisting of 16 to 64 pointers to socket buffers. When a packet is ready to be sent, then the corresponding socket buffer is arranged within this ring, and a pointer to the packet data is passed to the network adapter. Subsequently, the socket buffer remains in the ring buffer until the network adapter, using an interrupt, has notified that the packet was transmitted. Finally, the socket buffer is removed from the ring buffer and freed.

If the transmission was successful, then the socket buffer is no longer required, and it can be freed by dev_kfree_skb(). (See Section 4.1.1.) If an error occurred during the transmission, then the socket buffer should not be touched, because the kernel will most likely try to retransmit the packet.

When the method hard_start_xmit() is called, we can assume that there is currently at least one free place in the ring buffer. Whether this is true is checked by netif_queue_stopped(dev) before the call. Once the socket buffers have been arranged within the ring buffer, which add_to_tx_ring indicates as an example, we should check for whether there are more free buffer places. If this is not the case, i.e., if the ring buffer is fully occupied, then we have to use netif_stop_queue() to prevent more packets from being forwarded to the network adapter. The network device is stopped until there will be free places in the ring buffer. The kernel is notified about this situation by an interrupt, as explained in the following section.

 /* This will only be invoked if your driver is _not_ in XOFF state.  * What this means is that you need not check it, and that this  * invariant will hold if you make sure that the netif_*_queue()  * calls are done at the proper times.  */ static int net_send_packet(struct sk_buff *skb, struct net_device *dev) {     struct net_local *np = (struct net_local *)dev->priv;     int ioaddr = dev->base_addr;     short length = ETH_ZLEN < skb->len ? skb->len : ETH_ZLEN;     unsigned char *buf = skb->data;    /* If some error occurs while trying to transmit this     * packet, you should return '1' from this function.     * In such a case you _may not_ do anything to the     * SKB, it is still owned by the network queuing     * layer when an error is returned. This means you     * may not modify any SKB fields, you may not free     * the SKB, etc.     */ #if TX_RING     /* This is the most common case for modern hardware.      * The spinlock protects this code from the TX complete      * hardware interrupt handler. Queue flow control is      * thus managed under this lock as well.      */      spin_lock_irq(&np->lock);      add_to_tx_ring(np, skb, length);      dev->trans_start = jiffies;     /* If we just used up the very last entry in the      * TX ring on this device, tell the queuing      * layer to send no more.      */      if (tx_full(dev))         netif_stop_queue(dev);     /* When the TX completion hw interrupt arrives, this      * is when the transmit statistics are updated.      */      spin_unlock_irq(&np->lock); #else     /* This is the case for older hardware which takes      * a single transmit buffer at a time, and it is      * just written to the device via PIO.      *      * No spin locking is needed since there is no TX complete      * event. If by chance your card does have a TX complete      * hardware IRQ then you may need to utilize np->lock here.      */     hardware_send_packet(ioaddr, buf, length);     np->stats.tx_bytes += skb->len;     dev->trans_start = jiffies;     /* You might need to clean up and record Tx statistics here. */     if (inw(ioaddr) == /*RU*/81)         np->stats.tx_aborted_errors++;     dev_kfree_skb (skb); #endif     return 0; }

Receiving Packets and Messages from a Network Adapter

`net_interrupt()`	drivers/net/isa_skeleton.c

A network adapter uses interrupts and its driver-specific interrupt-handling routine to communicate with the operating system. More specifically, the network adapter triggers an interrupt to stop the current processor operation and notify it about an event. When a network adapter uses an interrupt, we generally distinguish between three different events:

Receive a data packet: The network adapter has accepted and buffered a data packet and now wants to forward this packet to the operating system.
Acknowledge a packet transmission: The network adapter uses this interrupt to acknowledge that a packet previously forwarded by the operating system was sent and that there is now space available in the ring buffer. However, this acknowledgment does not mean that the receiver received the packet successfully; it merely means that the network adapter has put the packet successfully to the medium.
Notify an error situation: Depending on the network adapter used, an interrupt can be used to notify the driver of error situations.

Figure 5-5 shows how the interrupt handling routine of a network driver works.

Figure 5-5. A network adapter uses an interrupt to send messages.

graphics/05fig05.gif

First, we should set an IRQ lock to prevent the function from being executed more than once at the same time. In older versions of the Linux kernel, the flag dev->interrupt was used to this end. From version 2.4 and higher, the driver should have its own lock variable.

Next, we want to know the cause of the interrupt. For this purpose, we normally read a state value from a state register, which shows whether a new packet has been received, whether a transmission was completed, and whether an error situation occurred. If a packet was received, then the driver-specific receive function net_rx() is called. If a packet transmission was fully completed, then the statistics are updated first; then netif_wake_queue(dev) (or dev->busy = 0; mark_bh(NET_BH) in earlier versions) announces the end of transmission and marks the NET_RX software interrupt for execution.

The NET_RX soft IRQ handles all incoming packets. Because it interrupts the normal work of a processor, an interrupt should complete its job quickly. Unfortunately, handling an incoming packet can be very complex, mainly because many protocols (e.g., PPP, IP, TCP, and FTP) normally participate in the process. To ensure that a processor's work is not interrupted for an excessive duration, the interrupt-handling routine carries out only those steps absolutely required to receive a packet. The more intensive part of protocol handling is done in the NET_RX software interrupt, which has a lower priority than interrupt handling.

 static void net_interrupt(int irq, void *dev_id, struct pt_regs * regs) {     struct net_device *dev = dev_id;     struct net_local *np;     int ioaddr, status;     ioaddr = dev->base_addr;     np = (struct net_local *)dev->priv;     status = inw(ioaddr + 0);     if (status & RX_INTR) {         /* Got a packet(s). */         net_rx(dev);     } #if TX_RING     if (status & TX_INTR) {         /* Transmit complete.    */         net_tx(dev);         np->stats.tx_packets++;         netif_wake_queue(dev);     } #endif     if (status & COUNTERS_INTR) {         /* Increment the appropriate 'localstats' field. */         np->stats.tx_window_errors++;     } }

Acknowledging a Transmission Process

`net_tx()`	drivers/net/isa_skeleton.c

With more recent network adapters, the network driver manages a ring buffer of socket buffers, which should be used to transmit over the network adapter. These socket buffers remain in the ring buffer until the network adapter uses an interrupt to acknowledge their transmission. The method net_tx() shows the tasks to be executed when a network adapter acknowledges a transmission. The method net_tx() is actually part of the interrupt-handling routine and normally is implemented as a separate function only for clarity.

First, we should set a lock (normally a spinlock) to ensure that parallel access attempts cannot cause inconsistent states in data structures. Subsequently, the adapter is repeatedly asked which packets have been sent, until all sent packets have been recorded. Next, the packets are removed from the ring buffer and freed by dev_kfree_skb_irq(skb).

Finally, we should check on whether the network device has been briefly halted by a full ring buffer. At least one buffer place has now been released, so the network can be freed by netif_wake_queue(dev). To free the network device, the flag LINK_STATE_XOFF is deleted, as described in Section 5.1.1.

 void net_tx(struct net_device *dev) {     struct net_local *np = (struct net_local *)dev->priv;     int entry;     /* This protects us from concurrent execution of      * our dev->hard_start_xmit function above.      */     spin_lock(&np->lock);     entry = np->tx_old;     while (tx_entry_is_sent(np, entry)) {         struct sk_buff *skb = np->skbs[entry];         np->stats.tx_bytes += skb->len;         dev_kfree_skb_irq (skb);         entry = next_tx_entry(np, entry);     }     np->tx_old = entry;     /* If we had stopped the queue due to a "tx full"      * condition, and space has now been made available,      * wake up the queue.      */     if (netif_queue_stopped(dev) && ! tx_full(dev))         netif_wake_queue(dev);         spin_unlock(&np->lock); }

Receiving a Data Packet

In contrast to sending a data packet, receiving a data packet from the network is an unforeseeable event for the operating system. The network adapter receives a packet in parallel to processor operations and wants to forward this packet to the kernel. In general, there are two methods to inform the kernel that a packet has arrived.

First, the system could periodically ask the network adapter whether data has been received; this is the so-called polling principle. One major problem of this method is the size of the time interval in which the network adapter should be asked. It this interval is too short, then unnecessary computing time is wasted, but, if it is too long, then the data exchange is unnecessarily delayed and the network adapter might be unable to buffer all incoming packets.

The second and better method uses an interrupt and an appropriate interrupt-handling routine to inform the operation system about an incoming packet. A processor of the system is briefly interrupted in its current work, accepts the packet received, and stores it in a queue. Next, the packet is further handled as soon as the processor has time. This interrupt principle clearly performs better than the polling principle, and it adapts itself better to the current system load. For this reason, each modern network adapter works by this principle (i.e., the receive function of the network driver is called by its interrupt-handling routine).

`net_rx()`	drivers/net/isa_skeleton.c

The driver method used to handle incoming packets is responsible for requesting a socket buffer and for filling the packet data space with the packet received. The method dev_alloc_skb() can be used to request a new socket buffer. This method attempts to get a used socket buffer from the socket-buffer cache to avoid slow memory management. Section 4.1.1 introduced the way how dev_alloc_skb() works.

Occasionally, more than one packet arrives. In the example discussed in this section, we use up to ten packets that are accepted by the network adapter and introduced as socket buffers to the Linux network architecture. The status of a received packet can generally be verified from specific hardware registers (i.e., whether the packet was received correctly and, if not, which error occurred). If errors occur, then these are generally collected in a net_device_stats structure, which is not part of the net_device structure; it has to be managed in the private data space (dev->priv) of the network device.

Once a packet has been received correctly and the packet data has been transferred to the packet data range of the socket buffer, the receiving network device dev is registered in the sk_buff structure, and the protocol type present in the packet is learned. Notice that this information cannot be carried out from the payload of the MAC packet, so it has to be learned here, before the packet is forwarded to the higher layers. For Ethernet packets, the method eth_type_trans() handles this task and extracts this information from the protocol field of the Ethernet frame. (See Section 6.3.1.)

Subsequently, netif_rx(skb) can place the socket buffers in the input queue. Finally, the statistics for the network device are updated, and the interrupt-handling routine either continues handling the next packet received or terminates the interrupt handling.

 /* We have a good packet(s), get it/them out of the buffers. */  static void net_rx(struct net_device *dev) {     struct net_local *lp = (struct net_local *)dev->priv;        int ioaddr = dev->base_addr;        int boguscount = 10;        do {            int status = inw(ioaddr);            int pkt_len = inw(ioaddr);            if (pkt_len == 0)       /* Read all the frames? */                break;              /* Done for now */            if (status & 0x40) {/* There was an error. */                lp->stats.rx_errors++;                if (status & 0x20) lp->stats.rx_frame_errors++;                if (status & 0x10) lp->stats.rx_over_errors++;                if (status & 0x08) lp->stats.rx_crc_errors++;                if (status & 0x04) lp->stats.rx_fifo_errors++;            } else {                /* Malloc up new buffer. */                struct sk_buff *skb;                lp->stats.rx_bytes+=pkt_len;                skb = dev_alloc_skb(pkt_len);                if (skb == NULL) {                    printk(KERN_NOTICE "%s: Memory squeeze, dropping packet.\n",                           dev->name);                    lp->stats.rx_dropped++;                    break;                }                /* 'skb->data' points to the start of sk_buff data area. */                memcpy(skb_put(skb,pkt_len), (void*)dev->rmem_start,                       pkt_len);                /* or */                insw(ioaddr, skb->data, (pkt_len + 1) >> 1);                skb->dev = dev;                skb->protocol = eth_type_trans(skb, dev);                netif_rx(skb);                dev->last_rx = jiffies;                lp->stats.rx_packets++;                lp->stats.rx_bytes += pkt_len;            }        } while ( boguscount);        return; }

5.3.4 Problems In Transmitting Packets

Even when a packet was passed to the network adapter, it is not yet certain whether the packet can be transmitted. The network adapter could be faulty, or the interrupt with the acknowledgment of the transmission process could have been lost. For this reason, a watchdog timer is used to detect errors.

During the registration of a network device (register_netdevice() ?see Section 5.2.1), the watchdog timer dev->watchdog_timer is initialized in the function dev_watchdog_init(). The handling routine of the timer is set not to the function dev->tx_timeout(), but to dev_watchdog(). Also, the net_device structure of the network device is entered as the timer's private data.

When the network device is activated (dev_open()) at a later point in time (see Section 5.2.2), then the watchdog timer is started by the method dev_activate() or dev_watchdog_up(). The time when the timer should be triggered is set to jiffies + dev->watchdog_timeo. If no valid value was stated for the interval when the network device was registered, then dev_watchdog_up() takes 5 · HZ.

This means that the handling routine of the watchdog timer has completed all dev->watchdog_timeo ticks. At the same time, dev_watchdog() is tested to check on whether the network device is active and usable at all. And, if the transmit buffers of the network adapter are still full (netif_queue_stopped(dev)) and the condition (jiffies ?dev->trans_start > dev->watchdog_timeo) is met, then there is a problem. The driver method dev->tx_timeout() is called to solve this problem, as is described later in this chapter.

If no problem occurred, or if the network device is not active, then the timer is registered again to be executed in dev->watchdog_timeo ticks.

In earlier kernel versions, the drivers of network devices were responsible themselves for implementing and managing a watchdog timer. This mechanism assumes that task now in the newer versions. This means that only the adapter-specific reset method dev->tx_timeout() has to be implemented.

`net_timeout()`	drivers/net/isa_skeleton.c

When a problem situation occurs during the transmission of data packets, then the above described watchdog timer of the network device (dev->watchdog_timer) detects the problem. As soon as more than dev->watchdog_timeo ticks have passed since the last packet start (trans_start), then the handling routine dev->tx_time-out() should take care of this problem.

This handling routine is responsible for analyzing the problem and for handling it. Often, the only way to solve the problem is to reset and reinitialize the complete hardware of the network adapter. In any event, an attempt should be made to send the packets waiting in the queue.

 static void net_tx_timeout(struct net_device *dev) {     struct net_local *np = (struct net_local *)dev->priv;     printk(KERN_WARNING "%s: transmit timed out, %s?\n", dev->name,            tx_done(dev) ? "IRQ conflict" : "network cable problem");     /* Try to restart the adaptor. */     chipset_init(dev, 1);     np->stats.tx_errors++;     /* If we have space available to accept new transmit      * requests, wake up the queuing layer. This would      * be the case if the chipset_init() call above just      * flushes out the tx queue and empties it.      *      * If instead, the tx queue is retained then the      * netif_wake_queue() call should be placed in the      * TX completion interrupt handler of the driver instead      * of here.      */     if (!tx_full(dev))         netif_wake_queue(dev); }

5.3.5 Runtime Configuration

`example_set_config()`	drivers/net/isa_skeleton.c

In certain situations, it can be necessary to change the configuration of the system resources used by a network adapter at runtime for example, when the interrupt cannot be identified automatically, or when there are conflicts with other devices. The driver method set_config() can be used to manipulate the configuration of system resources (i.e., interrupt, DMA, etc.) at runtime.

When the current configuration is polled on the application level, then the irq, dma, base_addr, mem_start, and mem_end parameters can be read directly from the net_device structure. However, when one of these parameters has to be changed, then we need a driver-specific method to effect the changes in the adapter. We will use the method net_set_config(), which allows us to change only the interrupt line, as an example to show how system resources can be changed in general.

The driver method set_config() is called when an application process invokes the ioctl() command SIOCSIFMAP (Socket I/O Control Set InterFace MAP). Beforehand, however, the process should have read the current configuration by use of the ioctl() command SIOCGIFMAP (Socket I/O Control Get InterFace MAP). The reason is that, when it wants to change a value, the other parameters should have the current values.

For both ioctl() commands, the system parameters are passed in a structure of the type ifmap. The ifmap structure has the following fields, corresponding to the fields with the same names in the net_device structure.

 struct ifmap {     unsigned long mem_start;     unsigned long mem_end;     unsigned short base_addr;     unsigned char irq;     unsigned char dma;     unsigned char port; };

The method's return value is also used as return value for the ioctl() call. Drivers that don't implement set_config() return -EOPNOTSUPP.

 static int net_set_config(struct net_device *dev, struct ifmap *map){     if (dev->flags & IFF_UP)      /* no changes on running devices */         return -EBUSY     /* we don't allow to change the port address */     if (map->base_addr != dev->base_addr) {         return -EOPNOTSUPP;     }     /* changing the irq is o.k. */     if (map->irq != dev->irq) {         dev->irq = map->irq;     }     /* ... */     return 0; }

5.3.6 Adapter-Specific ioctl() Commands

ioctl() commands are extremely useful tools to start certain actions from within the user address space. Normally, executing the system call ioctl() in a socket causes an ioctl() command of a network protocol to be invoked. The corresponding symbols are defined in the file include/linux/sockios.h and normally relate to a specific protocol instance. However, when an ioctl() command of higher protocol instances cannot be processed, then the kernel forwards it to the network devices, which can then define their own commands in the driver method do_ioctl().

`net_do_ioctl()`	drivers/net/isa_skeleton.c

The ioctl() implementation for sockets knows 16 additional ioctl() commands, which can be used by drivers. More specifically, these are the commands SIOCDEVPRIVATE to SIOCDEVPRIVATE + 15. When one of these commands is used, then the method dev->do_ioctl() of the relevant network device is invoked.

When called, do_ioctl(dev, ifr, cmd) gets a pointer to a structure of the type ifreq. This pointer (ifr) points to an address in the kernel address space, which contains a copy of the ifreq structure passed by the user. After the loopback from the do_iotcl() method, this structure is copied back to the user address space. This means that a network driver can then use its own ioctl() commands both to receive and to output data. Examples for driver-specific ioctl() commands include reading or writing special registers, such as the MII register of some modern network adapters (eepro100, epic100, etc.).

We use the following basic example to demonstrate a driver-specific ioctl() implementation:

 static int net_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) {     struct net_local *lp = (struct net_local *)dev->priv;     long ioaddr = dev->base->addr;     u16 *data = (u16 *)&ifr ->ifr_data;     int phy = 1p->phy[0] & 0x1f;     switch(cmd) {         case SIOCDEVPRIVATE:   /* Get the address of the PHY in use */             data[0] = phy;         case SIOCDECPRIVATE+1: /* Special ioctl command 1 */             special_ioctl_1();         case SIOCDEVPRIVATE+2: /* Special ioctl command 2 */             special_ioctl_2();            /* ... */            default:                return -EOPNOTSUPP;     } }

5.3.7 Statistical Information About a Network Device

In most cases, we could want to obtain statistical information about the operation of a network device or its network adapters. Detailed logging of the events can help us find and troubleshoot errors and faulty configurations easily. For this purpose, we always use the data structure net_device_stats in the Linux kernel.

`struct net_device_stats`	include/linux/netdevice.h

rx_packets and tx_packets contain the total number of packets successfully received and transmitted, respectively, over this network device.
rx_errors and tx_errors store the number of faulty packets received and unsuccessful transmissions, respectively. Typical receive errors are wrong checksums or wrong packet sizes. Transmit errors are mainly due to physical problems or faulty configurations.
rx_dropped and tx_dropped give the number of incoming and outgoing packets that were dropped for various reasons (e.g., memory unavailable for packet data).
multicasts shows the number of multicast packets received.

The net_device_stats structure has a number of additional fields you can use to specify occurring errors in more detail, such as the number of ring buffer overflows, CRC errors, and synchronization errors. The exact structure and content of the net_device_stats structure can be found in <linux/netdevice.h>. In addition, there is a separate structure (iw_statistics) for wireless network adapters, containing radio connection data. (See the file include/linux/wireless.h.)

`net_get_stats()`	drivers/net/isa_skeleton.c

Interestingly, there is no pointer to the net_device_stats structure for statistical data in the net_device structure. The structure for statistical data has to be accommodated in the private data space of a network driver and is invoked by the driver method get_stats().

network adapters: get_stats(dev) () returns a pointer to the statistical data of a network device (dev). A sample implementation might look like this:

 /*  * Get the current statistics.  * This may be called with the card open or closed.  */ static struct net_device_stats *net_get_stats(struct net_device *dev) {     struct net_local *lp = (struct net_local *)dev->prive;     short ioaddr = dev->base_addr;     /* Update the statistics from the device registers. */     lp->stats.rx_missed_errors = inw(ioaddr+1);     return &lp->stats; }

5.3.8 Multicast Support on Adapter Level

`net_set_multicast_list()`	drivers/net/isa_skeleton.c

A network adapter uses the MAC destination address of a data packet to decide whether it will accept or ignore it. This process runs on the network adapter, so it doesn't interfere with the central processor's work. The central processor will be interrupted in its work only if the network adapter triggers an interrupt because it wants to forward the packet to higher protocol instances. In general, a network adapter accepts only packets intended for it, to ensure that the processor is not unnecessarily interrupted. Of course, an exception to this rule is the promiscuous mode, where all packets are accepted for analytical purposes.

For unicast packets, it is relatively easy to see whether the computer is interested in a packet. The network adapter merely has to detect its own MAC address as the destination address contained in the layer-2 packet header. Broadcast packets are also accepted without exception. However, the situation is different when detecting the correct multicast packets. How can the card know whether the computer is interested in the data of that group? In case of doubt, the card accepts the packet and passes it on to higher protocols, which should be able to know the groups subscribed. Though this method is very expensive, because the central processor has to check each multicast packet, it is the only way for some (older) network adapters to receive the correct multicast packets.

A better support for multicast on the MAC level is offered by modern network adapters. Such adapters manage a list of MAC addresses from which they want to receive packets. If only the packets of a specific multicast group should be received, then the corresponding MAC group address is passed to the network adapter, which will then receive the multicast packets. Section 17.4.1 describes the connection between groups and group addresses on the MAC and IP levels.

A network device stores the list of active MAC group addresses in a list (dev- >mc_list). Whenever a new address is added or the state of the network device changes, then the driver method dev->set_multicast_list transfers this list to the adapter. The accompanying example illustrates how this method works.

When the network device is in promiscuous mode, then this mode is activated on the card. If all multicast packets should be received or if the list of MAC multicast addresses is bigger than the filter memory on the adapter, then all multicast packets are received; otherwise, the desired MAC addresses are transferred to the adapter for example, as expressed by hardware_set_filter.

 /*  * Set or clear the multicast filter for this adaptor.  * num_addrs == -1 Promiscuous mode, receive all packets  * num_addrs == 0 Normal mode, clear multicast list  * num_addrs > 0 Multicast mode, receive normal and MC packets,  * and do best-effort filtering.  */ static void set_multicast_list(struct net_device *dev) {     short ioaddr = dev->base_addr;     if (dev->flags&IFF_PROMISC)     {         /* Enable promiscuous mode */         outw(MULTICAST|PROMISC, ioaddr);     }     else if((dev->flags&IFF_ALLMULTI) | | dev->mc_count > HW_MAX_ADDRS)     {         /* Disable promiscuous mode, use normal mode. */         hardware_set_filter(NULL);         outw(MULTICAST, ioaddr);     }     else if(dev->mc_count)     {         /* Walk the address list, and load the filter */         hardware_set_filter(dev->mc_list);         outw(MULTICAST, ioaddr);     }     else         outw(0, ioaddr); }