17.4 Multicast Data Path in the Linux Kernel

This section describes how multicast data packets are processed in the Linux kernel. To get a good insight into matters, we will first explain the path a multicast packet takes across the kernel, then discuss different aspects of the data path. We will begin on the MAC layer, to see how multicasting is supported in local area networks and to introduce the following IP multicast concepts:

virtual network devices,
multicast routing tables, and
replicating of data packets.

As we introduce the implementation, we will emphasize differences between multicast-capable end systems and multicast routers.

17.4.1 Multicast Support on the MAC Layer

In general, IEEE-802.x LANs are broadcast-enabled: each data packet is sent to each participant. Each network adapter looks at the MAC destination address to decide whether it will accept and process a packet. This process normally is handled by the network adapter and doesn't interfere with the central processor's work. The central processor is stopped by an interrupt only when the adapter decides that a packet has to be forwarded to the higher layers. This means that the filtering of packets in the network adapter take load off the CPU and ensures that it will receive only packets that are actually addressed to the local computer.

Filtering undesired MAC frames works well in the case of unicast packets, because each adapter should know its MAC address. However, how can the card know whether the computer is interested in the data of a group when a multicast packet arrives? In case of doubt, the adapter accepts the packet and passes it on to the higher-layer protocols, which should know all subscribed groups. The next question is whether multicast packets use the MAC address at all. The MAC format supports group addresses, but how are they structured?

There is a clever solution for IP multicast groups to solve the problems described above. On the one hand, this solution prevents broadcasting of multicast packets; on the other hand, it concurrently filters IP groups on the MAC layer. The method, described here, is simple, and it relieves the central processing unit from too many unnecessary interrupts.

IP multicast packets are packed in MAC frames before they are sent to the local area network, and they contain a MAC group address. The MAC address is selected so that it gives a clue about which multicast group the packet could belong to. Figure 17-9 shows how this address is structured; it contains the following elements:

The first 25 bits of the MAC address identify the group address for IP multicast.
The first byte (0x01) shows that the address is a group MAC address, where the last bit is decisive. Notice that the address shown in Figure 17-9 is represented in the network byte order.
The next 17 bits (0x005E) state that the MAC packet carries an IP multicast packet. The identifier here would be different for other layer-3 protocols.
The last 23 bits carry the last 23 bits of the IP multicast address.

Figure 17-9. Mapping an IP multicast group address to an IEEE-802 MAC address.

We can easily see that this is not a reversible mapping between an IP multicast address and a MAC address. For each multicast MAC address, there are 2⁵ = 32 possible IP multicast groups matching this MAC address. This means that the network adapters cannot filter exactly. Instead, they pass all matching multicast groups to IP. In any event, this filtering takes a lot of load off the central processing unit, because only multicast packets actually subscribed are normally delivered to IP.

We still have to answer one question: How can the network adapter know to which groups IP subscribed? To solve this problem, each network adapter manages a list with multicast group addresses that should be received on this adapter. If the network adapter can support hardware filtering of multicast packets from the technical perspective, then a driver method transfers this list to the adapter (set_multicast_list()). The adapter can then filter without disturbing the central processing unit.

The active multicast groups of a network device's IP instance are maintained in the mc_list of the in_device structure. When an application joins an IP multicast group, the IP group address and some other information are recorded in this list. In addition, ip_eth_mc_map() computes the appropriate MAC group address (for Ethernet networks) and adds it to the list dev->mc_list. Once set_multicast_list() has updated the card, the network adapter should be able to receive packets for this group. Figure 17-10 shows schematically how groups are managed in the IP instance and in the network device.

Figure 17-10. Managing multicast groups or addresses in an IP instance and in a network device.

`struct ip_mc_list`	include/linux/igmp.h

multiaddr is the multicast address of the subscribed group.
interface points to the net_device structure of the network adapter.
next points to the next entry in the list.
timer is a timer used by IGMP to delay membership reports.
tm_running shows whether or not the timer is currently active.
reporter contains the value 1, if this computer sent the last membership report for this group to the multicast router. If another computer was faster, then reporter is set to 0. This information is required to send leave messages, which are transmitted exclusively by the reporter.
users counts the number of sockets that subscribed this group. During closing of one of these sockets, a leave message is sent over IGMPv2 only when no more users exist.

`dev->mc_list`	include/linux/netdevice.h

next points to the next entry in the list.
dmi_addr[MAX_ADDR_LEN]: This is the layer-2 address of the group the packets of which should be received.
dmi_addrlen specifies the length of the layer-2 address in dmi_addr.

17.4.2 Multicast Data Path in the Internet Protocol

Now that we have described how multicasting is supported on the MAC layer, this section explains how IP multicast packets are processed in the Linux implementation of the Internet Protocol. We will first look at end systems and multicast routers.

However, before we discuss the details of the processes involved, we want to introduce two important components of the multicast implementation: virtual network devices, which abstract from the two possible transmission types for multicast packets (i.e., LAN adapter card or IP tunnel), and the multicast routing table (multicast forwarding cache).

Virtual Network Devices

Multicast packets can be received and sent in either of two different ways: either directly over the network adapter in a LAN, or packed inside a second unicast IP packet and transported through a tunnel. To prevent having to distinguish these two cases in the entire multicast implementation, an abstraction the so-called virtual network device or virtual interface (VIF) was introduced for both types.

`struct vif_device`	include/linux/mroute.h

In the Linux kernel, virtual network devices are represented by the vif_device structure. A virtual interface describes either a physical network device of the type net_device or an IP-IP tunnel. A flag is used to distinguish these two methods.

 struct vif_device {        struct net_device  *dev;               /* Device we are using */        unsigned long      bytes_in,bytes_out;        unsigned long      pkt_in,pkt_out;     /* Statistics */        unsigned long      rate_limit;         /* Traffic shaping (NI) */        unsigned char      threshold;          /* TTL threshold */        unsigned short     flags;              /* Control flags */        __u32              local,remote;       /* Addresses(remote for                                                   tunnels)*/        int                link;               /* Physical interface index */ };

dev: a pointer to the network device (net_device), which may be used
bytes_in, bytes_out: statistical information about the transported bytes
pkt_in, pkt_out: number of packets handled
threshold: threshold value for packets that should be sent over this virtual network device (as in Section 17.2.3)
flags: Flags to specify, for example, whether the VIF represents a tunnel (VIFF_TUNNEL).
local and remote: either (a) the IP addresses of the tunnel starting point and the tunnel end point, or (b) the IP address of the network device
link: Index of the physical network device.

Figure 17-11 shows how the vif_device structure is structured and embedded in its environment. Virtual network devices are stored in the table (array) vif_table. It can maintain a maximum of MAXVIFS entries for each computer. The maximum number is 32, and this number cannot be increased in current personal computers, to fit the variable vifc_map (net/ipv4/ipmr.c). Each bit of vifc_map marks whether the virtual network device with the index corresponding to the bit index had already been created. The maximum number, 32 VIFs for a 32-bit architecture, results from the data type, unsigned long, of this variable. Consequently, this number can be limited (but not extended) by MAXVIFS. This limitation results in the fact that a multicast router can be directly connected to a maximum of MAXVIFS other multicast routers.

Figure 17-11. The vif_table of a virtual network device.

Each entry in the array represents either a physical network device or a tunnel. The entries are set and removed by the multicast routing daemon (e.g., mrouted) by use of the socket options MRT_ADD_VIF and MRT_DEL_VIF. The parameters of the vif_device structure are passed in a vifctl structure (VIF control). Section 17.5.2 discusses how virtual network devices are configured.

Multicast Forwarding Cache

The multicast forwarding cache (MFC) is the central structure used to store information about how often incoming multicast packets have to be replicated and where they have to be forwarded to. This means that the MFC implements the multicast routing table. The MFC is built in the form of a hash table, as we can easily see from the mfc_cache structure. All entries with the same hash value are linearly linked with the respective cache rows (singly linked list). All cache rows are grouped in the field mfc_cache_array to form an MFC hash table with a size specified by MFC_LINES (include/linux/mroute.h). By standard, the multicast forwarding cache comprises 64 rows or lines. Figure 17-12 shows schematically how the MFC is structured.

Figure 17-12. Structure of the multicast forwarding cache (`mfc_cache`).

When additional routes should be found for an incoming multicast packet, then the multicast forwarding cache has to be searched for a matching entry. To find a matching entry, mfc_hash is initially used to determine the correct cache line. The input network device and the multicast group address are used as parameters. Next, the linked list in the cache line is processed until a matching entry is found. The entries in the MFC are of the mfc_cache data type.

The ttls field includes information about virtual network devices that can be used to forward the packet. A TTL entry smaller than 255 means that the virtual network device with an index identical to the index in the ttls array should forward these packets. The ttls array is processed sequentially. To optimize this process, minvif and maxvif store the minimum and maximum VIF indexes that should receive the packet.

`struct mfc_cache`	include/linux/include/mroute.h

 struct mfc_cache {        struct    mfc_cache *next;         /* Next entry on cache line                                             */        __u32     mfc_mcastgrp;            /* Group, entry belongs to                                             */        __u32     mfc_origin;              /* Source of packet */        vifi_t    mfc_parent;              /* Source interface */        int       mfc_flags;               /* Flags on line */        union        {                  struct                  {                        unsigned long expires;                        struct sk_buff_head unresolved; /* Unres. buffers */                  } unres;                  struct                  {                        unsigned long last_assert;                        int minvif, maxvif;                        unsigned long bytes;                        unsigned long pkt;                        unsigned long wrong_if;                        unsigned char ttls[MAXVIFS]; /* TTL thresholds */                  } res;         } mfc_un; };

next points to the next entry in the multicast forwarding cache. The cache lines are organized in singly linked lists. A NULL pointer in this field marks the end of a cache line.
mfc_mcastgrp and mfc_origin together form the key for an entry in the multicast forwarding cache. mfc_origin is the IP address of the sending computer, and mfc_mcastgrp specifies the multicast group for a multicast packet, the route of which is represented by this entry.
mfc_parent is the index of the virtual network device in the vif_table, over which packets of this MFC entry should arrive.
The mfc_un structure is a union structure: It is either an unres structure or a res structure. The two structures are defined per union, because either one structure or the other is required, but never both.
unres is used for entries in the multicast forwarding cache when the multicast routing daemon has not yet finalized the routing selection. The entry for the mfc_cache structure is created in the MFC as soon as a packet for it arrives.
- unresolved is a queue for socket buffers that store packets for this routing entry until the multicast routing daemon has selected a route.
- expires specifies the time by which the daemon should have selected a route.
res is used in the mfc_un union when the multicast routing daemon has specified the routes for this entry.
- minvif and maxvif are indexes to elements in the ttls list of the MFC entry. They limit the range currently used by virtual network devices, which are used to send packets of this MFC entry. Stating this indexes saves computing time required to duplicate multicast packets. The maxvif index is limited by the MAXVIFS constant.
- ttls[MAXVIFS] is an array with MAXVIFS entries, where each entry specifies whether a packet should be forwarded over the virtual network device in the vif_table list with the corresponding index. This is the case when a value less than 255 exists. The value 0 cannot occur, because it is mapped to the value 255 when the table is built. However, an entry less than 255 is not sufficient to forward a packet; the TTL value of a packet has to be at least equal to the TTL value in the ttls array. This is the method used to create the threshold value for multicast packets described in Section 17.2.3.

Paths of a Multicast Packet Through the Linux Kernel

Figure 17-13 shows the paths a multicast packet can take to travel through the Linux kernel. Like any other IP packet, a multicast packet is received by ip_rcv(). The routing cache is normally asked for further packet-forwarding instructions in ip_route_input() (see Chapter 16), and multicast packets are no exception. In the case of an end system, ip_check_mc() checks on whether the multicast group is required in the computer at all. This means that all undesired multicast packets (i.e., packets passed upwards by the network card, either because of an unclear mapping of IP multicast addresses to MAC addresses, or because of an adapter without hardware filter, or by the promiscuous mode) are discarded. If a socket desires a packet, then the flag RTCF_LOCAL is attached to the packet at this point. In general, the flag RTCF_MULTICAST is also set in multicast packets.

Figure 17-13. Overview of how a multicast packet can travel through the Linux kernel.

If the packet is accepted (i.e., if the group is desired or the station is a multicast router, which accepts all packets), then the multicast packet continues its path in ip_route_input_mc(), where paths go in different directions, depending on whether the station is an end system or a multicast router. In end systems, local packets are passed to the function ip_local_deliver(), which forwards them to the application layer.

In multicast routers, all packets (including local packets) are first handled by ip_mr_input(). The most important task of this function is to use ipmr_cache_find() to find the entry in the MFC. Local packets are passed to ip_local_deliver(), and packets with other destinations are transported to ip_mr_forward(), where they will eventually be replicated. The details of each of these functions are described next.

`ip_route_input_mc()`	net/ipv4/route.c

ip_route_input() invokes ip_route_input_mc(skb, daddr, saddr, tos, dev, our) if the incoming packet is a multicast packet. First, the function checks the source and destination addresses (saddr, daddr) and returns an error, if present. Also, packets originating from the same computer should not arrive over this input routine for several reasons, including protection against spoofing. Once the sender address has been checked in the forwarding information base, memory space is allocated for a new entry in the routing cache. The route is entered in the cache, and the flag RTCF_MULTICAST is added. Subsequently, the packet is handled in ip_local_deliver() (for end systems) or ip_mr_input() (for multicast routers), and the computed hash value is returned.

`ip_mr_input()`	net/ipv4/ipmr.c

ip_rcv_finish() invokes ip_mr_input(skb) by use of the function pointer dst->input(), which is set to ip_mr_input() in ip_route_input_mc(), if the computer was configured as a multicast router. The first thing to be checked is whether the packet is an IGMP packet for the multicast routing daemon. If so, the packet is delivered to this daemon over the raw socket. Subsequently, the function ipmr_cache_find searches the multicast forwarding cache for a matching entry for the skb packet. If no matching entry can be found, then a local packet (RTCF_LOCAL) is passed to ip_local_deliver(), and packets that have to be forwarded are added to the queue for incomplete routing entries (unresolved queue in the mfc_cache structure). As soon as the routing daemon has determined the route, these packets are forwarded by ip_mr_forward().

If a valid entry was found in the multicast forwarding cache, then ip_mr_input() passes a pointer to this routing entry to the function ip_mr_forward, which duplicates and forwards the packet. Finally, packets marked local (RTCF_LOCAL) are passed upwards by ip_local_deliver.

`ip_mr_forward()`	net/ipv4/ipmr.c

ipmr_cache_find(origin, mcastgrp) searches the multicast forwarding cache (see above) for a specific entry. The source address of the packet and the multicast group address together are used as search key. An MFC entry found includes all virtual network devices (VIFs), which should be used to forward packets. The result is returned in an mfc_cache structure (see above).

`ip_mr_forward()`	net/ipv4/ipmr.c

ip_mr_input(skb, mfc, local) obtains the mfc pointer from ip_mr_input(), which points to the entry of the skb packet in the multicast forwarding cache and then duplicates the packet for each virtual network device that should be used to forward duplicates:

 for (ct = cache->mfc_un.res.maxvif-1; ct >= cache->mfc_un.res.minvif; ct-) {         if (skb->nh.iph->ttl > cache->mfc_un.res.ttls[ct])         {                if (psend !=-1)                       ipmr_queue_xmit(skb, cache, psend, 0);         psend=ct;         } }

The actual replication of multicast packets is done in the for loop, which checks all array entries in the ttls field of the mfc structure from maxvif-1 to minvif. However, the copies of socket buffers are not directly created by ip_mr_forward(). They are created later by the function ipmr_queue_xmit(), which is invoked if the multicast packet has sufficient lifetime (TTL) left to be sent over the current virtual network device. It passes the entire MFC structure and an index to each VIF that should receive the packet to ipmr_queue_xmit for each VIF.

Before a packet is replicated, ip_mr_forward() checks on whether the packet has arrived in the expected network device and on whether it was discarded, if applicable. One reason for this check will be discussed in Section 17.5.3. A conflict situation can occur if a computer acts as multicast router and runs multicast applications at the same time. If multicast packets are transported across the loopback network device, then the input network device can deviate from the default multicast route for these packets, and these packets will be discarded.

`ipmr_queue_xmit()`	net/ipv4/ipmr.c

ip_mr_forward invokes ipmr_queue_xmit(skb, mfc, vifi, last) to transmit the multicast packet skb. If the packet is not exclusively available (for example, because there are several references to the packet payload (cloned)), or if the packet is not the last of all of its replicates, then a clone of the socket buffer is created. In addition, a decision has to be made for the packet as to whether it should be transported through a tunnel or through a regular network device. In either case, ip_route_output with the respective parameters is invoked. For a tunneled packet, the variables local and remote from the pertaining VIF structure are passed; otherwise, the destination IP address from the socket buffer's structure is sufficient. In both cases, the variable link from the VIF structure passes an index to the relevant physical network device. Subsequently, the netfilter hook NF_IP_FORWARD is invoked. If it does not have to be fragmented, the packet is sent over the function pointer dst-> output(); otherwise it is sent over ip_fragment(skb, dst->output).