17.5 Multicasting in Today's Internet Multicasting was a thing unheard of at the advent of the Internet, and neither group addresses nor protocols to manage groups or multicast routing were available. In fact, the most important prerequisites to implementing an efficient group communication service were missing. The Internet was a pure unicast network. Several proposals in this field were made [Deer91] when the Internet community had started to think that such a service was necessary, at the beginning of the nineties. Eventually, IP multicast was born when the Internet Group Management Protocol and the address class D were standardized. In addition, multicast routing protocols were proposed, so that nothing was actually impeding the introducing of the new communication form. However, though the Internet had evolved into an enormous global network during the last twenty years, it was still a unicast network, and gradually each system connected to the Internet would have had to be extended to IP multicast support. This change would certainly have taken several years to complete. In addition, the new technology had not yet been tested extensively. Consequently, a decision was made to build a multicast test network within the unicast Internet, the so-called MBone (Multicast Backbone On the Internet), rather than converting to multicast from scratch. 17.5.1 The Multicast Backbone (MBone) The Internet Engineering Task Force (IETF) ran a pilot transmission session to officially introduce MBone in March 1992. Since then, more than 10,000 subnetworks have been connected to this network worldwide. MBone enables the connected multicast-enhanced subnetworks to run IP multicasting over the existing Internet, even though the Internet itself is not multicast capable. The solution offered by MBone is relatively simple: It builds a virtual multicast network over the conventional Internet, which understands only unicasting, and connected systems communicate over multicast-capable routers (multicast routers). As soon as there is a nonmulticast network between them, multicast routers bridge this situation by a so-called IP-in-IP tunnel. This tunnel consists of a unicast connection used to transport multicast traffic. For this purpose, the multicast router packs a multicast packet into another IP packet at the beginning of the tunnel and sends it as a normal unicast IP packet over the network to the tunnel output. The multicast router at that end of the tunnel removes the outer unicast packet and sends the multicast packet to the multicast-capable network. This method led to the formation of many multicast-capable islands interconnected by tunnels over the conventional Internet. Figure 17-14 shows an example for the basic MBone architecture. Technically, MBone is a virtual overlay network on top of the Internet. Similar overlay networks have been built to study other Internet technologies, including 6Bone (Six-Bone) for IPv6 and QBone to study quality of service (QoS) mechanisms. Figure 17-14. MBone consists of multicast islands connected by tunnels. 17.5.2 Accessing MBone Over the mrouted Daemon The mrouted daemon is a tool you can use to connect to MBone. It enables you to build tunnels to other MBone nodes and ensure connectivity. In addition, this daemon enables multicast routing for multicast packets within or at the boundaries of a multicast network. The standard implementation of mrouted in UNIX uses the Distant Vector Multicast Routing Protocol (DVMRP; see Section 17.5.3). Like all daemons, mrouted operates in the user-address space and can be started and stopped at system runtime. It communicates with the kernel over specific interfaces, which will be introduced in the course of this chapter. The mrouted daemon can be exchanged at runtime, so we can implement different routing algorithms. The mrouted daemon is not the only multicast routing daemon for Linux, but it is the most popular and the most frequently used today. How mrouted Operates Multicast packet forwarding is separate from selecting of forwarding routes; the kernel is responsible for forwarding, and the routing daemon determines the routes. Again, this shows a known principle, the separating of the data path from the control path. To determine routes, the mrouted daemon obtains information about all incoming multicast packets, including their destination and origin. Using this information, it computes the multicast routing tables and passes them to the kernel over a specific interface. This means that the daemon tells the kernel how it should forward packets from a specific sender to a specific group. The paths or routes for multicast packets are stated in the form of virtual network devices, which were introduced in Section 17.4.2. Interface Between the Multicast Routing Daemon and the Kernel The mrouted daemon communicates with the Linux kernel over a special socket. The kernel obtains routing information for multicast packets by setsockopt() calls over a raw socket. mrouted uses the IPPROTO_IGMP protocol to open this socket a head of time. Within the kernel, a reference to the socket used by mrouted to communicate with the kernel is stored in the variable mroute_socket. If mroute_socket contains the value NULL, then no instance of the mrouted daemon is running yet; otherwise, an instance is already active, and the socket is denied. This allows you to check for whether an attempt is made to create a second instance of the daemon and to ensure that the commands to manipulate multicast routing information in the kernel originate from the correct mrouted socket. The options available to change the multicast routing table are handled by the function ip_mroute_setsockopt() (defined in net/ipv4/ipmr.c). MRT_INIT has to be the first option or the first command we send to the socket. All other commands (MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC, MRT_DEL_MFC, and MRT_ASSERT) are ignored; they return the result -EACCESS if the socket was not previously reserved by MRT_INIT. The following subsections describe the most important functions and structures available in the Linux kernel for tasks handled by the multicast routing daemon. These functions and structures are declared primarily in the file net/ipv4/mroute.c. Subsequently, we use a few examples to show how the daemon and the kernel exchange data by use of ioctl() commands and socket options. Kernel Functions Used by mrouted ip_mroute_setsockopt() | net/ipv4/ipmr.c |
ip_mroute_setsockopt(sk, opt_name, opt_val, opt_len) accepts commands from the mrouted daemon and processes them, including, for example, the creating and deleting of virtual network devices (VIFs) and entries in the multicast routing table (MFC). ip_mroute_setsockopt() is invoked by ip_setsockopt() in net/ipv4/ip_sockglue.c, because these are socket options. In addition to the sk socket and the actual command, opt_name, this function can accept other parameters, including additional information (opt_val), if required for the command, and the length of this information (opt_len). In summary, this function can accept the following commands and additional information: MRT_INIT: This command sets the variable mroute_socket. It takes the value unequal NULL by default, and it returns an -EADDRINUSE error if another instance of mrouted (or another multicast routing daemon) is already running. Otherwise, mroute_socket obtains a pointer to the socket used to send MRT_INIT. This command has to be the first command for mrouted, because all other commands check on whether they originate from the socket specified in mroute_socket. If an error occurs, then -EACCESS is returned, and the command is not executed. Before we can use a socket, we have to run the protocol option IPPROTO_IGMP to create the socket. Otherwise, an -ENOPNOTSUPP error is returned. opt_val has to point to an integer value with the value one; otherwise, the result will again be an error (this time, -ENOPROTOOPT). MRT_ADD_VIF: This command instructs the kernel to create a new virtual network device (VIF). The additional information this command takes is a pointer to a vifctl structure (see Section 17.5.2) in opt_val, which is passed to the kernel because it includes all parameters required. These parameters include an index of the virtual network device, the address of the physical network device, and tunnel information. Another important thing is the flag that states whether this virtual network device is a tunnel (VIFF_TUNNEL). If this is the case, then the kernel uses the function ipmr_new_tunnel() to create a new tunnel. Otherwise, ip_dev_find() is used to search for the corresponding network device. If MRT_ADD_VIF is sent before MRT_INIT initialized the socket, then -EACCESS is returned. This is also the case when MRT_ADD_VIF was invoked by another socket. Other errors that can potentially occur are -EINVAL, -EFAULT, -ENFILE, and -EADDRINUSE if opt_val is an invalid structure, if the structure cannot be copied from the user address space, if the VIF index is outside the valid range, or if the virtual network device already exists, respectively. In the creating of a new virtual network device, a new vif_device structure is added to its list (vif_table), and the information relevant for this entry is taken from the vifctl structure. MRT_DEL_VIF: This command is used to remove the entry for a virtual network device from the list of virtual network devices (vif_table). The parameters used here are identical to those used in MRT_ADD_VIF. However, only the VIF index is evaluated. MRT_ADD_MFC: This command instructs the kernel to add a new entry in its multicast routing table (multicast forwarding cache) or, if an entry matching the specified search key exists, to update this entry. For this purpose, an mfcctl structure (see below) is additionally passed in opt_val. Once the validity check of the structure (error: -EINVAL) has been completed, ipmr_cache_find() (see below) checks the MFC cache to see whether a corresponding entry exists. If this is the case, then this entry is adapted to the new values; otherwise, a new entry is created. When a new cache entry is created, an mfc_cache structure is allocated at the same time, and all relevant information is taken from the mfcctl structure and transmitted. In this case, the entries in the field mfc_ttls having the value null are mapped to 255. The reason is that the validation process need only test whether the value is smaller than 255 to see whether a packet is to be routed to a VIF. MRT_DEL_MFC: This command instructs the kernel to remove an entry from the MFC. The kernel uses the function ipmr_mfc_delete() to delete an entry from the MFC. ip_mroute_getsockopt() | net/ipv4/ipmr.c |
The function ip_mroute_getsockopt(sk, opt_name, opt_val, opt_len) normally is used to poll the version number (0x0305). For this purpose, MRT_VERSION should be specified in ip_getsockopt(). net/ipv4/ip_sockglue.c invokes opt_val.ip_mroute_getsockopt(). We can use ioctl() queries to poll additional status information for multicast forwarding. ipmr_ioctl() | net/ipv4/ipmr.c |
ipmr_ioctl(sk, cmd, arg) can be used to poll various status information. If cmd has the value SIOCGETVIFCNT, then a pointer to a sioc_vif_req structure (see below) in arg is expected. It includes the index to the virtual network device for which information is to be queried (e.g., how many packets and what data volumes have been received and sent over this device). The result is added to the passed sioc_vif_req structure. If cmd is equal SIOCGETSGCNT, then a pointer to a sioc_sg_req structure in arg is expected. The corresponding information about the matching entry in the multicast forwarding cache is determined. ipmr_ioctl() is entered in inet_create() (in net/ipv4/af_inet.c) as ioctl() handling routing in the proto structure for raw sockets. ip_mr_init() | net/ipv4/ipmr.c |
ip_mr_init(void) initializes the multicast routing functions in the kernel; it is invoked when inet_proto_init() (in net/ipv4/af_inet.c) starts. ipmr_get_route() | net/ipv4/ipmr.c |
ipmr_get_route(skm, rtm, nowait) determines the route for a packet from a specific source for a specific group. This function is used for informative purposes only. The result is packed into a packet to send it to other routers (e.g., so that these routers can exchange routing information). The route for the actual routing of multicast packets is specified directly in ip_mr_forward(). ipmr_cache_find() | net/ipv4/ipmr.c |
ipmr_cache_find(origin, mcastgrp) searches the multicast forwarding cache (see Section 17.4.2) for a specific entry. The packet's source address and the multicast group address serve as search keys. If an MFC entry is found, it includes all virtual network devices (VIFs) that can be used to forward the packet. The result is returned in the form of a pointer to an mfc_cache structure. ipmr_new_tunnel() | net/ipv4/ipmr.c |
ipmr_new_tunnel(v) is responsible for creating a new tunnel. All information required toward this end is passed, together with the vifctl structure (v). The function gets the tunnel network device, tun10, and tries to create a tunnel to the destination specified in v. If it is successful, then this new virtual network device is returned; otherwise an error message (null) is returned. ipmr_cache_unresolved() | net/ipv4/ipmr.c |
ipmr_cache_unresolved(cache, vifi, skb) is invoked by ip_mr_input() and ipmr_get_route(), if a specific entry was polled from the multicast forwarding cache, and if this entry either does not exist or has not been filled yet, though it was requested by mrouted. ipmr_cache_unresolved creates a new entry in the multicast forwarding cache and sets its status to MFC_QUEUED, which means that the route specified in this entry has not yet been entered by mrouted. Subsequently, the timer is activated, and ipmr_cache_report is invoked to ask mrouted for the required route. The timer causes the entry to be deleted from the cache at a certain time, if mrouted cannot determine the route before this time. ipmr_cache_report() | net/ipv4/ipmr.c |
ipmr_cache_report(pkt, vifi, assert) asks mrouted to create an entry in the multicast forwarding cache (MFC) for a specific packet its origin and multicast group to determine the route. For this purpose, a packet is created, and sock_queue_rcv_skb() is used to send it to mrouted. ipmr_cache_timer() | net/ipv4/ipmr.c |
ipmr_cache_timer(data) deletes an entry from the MFC, if the mrouted daemon was requested to determine the route, and if it was unable to do this within a specific time. ipmr_cache_timer() is the timer-handling routine for mfc_timer defined in the mfc_cache structure. ipmr_cache_alloc() | net/ipv4/ipmr.c |
ipmr_cache_alloc(priority) creates a new mfc_cache structure and adds a few initial values, including the timer data and handling routine and information stating that the route does not exist yet. ipmr_cache_alloc() does not write the created structure to the multicast forwarding cache; ipmr_cache_insert() has to be invoked separately for that purpose. ipmr_mfc_modify() | net/ipv4/ipmr.c |
ipmr_mfc_modify(action, mfc) is invoked when the mrouted daemon uses MRT_ADD_MFC or MRT_DEL_MFC to manipulate an MFC entry over setsockopt(). In addition to this action, the function checks for whether this is a new entry or the entry already exists and just has to be filled. If the latter is the case, and if the MFC entry is set to the MFC_QUEUED status, then ipmr_cache_resolve() is invoked to send waiting packets. ipmr_cache_resolve() | net/ipv4/ipmr.c |
ipmr_cache_resolve(cache) is invoked by ipmr_mfc_modify() when the route is set in an MFC entry and packets are waiting to be sent. The timer (see above) is deleted, and ip_mr_forward() sends the waiting packets. Data Exchange Between the mrouted Daemon and the Kernel This section explains the most important structures exchanged between the Linux kernel and the mrouted daemon. The daemon passes data to the kernel as shown below. We use the example of a query of the virtual network device and an MFC entry to better explain the procedure: Initially, the mrouted daemon allocates memory space for a sioc_vif_req structure and writes the index of the desired virtual network device to the vifi field. Calling of the ioctl () command SIOCGETVIFCNT makes the appropriate part be executed in ipmr_ioctl. The argument for the ioctl () command is the address by which the daemon has created this structure. Then it is checked whether the vifi index points to a valid virtual network device. If this is the case, then the referenced structure is filled with the appropriate values and copied back to the user address space by copy_to_user (). The structure used to pass the data of a virtual network device is as follows: struct sioc_vif_req | include/linux/mroute.h |
sioc_vif_req { vifi_t vifi; /* Which iface */ unsigned long icount; /* In packets */ unsigned long ocount; /* Out packets */ unsigned long ibytes; * In bytes */ unsigned long obytes; /* Out bytes */ }; vifi: This is the index of the virtual network device specified in the vif_table: the VIF used to request information. icount, ocount: This is the number of packets received or sent, respectively, over this VIF. ibytes, obytes: This is the sum of bytes included in the packets received or sent. The procedure involved in querying an entry in the multicast forwarding cache is identical, except that the sioc_sg_req structure and the ioctl () command SIOCGETSGCNT are used: struct sioc_sg_req | include/linux/mroute.h |
struct sioc_sg_req { struct in_addr src; struct in_addr grp; unsigned long pktcnt; unsigned long bytecnt; unsigned long wrong_if; }; The elements of the sioc_sg_req structure handle the following tasks: src: This is the sender address; it is part of the key for the MFC entry. grp: This is the multicast group address; it is the second part of the key for the MFC entry. pktcnt: This is the number of packets sent over the desired MFC entry. bytecnt: This is the number of bytes forwarded over the MFC entry. Because multicast packets can be received and sent both over physical network adapters and through tunnels, we use an abstraction of virtual network devices (VIFs). The mrouted daemon passes vifctl structures to the kernel. These structures define virtual network devices, including whether a VIF is a physical network device or a tunnel. The multicast routing table of the kernel is influenced by mfcctl (multicast forwarding cache entries MFCs) structures. They represent routing entries used by the kernel to learn how to forward multicast packets. struct vifctl | include/linux/mroute.h |
struct vifctl { vifi_t vifc_vifi; /* Index of VIF */ unsigned char vifc_flags; /* VIFF_ flags */ unsigned char vifc_threshold; /* ttl limit */ unsigned int vifc_rate_limit; /* Rate limiter values (NI) */ struct in_addr vifc_lcl_addr; /* Our address */ struct in_addr vifc_rmt_addr; /* IPIP tunnel addr */ }; vifc_vifi: This is the index of this virtual network device in the array vif_table, which stores all VIFs (0 - vifc_vifi <MAXVIFS). vifc_flags: This part can be used to set options (i.e., VIFF_TUNNEL, if you want to use a tunnel). vifc_lcl_addr: This is the local address of the network device. For a tunnel, this address represents the entry point into the tunnel virtually. vifc_rmt_addr: This part includes the destination address of the tunnel, if the VIF is a tunnel. struct mfcctl | include/linux/mroute.h |
struct mfcctl { struct in_addr mfcc_origin; /* Origin of mcast */ struct in_addr mfcc_mcastgrp; /* Group in question */ vifi_t mfcc_parent; /* Where it arrived */ unsigned char mfcc_ttls[MAXVIFS]; /* Where it is going */ unsigned int mfcc_pkt_cnt; /* pkt count for src-grp */ unsigned int mfcc_byte_cnt; unsigned int mfcc_wrong_if; int mfcc_expire; }; mfcc_origin: This is the packet's source address (i.e., the address of the computer that originally sent the packet). mfcc_mcastgrp: This identifies the multicast group that should receive the packet. mfcc_parent: This identifies the VIF that received the packet. mfc_ttls: This field specifies the VIFs a packet should be routed to. The field includes one entry for each potential VIF. A value of 0 or 255 means that the VIF with the corresponding index in the vif_table is not interested in the packet. 17.5.3 The DVMRP Routing Algorithm The Distance Vector Multicast Routing Protocol (DVMRP) is the oldest multicast routing protocol; it was defined initially in [WaPD88] and extended later [Pusa00]. DVMRP was the multicast routing protocol used as basis to build MBone, and it has since remained the most popular multicast routing protocol. DVMRP uses a distance vector routing algorithm, which determines the shortest path to the sender. This means that it extends the principles of the RIP distance vector unicast routing protocol [Malk98] to multicasting capabilities. DVMRP supports both physical network devices and tunnels as potential routes to forward multicast packets, mainly because these capabilities were required when MBone was introduced. This section briefly explains the approach of DVMRP, including a practical example. How DVMRP Works To build a distribution tree, DVMRP in its current version uses a principle called Reverse Path Multicasting. The shortest path to the sender is learned when a multicast router receives a multicast packet in a network adapter. If this route leads over the network device that received the packet, then it is forwarded to all neighboring multicast routers except for the interface that received the packet. Otherwise, the multicast packet is discarded, because it didn't arrive on an optimal route, which means that it is assumed that it does not originate from the direct path to the sender. More specifically, it is assumed to be a duplicate that might previously have been received. A multicast router is not able to see whether it has already received a packet or the packet is arriving for the first time. This is the reason why packets are accepted only if it can be assumed that they originate directly from the sender. This approach avoids a large number of duplicates those created by a routing principle called flooding. Figure 17-15 shows how DVMRP works. Figure 17-15. Schematic representation of how the Distance Vector Multicast Routing Protocol (DVMRP) works. A unicast routing protocol is used to determine the shortest path back to the sender. In this respect, the mrouted daemon does not use the kernel's unicast routing table, but instead builds separate tables. The routing information stored in these tables is then exchanged between DVMRP routers in the network. Though the Reverse Path Multicasting principle enables multicast packets to be distributed across the entire network without creating duplicates, it doesn't consider whether a specific subtree in the multicast routing tree is interested in the packets of a group. Packets are simply distributed, which means that they load the network with undesired packets. This is the reason why the method was extended to pruning. When the subnetwork of a router does not want to receive data for a specific multicast group, then the router can return a prune message to the higher-layer multicast router in the multicast routing tree. If this router doesn't have interested computers for that group, it can also send a prune message to a higher-level router. This method prevents an excessive number of packets from being forwarded in networks where there are no receivers. In addition, graft messages can be used to include a router or subnetwork in the distribution tree dynamically. To prevent a router from having to continually learn the prune state for each network device and each group, it can use a timeout mechanism. The status is discarded and the subnetwork is included in the multicast tree again as soon as the time interval expires. DVMRP belongs to the class of dense-mode routing protocols. Initial flooding causes this method to work best in scenarios where the group members in the network infrastructure are geographically close, so that flooding won't limit the bandwidth excessively until pruning has built the multicast routing tree. Other multicast routing protocols are Multicast-OSPF and Protocol Independent Multicast (PIM) in sparse mode (PIM-SM) or dense mode (PIM-DM). |