26.3 Protocol-Specific Sockets

The central structure of all protocol-specific sockets underneath the BSD sockets is struct sock. This structure was oriented to TCP/UDP and IP in earlier kernel versions. Along with the addition of other protocols (e.g., ATM), the sock structure was extended, and other entries were partially removed from the structure. Initially, this created an extremely unclear construction having a number of entries needed for only a few protocols. Together with the introduction of the three union structures (net_pinfo, tp_pinfo, and protinfo), each of which contains a reference to protocol options of the matching layer, this situation should gradually improve, and the structure should become easier to understand. The structure is still rather extensive, but we will introduce only the entries of interest in the following sections.

26.3.1 `PF_INET` Sockets

This section describes the initialization on the level of PF_INET sockets when an application uses a socket() call.

`inet_create()`	net/ipv4/af_inet.c

As has been mentioned, this function is invoked by the function sock_create() to initialize the sock structure. Initially, the state of the BSD socket is set to SS_UNCONNECTED, and then the function sk_alloc(), which was described in Chapter 4, allocates a sock structure. The protocol family can only be PF_INET at this point, but we still have to distinguish by type and protocol. This is now done by comparing against the information in a list, struct inet_protosw inetsw_array, which is created by inet_register_protosw() (net/ipv4/af_inet.c) when the kernel starts.

Next, the ops field of the BSD socket structure can be filled. The sk pointer connects the BSD socket to the new sock structure, and the latter is connected to the BSD socket by the socket pointer. (See sock_init_data() in net/core/sock.c.) Similarly to the proto_ops structure, the proto structure supplies the functions of the lower-layer protocols:

 struct proto { void          (*close)            (...); int           (*connect)          (...); int           (*disconnect)       (...); struct sock*  (*accept)           (...); int           (*ioctl)            (...); int           (*init)             (...); int           (*destroy)          (...); void          (*shutdown)         (...); int           (*setsockopt)       (...); int           (*getsockopt)       (...); int           (*sendmsg)          (...); int           (*recvmsg)          (...); int           (*bind)             (...); int           (*backlog_rcv)      (...); void          (*hash)             (...); void          (*unhash)           (...); int           (*get_port)         (...); char          name [32];               struct {                      int inuse;                      u8 __pad[SMP_CACHE_BYTES - sizeof(int)];                      } stats[NR_CPUS]; };

Consequently, the proto structure represents the interface from the socket layer to the transport protocols. The hash() and unhash() functions serve to position or find a sock structure in a hash table.

Finally, inet_create() invokes the inet() function for the identified protocol from the proto structure, if it exists. For example, this would be the function tcp_v4_init_sock() (net/ipv4/tcp_ipv4.c) in case of TCP. A similar function exists in net/ipv4/raw.c for direct access to IP (SOCK_RAW). The other protocol functions are now available over the sock structure in these init() functions.

In TCP, the structure tcp_func (defined in include/net/tcp.h in net/ipv4/tcp_ipv4.c) is filled with TCP-specific functions and made available to the sock structure over the entry tp_pinfo.af_tcp.af_specific.

The transmit function of TCP (tcp_transmit_skb() see Chapter 24) accesses the transmit function ip_build_xmit() (described in Chapter 14) available from the IP layer (network layer). However, this approach is not uniform (e.g., UDP does not implement an init() function, but accesses ip_build_xmit() directly).

26.3.2 `PF_PACKET` Sockets

The PF_PACKET sockets represent a type of socket created to allow applications to access a network device directly. The basic idea was to let an application state a packet type at a PF_PACKET socket (e.g., PPPoE connection-management packets with the type ETH_P_PPP_DISC). This means that all incoming packets of this type are delivered directly to this socket; and, vice versa, all packets sent over this socket are sent directly over the specified network device. Consequently, no protocol processing occurs within the kernel, so you can implement network protocols in the user space. At the socket interface, the application sets the family field to PF_PACKET. Formerly, PF_INET had to be selected, and the type SOCK_PACKET had to be specified.

As for PF_INET sockets, a create() function is exported when the PF_PACKET support starts (the function packet_create() (net/packet/af_packet.c), in this case). This function registers the functions of the PF_PACKET socket with the BSD socket. In addition, the functions register_netdevice_notifier() and packet_notifier() are used to invoke the function dev_add_pack() (net/core/dev.c) for all packet types registered with the PF_PACKET socket. This means that a handler is installed in the specified network device for all incoming packets of the desired packet type. This handler forwards the packets to the packet_rcv() function for processing, as was described in Chapter 5.

In the following, we will briefly introduce the transmit and receive functions.

`packet_sendmsg()`	net/packet/af_packet.c

Pointers to the sock structure and the message to be sent are passed to this function. Next, the network device that should be used for this transmission has to be selected. This can be done by using the source address specified at the socket, unless the device had already been added to the field protinfo.af_packet->ifindex of the sock structure as a consequence of a previous bind() call. Subsequently, the message is copied to an skb and directly sent to the network device by the function dev_queue_xmit(), without using protocols underneath the PF_PACKET socket.

`packet_rcv()`	net/packet/af_packet.c

When data of the registered packet type is received, the network device passes a pointer to the sk_buff that contains the receive data to the packet_rcv() function. The next step fills other fields of the sk_buff structure. Subsequently, the function __skb_queue_tail() inserts sk_buff in the receive queue of the associated sock structure. Next, the process waiting for a packet at the socket has to be notified. This is done by the function pointer data_ready() of the sock structure, which was bound to the function sock_def_readable() when the structure was initialized.

`sock_def_readable()`	net/core/sock.c

This function informs a process that data have been received. For this purpose, two cases have to be distinguished: If the application is in blocking wait at the socket, wake_up_interruptible() is used to add the relevant process to the list of executable processes. In case of nonblocking wait, the field of the descriptor that probes the relevant process has to be converted. This is done by the function sk_wake_async() (include/net/sock.h), which accesses the fasync_list of the BSD socket, as mentioned earlier.

26.3.3 `PF_NETLINK` Sockets

PF_NETLINK sockets are used to exchange data between parts of the kernel and the user space. The protocol family to be stated here is PF_NETLINK, and the type is SOCK_RAW or SOCK_DGRAM. The protocol can be selected from among a number of options (see include/linux/netlink.h): NETLINK_ROUTE (interface to the routing and the network), NETLINK_USERSOCK (reserved for protocols that run in the user space), NETLINK_FIREWALL (interface to the firewall functionality), NETLINK_ARPD (management of the ARP table), NETLINK_ROUTE6 (IPv6 routing), NETLINK_IP6_FW (IPv6 firewall), and others.

A header containing information about the length and content of the message is appended to data to be sent from the user space to the kernel. The format of this header is defined as struct nlmsghdr (include/linux/netlink.h):

 struct nlmsghdr {        __u32        nlmsg_len;     /* Length of message including header */        __u16        nlmsg_type;    /* Message content */        __u16        nlmsg_flags;   /* Additional flags */        __u32        nlmsg_seq;     /* Sequence number */        __u32        nlmsg_pid;     /* Sending process PID */ };

The field nlmsg_type can take any of three values: NLMSG_NOOP, NLMSG_ERROR, and NLMSG_DONE. Other valid values for nlmsg_flags are also available in include/linux/netlink.h.

This general Netlink interface is now used for communication between the user space and different parts of the kernel. For further adaptation to the respective application purpose, additional values for nlsmg_type were defined (e.g., for NETLINK_ROUTE6 in include/linux/ipv6_route.h and for NETLINK_ROUTE in include/linux/rtnetlink.h).

In addition to the PF_NETLINK sockets, there is another Netlink interface over a character device (see net/netlink/netlink_dev.c), but it is still included for reasons of backward compatibility only. We will mainly discuss the so-called RT Netlink interface below. RT Netlink sockets are actually PF_NETLINK sockets with the NETLINK_ROUTE protocol option set. These are currently the most interesting for network implementation, so we will continue describing the PF_NETLINK sockets with emphasis on this type.

PF_NETLINK sockets reside underneath the BSD sockets, where they register themselves exactly as do PF_INET sockets. For registering PF_NETLINK sockets over the function sock_register(&netlink_family_ops), the function netlink_create() (both in net/netlink/af_netlink.c) is first registered as create() function with the BSD socket. The latter function handles all initializations required and registers the proto_ops structure.

Now, when data is sent to the kernel over a PF_NETLINK socket, the function netlink_sendmsg() (net/netlink/af_netlink.c) invokes the function netlink_unicast(). The vector struct sock *nl_table[MAX_LINKS] serves to manage sock structures that use a common protocol (e.g., NETLINK_ROUTE sock structures having the same protocol are distinguished by the process ID of the requesting application, which is copied into the protinfo.af_netlink->pid field of the sock structure).

`netlink_unicast()`	net/netlink/af_netlink.c

This function uses the function netlink_lookup() to access the sock structure that contains the desired value in its protocol field (e.g., NETLINK_ROUTE) and belongs to the requesting process. Subsequently, it has to wait until the field protinfo.af_netlink->state is deleted (access synchronization). Next, the sk_buff structure is positioned in the receive queue of the sock structure, and, finally, sk->data_ready() is used to access the respective input() function.

In the case of RT Netlink sockets, the function rtnetlink_rcv() is invoked here. This function was loaded by rtnetlink_init() (net/core/rtnetlink.c) when the system started: rtnetlink_init() uses rtnetlink_kernel_create() to set the function pointer data_ready() of the sock structure for NETLINK_ROUTE to the function netlink_data_ready() (net/netlink/af_netlink.c) and the entry protinfo.af_netlink->data_ready() to the function rtnetlink_rcv() (net/core/rtnetlink.c). At first sight, the naming convention of this function is confusing, because it is used for transmission, from the socket's view, but it serves to receive data from the user space, from the kernel's view.

On the basis of the values defined for nlmsg_type in include/linux/rtnetlink.h, the elements to be addressed within the kernel are now selected. Once again, we distinguish by family and type. type is identical to the type specified at the BSD socket (as mentioned above, RT_NETLINK defines additional types, which are exported from include/linux/rtnetlink.h); the RT_NETLINK support defines new, peculiar message formats and allocates them to a family (e.g., rtm_family to manage the routing table, tcm_family for traffic control messages). Each of these families defines its own structure, which determines the format of messages. These RT_NETLINK messages are within the data range of a PF_NETLINK message, which means that they are transparent for a PF_NETLINK socket.

A vector, static struct rtnetlink_link link_rtnetlink_table[RTM_MAX-RTM_BASE+1], is used to enable kernel elements, such as the traffic control element, to allocate functions to messages of their families. The structure rtnetlink_link (include/linux/rtnetlink.h) has two entries:

 int (*dolt) (struct sk_buff*, struct nlmsghdr*, void *attr); int (*dumpit) (struck sk_buff*, struct netlink_callback *cb);

An appropriate function can be allocated to each function pointer. The function doit() passes a command; the function dumpit() is used to read an existing configuration. For example, the following allocation is done for traffic control (net/sched/sch_api.c), among other things:

 link_p[RTM_GETQDISC-RTM_BASE].dolt = tc_get_qdisc; link_p[RTM_GETQDISC-RTM_BASE].dumpit = tc_dump_qdisc

As was mentioned earlier, the function rtnetlink_rcv() is used as an input() function. It invokes the function rtnetlink_rcv_skb() for each skb. The latter function uses the function rtnetlink_rcv_msg() (all in net/core/rtnetlink.c) to discover the type of a Netlink message and the family of the RT Netlink message and eventually invokes either doit() or dumpit().

26.3.1 PF_INET Sockets

26.3.2 PF_PACKET Sockets

26.3.3 PF_NETLINK Sockets

26.3.1 `PF_INET` Sockets

26.3.2 `PF_PACKET` Sockets

26.3.3 `PF_NETLINK` Sockets