Socket Addressing

	Network Programming with Perl By Lincoln D. Stein Slots : 1
	Table of Contents

	Chapter 3. Introduction to Berkeley Sockets

Content

In order for one process to talk to another, each of them has to know the other's address. Each networking domain has a different concept of what an address is. For the UNIX domain, which can be used only between two processes on the same host machine, addresses are simply paths on the host's filesystem, such as /usr/tmp/log . For the Internet domain, each socket address has three parts : the IP address, the port, and the protocol.

IP Addresses

In the currently deployed version of TCP/IP, IPv4, the IP address is a 32-bit number used to identify a network interface on the host machine. A series of subnetworks and routing tables enables any machine on the Internet to send packets to any other machine based on its IP address.

For readability, the four bytes of a machine's IP address are usually spelled out as a series of four decimal digits separated by dots to create the " dotted quad address" that network administrators have come to know and love. For example, 143.48.7.1 is the IP address of one of the servers at my workplace. Expressed as a 32-bit number in the hexadecimal system, this address is 0x8f3071 .

Many of Perl's networking calls require you to work with IP addresses in the form of packed binary strings. IP addresses can be converted manually to binary format and back again using pack() and unpack() with a template of "C4" (four unsigned characters ). For example, here's how to convert 18.157.0.125 into its packed form and then reverse the process:

 ($a,$b,$c,$d)      = split /\./, '18.157.0.125'; $packed_ip_address = pack 'C4',$a,$b,$c,$d; ($a,$b,$c,$d)      = unpack 'C4',$packed_ip_address; $dotted_ip_address = join '.', $a,$b,$c,$d;

You usually won't have to do this, however, because Perl provides convenient high-level functions to handle this conversion for you.

Most hosts have two addresses, the "loopback" address 127.0.0.1 (often known by its symbolic name "localhost") and its public Internet address. The loopback address is associated with a device that loops transmissions back onto itself, allowing a client on the host to make an outgoing connection to a server running on the same host. Although this sounds a bit pointless, it is a powerful technique for application development, because it means that you can develop and test software on the local machine without access to the network.

The public Internet address is associated with the host's network interface card, such as an Ethernet card. The address is either assigned to the host by the network administrator or, in systems with dynamic host addressing, by a Boot Protocol (BOOTP) or Dynamic Host Configuration Protocol (DHCP) server. If a host has multiple network interfaces installed, each one can have a distinct IP address. It's also possible for a single interface to be configured to use several addresses. Chapter 21 discusses IO::Interface, a third-party Perl module that allows a Perl script to examine and alter the IP addresses assigned to its interface cards.

Reserved IP Addresses, Subnets, and Netmasks

In order for a packet of information to travel from one location to another across the Internet, it must hop across a series of physical networks. For example, a packet leaving your desktop computer must travel across your LAN (local area network) to a modem or router, then across your Internet service provider's (ISP) regional network, then across a backbone to another ISP's regional network, and finally to its destination machine.

Network routers keep track of how the networks interconnect, and are responsible for determining the most efficient route to get a packet from point A to point B. However, if IP addresses were allocated ad hoc, this task would not be feasible because each router would have to maintain a map showing the locations of all IP addresses. Instead, IP addresses are allocated in contiguous chunks for use in organizational and regional networks.

For example, my employer, the Cold Spring Harbor Laboratory (CSHL), owns the block of IP addresses that range from 143.48.0.0 through 143.48.255.255 (this is a so-called class B address). When a backbone router sees a packet addressed to an IP address in this range, it needs only to determine how to get the packet into CSHL's network. It is then the responsibility of CSHL's routers to get the packet to its destination. In practice, CSHL and other large organizations split their allocated address ranges into several subnets and use routers to interconnect the parts.

A computer that is sending out an IP packet must determine whether the destination machine is directly reachable (e.g., over the Ethernet) or whether the packet must be directed to a router that interconnects the local network to more distant locations. The basic decision is whether the packet is part of the local network or part of a distant network.

To make this decision possible, IP addresses are arbitrarily split into a host part and a network part. For example, in CSHL's network, the split occurs after the second byte: the network part is 143.48. and the host part is the rest. So 143.48.0.0 is the first address in CSHL's network, and 143.48.255.255 is the last.

To describe where the network/host split occurs for routing purposes, networks use a netmask , which is a bitmask with 1s in the positions of the network part of the IP address. Like the IP address itself, the netmask is usually written in dotted-quad form. Continuing with our example, CSHL has a netmask of 255.255.0.0, which, when written in binary, is 11111111,11111111,00000000,00000000.

Historically, IP networks were divided into three classes on the basis of their netmasks (Table 3.5). Class A networks have a netmask of 255.0.0.0 and approximately 16 million hosts. Class B networks have a netmask of 255.255.0.0 and some 65,000 hosts, and class C networks use the netmask 255.255.255.0 and support 254 hosts (as we will see, the first and last host numbers in a network range are unavailable for use as a normal host address).

Table 3.5. Address Classes and Their Netmasks

Class	Netmask	Example Address	Network Park	Host Part
A	255.0.0.0	120.155.32.5	120.	155.32.5
B	255.255.0.0	128.157.32.5	128.157.	32.5
C	255.255.255.0	192.66.12.56	192.66.12.	56

As the Internet has become more crowded, however, networks have had to be split up in more flexible ways. It's common now to see netmasks that don't end at byte boundaries. For example, the netmask 255.255.255.128 (binary 11111111,11111111,11111111,10000000) splits the last byte in half, creating a set of 126-host networks. The modern Internet routes packets based on this more flexible scheme, called Classless Inter-Domain Routing (CIDR). CIDR uses a concise convention to describe networks in which the network address is followed by a slash and an integer containing the number of 1s in the mask. For example, CSHL's network is described by the CIDR address 143.48.0.0/16. CIDR is described in detail in RFCs 1517 through 1520, and in the FAQs listed in Appendix D.

Figuring out the network and broadcast addresses can be confusing when you work with netmasks that do not end at byte boundaries. The Net::Netmask module, available on CPAN, provides facilities for calculating these values in an intuitive way. You'll also find a short module that I wrote, Net::NetmaskLite, in Appendix A. You might want to peruse this code in order to learn the relationships among the network address, broadcast address, and netmask.

The first and last addresses in a subnet have special significance and cannot be used as ordinary host addresses. The first address, sometimes known as the all-zeroes address, is reserved for use in routing tables to denote the network as a whole. The last address in the range, known as the all-ones address, is reserved for use as the broadcast address. IP packets sent to this address will be received by all hosts on the subnet. For example, for the network 192.18.4.x (a class C address or 192.18.4.0/24 in CIDR format), the network address is 192.18.4.0 and the broadcast address is 192.18.4.255. We will discuss broadcasting in detail in Chapter 20.

In addition, several IP address ranges have been set aside for special purposes (Table 3.6). The class A network 10.x.x.x, the 16 class B networks 172.16.x.x through 172.31.x.x, and the 255 class C addresses 192.168.0.x through 192.168.255.x are reserved for use as internal networks. An organization may use any of these networks internally, but must not connect the network directly to the Internet. The 192.168.x.x networks are used frequently in testing, or placed behind firewall systems that translate all the internal network addresses into a single public IP address. The network addresses 224.x.x.x through 239.x.x.x are reserved for multicasting applications (Chapter 21), and everything above 240.x.x.x is reserved for future expansion.

Table 3.6. Reserved IP Addresses

Address	Description
127.0.0.x	Loopback interface
10.x.x.x	Private class A address
172.16.x.x “172.32.x.x	Private class B addresses
192.168.0.x “172.168.255.x	Private class C addresses

Finally, IP address 127.0.0.x is reserved for use as the loopback network. Anything sent to an address in this range is received by the local host.

IPv6

Although there are more than 4 billion possible IPv4 addresses, the presence of several large reserved ranges and the way the addresses are allocated into subnetworks reduces the effective number of addresses considerably. This, coupled with the recent explosion in network-connected devices, means that the Internet is rapidly running out of IP addresses. The crisis has been forestalled for now by various dynamic host-addressing and address-translation techniques that share a pool of IP addresses among a larger set of hosts. However, the new drive to put toaster ovens, television set-top boxes, and cell phones on the Internet is again threatening to exhaust the address space.

This is one of the major justifications for the new version of TCP/IP, known as IPv6, which expands the IP address space from 4 to 16 bytes. IPv6 is being deployed on the Internet backbones now, but this change will not immediately affect local area networks, which will continue to use addresses backwardly compatible with IPv4. Perl has not yet been updated to support IPv6, but will undoubtedly do so by the time that IPv6 is widely implemented.

More information about IPv6 can be found in [Stevens 1996] and [Hunt 1998 ].

Network Ports

Once a message reaches its destination IP address, there's still the matter of finding the correct program to deliver it to. It's common for a host to be running multiple network servers, and it would be impractical , not to say confusing, to deliver the same message to them all. That's where the port number comes in. The port number part of the socket address is an unsigned 16-bit number ranging from 1 to 65535. In addition to its IP address, each active socket on a host is identified by a unique port number; this allows messages to be delivered unambiguously to the correct program. When a program creates a socket, it may ask the operating system to associate a port with the socket. If the port is not being used, the operating system will grant this request, and will refuse other programs access to the port until the port is no longer in use. If the program doesn't specifically request a port, one will be assigned to it from the pool of unused port numbers.

There are actually two sets of port numbers, one for use by TCP sockets, and the other for use by UDP-based programs. It is perfectly all right for two programs to be using the same port number provided that one is using it for TCP and the other for UDP.

Not all port numbers are created equal. The ports in the range 0 through 1023 are reserved for the use of "well-known" services, which are assigned and maintained by ICANN, the Internet Corporation for Assigned Names and Numbers. For example, TCP port 80 is reserved for use for the HTTP used by Web servers, TCP port 25 is used for the SMTP used by e-mail transport agents , and UDP port 53 is used for the domain name service (DNS). Because these ports are well known, you can be pretty certain that a Web server running on a remote machine will be listening on port 80. On UNIX systems, only the root user (i.e., the superuser) is allowed to create a socket using a reserved port. This is partly to prevent unprivileged users on the system inadvertently running code that will interfere with the operations of the host's network services.

A list of reserved ports and their associated well-known services is given in Appendix C. Most services are either TCP- or UDP-based, but some can communicate with both protocols. In the interest of future compatibility, ICANN usually reserves both the UDP and TCP ports for each service. However, there are many exceptions to this rule. For example, TCP port 514 is used on UNIX systems for remote shell (login) services, while UDP port 514 is used for the system logging daemon.

In some versions of UNIX, the high-numbered ports in the range 49152 through 65535 are reserved by the operating system for use as "ephemeral" ports to be assigned automatically to outgoing TCP/IP connections when a port number hasn't been explicitly requested . The remaining ports, those in the range 1024 through 49151, are free for use in your own applications, provided that some other service has not already claimed them. It is a good idea to check the ports in use on your machine by using one of the network tools introduced later in this chapter (Network Analysis Tools) before claiming one.

The `sockaddr_in` Structure

A socket address is the combination of the host address and the port, packed together in a binary structure called a sockaddr_in. This corresponds to a C structure of the same name that is used internally to call the system networking routines. (By analogy, UNIX domain sockets use a packed structure called a sockaddr_un. ) Functions provided by the standard Perl Socket module allow you to create and manipulate sockaddr_in structures easily:

$packed_address = inet_aton($dotted_quad)

Given an IP address in dotted-quad form, this function packs it into binary form suitable for use by sockaddr_in() . The function will also operate on symbolic hostnames. If the hostname cannot be looked up, it returns undef .

$dotted_quad = inet_ntoa($packed_address)

This function takes a packed IP address and converts it into human-readable dotted-quad form. It does not attempt to translate IP addresses into hostnames. You can achieve this effect by using gethostbyaddr() , discussed later.

$socket_addr = sockaddr_in($port,$address)

($port,$address) = sockaddr_in($socket_addr)

When called in a scalar context, sockaddr_in() takes a port number and a binary IP address and packs them together into a socket address, suitable for use by socket() . When called in a list context, sockaddr_in() does the opposite , translating a socket address into the port and IP address. The IP address must still be passed through inet_ntoa() to obtain a human-readable string.

$socket_addr = pack_sockaddr_in($port,$address)

($port,$address) = unpack_sockaddr_in($socket_addr)

If you don't like the confusing behavior of sockaddr_in() , you can use these two functions to pack and unpack socket addresses in a context-insensitive manner.

We'll use several of these functions in the example that follows in the next section.

In some references, you'll see a socket's address referred to as its "name." Don't let this confuse you. A socket's address and its name are one and the same.

Top