Plumbing the Internet | Network Scripting

Table of contents:

More on Protocol Standards

Unless you've been living in a cave for the last decade, you are probably already familiar with what the Internet is about, at least from a user's perspective. Functionally, we use it as a communication and information medium, by exchanging email, browsing web pages, transferring files, and so on. Technically, the Internet consists of many layers of abstraction and device -- from the actual wires used to send bits across the world to the web browser that grabs and renders those bits into text, graphics, and audio on your computer.

In this book, we are primarily concerned with the programmer's interface to the Internet. This too consists of multiple layers: sockets, which are programmable interfaces to the low-level connections between machines, and standard protocols, which add structure to discussions carried out over sockets. Let's briefly look at each of these layers in the abstract before jumping into programming details.

10.2.1 The Socket Layer

In simple terms, sockets are a programmable interface to network connections between computers. They also form the basis, and low-level "plumbing," of the Internet itself: all of the familiar higher-level Net protocols like FTP, web pages, and email, ultimately occur over sockets. Sockets are also sometimes called communications endpoints because they are the portals through which programs send and receive bytes during a conversation.

To programmers, sockets take the form of a handful of calls available in a library. These socket calls know how to send bytes between machines, using lower-level operations such as the TCP network transmission control protocol. At the bottom, TCP knows how to transfer bytes, but doesn't care what those bytes mean. For the purposes of this text, we will generally ignore how bytes sent to sockets are physically transferred. To understand sockets fully, though, we need to know a bit about how computers are named.

10.2.1.1 Machine identifiers

Suppose for just a moment that you wish to have a telephone conversation with someone halfway across the world. In the real world, you would probably either need that person's telephone number, or a directory that can be used to look up the number from his or her name (e.g., a telephone book). The same is true on the Internet: before a script can have a conversation with another computer somewhere in cyberspace, it must first know that other computer's number or name.

Luckily, the Internet defines standard ways to name both a remote machine, and a service provided by that machine. Within a script, the computer program to be contacted through a socket is identified by supplying a pair of values -- the machine name, and a specific port number on that machine:

Machine names

A machine name may take the form of either a string of numbers separated by dots called an IP address (e.g., 166.93.218.100), or a more legible form known as a domain name (e.g., starship.python.net). Domain names are automatically mapped into their dotted numeric address equivalent when used, by something called a domain name server -- a program on the Net that serves the same purpose as your local telephone directory assistance service.

Port numbers

A port number is simply an agreed-upon numeric identifier for a given conversation. Because computers on the Net can support a variety of services, port numbers are used to name a particular conversation on a given machine. For two machines to talk over the Net, both must associate sockets with the same machine name and port number when initiating network connections.

The combination of a machine name and a port number uniquely identifies every dialog on the Net. For instance, an Internet Service Provider's computer may provide many kinds of services for customers -- web pages, Telnet, FTP transfers, email, and so on. Each service on the machine is assigned a unique port number to which requests may be sent. To get web pages from a web server, programs need to specify both the web server's IP or domain name, and the port number on which the server listens for web page requests.

If this all sounds a bit strange, it may help to think of it in old-fashioned terms. In order to have a telephone conversation with someone within a company, for example, you usually need to dial both the company's phone number, as well as the extension of the person you want to reach. Moreover, if you don't know the company's number, you can probably find it by looking up the company's name in a phone book. It's almost the same on the Net -- machine names identify a collection of services (like a company), port numbers identify an individual service within a particular machine (like an extension), and domain names are mapped to IP numbers by domain name servers (like a phone book).

When programs use sockets to communicate in specialized ways with another machine (or with other processes on the same machine), they need to avoid using a port number reserved by a standard protocol -- numbers in the range of 0-1023 -- but we first need to discuss protocols to understand why.

10.2.2 The Protocol Layer

Although sockets form the backbone of the Internet, much of the activity that happens on the Net is programmed with protocols,[1] which are higher-level message models that run on top of sockets. In short, Internet protocols define a structured way to talk over sockets. They generally standardize both message formats and socket port numbers:

[1] Some books also use the term protocol to refer to lower-level transport schemes such as TCP. In this book, we use protocol to refer to higher-level structures built on top of sockets; see a networking text if you are curious about what happens at lower levels.

Message formats provide structure for the bytes exchanged over sockets during conversations.
Port numbers are reserved numeric identifiers for the underlying sockets over which messages are exchanged.

Raw sockets are still commonly used in many systems, but it is perhaps more common (and generally easier) to communicate with one of the standard higher-level Internet protocols.

10.2.2.1 Port number rules

Technically speaking, socket port numbers can be any 16-bit integer value between and 65,535. However, to make it easier for programs to locate the standard protocols, port numbers in the range of 0-1023 are reserved and preassigned to the standard higher-level protocols. Table 10-1 lists the ports reserved for many of the standard protocols; each gets one or more preassigned numbers from the reserved range.

Table 10-1. Port Numbers Reserved for Common Protocols
Protocol	Common Function	Port Number	Python Module
HTTP	Web pages	80	`httplib`
NNTP	Usenet news	119	`nntplib`
FTP data default	File transfers	20	`ftplib`
FTP control	File transfers	21	`ftplib`
SMTP	Sending email	25	`smtplib`
POP3	Fetching email	110	`poplib`
IMAP4	Fetching email	143	`imaplib`
Finger	Informational	79	n/a
Telnet	Command lines	23	`telnetlib`
Gopher	Document transfers	70	`gopherlib`

10.2.2.2 Clients and servers

To socket programmers, the standard protocols mean that port numbers 0-1023 are off-limits to scripts, unless they really mean to use one of the higher-level protocols. This is both by standard and by common sense. A Telnet program, for instance, can start a dialog with any Telnet-capable machine by connecting to its port 23; without preassigned port numbers, each server might install Telnet on a different port. Similarly, web sites listen for page requests from browsers on port 80 by standard; if they did not, you might have to know and type the HTTP port number of every site you visit while surfing the Net.

By defining standard port numbers for services, the Net naturally gives rise to a client/server architecture. On one side of a conversation, machines that support standard protocols run a set of perpetually running programs that listen for connection requests on the reserved ports. On the other end of a dialog, other machines contact those programs to use the services they export.

We usually call the perpetually running listener program a server and the connecting program a client. Let's use the familiar web browsing model as an example. As shown in Table 10-1, the HTTP protocol used by the Web allows clients and servers to talk over sockets on port 80:

Server

A machine that hosts web sites usually runs a web server program that constantly listens for incoming connection requests, on a socket bound to port 80. Often, the server itself does nothing but watch for requests on its port perpetually; handling requests is delegated to spawned processes or threads.

Clients

Programs that wish to talk to this server specify the server machine's name and port 80 to initiate a connection. For web servers, typical clients are web browsers like Internet Explorer or Netscape, but any script can open a client-side connection on port 80 to fetch web pages from the server.

In general, many clients may connect to a server over sockets, whether it implements a standard protocol or something more specific to a given application. And in some applications, the notion of client and server is blurred -- programs can also pass bytes between each other more as peers than as master and subordinate. For the purpose of this book, though, we usually call programs that listen on sockets servers, and those that connect, clients. We also sometimes call the machines that these programs run on server and client (e.g., a computer on which a web server program runs may be called a web server machine, too), but this has more to do with the physical than the functional.

10.2.2.3 Protocol structures

Functionally, protocols may accomplish a familiar task like reading email or posting a Usenet newsgroup message, but they ultimately consist of message bytes sent over sockets. The structure of those message bytes varies from protocol to protocol, is hidden by the Python library, and is mostly beyond the scope of this book, but a few general words may help demystify the protocol layer.

Some protocols may define the contents of messages sent over sockets; others may specify the sequence of control messages exchanged during conversations. By defining regular patterns of communication, protocols make communication more robust. They can also minimize deadlock conditions -- machines waiting for messages that never arrive.

For example, the FTP protocol prevents deadlock by conversing over two sockets: one for control messages only, and one to transfer file data. An FTP server listens for control messages (e.g., "send me a file") on one port, and transfers file data over another. FTP clients open socket connections to the server machine's control port, send requests, and send or receive file data over a socket connected to a data port on the server machine. FTP also defines standard message structures passed between client and server. The control message used to request a file, for instance, must follow a standard format.

10.2.3 Python's Internet Library Modules

If all of this sounds horribly complex, cheer up: Python's standard protocol modules handle all the details. For example, the Python library's ftplib module manages all the socket and message-level handshaking implied by the FTP protocol. Scripts that import ftplib have access to a much higher-level interface for FTPing files and can be largely ignorant of both the underlying FTP protocol, and the sockets over which it runs.[2]

[2] Since Python is an open source system, you can read the source code of the ftplib module if you are curious about how the underlying protocol actually works. See file ftplib.py in the standard source library directory in your machine. Its code is complex (since it must format messages and manage two sockets), but with the other standard Internet protocol modules, it is a good example of low-level socket programming.

In fact, each supported protocol is represented by a standard Python module file with a name of the form xxxlib.py, where xxx is replaced by the protocol's name, in lowercase. The last column in Table 10-1 gives the module name for protocol standard modules. For instance, FTP is supported by module file ftplib.py. Moreover, within the protocol modules, the top-level interface object is usually the name of the protocol. So, for instance, to start an FTP session in a Python script, you run import ftplib and pass appropriate parameters in a call to ftplib.FTP(); for Telnet, create a telnetlib.Telnet().

In addition to the protocol implementation modules in Table 10-1, Python's standard library also contains modules for parsing and handling data once it has been transferred over sockets or protocols. Table 10-2 lists some of the more commonly used modules in this category.

Table 10-2. Common Internet-Related Standard Modules
Python Modules	Utility
`socket`	Low-level network communications support (TCP/IP, UDP, etc.).
`cgi`	Server-side CGI script support: parse input stream, escape HTML text, etc.
`urllib`	Fetch web pages from their addresses (URLs), escape URL text
`httplib`, `ftplib`, `nntplib`	HTTP (web), FTP (file transfer), and NNTP (news) protocol modules
`poplib`, `imaplib`, `smtplib`	POP, IMAP (mail fetch), and SMTP (mail send) protocol modules
`telnetlib`, `gopherlib`	Telnet and Gopher protocol modules
`htmllib`, `sgmllib`, `xmllib`	Parse web page contents (HTML, SGML, and XML documents)
`xdrlib`	Encode binary data portably (also see the `struct` and `socket` modules)
`rfc822`	Parse email-style header lines
`mhlib`, `mailbox`	Process complex mail messages and mailboxes
`mimetools`, `mimify`	Handle MIME-style message bodies
`multifile`	Read messages with multiple parts
`uu`, `binhex`, `base64`, `binascii`, `quopri`	Encode and decode binary (or other) data transmitted as text
`urlparse`	Parse URL string into components
`SocketServer`	Framework for general net servers
`BaseHTTPServer`	Basic HTTP server implementation
`SimpleHTTPServer`, `CGIHTTPServer`	Specific HTTP web server request handler modules
`rexec`, `bastion`	Restricted code execution modes

We will meet many of this table's modules in the next few chapters of this book, but not all. The modules demonstrated are representative, but as always, be sure to see Python's standard Library Reference Manual for more complete and up-to-date lists and details.

More on Protocol Standards

If you want the full story on protocols and ports, at this writing you can find a comprehensive list of all ports reserved for protocols, or registered as used by various common systems, by searching the web pages maintained by the Internet Engineering Task Force (IETF) and the Internet Assigned Numbers Authority (IANA). The IETF is the organization responsible for maintaining web protocols and standards. The IANA is the central coordinator for the assignment of unique parameter values for Internet protocols. Another standards body, the W3 (for WWW), also maintains relevant documents. See these web pages for more details:

http://www.ietf.org

http://www.iana.org/numbers.html

http://www.iana.org/assignments/port-numbers

http://www.w3.org

It's not impossible that more recent repositories for standard protocol specifications will arise during this book's shelf-life, but the IETF web site will likely be the main authority for some time to come. If you do look, though, be warned that the details are, well, detailed. Because Python's protocol modules hide most of the socket and messaging complexity documented in the protocol standards, you usually don't need to memorize these documents to get web work done in Python.

Introducing Python

Part I: System Interfaces