Unless you've been living in a cave for the last decade, you are probably already familiar with what the Internet is about, at least from a user's perspective. Functionally, we use it as a communication and information medium, by exchanging email, browsing web pages, transferring files, and so on. Technically, the Internet consists of many layers of abstraction and device -- from the actual wires used to send bits across the world to the web browser that grabs and renders those bits into text, graphics, and audio on your computer.
In this book, we are primarily concerned with the programmer's interface to the Internet. This too consists of multiple layers: sockets, which are programmable interfaces to the low-level connections between machines, and standard protocols, which add structure to discussions carried out over sockets. Let's briefly look at each of these layers in the abstract before jumping into programming details.
10.2.1 The Socket Layer
In simple terms, sockets are a programmable interface to network connections between computers. They also form the basis, and low-level "plumbing," of the Internet itself: all of the familiar higher-level Net protocols like FTP, web pages, and email, ultimately occur over sockets. Sockets are also sometimes called communications endpoints because they are the portals through which programs send and receive bytes during a conversation.
To programmers, sockets take the form of a handful of calls available in a library. These socket calls know how to send bytes between machines, using lower-level operations such as the TCP network transmission control protocol. At the bottom, TCP knows how to transfer bytes, but doesn't care what those bytes mean. For the purposes of this text, we will generally ignore how bytes sent to sockets are physically transferred. To understand sockets fully, though, we need to know a bit about how computers are named.
10.2.1.1 Machine identifiers
Suppose for just a moment that you wish to have a telephone conversation with someone halfway across the world. In the real world, you would probably either need that person's telephone number, or a directory that can be used to look up the number from his or her name (e.g., a telephone book). The same is true on the Internet: before a script can have a conversation with another computer somewhere in cyberspace, it must first know that other computer's number or name.
Luckily, the Internet defines standard ways to name both a remote machine, and a service provided by that machine. Within a script, the computer program to be contacted through a socket is identified by supplying a pair of values -- the machine name, and a specific port number on that machine:
Machine names
A machine name may take the form of either a string of numbers separated by dots called an IP address (e.g., 166.93.218.100), or a more legible form known as a domain name (e.g., starship.python.net). Domain names are automatically mapped into their dotted numeric address equivalent when used, by something called a domain name server -- a program on the Net that serves the same purpose as your local telephone directory assistance service.
Port numbers
A port number is simply an agreed-upon numeric identifier for a given conversation. Because computers on the Net can support a variety of services, port numbers are used to name a particular conversation on a given machine. For two machines to talk over the Net, both must associate sockets with the same machine name and port number when initiating network connections.
The combination of a machine name and a port number uniquely identifies every dialog on the Net. For instance, an Internet Service Provider's computer may provide many kinds of services for customers -- web pages, Telnet, FTP transfers, email, and so on. Each service on the machine is assigned a unique port number to which requests may be sent. To get web pages from a web server, programs need to specify both the web server's IP or domain name, and the port number on which the server listens for web page requests.
If this all sounds a bit strange, it may help to think of it in old-fashioned terms. In order to have a telephone conversation with someone within a company, for example, you usually need to dial both the company's phone number, as well as the extension of the person you want to reach. Moreover, if you don't know the company's number, you can probably find it by looking up the company's name in a phone book. It's almost the same on the Net -- machine names identify a collection of services (like a company), port numbers identify an individual service within a particular machine (like an extension), and domain names are mapped to IP numbers by domain name servers (like a phone book).
When programs use sockets to communicate in specialized ways with another machine (or with other processes on the same machine), they need to avoid using a port number reserved by a standard protocol -- numbers in the range of 0-1023 -- but we first need to discuss protocols to understand why.
10.2.2 The Protocol Layer
Although sockets form the backbone of the Internet, much of the activity that happens on the Net is programmed with protocols,[1] which are higher-level message models that run on top of sockets. In short, Internet protocols define a structured way to talk over sockets. They generally standardize both message formats and socket port numbers:
[1] Some books also use the term protocol to refer to lower-level transport schemes such as TCP. In this book, we use protocol to refer to higher-level structures built on top of sockets; see a networking text if you are curious about what happens at lower levels.
Raw sockets are still commonly used in many systems, but it is perhaps more common (and generally easier) to communicate with one of the standard higher-level Internet protocols.
10.2.2.1 Port number rules
Technically speaking, socket port numbers can be any 16-bit integer value between and 65,535. However, to make it easier for programs to locate the standard protocols, port numbers in the range of 0-1023 are reserved and preassigned to the standard higher-level protocols. Table 10-1 lists the ports reserved for many of the standard protocols; each gets one or more preassigned numbers from the reserved range.
Protocol |
Common Function |
Port Number |
Python Module |
---|---|---|---|
HTTP |
Web pages |
80 |
httplib |
NNTP |
Usenet news |
119 |
nntplib |
FTP data default |
File transfers |
20 |
ftplib |
FTP control |
File transfers |
21 |
ftplib |
SMTP |
Sending email |
25 |
smtplib |
POP3 |
Fetching email |
110 |
poplib |
IMAP4 |
Fetching email |
143 |
imaplib |
Finger |
Informational |
79 |
n/a |
Telnet |
Command lines |
23 |
telnetlib |
Gopher |
Document transfers |
70 |
gopherlib |
10.2.2.2 Clients and servers
To socket programmers, the standard protocols mean that port numbers 0-1023 are off-limits to scripts, unless they really mean to use one of the higher-level protocols. This is both by standard and by common sense. A Telnet program, for instance, can start a dialog with any Telnet-capable machine by connecting to its port 23; without preassigned port numbers, each server might install Telnet on a different port. Similarly, web sites listen for page requests from browsers on port 80 by standard; if they did not, you might have to know and type the HTTP port number of every site you visit while surfing the Net.
By defining standard port numbers for services, the Net naturally gives rise to a client/server architecture. On one side of a conversation, machines that support standard protocols run a set of perpetually running programs that listen for connection requests on the reserved ports. On the other end of a dialog, other machines contact those programs to use the services they export.
We usually call the perpetually running listener program a server and the connecting program a client. Let's use the familiar web browsing model as an example. As shown in Table 10-1, the HTTP protocol used by the Web allows clients and servers to talk over sockets on port 80:
Server
A machine that hosts web sites usually runs a web server program that constantly listens for incoming connection requests, on a socket bound to port 80. Often, the server itself does nothing but watch for requests on its port perpetually; handling requests is delegated to spawned processes or threads.
Clients
Programs that wish to talk to this server specify the server machine's name and port 80 to initiate a connection. For web servers, typical clients are web browsers like Internet Explorer or Netscape, but any script can open a client-side connection on port 80 to fetch web pages from the server.
In general, many clients may connect to a server over sockets, whether it implements a standard protocol or something more specific to a given application. And in some applications, the notion of client and server is blurred -- programs can also pass bytes between each other more as peers than as master and subordinate. For the purpose of this book, though, we usually call programs that listen on sockets servers, and those that connect, clients. We also sometimes call the machines that these programs run on server and client (e.g., a computer on which a web server program runs may be called a web server machine, too), but this has more to do with the physical than the functional.
10.2.2.3 Protocol structures
Functionally, protocols may accomplish a familiar task like reading email or posting a Usenet newsgroup message, but they ultimately consist of message bytes sent over sockets. The structure of those message bytes varies from protocol to protocol, is hidden by the Python library, and is mostly beyond the scope of this book, but a few general words may help demystify the protocol layer.
Some protocols may define the contents of messages sent over sockets; others may specify the sequence of control messages exchanged during conversations. By defining regular patterns of communication, protocols make communication more robust. They can also minimize deadlock conditions -- machines waiting for messages that never arrive.
For example, the FTP protocol prevents deadlock by conversing over two sockets: one for control messages only, and one to transfer file data. An FTP server listens for control messages (e.g., "send me a file") on one port, and transfers file data over another. FTP clients open socket connections to the server machine's control port, send requests, and send or receive file data over a socket connected to a data port on the server machine. FTP also defines standard message structures passed between client and server. The control message used to request a file, for instance, must follow a standard format.
10.2.3 Python's Internet Library Modules
If all of this sounds horribly complex, cheer up: Python's standard protocol modules handle all the details. For example, the Python library's ftplib module manages all the socket and message-level handshaking implied by the FTP protocol. Scripts that import ftplib have access to a much higher-level interface for FTPing files and can be largely ignorant of both the underlying FTP protocol, and the sockets over which it runs.[2]
[2] Since Python is an open source system, you can read the source code of the ftplib module if you are curious about how the underlying protocol actually works. See file ftplib.py in the standard source library directory in your machine. Its code is complex (since it must format messages and manage two sockets), but with the other standard Internet protocol modules, it is a good example of low-level socket programming.
In fact, each supported protocol is represented by a standard Python module file with a name of the form xxxlib.py, where xxx is replaced by the protocol's name, in lowercase. The last column in Table 10-1 gives the module name for protocol standard modules. For instance, FTP is supported by module file ftplib.py. Moreover, within the protocol modules, the top-level interface object is usually the name of the protocol. So, for instance, to start an FTP session in a Python script, you run import ftplib and pass appropriate parameters in a call to ftplib.FTP(); for Telnet, create a telnetlib.Telnet().
In addition to the protocol implementation modules in Table 10-1, Python's standard library also contains modules for parsing and handling data once it has been transferred over sockets or protocols. Table 10-2 lists some of the more commonly used modules in this category.
Python Modules |
Utility |
---|---|
socket |
Low-level network communications support (TCP/IP, UDP, etc.). |
cgi |
Server-side CGI script support: parse input stream, escape HTML text, etc. |
urllib |
Fetch web pages from their addresses (URLs), escape URL text |
httplib, ftplib, nntplib |
HTTP (web), FTP (file transfer), and NNTP (news) protocol modules |
poplib, imaplib, smtplib |
POP, IMAP (mail fetch), and SMTP (mail send) protocol modules |
telnetlib, gopherlib |
Telnet and Gopher protocol modules |
htmllib, sgmllib, xmllib |
Parse web page contents (HTML, SGML, and XML documents) |
xdrlib |
Encode binary data portably (also see the struct and socket modules) |
rfc822 |
Parse email-style header lines |
mhlib, mailbox |
Process complex mail messages and mailboxes |
mimetools, mimify |
Handle MIME-style message bodies |
multifile |
Read messages with multiple parts |
uu, binhex, base64, binascii, quopri |
Encode and decode binary (or other) data transmitted as text |
urlparse |
Parse URL string into components |
SocketServer |
Framework for general net servers |
BaseHTTPServer |
Basic HTTP server implementation |
SimpleHTTPServer, CGIHTTPServer |
Specific HTTP web server request handler modules |
rexec, bastion |
Restricted code execution modes |
We will meet many of this table's modules in the next few chapters of this book, but not all. The modules demonstrated are representative, but as always, be sure to see Python's standard Library Reference Manual for more complete and up-to-date lists and details.
Introducing Python
Part I: System Interfaces
System Tools
Parallel System Tools
Larger System Examples I
Larger System Examples II
Part II: GUI Programming
Graphical User Interfaces
A Tkinter Tour, Part 1
A Tkinter Tour, Part 2
Larger GUI Examples
Part III: Internet Scripting
Network Scripting
Client-Side Scripting
Server-Side Scripting
Larger Web Site Examples I
Larger Web Site Examples II
Advanced Internet Topics
Part IV: Assorted Topics
Databases and Persistence
Data Structures
Text and Language
Part V: Integration
Extending Python
Embedding Python
VI: The End
Conclusion Python and the Development Cycle