Section 14.10. HTTP: Accessing Web Sites

14.10. HTTP: Accessing Web Sites

Python's standard library (the modules that are installed with the interpreter) also includes client-side support for HTTPthe Hypertext Transfer Protocola message structure and port standard used to transfer information on the World Wide Web. In short, this is the protocol that your web browser (e.g., Internet Explorer, Netscape) uses to fetch web pages and run applications on remote servers as you surf the Web. Essentially, it's just bytes sent over port 80.

To really understand HTTP-style transfers, you need to know some of the server-side scripting topics covered in Chapter 16 (e.g., script invocations and Internet address schemes), so this section may be less useful to readers with no such background. Luckily, though, the basic HTTP interfaces in Python are simple enough for a cursory understanding, even at this point in the book, so let's take a brief look here.

Python's standard httplib module automates much of the protocol defined by HTTP and allows scripts to fetch web pages much like web browsers. For instance, the script in Example 14-29 can be used to grab any file from any server machine running an HTTP web server program. As usual, the file (and descriptive header lines) is ultimately transferred as formatted messages over a standard socket port, but most of the complexity is hidden by the httplib module.

Example 14-29. PP3E\Internet\Other\http-getfile.py

 ####################################################################### # fetch a file from an HTTP (web) server over sockets via httplib; # the filename param may have a full directory path, and may name a CGI # script with query parameters on the end to invoke a remote program; # fetched file data or remote program output could be saved to a local # file to mimic FTP, or parsed with str.find or the htmllib module; ####################################################################### import sys, httplib showlines = 6 try:     servername, filename = sys.argv[1:]           # cmdline args? except:     servername, filename = 'starship.python.net', '/index.html' print servername, filename server = httplib.HTTP(servername)                 # connect to http site/server server.putrequest('GET', filename)                # send request and headers server.putheader('Accept', 'text/html')           # POST requests work here too server.endheaders( )                                   # as do CGI script filenames errcode, errmsh, replyheader = server.getreply( )  # read reply info headers if errcode != 200:                                  # 200 means success     print 'Error sending request', errcode else:     file = server.getfile( )                       # file obj for data received     data = file.readlines( )     file.close( )                                  # show lines with eoln at end     for line in data[:showlines]: print line,       # to save, write data to file

Desired server names and filenames can be passed on the command line to override hardcoded defaults in the script. You need to know something of the HTTP protocol to make the most sense of this code, but it's fairly straightforward to decipher. When run on the client, this script makes an HTTP object to connect to the server, sends it a GET request along with acceptable reply types, and then reads the server's reply. Much like raw email message text, the HTTP server's reply usually begins with a set of descriptive header lines, followed by the contents of the requested file. The HTTP object's getfile method gives us a file object from which we can read the downloaded data.

Let's fetch a few files with this script. Like all Python client-side scripts, this one works on any machine with Python and an Internet connection (here it runs on a Windows client). Assuming that all goes well, the first few lines of the downloaded file are printed; in a more realistic application, the text we fetch would probably be saved to a local file, parsed with Python's htmllib module, and so on. Without arguments, the script simply fetches the HTML index page at http://starship.python.org, a Python community resources site:

 C:\...\PP3E\Internet\Other>python http-getfile.py starship.python.net /index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD>   <META NAME="GENERATOR" CONTENT="HTMLgen">   <TITLE>Starship Python -- Python Programming Community</TITLE>   <LINK REL="SHORTCUT ICON" HREF="http://starship.python.net/favicon.ico">

But we can also list a server and file to be fetched on the command line, if we want to be more specific. In the following code, we use the script to fetch files from two different web sites by listing their names on the command lines (I've added line breaks to make these lines fit in this book). Notice that the filename argument can include an arbitrary remote directory path to the desired file, as in the last fetch here:

 C:\...\PP3E\Internet\Other>python http-getfile.py www.python.org /index.html www.python.org /index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"                       "http://www.w3.org/TR/html4/loose.dtd" > <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html> <!-- THIS PAGE IS AUTOMATICALLY GENERATED.  DO NOT EDIT. --> <!-- Mon Jan 16 13:02:12 2006 --> C:\...\PP3E\Internet\Other>python http-getfile.py www.python.org index www.python.org index Error sending request 404 C:\...\PP3E\Internet\Other>python http-getfile.py www.rmi.net /~lutz www.rmi.net /~lutz Error sending request 301 C:\...\PP3E\Internet\Other>python http-getfile.py www.rmi.net                                                         /~lutz/index.html www.rmi.net /~lutz/index.html <HTML> <HEAD> <TITLE>Mark Lutz's Home Page</TITLE> </HEAD> <BODY BGCOLOR="#f1f1ff">

Also notice the second and third attempts in this code: if the request fails, the script receives and displays an HTTP error code from the server (we forgot the leading slash on the second, and the "index.html" on the thirdrequired for this server and interface). With the raw HTTP interfaces, we need to be precise about what we want.

Technically, the string we call filename in the script can refer to either a simple static web page file or a server-side program that generates HTML as its output. Those server-side programs are usually called CGI scriptsthe topic of Chapters 16 and 17. For now, keep in mind that when filename refers to a script, this program can be used to invoke another program that resides on a remote server machine. In that case, we can also specify parameters (called a query string) to be passed to the remote program after a ?.

Here, for instance, we pass a language=Python parameter to a CGI script we will meet in Chapter 16 (we're first spawning a locally running HTTP web server coded in Python, using a script we first met in Chapter 2, but will revisit in Chapter 16):

 In a different window C:\...\PP3E\Internet\Web>webserver.py webdir ".", port 80 C:\...\PP3E\Internet\Other>http-getfile.py localhost                                 /cgi-bin/languages.py?language=Python localhost /cgi-bin/languages.py?language=Python <TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>Python</H3><P><PRE>  print 'Hello World' </PRE></P><BR> <HR>

This book has much more to say about HTML, CGI scripts, and the meaning of the HTTP GET request used in Example 14-29 (along with POST, one of two way to format information sent to an HTTP server) later, so we'll skip additional details here.

Suffice it to say, though, that we could use the HTTP interfaces to write our own web browsers, and build scripts that use web sites as though they were subroutines. By sending parameters to remote programs and parsing their results, web sites can take on the role of simple in-process functions (albeit, much more slowly and indirectly).