Section 14.11. Module urllib Revisited

14.11. Module urllib Revisited

The httplib module we just met provides low-level control for HTTP clients. When dealing with items available on the Web, though, it's often easier to code downloads with Python's standard urllib module, introduced in the FTP section earlier in this chapter. Since this module is another way to talk HTTP, let's expand on its interfaces here.

Recall that given a URL, urllib either downloads the requested object over the Net to a local file, or gives us a file-like object from which we can read the requested object's contents. As a result, the script in Example 14-30 does the same work as the httplib script we just wrote, but requires noticeably less code.

Example 14-30. PP3E\Internet\Other\http-getfile-urllib1.py

 ################################################################### # fetch a file from an HTTP (web) server over sockets via urllib; # urllib supports HTTP, FTP, files, etc. via URL address strings; # for HTTP, the URL can name a file or trigger a remote CGI script; # see also the urllib example in the FTP section, and the CGI # script invocation in a later chapter; files can be fetched over # the net with Python in many ways that vary in complexity and # server requirements: sockets, FTP, HTTP, urllib, CGI outputs; # caveat: should run urllib.quote on filename--see later chapters; ################################################################### import sys, urllib showlines = 6 try:     servername, filename = sys.argv[1:]              # cmdline args? except:     servername, filename = 'starship.python.net', '/index.html' remoteaddr = 'http://%s%s' % (servername, filename)  # can name a CGI script too print remoteaddr remotefile = urllib.urlopen(remoteaddr)              # returns input file object remotedata = remotefile.readlines( )                      # read data directly here remotefile.close( ) for line in remotedata[:showlines]: print line,

Almost all HTTP transfer details are hidden behind the urllib interface here. This version works in almost the same way as the httplib version we wrote first, but it builds and submits an Internet URL address to get its work done (the constructed URL is printed as the script's first output line). As we saw in the FTP section of this chapter, the urllib urlopen function returns a file-like object from which we can read the remote data. But because the constructed URLs begin with "http://" here, the urllib module automatically employs the lower-level HTTP interfaces to download the requested file, not FTP:

 C:\...\PP3E\Internet\Other>python http-getfile-urllib1.py http://starship.python.net/index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD>   <META NAME="GENERATOR" CONTENT="HTMLgen">   <TITLE>Starship Python -- Python Programming Community</TITLE>   <LINK REL="SHORTCUT ICON" HREF="http://starship.python.net/favicon.ico"> C:\...\PP3E\Internet\Other>python http-getfile-urllib1.py www.python.org /index http://www.python.org/index <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"                       "http://www.w3.org/TR/html4/loose.dtd" > <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html> <!-- THIS PAGE IS AUTOMATICALLY GENERATED.  DO NOT EDIT. --> <!-- Mon Jan 16 13:02:12 2006 --> C:\...\PP3E\Internet\Other>python http-getfile-urllib1.py www.rmi.net /~lutz http://www.rmi.net/~lutz <HTML> <HEAD> <TITLE>Mark Lutz's Home Page</TITLE> </HEAD> <BODY BGCOLOR="#f1f1ff"> C:\...\PP3E\Internet\Other>python http-getfile-urllib1.py                                   localhost /cgi-bin/languages.py?language=Java http://localhost/cgi-bin/languages.py?language=Java <TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>Java</H3><P><PRE>  System.out.println("Hello World"); </PRE></P><BR> <HR>

As before, the filename argument can name a simple file or a program invocation with optional parameters at the end, as in the last run here. If you read this output carefully, you'll notice that this script still works if you leave the "index.html" off the end of a filename (in the third command line); unlike the raw HTTP version of the preceding section, the URL-based interface is smart enough to do the right thing.

14.11.1. Other urllib Interfaces

One last mutation: the following urllib downloader script uses the slightly higher-level urlretrieve interface in that module to automatically save the downloaded file or script output to a local file on the client machine. This interface is handy if we really mean to store the fetched data (e.g., to mimic the FTP protocol). If we plan on processing the downloaded data immediately, though, this form may be less convenient than the version we just met: we need to open and read the saved file. Moreover, we need to provide an extra protocol for specifying or extracting a local filename, as in Example 14-31.

Example 14-31. PP3E\Internet\Other\http-getfile-urllib2.py

 #################################################################### # fetch a file from an HTTP (web) server over sockets via urlllib; # this version uses an interface that saves the fetched data to a # local file; the local file name is either passed in as a cmdline # arg or stripped from the URL with urlparse: the filename argument # may have a directory path at the front and query params at end, # so os.path.split is not enough (only splits off directory path); # caveat: should run urllib.quote on filename--see later chapters; #################################################################### import sys, os, urllib, urlparse showlines = 6 try:     servername, filename = sys.argv[1:3]              # first 2 cmdline args? except:     servername, filename = 'starship.python.net', '/index.html' remoteaddr = 'http://%s%s' % (servername, filename)   # any address on the Net if len(sys.argv) == 4:                                # get result filename     localname = sys.argv[3] else:     (scheme, server, path, parms, query, frag) = urlparse.urlparse(remoteaddr)     localname = os.path.split(path)[1] print remoteaddr, localname urllib.urlretrieve(remoteaddr, localname)                 # can be file or script remotedata = open(localname).readlines( )                # saved to local file for line in remotedata[:showlines]: print line,

Let's run this last variant from a command line. Its basic operation is the same as the last two versions: like the prior one, it builds a URL, and like both of the last two, we can list an explicit target server and file path on the command line:

 C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py http://starship.python.net/index.html index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD>   <META NAME="GENERATOR" CONTENT="HTMLgen">   <TITLE>Starship Python -- Python Programming Community</TITLE>   <LINK REL="SHORTCUT ICON" HREF="http://starship.python.net/favicon.ico"> C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py                                            www.python.org /index.html http://www.python.org/index.html index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"                       "http://www.w3.org/TR/html4/loose.dtd" > <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html> <!-- THIS PAGE IS AUTOMATICALLY GENERATED.  DO NOT EDIT. --> <!-- Mon Jan 16 13:02:12 2006 -->

Because this version uses an urllib interface that automatically saves the downloaded data in a local file, it's similar to FTP downloads in spirit. But this script must also somehow come up with a local filename for storing the data. You can either let the script strip and use the base filename from the constructed URL, or explicitly pass a local filename as a last command-line argument. In the prior run, for instance, the downloaded web page is stored in the local file index.html in the current working directorythe base filename stripped from the URL (the script prints the URL and local filename as its first output line). In the next run, the local filename is passed explicitly as python-org-index.html:

 C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py www.python.org                                         /index.html python-org-index.html http://www.python.org/index.html python-org-index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"                       "http://www.w3.org/TR/html4/loose.dtd" > <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html> <!-- THIS PAGE IS AUTOMATICALLY GENERATED.  DO NOT EDIT. --> <!-- Mon Jan 16 13:02:12 2006 --> C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py www.rmi.net                                         /~lutz/home/index.html http://www.rmi.net/~lutz/index.html index.html <HTML> <HEAD> <TITLE>Mark Lutz's Home Page</TITLE> </HEAD> <BODY BGCOLOR="#f1f1ff"> C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py www.rmi.net                                         /~lutz/home/about-pp.html http://www.rmi.net/~lutz/about-pp.html about-pp.html <HTML> <HEAD> <TITLE>About "Programming Python"</TITLE> </HEAD>

What follows is a listing showing this third version being used to trigger a remote program. As before, if you don't give the local filename explicitly, the script strips the base filename out of the filename argument. That's not always easy or appropriate for program invocationsthe filename can contain both a remote directory path at the front, and query parameters at the end for a remote program invocation.

Given a script invocation URL and no explicit output filename, the script extracts the base filename in the middle by using first the standard urlparse module to pull out the file path, and then os.path.split to strip off the directory path. However, the resulting filename is a remote script's name, and it may or may not be an appropriate place to store the data locally. In the first run that follows, for example, the script's output goes in a local file called languages.py, the script name in the middle of the URL; in the second, we instead name the output CxxSyntax.html explicitly to suppress filename extraction:

 C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py localhost                               /cgi-bin/languages.py?language=Scheme http://localhost/cgi-bin/languages.py?language=Scheme languages.py <TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>Scheme</H3><P><PRE>  (display "Hello World") (newline) </PRE></P><BR> <HR> C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py localhost                       /cgi-bin/languages.py?language=C++ CxxSyntax.html http://localhost/cgi-bin/languages.py?language=C++ CxxSyntax.html <TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>C  </H3><P><PRE> Sorry--I don't know that language </PRE></P><BR> <HR>

The remote script returns a not-found message when passed "C++" in the last command here. It turns out that "+" is a special character in URL strings (meaning a space), and to be robust, both of the urllib scripts we've just written should really run the filename string through something called urllib.quote, a tool that escapes special characters for transmission. We will talk about this in depth in Chapter 16, so consider this a preview for now. But to make this invocation work, we need to use special sequences in the constructed URL. Here's how to do it by hand:

 C:\...\PP3E\Internet\Other>python http-getfile-urllib2.py  localhost                /cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html http://localhost/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html <TITLE>Languages</TITLE> <H1>Syntax</H1><HR> <H3>C++</H3><P><PRE>  cout &lt;&lt; "Hello World" &lt;&lt; endl; </PRE></P><BR> <HR>

The odd %2b strings in this command line are not entirely magical: the escaping required for URLs can be seen by running standard Python tools manually (this is what these scripts should do automatically to handle all possible cases well):

 C:\...\PP3E\Internet\Other>python >>> import urllib >>> urllib.quote('C++') 'C%2b%2b'

Again, don't work too hard at understanding these last few commands; we will revisit URLs and URL escapes in Chapter 16, while exploring server-side scripting in Python. I will also explain there why the C++ result came back with other oddities like <<HTML escapes for <<, generated by the tool cgi.escape in the script on the server that produces the reply:

 >>> import cgi >>> cgi.escape('<<') '&lt;&lt;'

Also in Chapter 16, we'll meet urllib support for proxies, and the support for client-side cookies in the newer urllib2 standard library module. We'll discuss the related HTTPS concept in Chapter 17HTTP transmissions over secure sockets, supported by urllib and urllib2 if SSL support is compiled into your Python.