Recipe13.2.Grabbing a Document from the Web

Recipe 13.2. Grabbing a Document from the Web

Credit: Gisle Aas, Magnus Bodin

Problem

You need to grab a document from a URL on the Web.

Solution

urllib.urlopen returns a file-like object, and you can call the read method on that object to get all of its contents:

from urllib import urlopen doc = urlopen("http://www.python.org").read( ) print doc

Discussion

Once you obtain a file-like object from urlopen, you can read it all at once into one big string by calling its read method, as I do in this recipe. Alternatively, you can read the object as a list of lines by calling its readlines method, or, for special purposes, just get one line at a time by looping over the object in a for loop. In addition to these file-like operations, the object that urlopen returns offers a few other useful features. For example, the following snippet gives you the headers of the document:

doc = urlopen("http://www.python.org") print doc.info( )

such as the Content-Type header (text/html in this case) that defines the MIME type of the document. doc.info returns a mimetools.Message instance, so you can access it in various ways besides printing it or otherwise transforming it into a string. For example, doc.info( ).getheader(`Content-Type') returns the 'text/html' string. The maintype attribute of the mimetools.Message object is the 'text' string, subtype is the 'html' string, and type is also the 'text/html' string. If you need to perform sophisticated analysis and processing, all the tools you need are right there. At the same time, if your needs are simpler, you can meet them in very simple ways, as this recipe shows.

If what you need to do with the document you grab from the Web is specifically to save it to a local file, urllib.urlretrieve is just what you need, as the "Introduction" to this chapter describes.

urllib implicitly supports the use of proxies (as long as the proxies do not require authentication: the current implementation of urllib does not support authentication-requiring proxies). Just set environment variable HTTP_PROXY to a URL, such as 'http://proxy.domain.com:8080', to use the proxy at that URL. If the environment variable HTTP_PROXY is not set, urllib may also look for the information in other platform-specific locations, such as the Windows registry if you're running under Windows.

If you have more advanced needs, such as using proxies that require authentication, you may use the more sophisticated urllib2 module of the Python Standard Library, rather than simple module urllib. At http://pydoc.org/2.3/urllib2.html, you can find an example of how to use urllib2 for the specific task of accessing the Internet through a proxy that does require authentication.

Recipe13.2.Grabbing a Document from the Web