Section 20.3. Advanced Web Clients | Core Python Programming (2nd Edition)

20.3. Advanced Web Clients

Web browsers are basic Web clients. They are used primarily for searching and downloading documents from the Web. Advanced clients of the Web are those applications that do more than download single documents from the Internet.

One example of an advanced Web client is a crawler (aka spider, robot). These are programs that explore and download pages from the Internet for different reasons, some of which include:

Indexing into a large search engine such as Google or Yahoo!
Offline browsingdownloading documents onto a local hard disk and rearranging hyperlinks to create almost a mirror image for local browsing
Downloading and storing for historical or archival purposes, or
Web page caching to save superfluous downloading time on Web site revisits.

The crawler we present below, crawl.py, takes a starting Web address (URL), downloads that page and all other pages whose links appear in succeeding pages, but only those that are in the same domain as the starting page. Without such limitations, you will run out of disk space! The source for crawl.py appears in Example 20.2.

Line-by-Line (Class-by-Class) Explanation

Lines 111

The top part of the script consists of the standard Python Unix start-up line and the importation of various module attributes that are employed in this application.

Lines 1349

The Retriever class has the responsibility of downloading pages from the Web and parsing the links located within each document, adding them to the "to-do" queue if necessary. A Retriever instance object is created for each page that is downloaded from the net. Retriever consists of several methods to aid in its functionality: a constructor (__init__()), filename(), download(), and parseAndGetLinks().

The filename() method takes the given URL and comes up with a safe and sane corresponding filename to store locally. Basically, it removes the "http://" prefix from the URL and uses the remaining part as the filename, creating any directory paths necessary. URLs without trailing file-names will be given a default filename of "index.htm". (This name can be overridden in the call to filename()).

The constructor instantiates a Retriever object and stores both the URL string and the corresponding file name returned by filename() as local attributes.

The download() method, as you may imagine, actually goes out to the net to download the page with the given link. It calls urllib.urlretrieve() with the URL and saves it to the filename (the one returned by filename()). If the download was successful, the parse() method is called to parse the page just copied from the network; otherwise an error string is returned.

If the Crawler determines that no error has occurred, it will invoke the parseAndGetLinks() method to parse the newly downloaded page and determine the course of action for each link located on that page.

Lines 5198

The Crawler class is the "star" of the show, managing the entire crawling process for one Web site. If we added threading to our application, we would create separate instances for each site crawled. The Crawler consists of three items stored by the constructor during the instantiation phase, the first of which is q, a queue of links to download. Such a list will fluctuate during execution, shrinking as each page is processed and grown as new links are discovered within each downloaded page.

The other two data values for the Crawler include seen, a list of all the links that "we have seen" (downloaded) already. And finally, we store the domain name for the main link, dom, and use that value to determine whether any succeeding links are part of the same domain.

Crawler also has a static data item named count. The purpose of this counter is just to keep track of the number of objects we have downloaded from the net. It is incremented for every page successfully download.

Crawler has a pair of other methods in addition to its constructor, getPage() and go(). go() is simply the method that is used to start the Crawler and is called from the main body of code. go() consists of a loop that will continue to execute as long as there are new links in the queue that need to be downloaded. The workhorse of this class, though, is the getPage() method.

getPage() instantiates a Retriever object with the first link and lets it go off to the races. If the page was downloaded successfully, the counter is incremented and the link added to the "already seen" list. It looks recursively at all the links featured inside each downloaded page and determines whether any more links should be added to the queue. The main loop in go() will continue to process links until the queue is empty, at which time victory is declared.

Links that are part of another domain, have already been downloaded, are already in the queue waiting to be processed, or are "mailto:" links are ignored and not added to the queue.

Lines 100114

main() is executed if this script is invoked directly and is the starting point of execution. Other modules that import crawl.py will need to invoke main() to begin processing. main() needs a URL to begin processing. If one is given on the command line (for example, when this script is invoked directly), it will just go with the one given. Otherwise, the script enters interactive mode, prompting the user for a starting URL. With a starting link in hand, the Crawler is instantiated and away we go.

One sample invocation of crawl.py may look like this:

        % crawl.py         Enter starting URL: http://www.null.com/home/index.html         ( 1 )         URL: http://www.null.com/home/index.html         FILE: www.null.com/home/index.html         * http://www.null.com/home/overview.html ... new, added to Q         * http://www.null.com/home/synopsis.html ... new, added to Q         * http://www.null.com/home/order.html ... new, added to Q         * mailto:postmaster@null.com ... discarded, mailto link         * http://www.null.com/home/overview.html ... discarded, already in Q         * http://www.null.com/home/synopsis.html ... discarded, already in Q         * http://www.null.com/home/order.html ... discarded, already in Q         * mailto:postmaster@null.com ... discarded, mailto link         * http://bogus.com/index.html ... discarded, not in domain         ( 2 )         URL: http://www.null.com/home/order.html         FILE: www.null.com/home/order.html         * mailto:postmaster@null.com ... discarded, mailto link         * http://www.null.com/home/index.html ... discarded, already processed         * http://www.null.com/home/synopsis.html ... discarded, already in Q         * http://www.null.com/home/overview.html ... discarded, already in Q         ( 3 )         URL: http://www.null.com/home/synopsis.html         FILE: www.null.com/home/synopsis.html         * http://www.null.com/home/index.html ... discarded, already processed         * http://www.null.com/home/order.html ... discarded, already processed         * http://www.null.com/home/overview.html ... discarded, already in Q         ( 4 )         URL: http://www.null.com/home/overview.html         FILE: www.null.com/home/overview.html         * http://www.null.com/home/synopsis.html ... discarded, already processed         * http://www.null.com/home/index.html ... discarded, already processed         * http://www.null.com/home/synopsis.html ... discarded, already processed         * http://www.null.com/home/order.html ... discarded, already processed

After execution, a www.null.com directory would be created in the local file system, with a home subdirectory. Within home, all the HTML files processed will be found.