Section 20.10. Exercises | Core Python Programming (2nd Edition)

20.10. Exercises

20-1.	`urllib` Module and Files. Update the `friends3.py` script so that it stores names and corresponding number of friends into a two-column text file on disk and continues to add names each time the script is run. Extra Credit: Add code to dump the contents of such a file to the Web browser (in HTML format). Additional Extra Credit: Create a link that clears all the names in this file.
20-2.	`urllib` Module. Write a program that takes a user-input URL (either a Web page or an FTP file, i.e., `http://python.org` or `ftp://ftp.python.org/pub/python/README)`, and downloads it to your machine with the same filename (or modified name similar to the original if it is invalid on your system). Web pages (HTTP) should be saved as .`htm` or .`html` files, and FTP'd files should retain their extension.
20-3.	`urllib` Module. Rewrite the `grabWeb.py` script of Example 11.,4, which downloads a Web page and displays the first and last non-blank lines of the resulting HTML file so that you use `urlopen()` instead of `urlretrieve()` to process the data directly (as opposed to downloading the entire file first before processing it).
20-4.	URLs and Regular Expressions. Your browser may save your favorite Web site URLs as a "bookmarks" HTML file (Mozilla-flavored browsers do this) or as a set of .URL files in a "favorites" directory (IE does this). Find your browser's method of recording your "hot links" and the location of where and how they stored. Without altering any of the files, strip the URLs and names of the corresponding Web sites (if given) and produce a two-column list of names and links as output, and storing this data into a disk file. Truncate site names or URLs to keep each line of output within 80 columns in size.
20-5.	URLs, `urllib` Module, Exceptions, and REs. As a follow-up problem to the previous one, add code to your script to test each of your favorite links. Report back a list of dead links (and their names), i.e., Web sites that are no longer active or a Web page that has been removed. Only output and save to disk the still-valid links.

20-6.	Error Checking. The `friends3.py` script reports an error if no radio button was selected to indicate the number of friends. Update the CGI script to also report an error if no name (e.g., blank or whitespace) is entered. Extra Credit: We have so far explored only server-side error checking. Explore JavaScript programming and implement client-side error checking by creating JavaScript code to check for both error situations so that these errors are stopped before they reach the server. Problems 20-7 to 20-10 below pertain to Web server access log files and regular expressions. Web servers (and their administrators) generally have to maintain an access log file (usually `logs/access_log` from the main Web, server directory) which tracks requests file. Over a period of time, such files get large and either need to be stored or truncated. Why not save only the pertinent information and delete the files to conserve disk space? The exercises below are designed to give you some exercise with REs and how they can be used to help archive and analyze Web server data.
20-7.	Count how many of each type of request (GET versus POST) exist in the log file.
20-8.	Count the successful page/data downloads: Display all links that resulted in a return code of 200 (OK [no error]) and how many times each link was accessed.
20-9.	Count the errors: Show all links that resulted in errors (return codes in the 400s or 500s) and how many times each link was accessed.
20-10.	Track IP addresses: For each IP address, output a list of each page/data downloaded and how many times that link was accessed.
20-11.	Simple CGI. Create a "Comments" or "Feedback" page for a Web site. Take user feedback via a form, process the data in your script, and return a "thank you" screen.
20-12.	Simple CGI. Create a Web guestbook. Accept a name, an e-mail address, and a journal entry from a user and log it to a file (format of your choice). Like the previous problem, return a "thanks for filling out a guestbook entry" page. Also provide a link that allows users to view guestbooks.
20-13.	Web Browser Cookies and Web site Registration. Update your solution to Exercise 20-4. So your user-password information should now pertain to Web site registration instead of a simple text-based menu system. Extra Credit: familiarize yourself with setting Web browser cookies and maintain a login session for 4 hours from the last successful login.
20-14.	Web Clients. Port Example 20.1, `crawl.py`, the Web crawler, to using the `HTMLParser` module or the BeautifulSoup parsing system.
20-15.	Errors. What happens when a CGI script crashses? How can the `cgitb` module be helpful?
20-16.	CGI, File Updates, and Zip Files. Create a CGI application that not only saves files to the server's disk, but also intelligently unpacks Zip files (or other archive) into a subdirectory named after the archive file.
20-17.	Zope, Plone, TurboGears, Django. Investigate each of these complex Web development platforms and create one simple application in each.
20-18.	Web Database Application. Think of a database schema you want to provide as part of a Web database application. For this multi-user application, you want to provide everyone read access to the entire contents of the database, but perhaps only write access to each individual. One example may be an "address book" for your family and relatives. Each family member, once successfully logged in, is presented with a Web page with several options, add an entry, view my entry, update my entry, remove or delete my entry, and view all entries (entire database). Design a `UserEntry` class and create a database entry for each instance of this class. You may use any solution created for any previous problem to implement the registration framework. Finally, you may use any type of storage mechanism for your database, either a relational database such as MySQL or some of the simpler Python persistent storage modules such as `anydbm` or `shelve`.
20-19.	Electronic Commerce Engine. Use the classes created for your solution to Exercise 13-11 and add some product inventory to create a potential electronic commerce Web site. Be sure your Web application also supports multiple customers and provides registration for each user.

20-20.	Dictionaries and `cgi` module. As you know, the `cgi.FieldStorage()` method returns a dictionary-like object containing the key-value pairs of the submitted CGI variables. You can use methods such as `keys()` and `has_key()` for such objects. In Python 1.5, a `get()` method was added to dictionaries which returned the value of the requested key, or the default value for a non-existent key. `FieldStorage` objects do not have such a method. Let's say we grab the form in the usual manner of: form = cgi.FieldStorage() Add a similar `get()` method to class definition in `cgi.py` (you can rename it to `mycgi.py` or something like that) such that code that looks like this: if form.has_key('who'): who = form['who'].value else: who = '(no name submitted)' ... can be replaced by a single line which makes forms even more like a dictionary: howmany = form.get('who', '(no name submitted)')
20-21.	Creating Web Servers. Our code for `myhttpd.py` in Section 20.7 is only able to read HTML files and return them to the calling client. Add support for plain text files with the ".`txt`" ending. Be sure that you return the correct MIME type of "text/plain." Extra credit: add support for JPEG files ending with either ".`jpg`" or ".`jpeg`" and having a MIME type of "image/jpeg."
20-22.	Advanced Web Clients. URLs given as input to `crawl.py` must have the leading "`http://`" protocol indicator and top-level URLs must contain a trailing slash, i.e., `http://www.prenhallprofessional.com/`. Make `crawl.py` more robust by allowing the user to input just the hostname (without the protocol part [make it assume HTTP]) and also make the trailing slash optional. For example, `www.prenhallprofessional.com` should now be acceptable input.
20-23.	Advanced Web Clients. Update the `crawl.py` script in Section 20.3 to also download links that use the "`ftp:`" scheme. All "`mailto:`" links are ignored by `crawl.py`. Add support to ensure that it also ignores "`telnet:`", "`news:`", "`gopher:`", and "`about:`" links.

20-24.	Advanced Web Clients. The `crawl.py` script in Section 20.3 only downloads `.html` files via links found in Web pages at the same site and does not handle/save images that are also valid "files" for those pages. It also does not handle servers that are susceptible to URLs that are missing the trailing slash ( / ). Add a pair of classes to `crawl.py` to deal with these problems. A `My404UrlOpener` class should subclass `urllib.FancyURLOpener` and consist of a single method, `http_error_404()` which determines if a 404 error was reached using a URL without a trailing slash. If so, it adds the slash and retries the request again (and only once). If it still fails, return a real 404 error. You must set `urllib._urlopener` with an instance of this class so that `urllib` uses it. Create another class called `LinkImageParser`, which derives from `htmllib.HTMLParser`. This class should contain a constructor to call the base class constructor as well as initialize a list for the image files parsed from Web pages. The `handle_image()` method should be overridden to add image filenames to the image list (instead of discarding them like the current base class method does).