Section 16.5. Saving State Information in CGI Scripts

16.5. Saving State Information in CGI Scripts

One of the most unusual aspects of the basic CGI model, and one of its starkest contrasts to the GUI programming techniques we studied in the prior part of this book, is that CGI scripts are statelesseach is a standalone program, normally run autonomously, with no knowledge of any other scripts that may run before or after. There is no notion of things such as global variables or objects that outlive a single step of interaction and retain context. Each script begins from scratch, with no memory of where the prior left off.

This makes web servers simple and robusta buggy CGI script won't interfere with the server process. In fact, a flaw in a CGI script generally affects only the single page it implements, not the entire web-based application. But this is a very different model from callback-handler functions in a single process GUI, and it requires extra work to remember things longer than a single script's execution.

Lack of state retention hasn't mattered in our simple examples so far, but larger systems are usually composed of multiple user interaction steps and many scripts, and they need a way to keep track of information gathered along the way. As suggested in the last two sections, generating query parameters on URL links and hidden form fields in reply pages are two simple ways for a CGI script to pass data to the next script in the application. When clicked or submitted, such parameters send preprogrammed selection or session information back to another server-side handler script. In a sense, the content of the generated reply page itself becomes the memory space of the application.

For example, a site that lets you read your email may present you with a list of viewable email messages, implemented in HTML as a list of hyperlinks generated by another script. Each hyperlink might include the name of the message viewer script, along with parameters identifying the selected message number, email server name, and so onas much data as is needed to fetch the message associated with a particular link. A retail site may instead serve up a generated list of product links, each of which triggers a hardcoded hyperlink containing the product number, its price, and so on. Alternatively, the purchase page at a retail site may embed the product selected in a prior page as hidden form fields.

In fact, one of the main reasons for showing the techniques in the last two sections is that we're going to use them extensively in the larger case study in the next chapter. For example, we'll use generated stateful URLs with query parameters to implement lists of dynamically generated selections that "know" what to do when clicked. Hidden form fields will also be deployed to pass user login data to the next page's script. From a more general perspective, both techniques are ways to retain state information between pagesthey can be used to direct the action of the next script to be run.

Generating URL parameters and hidden form fields works well for retaining state information across pages during a single session of interaction. Some scenarios require more, though. For instance, what if we want to remember a user's login name from session to session? Or what if we need to keep track of pages at our site visited by a user in the past? Because such information must be longer lived than the pages of a single session of interaction, query parameters and hidden form fields won't suffice.

In general, there are a variety of ways to pass or retain state information between CGI script executions and across sessions of interaction:

URL query parameters: Session state embedded in pages
Hidden form fields: Session state embedded in pages
Cookies: Smaller information stored on the client that may span sessions
Server-side databases: Larger information that might span sessions
CGI model extensions: Persistent processes, session management, and so on

We'll explore most of these in later examples, but since this is a core idea in server-side scripting, let's take a brief look at each of these in turn.

16.5.1. URL Query Parameters

We met these earlier in this chapter: hardcoded URL parameters in dynamically generated hyperlinks embedded in reply web pages. By including both a processing script name and input to it, such links direct the operation of the next page when selected. The parameters are transmitted from client to server automatically, as part of a GET-style request.

Coding query parameters is straightforwardprint the correctly formatted URL to standard output from your CGI script as part of the reply page (albeit following some escaping conventions we'll meet later in this chapter):

 script = "onViewListLink.py" user = 'bob' mnum = 66 pswd = 'xxx' site = 'pop.rmi.net' print ('<a href="%s?user=%s&pswd=%s&mnum=%d&site=%s">View %s</a>'               % (script, user, pswd, mnum, site, mnum))

The resulting URL will have enough information to direct the next script when clicked:

 <a href="onViewListLink.py      ?user=bob&pswd=xxx&mnum=66&site=pop.rmi.net">View 66</a>

Query parameters serve as memory, and they pass information between pages. As such, they are useful for retaining state across the pages of a single session of interaction. Since each generated URL may have different attached parameters, this scheme can provide context per user-selectable action. Each link in a list of selectable alternatives, for example, may have a different implied action coded as a different parameter value. Moreover, users can bookmark a link with parameters, in order to return to a specific state in an interaction.

Because their state retention is lost when the page is abandoned, though, they are not useful for remembering state from session to session. Moreover, the data appended as URL query parameters is generally visible to users and may appear in server logfiles; in some applications, it may have to be manually encrypted to avoid display or forgery.

16.5.2. Hidden Form Input Fields

We met these in the prior section as well: hidden form input fields that are attached to form data and are embedded in reply web pages, but are not displayed on web pages. When the form is submitted, all the hidden fields are transmitted to the next script along with any real inputs, to serve as context. The net effect provides context for an entire input form, not a particular hyperlink. An already entered username, password, or selection, for instance, can be implied by the values of hidden fields in subsequently generated pages.

In terms of code, hidden fields are generated by server-side scripts as part of the reply page's HTML, and are later returned by the client with all of the form's input data:

 print '<form method=post action="%s/onViewSubmit.py">' % urlroot print '<input type=hidden name=mnum value="%s">' % msgnum print '<input type=hidden name=user value="%s">' % user print '<input type=hidden name=site value="%s">' % site print '<input type=hidden name=pswd value="%s">' % pswd

Like query parameters, hidden form fields can also serve as a sort of memory, retaining state information from page to page. Also like query parameters, because this kind of memory is embedded in the page itself, hidden fields are useful for state retention among the pages of a single session of interaction, but not for data that spans multiple sessions.

And like both query parameters and cookies (up next), hidden form fields may be visible to userstheir values are displayed if the page's source HTML code is displayed. As a result, hidden form fields are not secure; encryption of the embedded data may again be required in some contexts to avoid display on the client, or forgery in form submissions.

16.5.3. HTTP "Cookies"

Cookies, an extension to the HTTP protocol underlying the web model, are a way for server-side applications to directly store information on the client computer. Because this information is not embedded in the HTML of web pages, it outlives the pages of a single session. As such, cookies are ideal for remembering things that must span sessions.

Things like usernames and preferences, for example, are prime cookie candidatesthey will be available the next time the client visits our site. However, because cookies may have space limitations, are seen by some as intrusive, and can be disabled by users on the client, they are not always well suited to general data storage needs. They are often best used for small pieces of noncritical cross-session state information.

Operationally, HTTP cookies are strings of information stored on the client machine and transferred between client and server in HTTP message headers. Server-side scripts generate HTTP headers to request that a cookie be stored on the client as part of the script's reply stream. Later, the client web browser generates HTTP headers that send back all the cookies matching the server and page being contacted. In effect, cookie data is embedded in the data streams much like query parameters and form fields, but is contained in HTTP headers, not in a page's HTML. Moreover, cookie data can be stored permanently on the client, and so outlives both pages and interactive sessions.

For web application developers, Python's standard library includes tools that simplify the task of sending and receiving: cookielib does cookie handling for HTTP clients that talk to web servers, and the module Cookie simplifies the task of creating and receiving cookies on the server. Moreover, the module urllib2 has support for opening URLs with automatic cookie handling.

16.5.3.1. Creating a cookie

Web browsers such as Firefox and Internet Explorer generally handle the client side of this protocol, storing and sending cookie data. For the purpose of this chapter, we are mainly interested in cookie processing on the server. Cookies are created by sending special HTTP headers at the start of the reply stream:

 Content-type: text/html Set-Cookie: foo=bar; <HTML>...

The full format of a cookie's header is as follows:

 Set-Cookie: name=value; expires=date; path=pathname; domain=domainname; secure

The domain defaults to the hostname of the server that set the cookie, and the path defaults to the path of the document or script that set the cookiethese are later matched by the client to know when to send a cookie's value back to the server. In Python, cookie creation is simple; the following in a CGI script stores a last-visited time cookie:

 import Cookie, time cook = Cookie.SimpleCookie( ) cook["visited"] = str(time.time( ))     # a dictionary print cook.output( )                    # "Set-Cookie: visited=1137268854.98;" print 'Content-type: text/html\n'

The SimpleCookie call here creates a dictionary-like cookie object whose keys are strings (the names of the cookies), and whose values are "Morsel" objects (describing the cookie's value). Morsels in turn are also dictionary-like objects with one key per cookie propertypath and domain, expires to give the cookie an expiration date (the default is the duration of the browser session), and so on. Morsels also have attributesfor instance, key and value give the name and value of the cookie, respectively. Assigning a string to a cookie key automatically creates a Morsel from the string, and the cookie object's output method returns a string suitable for use as an HTTP header (printing the object directly has the same effect, due to its _ _str_ _ operator overloading). Here is a more comprehensive example of the interface in action:

 >>> import Cookie, time >>> cooks = Cookie.SimpleCookie( ) >>> cooks['visited']  = time.asctime( ) >>> cooks['username'] = 'Bob' >>> cooks['username']['path'] = '/myscript' >>> cooks['visited'].value 'Sun Jan 15 11:31:24 2006' >>> print cooks['visited'] Set-Cookie: visited="Sun Jan 15 11:31:24 2006"; >>> print cooks Set-Cookie: username=Bob; Path=/myscript; Set-Cookie: visited="Sun Jan 15 11:31:24 2006";

16.5.3.2. Receiving a cookie

Now, when the client visits the page again in the future, the cookie's data is sent back from the browser to the server in HTTP headers again, in the form "Cookie: name1=value1; name2=value2 ...". For example:

 Cookie: visited=1137268854.98

Roughly, the browser client returns all cookies that match the requested server's domain name and path. In the CGI script on the server, the environment variable HTTP_COOKIE contains the raw cookie data headers string uploaded from the client; it can be extracted in Python as follows:

 import os, Cookie cooks = Cookie.SimpleCookie(os.environ.get("HTTP_COOKIE")) vcook = cooks.get("visited")     # a Morsel dictionary if vcook != None:     time = vcook.value

Here, the SimpleCookie constructor call automatically parses the passed-in cookie data string into a dictionary of Morsel objects; as usual, the dictionary get method returns a default None if a key is absent, and we use the Morsel object's value attribute to extract the cookie's value string if sent.

16.5.3.3. Using cookies in CGI scripts

To help put these pieces together, Example 16-16 lists a CGI script that stores a client-side cookie when first visited, and receives and displays it on subsequent visits.

Example 16-16. PP3E\Internet\Web\cgi-bin\cookies.py

 ####################################################### # create or use a client-side cookie storing username; # there is no input form data to parse in this example ####################################################### import Cookie, os cookstr  = os.environ.get("HTTP_COOKIE") cookies  = Cookie.SimpleCookie(cookstr) usercook = cookies.get("user")                     # fetch if sent if usercook == None:                               # create first time     cookies = Cookie.SimpleCookie( )                    # print Set-cookie hdr     cookies['user']  = 'Brian'     print cookies     greeting = '<p>His name shall be... %s</p>' % cookies['user'] else:     greeting = '<p>Welcome back, %s</p>' % usercook.value print "Content-type: text/html\n"                  # plus blank line now print greeting

Assuming you are running this chapter's local web server from Example 16-1, you can invoke this script with a URL such as http://localhost/cgi-bin/cookies.py (type this in your browser's address field, or submit it interactively with the module urllib2). The first time you visit the script, the script sets the cookie within its reply's headers, and you'll see a reply page with this message:

 His name shall be... Set-Cookie: user=Brian;

Thereafter, revisiting the script's URL (use your browser's reload button) produces a reply page with this message:

 Welcome back, Brian

This is because the client is sending the previously stored cookie value back to the script, at least until you kill and restart your web browserthe default expiration of a cookie is the end of a browsing session. In a realistic program, this sort of structure might be used by the login page of a web application; a user would need to enter his name only once per browser session.

16.5.3.4. Handling cookies with the module urllib2

As mentioned earlier, the urllib2 module provides an interface similar to urllib for reading the reply from a URL, but it uses the cookielib module to also support storing and sending cookies on the client. For example, to use it to test the last section's script, we simply need to enable the cookie-handler class:

 >>> import urllib2 >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor( )) >>> urllib2.install_opener(opener) >>> >>> reply = urllib2.urlopen('http://localhost/cgi-bin/cookies.py').read( ) >>> print reply <p>His name shall be... Set-Cookie: user=Brian;</p> >>> reply = urllib2.urlopen('http://localhost/cgi-bin/cookies.py').read( ) >>> print reply <p>Welcome back, Brian</p> >>> reply = urllib2.urlopen('http://localhost/cgi-bin/cookies.py').read( ) >>> print reply <p>Welcome back, Brian</p>

This works because urllib2 mimics the cookie behavior of a web browser on the client. Just as in a browser, the cookie is deleted if you exit Python and start a new session to rerun this code. See the library manual for more on this module's interfaces.

Although easy to use, cookies have potential downsides. For one, they may be subject to size limitations (4 KB per cookie, 300 total, and 20 per domain are one common limit). For another, users can disable cookies in most browsers, making them less suited to critical data. Some even see them as intrusive, because they can be abused to track user behavior. Many sites simply require cookies to be turned on, finessing the issue completely. Finally, because they are transmitted over the network between client and server, they are still only as secure as the transmission stream itself; this may be an issue for sensitive data if the page is not using secure HTTP transmissions between client and server. We'll explore secure cookies and server concepts in the next chapter.

For more details on the cookie modules and the cookie protocol in general, see Python's library manual, and search the Web for resources.

16.5.4. Server-Side Databases

For more industrial-strength state retention, Python scripts can employ full-blown database solutions in the server. We will study these options in depth in Chapter 19 of this book. Python scripts have access to a variety of server-side data stores, including flat files, persistent object pickles and shelves, object-oriented databases such as ZODB, and relational SQL-based databases such as MySQL, PostgreSQL, and Oracle. Besides data storage, such systems may provide advanced tools such as transaction commits and rollbacks, concurrent update synchronization, and more.

Full-blown databases are the ultimate storage solution. They can be used to represent state both between the pages of a single session (by tagging the data with generated per-session keys) and across multiple sessions (by storing data under per-user keys).

Given a user's login name, for example, CGI scripts can fetch all of the context we have gathered in the past about that user from the server-side database. Server-side databases are ideal for storing more complex cross-session information; a shopping cart application, for instance, can record items added in the past in a server-side database.

Databases outlive both pages and sessions. Because data is kept explicitly, there is no need to embed it within the query parameters or hidden form fields of reply pages.

Because the data is kept on the server, there is no need to store it on the client in cookies. And because such schemes employ general-purpose databases, they are not subject to the size constraints or optional nature of cookies.

In exchange for their added utility, full-blown databases require more in terms of installation, administration, and coding. As we'll see in Chapter 19, luckily the extra coding part of that trade-off is remarkably simple in Python. Moreover, Python's database interfaces may be used in any application, web-based or otherwise.

16.5.5. Extensions to the CGI Model

Finally, there are more advanced protocols and frameworks for retaining state on the server, which we won't cover in this book. For instance, the Zope web application framework, discussed briefly in Chapter 18, provides a product interface, which allows for the construction of web-based objects that are automatically persistent.

Other schemes, such as FastCGI, as well as server-specific extensions such as mod_python for Apache, may attempt to work around the autonomous, one-shot nature of CGI scripts, or otherwise extend the basic CGI model to support long-lived memory stores. For instance:

FastCGI allows web applications to run as persistent processes, which receive input data from and send reply streams to the HTTP web server over Inter-Process Communication (IPC) mechanisms such as sockets. This differs from normal CGI, which communicates inputs and outputs with environment variables, standard streams, and command-line arguments, and assumes scripts run to completion on each request. Because a FastCGI process may outlive a single page, it can retain state information from page to page, and avoids startup performance costs.
mod_python extends the open source Apache web server by embedding the Python interpreter within Apache. Python code is executed directly within the Apache server, eliminating the need to spawn external processes. This package also supports the concept of sessions, which can be used to store data between pages. Session data is locked for concurrent access and can be stored in files or in memory, depending on whether Apache is running in multiprocess or multithreaded mode. mod_python also includes web development tools, such as the Python Server Pages templating language for HTML generation (described later in this book).

Such models are not universally supported, though, and may come with some added cost in complexityfor example, to synchronize access to persistent data with locks. Moreover, a failure in a FastCGI-style web application impacts the entire application, not just a single page, and things like memory leaks become much more costly. For more on persistent CGI models, and support in Python for things such as FastCGI, search the Web or consult web-specific resources.

16.5.6. Combining Techniques

Naturally, these techniques may be combined to achieve a variety of memory strategies, both for interaction sessions and for more permanent storage needs. For example:

A web application may use cookies to store a per-user or per-session key on the client, and later use that key to index into a server-side database to retrieve the user's or session's full state information.
Even for short-lived session information, URL query parameters or hidden form fields may similarly be used to pass a key identifying the session from page to page, to be used by the next script to index a server-side database.
Moreover, URL query parameters and hidden fields may be generated for temporary state memory that spans pages, even though cookies and databases are used for retention that must span sessions.

The choice of appropriate technique is driven by the application's storage needs. Although not as straightforward as the in-memory variables and objects of single process GUI programs running on a client, with a little creativity, CGI script state retention is entirely possible.