Things are going so well here at Butterthlies, Inc. that we are hard put to keep up with the flood of demand. Everyone, even the cat, is hard at work typing in orders that arrive incessantly by mail and telephone.
Then someone has a brainstorm: "Hey," she cries, "let's use the Internet to take the orders!" The essence of her scheme is simplicity itself. Instead of letting customers read our catalog pages on the Web and then, drunk with excitement, phone in their orders, we provide them with a form they can fill out on their screens. At our end we get a chunk of data back from the Web, which we then pass to a script or program we have written. This brings us into the world of scripting, where the web site can take a much more active role in interacting with users. These tools make Apache a foundation for building applications, not just publishing web pages.
While many sites act as simple repositories, providing users with a collection of files they can retrieve and navigate through with hyperlinks, web sites are capable of much more sophisticated interactions. Sites can collect information from users through forms, customize their appearance and their contents to reflect the interests of particular users, or let users interact with a wide variety of information sources. Sites can also serve as hosts for services provided not to browsers but to other computers, as "web services" become a more common part of computing.
Apache provides a solid foundation for applications, using its core web server to manage HTTP transactions and a wide variety of modules and interfaces to connect those transactions to programs. Developers can create logic that manages a much more complex flow of information than just reading pages, they can use the development environment of their choice, as well as Apache services for HTTP, security, and other web-specific aspects of application design. Everything from simple inclusion of changing information to sophisticated integration of different environments and applications is possible.
In publishing a site, we've been focusing on only one method of the HTTP protocol, GET. Apache's basic handling of GET is more than adequate for sites that just need to publish information from files, but HTTP (and Apache) can support a much wider range of options. Developers who want to create interactive sites will have to write some programs to supply the basic logic. However, many useful tasks are simple to create, and Apache is quite capable of supporting much more complex applications, including applications that connect to databases or other information sources.
Every HTTP request must specify a method. This tells the server how to handle the incoming data. For a complete account, see the HTTP 1.1 specification (http://www.w3.org/Protocols/rfc2616/rfc2616.html). Briefly, however, the methods are as follows:
Returns the data asked for. To save network traffic, a "conditional GET " only generates a return if the condition is satisfied. For instance, a page that alters frequently may be transmitted. The client asks for it again: if it hasn't changed since last time, the conditional GET generates a response telling the client to get it from its local cache. (GET may also include extra path information, as well as a query string with information an application needs to process.)
Returns the headers that a GET would have included, but without data. They can be used to test the freshness of the client's cache without the bandwidth expense of retrieving the whole document.
Tells the server to accept the data and do something with it, using the resource identified by the URL. (Often this will be the ACTION field from an HTML form, but in principle at least, it could be generated other ways.) For instance, when you buy a book across the Web, you fill in a form with the book's title, your credit card number, and so on. Your browser will then POST this data to the server.
Tells the server to store the data.
Tells the server to delete the data.
Tells the server to return a diagnostic trace of the actions it takes.
Used to ask a proxy to make a connection to another host and simply relay the content, rather than attempting to parse or cache it. This is often used to make SSL connections through a proxy.
Note that servers do not have to implement all these methods. See RFC 2068 for more detail. The most commonly used methods are GET and POST, which handle the bulk of interactions with users.
Forms are the most common type of interaction between users and web applications, providing a much wider set of possibilities for user input than simple hypertext linking. HTML provides a set of components for collecting information from users, which HTTP then transmits to the server using your choice of methods. On the server side, your application processes the information sent from the form and generally replies to the user as you deem appropriate.
Creating the form is a simple matter of editing our original brochure to turn it into a form. We have to resist the temptation to fool around, making our script more and more beautiful. We just want to add four fields to capture the number of copies of each card the customer wants and, at the bottom, a field for the credit card number.
The catalog, now a form with the new lines marked:
<!-- NEW LINE - <explanation> -->
looks like this:
<html> <body> <FORM METHOD="POST" ACTION="cgi-bin/mycgi.cgi"> <!-- see text --> <h1> Welcome to Butterthlies Inc</h1> <h2>Summer Catalog</h2> <p> All our cards are available in packs of 20 at $2 a pack. There is a 10% discount if you order more than 100. </p> <hr> <p> Style 2315 <p align="center"> <img src="bench.jpg" alt="Picture of a bench"> <p align="center"> Be BOLD on the bench <p>How many packs of 20 do you want? <INPUT NAME="2315_order" > <!-- new line --> <hr> <p> Style 2316 <p align="center"> <img src="hen.jpg" alt="Picture of a hencoop like a pagoda"> <p align="center"> Get SCRAMBLED in the henhouse <p>How many packs of 20 do you want? <INPUT NAME="2316_order" > <HR> <p> Style 2317 <p align="center"> <img src="tree.jpg" alt="Very nice picture of tree"> <p align="center"> Get HIGH in the treehouse <p>How many packs of 20 do you want? <INPUT NAME="2317_order"> <!-- new line --> <hr> <p> Style 2318 <p align="center"> <img src="bath.jpg" alt="Rather puzzling picture of a batchtub"> <p align="center"> Get DIRTY in the bath <p>How many packs of 20 do you want? <INPUT NAME="2318_order"> <!-- new line --> <hr> <p> Which Credit Card are you using? <ol> <li>Access <INPUT NAME="card_type" TYPE="checkbox" VALUE="Access"> <li>Amex <INPUT NAME="card_type" TYPE="checkbox" VALUE="Amex"> <li>MasterCard <INPUT NAME="card_type" TYPE="checkbox" VALUE="MasterCard"> </ol> <p>Your card number? <INPUT NAME="card_num" SIZE=20> <!-- new line --> <hr> <p align=right> Postcards designed by Harriet@alart.demon.co.uk <hr> <br> Butterthlies Inc, Hopeful City, Nevada, 99999 </br> <p><INPUT TYPE="submit"><INPUT TYPE="reset"> <!-- new line --> </FORM> </body> </html>
This is all pretty straightforward stuff, except perhaps for the line:
<FORM METHOD="POST" ACTION="/cgi-bin/mycgi.cgi">
which on Windows might look like this:
<FORM METHOD="POST" ACTION="mycgi.bat">
The tag <FORM> introduces the form; at the bottom, </FORM> ends it. The METHOD attribute tells Apache how to return the data to the CGI script we are going to write, in this case using POST.
In the Unix case, the ACTION attribute tells Apache to use the URL cgi-bin/mycgi.cgi (which the server may internally expand to /usr/www/cgi-bin/mycgi.cgi, depending on server configuration) to do something about it all:
It would be good if we wrote perfect HTML, which this is not. Although most browsers allow some slack in the syntax, they don't all allow the same slack in the same places. If you write HTML that deviates from the standard, you have to expect that your pages will behave oddly somewhere, sometime. To make sure you have not done so, you can submit your pages to a validator for instance, http://validator.w3.org.
For more information on the many HTML features used to create forms, see HTML & XHTML: The Definitive Guide by Chuck Musciano and Bill Kennedy (O'Reilly, 2002).
While HTML forms are likely the most common use for application logic on web servers, there are many other cases where users interact with applications without necessarily filling out forms. Large sites often use content-management systems to store the information the site presents in databases, generating content regularly even though it may look to users exactly like an ordinary site with static files. Even smaller sites may use tools like Cocoon (discussed in Chapter 19) to manage and generate content for users.
Many sites create customized experiences for their users, making suggestions based on prior visits to the site or information users have provided previously. These sites typically use "cookies," a mechanism that lets sites store a tiny amount of information on the user's computer and that the browser will report each time the user visits the site. Cookies may last for a single session, expiring when the user quits the browser, or they may last longer, expiring at some preset date. Cookies raise a number of privacy issues, but are frequently used in applications that interact with users over more than a single transaction. Using mechanisms like this, a web site might in fact generate every page a user sees, customizing the entire site.
Building complex web applications is well beyond the scope of this book, which focuses on the Apache server you would use as their foundation. For more on web-application design in general, see Information Architecture for the World Wide Web by Louis Rosenfeld and Peter Morville (O'Reilly, 2002). For more on application design in specific environments, see the books referenced in the environment-specific chapters.
While you could write Apache modules that provide the logic for your applications, most developers find it much easier to use scripting languages and integrate them with Apache using modules others have already written. Ultimately, all any computer language can do is to make the CPU compare, add, subtract, multiply, and divide bytes. An important point about scripting languages is that they should run without modification on as many platforms as possible, so that your site can move from machine to machine. On the other hand, if you are a beginner and know someone who can help with one particular language, then that one might be the best choice. We devote a chapter to installing support for each of the major languages and run over the main possibilities here.
The discussion of computer languages is made rather difficult by the fact that human beings fall into two classes: those who love some particular language and those don't. Naturally, the people who discuss languages fall into the first class; many of the people who read books like this in the hope of doing something useful with a computer tend more towards the second. The authors regard computer languages as a necessary evil. Languages all have their quirks, ranging from the mildly amusing to pleasures comparable to gargling battery acid. We would like enthusiasts for each of these languages to know that our comments on the others have reduced those enthusiasts to fury as well.
Server-side includes are more of a means of avoiding scripting languages than a proper scripting language. If your needs are very limited, you may also find that the basic functionality this tool provides can solve a number of content issues, and it may also prove useful in combination with other approaches. Server-side includes are covered in Chapter 14.
Another approach to the problem of orchestrating HTML with CGI scripts, databases, and Apache is PHP. Someone who is completely new to programming of any sort might do best to start with PHP, which extends HTML and one has to learn HTML anyway.
Instead of writing CGI scripts in a language like Perl or Java, which then run in interaction with Apache and generate HTML pages to be sent to the client, PHP's strategy is to embed itself into the HTML. The author then writes HTML with embedded commands, which are interpreted by the PHP package as the page is served up. For instance, you could include the line:
in your HTML. Or, you could have the PHP statement:
<?php print "Hello world!<BR>";?>
which would produce exactly the same effect. The <? php ...?> construction embeds PHP commands within standard HTML. PHP has resources to interact with databases and do most things that other scripting languages do.
The syntax of PHP is based on that of C with bits of Perl. The main problem with learning a new programming language is unlearning irrelevant bits of the ones you already know. So if you have no programming experience to confuse you, PHP may be as good a place to start as any. Its promoters claim that over a million web sites use it, so you will not be the first.
Also, since it was designed for its web function from the start, it avoids a lot of the bodging that has proven necessary to get Perl to work properly in a web environment. On the other hand, it is relatively new and has not accumulated the wealth of prewritten modules that fill the Comprehensive Perl Archive Network (CPAN) library (see http://www.cpan.org).
For example, one of us (PL) was creating a web site that offered a full-text search on a medical encyclopedia. The problem with text searching is that the visitor looks for "operation," but the text talks about "operated on," "operating theater," etc. The answer is to work back to the word stem, and there are several Perl modules in CPAN that strip the endings from English words to get, for instance, the stem "operat" from "operation," the word the enquirer entered. If one wanted to go further and parse English sentences into their parts of speech, modules to do that exist as well. But they might not exist for PHP and it might be hard to create them on your own. An early decision to take the simple route might prove expensive later on.
PHP installation is covered in Chapter 15.
Perl, on the other hand, is an effective but annoyingly idiosyncratic language that has not been designed along sound theoretical lines. However, it has been around since 1987, has had many tiresome features ironed out of it, and has accumulated an enormous body of enthusiasts and supporting software in the CPAN archive. Its star feature is its regular expression tool for parsing lines of text. When one is programming for the Web, this is constantly in use to dissect URLs and strip meaning out of the returns from HTML forms. Perl also has a construct called an "associative array," which gives names to the array elements. This can be very useful, but its syntax can also be very complicated and mind-bending.
Perhaps the most serious defect of Perl is its absence of variable declaration. You can make up variable names on the fly (usually by mistyping or misthinking): Perl will create them and reference them, even if they are wrong and should not exist. This problem can be mitigated, however, with the use of the -w command line flag, as well as the following:
within the scripts.
Anyone who writes Perl needs the "Camel Book" from O'Reilly & Associates. For all its occasional jokes, this is a fairly heavyweight book that is not meant to guide novices' first steps. Sriram Srinivasan's Advanced Perl Programming (O'Reilly, 1997) is also useful. If you are a complete newcomer to programming (and we all were once) you might like to look at Perl for Web Site Management by John Callender (O'Reilly, 2001) or Learning Perl by Randal L. Schwartz and Tom Phoenix (O'Reilly, 2001).
The use of Perl in CGI applications is covered in Chapter 16, while mod_perl is covered in Chapter 17.
Java is a more "proper" (and compiled) programming language, but it is newish. In the Apache world, server-side Java is now available through Tomcat. See Chapter 17. Whether you choose Java over Perl, Python, or PHP probably depends on what you think of Java. As President Lincoln once famously said: "People who like this sort of thing will find this the sort of thing they like." But it is the strongly held, if possibly cranky, view of at least one of us (PL) that a lot of what is wrong with the Web is due to Java. Java makes it possible for web creators to invest their energies in an interestingly complicated medium that allows them to make pages that judder, vibrate, bounce, flash, dissolve, and swim about... By the time a programmer has mastered Java and all its distracting tricks, it is probably far too late to suggest that what the viewer really wants is static information in lucidly laid out words and pictures, for which Perl or PHP are perfectly adequate and much easier to use.
As we went to press with this edition, it became plain that this Luddite view might have other supporters. Velocity, seemingly yet another page-authoring language, but one written in Java so that you can mess with its innards, was announced:
Velocity is a Java-based template engine. It permits web page designers to use simple yet powerful template language to reference objects defined in Java code. Web designers can work in parallel with Java programmers to develop web sites according to the Model-View-Controller (MVC) model, meaning that web page designers can focus solely on creating a site that looks good, and programmers can focus solely on writing top-notch code. Velocity separates Java code from the web pages, making the web site more maintainable over the long run and providing a viable alternative to Java Server Pages (JSPs) or PHP.
The curious will find Velocity at http://jakarta.apache.org/velocity/.
In addition to these stylistic reservations about Java as a creative medium, we felt that Tomcat showed several symptoms of being an over-complicated project, which is as yet in an early stage of development. There seemed to be a lot of loose ends and many ways of getting things wrong. Certainly, we struggled over the interface between Tomcat and Apache for several months without success. Each time we returned to the problem, a new release of Tomcat had changed a lot of the ground rules. But in the end we succeeded, though we had to hack both Apache and Tomcat to make it work.
Using Java with Apache is covered in Chapter 18.
Python is fairly similar to Perl less well known but also less idiosyncratic. It is also a scripting language, but one that has been properly written along sound academic lines (not necessarily a bad thing) and is easy to learn.
Extensible Markup Language (XML) has taken off in the last few years as a generic format for storing information. XML looks much like HTML, with a similar combination of elements and attributes for marking up text, but it lets developers create their own vocabularies. Some XML is shared directly over the Web; some XML is used by web services applications; and some XML is used as a foundation for web sites that need to present information in multiple forms. Serving XML documents is just like serving any other files in Apache, requiring only putting the files up and setting a MIME type identifier for them. Web services generally require the installation of modules specific to a particular web-service protocol, which then act as a gateway between the web server and application logic elsewhere on the computer.
The last option using XML as a foundation for information the Apache server needs to be able to present in multiple forms is growing more common and fits well in more typical web-server applications. In this case, XML typically provides a format for storing information separate from its presentation details. When the Apache server gets a request for a particular file, say in HTML, it passes it to a tool that deals with the XML. That tool typically loads the XML document, generates a file in the format requested, and passes it back to Apache, which then transmits it to the user. (The XML processor may pull the file from a cache if the file has been requested previously.) If a site is only serving up HTML files, all this extra work is probably unnecessary, but sites that provide HTML, PDF, WML (Wireless Markup Language), and plain-text versions of the same content will likely find this approach very useful. Even sites that offer multiple HTML renditions of the same information may find this approach easier than managing multiple files.
Most commonly, the transformation between the original XML document and the result the user wants is defined using Extensible Stylesheet Language Transformations (XSLT). Developers use XSLT to create templates that define the production of result documents from original XML documents, and these templates can generally be applied to many originals to produce many results.
Making this work on Apache requires adding some parts that support XSLT and manage the caching process. Chapter 19 will explore Cocoon, a Java-based sub-project of the Apache Project that is widely used for this work. Perl devotees may want to explore AxKit, another Apache project that does similar work in Perl. (For a complete list of XML-related projects at Apache, visit http://xml.apache.org/.)
XML and XSLT are subjects that go well beyond the scope of this book. Chapter 19 will provide a brief introduction, but you may also want to explore Learning XML by Erik Ray (O'Reilly, 2001), XSLT by Doug Tidwell (O'Reilly, 2001), and XML in a Nutshell by Elliotte Rusty Harold and Scott Means (O'Reilly, 2002).
 Wall, Larry, Jon Orwant, and Tom Christiansen. Programming Perl (O'Reilly, 2000).
 "New" is a bad four letter word in computing.