7.1 How the Web Works | Mastering Perl for Bioinformatics

The Internet is short for " interconnected networks." It is a set of conventions ”protocols ”with which computers and networks can intercommunicate. Its development from earlier work before 1980 allowed many different networks to join and users on many computers to communicate. This communication was originally done in several ways, such as by email , electronic mail, and by FTP , file transfer protocol. These methods remain very popular and widely used.

It wasn't until the early 1990s that the World Wide Web or Web was born as a new Internet service. The Web was based on the new hypertext transport protocol or HTTP, and the first software to use it was in the form of programs called web browsers and web servers. Web browsers are programs that handle user requests and display results to the user; the most widely known web browsers include Internet Explorer and Netscape. Web servers are programs that accept requests from web browsers and send results back to them for display; Apache is the most widely used web server. With the development of web browsers and their ability to handle images as well as text, these new protocols sparked intense popular interest in computers. At the same time, computer costs were falling steadily, and their capabilities were growing, which made the new web protocol even more widespread.

The Web has become critical to scientific programming; in fact, it started there. The Web and its associated protocols such as HTTP were originally developed at a high-energy physics laboratory in Switzerland, CERN, and they have been heavily used in the sciences ever since. In biology, as elsewhere, the Web has become one of the principal means of communication.

7.1.1 URLs

The Web is essentially a two-part system of browsers and servers, in which browsers get results from servers and display them for the user. This type of architecture is called a client-server design, in which the client (web browser) requests service from the server (web server). Web browsers and servers are just programs that run on computers. They may both be on the same computer, or, thanks to the Internet, they may be on opposite ends of the earth.

In order for this scheme to work, the web browser has to be able to send its request to the web server. For instance, say you want to see the New York Times from your Internet Explorer web browser (or your Netscape, Mozilla, or other web browser). You have to know the location of the New York Times on the Web and type it into the space provided in your browser screen.

So, you type http://www.nytimes.com , hit the Return or Enter key on your keyboard, and the next thing you know you're reading the latest articles about human cloning and double-stranded RNA. How does this work, exactly?

The answer is really very simple. The web browser sends your request to the Internet; the actual location of the desired computer is determined, and your request is sent to the web server program on that computer. The web server handles your request and sends back a web page your browser then displays. This web page may include other URLs of specific articles. You can click on one, and the whole process is repeated, but this time your request is for a specific article, which is then returned to your computer and displayed by your web browser.

Behind this simple overall architecture are several steps. A basic familiarity with some of these steps and the associated terminology is needed in order to learn the fundamentals of web programming.

The location you typed in, http://www.nytimes.com , is called a Uniform Resource Locator or URL. The Internet (to which you must be connected for this to work, of course) takes the URL you typed in and, with the help of a network of computers and their routing tables that are configured for this task, resolves the URL into an Internet address (IP address), which is numeric. The address you typed in has a vague resemblance to English; www.nytimes.com is translated by routing tables into the numeric IP address.

The details of this are not important; if you know it, you can also just type in the numeric Internet address instead of the domain name . The advantage of this design is that it allows the routing tables maintained by the Internet to take a domain name and translate it to the correct actual Internet address. Then, if the New York Times changes its main computer, or decides to move to Paris, ^[1] all that needs to be done is for the routing tables to be updated with the new actual Internet address. You can still type www.nytimes.com and get to the proper web server without worrying about where it actually lives on the Internet.

^[1] The Paris Review moved from Paris to New York, after all.

A URL can have several parts :

It begins with a scheme , which is "http" in this case and specifies the protocol for the request. "http" is the most common scheme; others include " https " for increased security, "ftp" for file transfer protocol, and so on.
A colon and two forward slashes ( :// ) separates the scheme from the hostname, which is www.nytimes.com . This is the part that's resolved by the routing tables on the Internet; it gives the address of the computer with which you actually want to communicate.
Following the hostname, several other bits of information may optionally appear in the URL, such as a particular location on the server's computer, a port number, some parameters to pass to the server, and other details that tell the server exactly what the browser is requesting.

For instance, say you want to use the web page the reporters for the New York Times use to organize their list of helpful web sites. You'd type in http://www.nytimes.com/navigator . Now, following the hostname is the additional information /navigator . This is a pathname for the particular web page you're interested in. It's just like a pathname of a file or directory on your filesystem. Sometimes it is longer, such as http://www.nytimes.com/library/tech/reference/cynavi.html , and includes several directories and subdirectories, and finally a filename (in this case, it's cynavi.html ).

These path names are relative to the way the web server is installed and configured; they tell the web server exactly what resource is being requested by the browser.
Following the pathname may come other information. If the pathname is the name of a CGI program (discussed later in this chapter), certain arguments may be sent to that program; of course, these vary depending on the particular CGI program involved. The arguments or queries are separated by question marks and give desired values to parameters. A typical example might be http://www.mycomputer.com/cgi/rebase.cgi?enzyme=EcoRI? enzyme =HinDIII . This requests a web page from the web server on the computer www.mycomputer.com . The web page to be returned is generated on the fly by the CGI script on that computer in the file cgi/rebase.cgi . The URL also passes that CGI script the names of two enzymes (which the script will presumably use to formulate its reply), EcoRI and HinDIII.

Other information may appear in a URL, and other variations are possible. As one more example, if you had a web page saved on your computer in the file /home/tisdall/arabidopsis.html , you can display it by typing the following into a web browser running on the same computer: file:/home/tisdall/arabidopsis.html .

If you have to manipulate URLs in your program (and you very well may at some point), there is a collection of modules available on CPAN called URI::URL that will make your life a whole lot easier.

7.1.2 HTML

The Hypertext Markup Language (HTML) is the language that embellishes text so that it can be displayed in a web browser.

There are two important parts of HTML. It formats text, specifying such things as paragraphs, italics, numbered section headings, and the like. Although text is the most common type of information displayed, other types of information such as images and sound are also commonly incorporated into a document.

The other important part of HTML is that it incorporates hypertext links , which make a document interactive by providing the user viewing the document in a web browser the ability to click on links and go to other web pages.

The basic idea of HTML is to embed within a document directions for how to display the document. The directions are rather vague, compared to real typesetting tools such as FrameMaker or Quark. HTML commands may be interpreted differently by different web browsers so that your HTML document can look considerably different when viewed by different people. This limitation was a deliberate part of the design of HTML and web browsers. The disadvantage of not being able to exactly specify how a web page appears is offset by the advantages of the simplicity of HTML and the possibility to view HTML documents on a variety of computers and operating systems.

7.1.2.1 HTML web page example

To demonstrate , let's see a short example of an HTML web page: ^[2]

^[2] The Rebase web page that I'll develop in this chapter will give you a more complete example. Most web browsers allow you to see the HTML for whatever web page you're viewing by clicking on the Page Source link in the View menu of the web browser. (Your browser may use slightly different names, but all the major web browsers enable you to look at the HTML source by selecting a menu item.)

 <html> <head> <title>Double stranded RNA can regulate genes</title> </head> <body> <h2>Double stranded RNA can regulate genes</h2> <p>A recent article in <b>Nature</b> describes the important discovery of <i>RNA interference</i>, the action of snippets of double-stranded RNA in suppressing gene expression. </p> <p> The discovery has provided a powerful new tool in investigating gene function, and has raised many questions about the nature of gene regulation in a wide variety of organisms. </p> </body> </html>

This HTML, if contained in a file, can be displayed in a web browser. If the file is on the same computer as the web browser you're using, you can display it easily. If it's on a different computer, the file has to be in a place your computer's web server has been configured to look.

For instance, if a file on your computer /home/tisdall/htmlexample1.html contained the previous HTML content, you can type the URL into your web browser as so:

 file:/home/tisdall/htmlexample1.html

and the browser would display something like that in Figure 7-1.

Figure 7-1. HTML example

I say "something like this" because many of the details of exactly how the text and layout appears are left to the browser program. The browser program may be set to use different font sizes or font types, break the lines at different places, display a different colored background, and, in general, specify locally several of the formatting options for the web page to be displayed. For example, the browser window may be very small, in which case the text will be reformatted to fit as well as possible into the available window size. Still, the basic content of the text should appear similarly to what is shown in Figure 7-1.

7.1.2.2 HTML directives

Let's take a look at how the HTML directives are embedded into the document.

HTML directives are mostly specified by enclosing them in angle brackets. The directives come in pairs, and the text between the opening and closing directive is affected. The second member of a pair has an added forward slash / before the tag name.

So, for example, to make a word italicized, you surround the word with the <i> and the </i> pair of tags. In the previous example, the term "RNA interference" is surrounded in this fashion, so it appears in italics in the browser.

The pair of tags <html> and </html> surrounds the entire document, and serves to delimit the HTML content for the web browser (or other HTML-reading program).

HTML documents have two major sections: the head and the body. The pair of tags <head> and </head> surrounds text that is related to the document as a whole. In Figure 7-1, there is only one item in the head section ”a "title" that is displayed in the titlebar of the web browser. The title tags <title> and </title> surround the title "Double stranded RNA can regulate genes".

The head section can contain many different kinds of directives that influence the display of a document. It is followed by the "body" of the document which is surrounded by the tags <body> and </body> and comprises the rest of the document.

The body, in this simple example, has a header, paragraphs, and a few formatting directives, and it is surrounded by the tags <body> and </body> .

The headers can be of different levels, so you can make a document structure with primary headers and various subsections. This example specifies just a single header as follows :

 <h2>Double stranded RNA can regulate genes</h2>

The first paragraph makes the journal name "Nature" appear in bold font, and the new term "RNA interference" appears in italics:

 <p>A recent article in <b>Nature</b> describes the important discovery of <i>RNA interference</i>, the action of snippets of double-stranded RNA in suppressing gene expression. </p>

Notice how the paragraph tags <p> and </p> surround a paragraph. (Actually, the closing paragraph tag can be omitted as a time-saving convenience; some very common tags have this feature, but most do not.)

The second paragraph contains only text:

 <p> The discovery has provided a powerful new tool in investigating gene function, and has raised many questions about the nature of gene regulation in a wide variety of organisms. </p>

The following summary of the document highlights the major sections and omits the details within the head and the body:

 <html> <head>   ...   header information goes here   </head> <body>   ...     the body of the document goes here   </body> </html>

That's all there is to say about this simple example. Other features of HTML include embedded hyperlinks to web pages, email, and so forth. HTML has expanded in several ways over the last few years , and many more types of formatting are possible.

7.1.3 HTTP

The Web is based on a language called the Hypertext Transport Protocol, or HTTP. HTTP is the protocol that communicates between web browsers and servers.

Recall that in this chapter I'm using CGI to handle the communication between browsers and servers, and CGI can be thought of as a simplified interface to HTTP. So, it's necessary and useful to learn a few basic facts about HTTP before embarking on CGI programming.

HTTP works in a simple fashion. The browser sends a request which is made of a header and, often, a body. The server receives the request and sends a response , which is also made of a header and, sometimes, a body.

The first line of a request header is called the request line and contains the request method . The request method is usually GET. This is the most common request, and it asks for a specific resource from the web server, usually specified as a URL to be retrieved by a specific protocol such as HTTP.

The remaining lines of the request header are called header fields and consist of name-value pairs, which include such items as the hostname the request is being sent from.

The reply message also has a special first line called the status line which reports the protocol, a numeric code representing the specific response, and a text version of the response ("OK").

The remaining lines of the header contain name-value pairs of various other parameters. For instance, the name and version of the web server may be specified.

After the reply header may come the body of the response. This is always separated from the reply header by a blank line (actually a carriage return and line feed). In this case, the body of the reply is exactly the HTML code for the simple web page concerning RNA interference shown earlier.

There are many name-value pairs I have not mentioned and many other details that can have significance within this basically simple HTTP protocol scheme. However, this overview gives you the basic idea and the essential structure of the protocol that is exchanged between the web browser and the server.