Fetching a URL with HTTP

   

Practical Programming in Tcl & Tk, Third Edition
By Brent B. Welch

Table of Contents
Chapter 17.  Socket Programming


The HyperText Transport Protocol (HTTP) is the protocol used on the World Wide Web. This section presents a procedure to fetch pages or images from a server on the Web. Items in the Web are identified with a Universal Resource Location (URL) that specifies a host, port, and location on the host. The basic outline of HTTP is that a client sends a URL to a server, and the server responds with some header information and some content data. The header information describes the content, which can be hypertext, images, postscript, and more.

Example 17-5 Opening a connection to an HTTP server.
 proc Http_Open {url} {    global http    if {![regexp -nocase {^(http://)?([^:/]+)(:([0-9])+)?(/.*)}\           $url x protocol server y port path]} {    error "bogus URL: $url"    }    if {[string length $port] == 0} {       set port 80    }    set sock [socket $server $port]    puts $sock "GET $path HTTP/1.0"    puts $sock "Host: $server"    puts $sock "User-Agent: Tcl/Tk Http_Open"    puts $sock ""    flush $sock    return $sock } 

The Http_Open procedure uses regexp to pick out the server and port from the URL. This regular expression is described in detail on page 149. The leading http:// is optional, and so is the port number. If the port is left off, then the standard port 80 is used. If the regular expression matches, then a socket command opens the network connection.

The protocol begins with the client sending a line that identifies the command (GET), the path, and the protocol version. The path is the part of the URL after the server and port specification. The rest of the request is lines in the following format:

 key: value 

The Host identifies the server, which supports servers that implement more than one server name. The User-Agent identifies the client program, which is often a browser like Netscape Navigator or Internet Explorer. The key-value lines are terminated with a blank line. This data is flushed out of the Tcl buffering system with the flush command. The server will respond by sending the URL contents back over the socket. This is described shortly, but first we consider proxies.

Proxy Servers

A proxy is used to get through firewalls that many organizations set up to isolate their network from the Internet. The proxy accepts HTTP requests from clients inside the firewall and then forwards the requests outside the firewall. It also relays the server's response back to the client. The protocol is nearly the same when using the proxy. The difference is that the complete URL is passed to the GET command so that the proxy can locate the server. Example 17-6 uses a proxy if one is defined:

Example 17-6 Opening a connection to an HTTP server.
 # Http_Proxy sets or queries the proxy proc Http_Proxy {{new {}}} {    global http    if ![info exists http(proxy)] {       return {}    }    if {[string length $new] == 0} {       return $http(proxy):$http(proxyPort)    } else {       regexp {^([^:]+):([0-9]+)$}$new x \          http(proxy) http(proxyPort)    } } proc Http_Open {url {cmd GET} {query {}}} {    global http    if {![regexp -nocase {^(http://)?([^:/]+)(:([0-9])+)?(/.*)}\           $url x protocol server y port path]} {       error "bogus URL: $url"    }    if {[string length $port] == 0} {       set port 80    }    if {[info exists http(proxy)] &&           [string length $http(proxy)]} {       set sock [socket $http(proxy) $http(proxyPort)]       puts $sock "$cmd http://$server:$port$path HTTP/1.0"    } else {       set sock [socket $server $port]       puts $sock "$cmd $path HTTP/1.0"    }    puts $sock "User-Agent: Tcl/Tk Http_Open"    puts $sock "Host: $server"    if {[string length $query] > 0} {       puts $sock "Content-Length: [string length $query]"       puts $sock ""       puts $sock $query    }    puts $sock ""    flush $sock    fconfigure $sock -blocking 0    return $sock } 

The HEAD Request

In Example 17-6, the Http_Open procedure takes a cmd parameter so that the user of Http_Open can perform different operations. The GET operation fetches the contents of a URL. The HEAD operation just fetches the description of a URL, which is useful to validate a URL. The POST operation transmits query data to the server (e.g., values from a form) and also fetches the contents of the URL. All of these operations follow a similar protocol. The reply from the server is a status line followed by lines that have key-value pairs. This format is similar to the client's request. The reply header is followed by content data with GET and POST operations. Example 17-7 implements the HEAD command, which does not involve any reply data:

Example 17-7 Http_Head validates a URL.
 proc Http_Head {url} {    upvar #0 $url state    catch {unset state}    set state(sock) [Http_Open $url HEAD]    fileevent $state(sock) readable [list HttpHeader $url]    # Specify the real name, not the upvar alias, to vwait    vwait $url\(status)    catch {close $state(sock)}    return $state(status) } proc HttpHeader {url} {    upvar #0 $url state    if {[eof $state(sock)]} {       set state(status) eof       close $state(sock)       return    }    if {[catch {gets $state(sock) line}nbytes]} {       set state(status) error       lappend state(headers) [list error $nbytes]       close $state(sock)       return    }    if {$nbytes < 0} {       # Read would block       return    } elseif {$nbytes == 0} {       # Header complete       set state(status) head    } elseif {![info exists state(headers)]} {       # Initial status reply from the server       set state(headers) [list http $line]    } else {       # Process key-value pairs       regexp {^([^:]+): *(.*)$}$line x key value       lappend state(headers) [string tolower $key] $value    } } 

The Http_Head procedure uses Http_Open to contact the server. The HttpHeader procedure is registered as a fileevent handler to read the server's reply. A global array keeps state about each operation. The URL is used in the array name, and upvar is used to create an alias to the name (upvar is described on page 86):

 upvar #0 $url state 

You cannot use the upvar alias as the variable specified to vwait. Instead, you must use the actual name. The backslash turns off the array reference in order to pass the name of the array element to vwait, otherwise Tcl tries to reference url as an array:

 vwait $url\(status) 

The HttpHeader procedure checks for special cases: end of file, an error on the gets, or a short read on a nonblocking socket. The very first reply line contains a status code from the server that is in a different format than the rest of the header lines:

 code message 

The code is a three-digit numeric code. 200 is OK. Codes in the 400's and 500's indicate an error. The codes are explained fully in RFC 1945 that specifies HTTP 1.0. The first line is saved with the key http:

 set state(headers) [list http $line] 

The rest of the header lines are parsed into key-value pairs and appended onto state(headers). This format can be used to initialize an array:

 array set header $state(headers) 

When HttpHeader gets an empty line, the header is complete and it sets the state(status) variable, which signals Http_Head. Finally, Http_Head returns the status to its caller. The complete information about the request is still in the global array named by the URL. Example 17-8 illustrates the use of Http_Head:

Example 17-8 Using Http_Head.
 set url http://www.sun.com/ set status [Http_Head $url] => eof upvar #0 $url state array set info $state(headers) parray info info(http)          HTTP/1.0 200 OK info(server)        Apache/1.1.1 info(last-modified) Nov ... info(content-type)  text/html 

The GET and POST Requests

Example 17-9 shows Http_Get, which implements the GET and POST requests. The difference between these is that POST sends query data to the server after the request header. Both operations get a reply from the server that is divided into a descriptive header and the content data. The Http_Open procedure sends the request and the query, if present, and reads the reply header. Http_Get reads the content.

The descriptive header returned by the server is in the same format as the client's request. One of the key-value pairs returned by the server specifies the Content-Type of the URL. The types come from the MIME standard, which is described in RFC 1521. Typical content types are:

  • text/html ? HyperText Markup Language (HTML), which is introduced in Chapter 3.

  • text/plain ? plain text with no markup.

  • image/gif ? image data in GIF format.

  • image/jpeg ? image data in JPEG format.

  • application/postscript ? a postscript document.

  • application/x-tcl ? a Tcl program! This type is discussed in Chapter 20.

Example 17-9 Http_Get fetches the contents of a URL.
 proc Http_Get {url {query {}}} {    upvar #0 $url state        ;# Alias to global array    catch {unset state}        ;# Aliases still valid.    if {[string length $query] > 0} {       set state(sock) [Http_Open $url POST $query]    } else {       set state(sock) [Http_Open $url GET]    }    set sock $state(sock)    fileevent $sock readable [list HttpHeader $url]    # Specify the real name, not the upvar alias, to vwait    vwait $url\(status)    set header(content-type) {}    set header(http) "500 unknown error"    array set header $state(headers)    # Check return status.    # 200 is OK, other codes indicate a problem.    regsub "HTTP/1.. " $header(http) {}header(http)    if {![string match 2* $header(http)]} {       catch {close $sock}       if {[info exists header(location)] &&              [string match 3* $header(http)]} {          # 3xx is a redirection to another URL          set state(link) $header(location)          return [Http_Get $header(location) $query]       }       return -code error $header(http)    }    # Set up to read the content data    switch -glob -- $header(content-type) {       text/*     {          # Read HTML into memory          fileevent $sock readable [list HttpGetText $url]       }       default   {          # Copy content data to a file          fconfigure $sock -translation binary          set state(filename) [File_TempName http]          if [catch {open $state(filename) w}out] {             set state(status) error             set state(error) $out             close $sock             return $header(content-type)          }          set state(fd) $out          fcopy $sock $out -command [list HttpCopyDone $url]       }    }    vwait $url\(status)    return $header(content-type) } 

Http_Get uses Http_Open to initiate the request, and then it looks for errors. It handles redirection errors that occur if a URL has changed. These have error codes that begin with 3. A common case of this error is when a user omits the trailing slash on a URL (e.g., http://www.scriptics.com). Most servers respond with:

 302 Document has moved Location: http://www.scriptics.com/ 

If the content-type is text, then Http_Get sets up a fileevent handler to read this data into memory. The socket is in nonblocking mode, so the read handler can read as much data as possible each time it is called. This is more efficient than using gets to read a line at a time. The text will be stored in the state(body) variable for use by the caller of Http_Get. Example 17-10 shows the HttpGetText fileevent handler:

Example 17-10 HttpGetText reads text URLs.
 proc HttpGetText {url} {    upvar #0 $url state    if {[eof $state(sock)]} {       # Content complete       set state(status) done       close $state(sock)    } elseif {[catch {read $state(sock)}block]} {       set state(status) error       lappend state(headers) [list error $block]       close $state(sock)    } else {       append state(body) $block    } } 

The content may be in binary format. This poses a problem for Tcl 7.6 and earlier. A null character will terminate the value, so values with embedded nulls cannot be processed safely by Tcl scripts. Tcl 8.0 supports strings and variable values with arbitrary binary data. Example 17-9 uses fcopy to copy data from the socket to a file without storing it in Tcl variables. This command was introduced in Tcl 7.5 as unsupported0, and became fcopy in Tcl 8.0. It takes a callback argument that is invoked when the copy is complete. The callback gets additional arguments that are the bytes transferred and an optional error string. In this case, these arguments are added to the url argument specified in the fcopy command. Example 17-11 shows the HttpCopyDone callback:

Example 17-11 HttpCopyDone is used with fcopy.
 proc HttpCopyDone {url bytes {error {}}} {    upvar #0 $url state    if {[string length $error]} {       set state(status) error       lappend state(headers) [list error $error]    } else {       set state(status) ok    }    close $state(sock)    close $state(fd) } 

The user of Http_Get uses the information in the state array to determine the status of the fetch and where to find the content. There are four cases to deal with:

  • There was an error, which is indicated by the state(error) element.

  • There was a redirection, in which case, the new URL is in state(link). The client of Http_Get should change the URL and look at its state instead. You can use upvar to redefine the alias for the state array:

     upvar #0 $state(link) state 
  • There was text content. The content is in state(body).

  • There was another content type that was copied to state(filename).

The fcopy Command

The fcopy command can do a complete copy in the background. It automatically sets up fileevent handlers, so you do not have to use fileevent yourself. It also manages its buffers efficiently. The general form of the command is:

 fcopy input output ?-size size? ?-command callback? 

The -command argument makes fcopy work in the background. When the copy is complete or an error occurs, the callback is invoked with one or two additional arguments: the number of bytes copied, and, in the case of an error, it is also passed an error string:

 fcopy $in $out -command [list CopyDone $in $out] proc CopyDone {in out bytes {error {}} {      close $in ; close $out } 

With a background copy, the fcopy command transfers data from input until end of file or size bytes have been transferred. If no -size argument is given, then the copy goes until end of file. It is not safe to do other I/O operations with input or output during a background fcopy. If either input or output gets closed while the copy is in progress, the current copy is stopped. If the input is closed, then all data already queued for output is written out.

Without a -command argument, the fcopy command reads as much as possible depending on the blocking mode of input and the optional size parameter. Everything it reads is queued for output before fcopy returns. If output is blocking, then fcopy returns after the data is written out. If input is blocking, then fcopy can block attempting to read size bytes or until end of file.


       
    Top
     



    Practical Programming in Tcl and Tk
    Practical Programming in Tcl and Tk (4th Edition)
    ISBN: 0130385603
    EAN: 2147483647
    Year: 1999
    Pages: 478

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net