More Practical Examples | Network Programming with Perl

	Network Programming with Perl By Lincoln D. Stein Slots : 1
	Table of Contents

	Chapter 5. The IO::Socket API

Content

We'll now look at additional examples that illustrate important aspects of the IO::Socket API. The first is a rewritten and improved version of the reverse echo server from Chapter 4. The second is a simple Web client.

Reverse Echo Server Revisited

We'll now rewrite the reverse echo server in Figure 4.2. In addition to being more elegant than the earlier version (and easier to follow, in my opinion), this version makes several improvements on the original. It replaces the dangerous signal handler with a handler that simply sets a flag and returns. This avoids problems stemming from making I/O calls within the handler and problems with exit() on Windows platforms. The second improvement is that the server resolves the names of incoming connections, printing the remote hostname and port number to standard error. Finally, we will observe the Internet convention that servers use CRLF sequences for line endings. This means that we will set $/ to CRLF and append CRLF to the end of all our writes . Figure 5.4 lists the code for the improved server.

Figure 5.4. The reverse echo server, using IO::Socket

graphics/05fig04.gif

Lines 1 “7: Initialize script We turn on strict syntax checking and load the IO::Socket module. We import the default constants and the newline- related constants by importing the tags :DEFAULT and :crlf .

We define our local port as a constant, and initialize the byte counters for tracking statistics. We also set the global $/ variable to CRLF in accordance with the network convention.

Lines 8 “9: Install INT signal handler We install a signal handler for the INT signal, so that the server will shut down gracefully when the user presses the interrupt key. The improved handler simply sets the flag named $quit to true.

Lines 10 “15: Create the socket object We recover the port number from the command line or, if no port number is provided, we default to the hard-coded constant. We now call IO::Socket::INET->new() with arguments that cause it to create a listening socket bound to the specified local port. Other arguments set the SO_REUSEADDR option to true, and specify a 1- hour timeout (60*60 seconds) for the accept() operation.

The Timeout parameter makes each call to the accept() method return undef if an incoming connection has not been received within the specified time. However, our motivation for activating this feature is not for its own sake, but for the fact that it changes the behavior of the method so that it is not automatically restarted after being interrupted by a signal. This allows us to interrupt the server with the ^C key without bothering to wrap accept() in an eval{} block (see Chapter 2, Timing Out Slow System Calls).

Lines 16 “31: Enter main loop After printing a status message, we enter a loop that continues until the INT interrupt handler has set $quit to true. Each time through the loop, we call the socket's accept() method. If the accept() method completes without being interrupted by a signal or timing out on its own, it returns a new connected socket object, which we store in a variable named $session . Otherwise, accept() returns undef , in which case we go back to the beginning of the loop. This gives us a chance to test whether the interrupt handler has set $quit to true.

Lines 19 “21: Get remote host's name and port We call the connected socket's peeraddr() method to get the packed IP address at the other end of the connection, and attempt to translate it into a hostname using gethostbyaddr() . If this fails, it returns undef , and we call the peerhost() method to give us the remote host's address in dotted -quad form.

We get the remote host's port number using peerport() , and print the address and port number to standard error.

Lines 22 “30: Handle the connection We read lines from the connected socket, reverse them, and print them back out to the socket, keeping track of the number of bytes sent and received while we do so. The only change from the earlier example is that we now terminate each line with CRLF.

When the remote host is done, we get an EOF when we try to read from the connected socket. We print out a warning, close the connected socket, and go back to the top of the loop to accept() again.

When we run the script, it acts like the earlier version did, but the status messages give hostnames rather than dotted-IP addresses.

 %  tcp_echo_serv2.pl  waiting for incoming connections on port 2007... Connection from [localhost,2895] Connection from [localhost,2895] finished Connection from [formaggio.cshl.org,12833] Connection from [formaggio.cshl.org,12833] finished  ^C  bytes_sent = 50, bytes_received = 50

A Web Client

In this section, we develop a tiny Web client named web_fetch.pl. It reads a Universal Resource Locator (URL) from the command line, parses it, makes the request, and prints the Web server's response to standard output. Because it returns the raw response from the Web server without processing it, it is very useful for debugging misbehaving CGI (Common Gateway Interface) scripts and other types of Web-server dynamic content.

The Hypertext Transfer Protocol (HTTP) is the main protocol for Web servers. Part of the power and appeal of the protocol is its simplicity. A client wishing to fetch a document makes a TCP connection to the desired Web server, sends a brief request, and then receives the response from the server. After the document is delivered, the Web server breaks the connection. The hardest part is parsing the URL. HTTP URLs have the following general format:

http://hostname:port/ path /to/document#fragment

All HTTP URLs begin with the prefix http://. This is followed by a hostname such as www.yahoo.com, a colon, and the port number that the Web server is listening on. The colon and port may be omitted, in which case the standard server port 80 is assumed. The hostname and port are followed by the path to the desired document using UNIX-like file path conventions. The path may be followed by a "#" sign and a fragment name, which indicate a subsection in the document that the Web browser should scroll to.

Our client will parse the components of this URL into the hostname:port combination and the path. We ignore the fragment name. We then connect to the designated server using a TCP socket and send an HTTP request of this form:

 GET /path/to/document HTTP/1.0 CRLF CRLF

The request consists of the request method "GET" followed by a single space and the designated path, copied verbatim from the URL. This is followed by another space, the protocol version number HTTP/1.0, and two CRLF pairs. After the request is sent, we wait for a response from the server. A typical response looks like this:

 HTTP/1.1 200 OK Date: Wed, 01 Mar 2000 17:00:41 GMT Server: Apache/1.3.6 (UNIX) Last-Modified: Mon, 31 Jan 2000 04:28:15 GMT Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head> <title> Presto Home Page </title> </head> <body> <h1>Welcome to Presto</h1> ...

The response is divided into two parts: a header containing information about the returned document, and the requested document itself. The two parts are separated by a blank line formed by two CRLF pairs.

We will delve into the structure of HTTP responses in more detail in Chapter 9, where we discuss the LWP library, and Chapter 13, where we develop a more sophisticated Web client capable of retrieving multiple documents simultaneously . The only issue to worry about here is that, whereas the header is guaranteed by the HTTP protocol to be nice human-readable lines of text, each terminated by a CRLF pair, the document itself can have any format. In particular, we must be prepared to receive binary data, such as the contents of a GIF or MP3 file.

Figure 5.5 shows the web_fetch.pl script in its entirety.

Figure 5.5. The web_fetch.pl script

graphics/05fig05.gif

Lines 1 “5: Initialize module We turn on strict syntax checking and load the IO::Socket module, importing the default and newline-related constants. As in previous examples, we are dealing with CRLF-delimited data. However, in this case, we set $/ to be a pair of CRLF sequences. Later, when we call the <> operator, it will read the entire header down through the CRLF pair that terminates it.

Lines 6 “8: Parse URL We read the requested URL from the command line and parse it using a pattern match. The match returns the hostname, possibly followed by a colon and port number, and the path up through, but not including, the fragment.

Lines 9 “10: Open socket We open a socket connected to the remote Web server. If the URL contained the port number, it is included in the hostname passed to PeerAddr , and the PeerPort argument is ignored. Otherwise, PeerPort specifies that we should connect to the standard "http" service, port 80.

Line 11: Send the request We send an HTTP request to the server using the format described earlier.

Lines 12 “14: Read and print the header Our first read is line-oriented. We read from the socket using the <> operator. Because $/ is set to a pair of CRLF sequences, this read grabs the entire header up through the blank line. We now print the header, but since we don't particularly want extraneous CRs to mess up the output, we first replace all occurrence of $CRLF with a logical newline ("\n", which will evaluate to whatever is the appropriate newline character for the current platform.

Line 15: Read and print the document Our subsequent reads are byte-oriented. We call read() in a tight loop, reading up to 1024 bytes with each operation, and immediately printing them out with print() . We exit when read() hits the EOF and returns 0.

Here is an example of what web_fetch.pl looks like when it is asked to fetch the home page for www.cshl.org :

 %  web_fetch.pl http://www.cshl.org/  HTTP/1.1 200 OK Server: Netscape-Enterprise/3.5.1C Date: Wed, 16 Aug 2000 00:46:12 GMT Content-type: text/html Last-modified: Fri, 05 May 2000 13:19:29 GMT Content-length: 5962 Accept-ranges: bytes Connection: close <HTML> <HEAD> <TITLE>Cold Spring Harbor Laboratory</TITLE> <META NAME="GENERATOR" CONTENT="Adobe PageMill 2.0 Mac"> <META Name="keywords" Content="DNA, genes, genetics, genome, genome  sequencing, molecular biology, biological science, cell biology,  James D. Watson, Jim Watson, plant genetics, plant biology,  bioinformatics, neuroscience, neurobiology, cancer, drosophila,  Arabidopsis, double-helix, oncogenesis, Cold Spring Harbor  Laboratory, CSHL"> ...

Although it seems like an accomplishment to fetch a Web page in a mere 15 lines of code, client scripts that use the LWP module can do the same thing ”and more ”with just a single line. We discuss how to use LWP to access the Web in Chapter 9.

Performance and Style

Although the IO::Socket API simplifies programming, and is generally a big win in terms of development effort and code maintenance, it has some drawbacks over the built-in function-oriented interface.

If memory usage is an issue, you should be aware that IO::Socket adds to the Perl process's memory "footprint" by a significant amount: approximately 800 K on my Linux/Intel-based system, and more than double that on a Solaris system.

The object-oriented API also slows program loading slightly. On my laptop, programs that use IO::Socket take about half a second longer to load than those that use the function-oriented Socket interface. Fortunately, the execution speed of programs using IO::Socket are not significantly different from the speeds of classical interface. Network programs are generally limited by the speed of the network rather than the speed of processing.

Nevertheless, for the many IO::Socket methods that are thin wrappers around the corresponding system calls and do not add significant functionality, I prefer to use the IO::Socket as plain filehandles rather than use the object-oriented syntax. For example, rather than writing:

 $socket->syswrite("A man, a plan, a canal, panama!");

I will write:

 syswrite ($socket,"A man, a plan, a canal, panama!");

This has exactly the same effect as the method call, but avoids its overhead.

For methods that do improve on the function call, don't hesitate to use them. For example, the accept() method is an improvement over the built-in function because it returns a connected IO::Socket object rather than a plain filehandle. The method also has a syntax that is more Perl-like than the built-in.

Top