|  The remainder of this chapter deals with nonblocking connects and accepts. In addition to read and write operations, sockets can block under two other circumstances: during a call to  connect()  when the remote host is slow to respond and during calls to  accept()  while waiting for incoming connections.    connect()  may block indefinitely under a variety of conditions, most typically when the remote host is down or a broken router makes it unreachable. In these cases,  connect()  blocks indefinitely until the error is corrected. Less often, the remote server is overtaxed by incoming requests and is slow to call  accept()  . In both cases, you can use a nonblocking  connect()  to limit the time that  connect()  will block. In addition, you can initiate multiple connects simultaneously and handle each one as it completes.    accept()  is typically used in a blocking mode by servers waiting for incoming connections. However, for servers that need to do some background processing between calls to  accept()  , you can use nonblocking  accept()  to limit the time the server spends blocked in the  accept()  call.   The IO::Socket Timeout Parameter  If you are just interested in timing out a  connect()  or  accept()  call after a certain period has elapsed, the object-oriented IO::Socket modules provide a simple way to do this. When you create a new IO::Socket object, you can provide it with a  Timeout  parameter indicating the number of seconds you are willing to block. Internally, IO::Socket uses nonblocking I/O to implement these timeouts.   For outgoing connections, the  connect()  occurs automatically during object creation, so in the case of a timeout, the IO::Socket  new()  method returns  undef  . The following example attempts to connect to port 80 of the host 192.168.3.1, giving it up to 10 seconds for the  connect()  . If the connection completes during the time frame, then the connected IO::Socket object is returned and saved in  $sock  . Otherwise , we die with the error message stored in  $@  . For reasons that will become clear later, the error message for timeouts is "IO::Socket::INET:Operation now in progress."   $sock = IO::Socket::INET(PeerAddr => '192.168.3.1:80',                          Timeout  => 10); $sock or die $@;   The timeout for accepts is applied by IO::Socket at the time that  accept()  is called. The following bit of code creates a listening socket with a timeout of 5 seconds and then enters a loop awaiting incoming connections. Because of the timeout,  accept()  waits at most 5 seconds for an incoming connection, returning either the connected socket object, if one is available, or  undef  . In the latter case, the loop prints a warning and returns to the top of the loop. Otherwise, it processes the connected socket as usual.   $sock = IO::Socket::INET->new( LocalPort => 8000,                                Listen    => 20,                                Reuse     => 1,                                Timeout   => 5 ); while (1) {    my $connected = $sock->accept();    unless ($connected) {       warn "timeout! ($@)\n";       next;    }    # otherwise process connected socket    ... }  If  accept()  times out before returning a connection,  $@  will contain "IO::Socket::INET: Operation now in progress."   Nonblocking Connect()  In this section we look at how IO::Socket implements timeouts on the  connect()  call. This will help you understand how to use nonblocking  connect()  in more sophisticated applications.   To accomplish a nonblocking connect using the IO::Socket module, you need to create an IO::Socket object  without  allowing it to connect automatically, put it into nonblocking mode, and then make the  connect()  call manually. This code fragment illustrates the idiom:   use IO::Socket; use Errno qw(EWOULDBLOCK EINPROGRESS); use IO::Select; my $TIMEOUT = 10;  # ten second timeout my $sock = IO::Socket::INET->new(Proto => 'tcp',                                  Type  => SOCK_STREAM) or die $@; $sock->blocking(0);  # nonblocking mode my $addr = sockaddr_in(80,inet_aton('192.168.3.1')); my $result = $sock->connect($addr);  Because we're going to do the connect manually, we don't pass  PeerAddr  or  PeerHost  arguments to the  IO::Socket new()  method, either of which would trigger a connection attempt. Instead we provide  Proto  and  Type  arguments to ensure that a TCP socket is created. If the socket was created successfully, we put it into nonblocking mode by passing a false argument to the  blocking()  method. We now need to connect it explicitly by passing it to the  connect()  function. Because  connect()  doesn't accept any of the naming shortcuts that the object-oriented  new()  method does, we must explicitly create a packed Internet address structure using the  sockaddr_in()  and  inet_aton()  functions discussed in Chapter 3 and use that as the second argument to  connect()  .   Recall that  connect()  will return a result code indicating whether the connection was successful. In a few cases, such as when connecting to the loopback address, a nonblocking connect succeeds immediately and returns a true result. In most cases, however, the call returns a variety of nonzero result codes. The most likely result is  EINPROGRESS  , which indicates simply that the nonblocking connect is in progress and should be checked periodically for completion. However, various failure codes are also possible;  ECONNREFUSED  , for instance, indicates that the remote host has refused the connection.   If the  connect()  is immediately successful, we can proceed to use the socket without further ado. Otherwise, we check the result code. If it is anything other than  EINPROGRESS  , the connect was unsuccessful and we die:   unless ($result) { # potential failure    die "Can't connect: $!" unless $! == EINPROGRESS;  Otherwise, if the result code indicates  EINPROGRESS  , the connect is still in progress. We now have to wait until the connection completes. Recall from Chapter 12 that  select()  will indicate that a socket is marked as writable immediately after a nonblocking connect completes. We take advantage of this feature by creating a new IO::Select object, adding the socket to it, and calling its  can_write()  method with a timeout. If the socket completes its connect before the timeout,  can_write()  returns a one-element list containing the socket. Otherwise, it returns an empty list and we die with an error message:   my $s = IO::Select->new($sock);    die "timeout!" unless $s->can_write($TIMEOUT);   If  can_write()  returns the socket, we know that the connect has completed, but we don't know whether the connection was actually successful. It is possible for a nonblocking connect to return a delayed error such as  ECONNREFUSED  . We can determine whether the connect was successful by calling the socket object's  connected()  method, which returns true if the socket is currently connected and false otherwise:   unless ($sock->connected) {      $! = $sock->sockopt(SO_ERROR);      die "Can't connect: $!"    } }  If the result from  connected()  is false, then we probably want to know why the connect failed. However, we can't simply check the contents of  $!  , because that will contain the error message from the most recent system call, not the delayed error. To get this information, we call the socket's  sockopt()  method with an argument of  SO_ERROR  to recover the socket's delayed error. This returns a standard numeric error code, which we assign to  $!  . Now when we die with an error message, the magical behavior of  $!  ensures that the error code will be displayed as a human-readable message when used in a string context.   At the end of this block, we have a connected socket. We turn its blocking mode back on and proceed to work with it as usual:   $sock->blocking(1); # handle IO on the socket, etc. ...   Figure 13.8 shows the complete code fragment in the form of a subroutine named  connect_with_timeout()  . You can call it like this:   Figure 13.8. A subroutine to  connect()  with a timeout     my $socket = connect_with_timeout($host,$port,$timeout);   If you examine the source code for IO::Socket, you will see that a very similar technique is used to implement the  Timeout  option.   Multiple Simultaneous Connects  An elaboration on the idiom used to make a nonblocking connect with a timeout can be used to initiate multiple connections in parallel. This can dramatically improve the performance of certain applications.   Consider a Web browser application. The sequence of events when a browser fetches an HTML page is that it parses the page looking for embedded images. Each image is associated with a separate URL, and each potentially lives on a different Web server, some of which may be slower to respond than others. If the client were to take the naive approach of connecting to each server individually, downloading the image, and then proceeding to the next server, the slowest server to respond would delay all subsequent operations. Instead, by initiating multiple connection attempts in parallel, the program can handle the servers in the order in which they respond. Coupled with concurrent data-transfer and page-rendering processes, this technique allows Web browsers to begin rendering the page as soon as the HTML is downloaded.   A Simple HTTP Client  To illustrate this, this section will develop a small Web client application on top of the HTTP protocol. This is not nearly as sophisticated as the functionality provided by the LWP library (Chapter 9), but it has the ability to perform its fetches in parallel, something that LWP cannot (yet) do.   Because it isn't fancy, we won't do any rendering or browsing, but instead just retrieve a series of URLs specified on the command line and store copies to disk. You might use this application to mirror a set of pages locally. The program has the following structure:      Parse URLs specified on the command line, retrieving the hostnames and port numbers .     Create a set of nonblocking IO::Socket handles.     Initiate nonblocking connects to each of the handles and deal with any immediate errors.     Add each handle to an IO::Select set that will be monitored for writing, and  select()  across them until one or more becomes ready for writing.     Send the request for the appropriate Web document and add the handle to an IO::Select set that will be monitored for reading.     Read the document data from each of the handles in a  select()  loop, and write the data to local files as the sockets become ready for reading.    In practice, steps 4, 5, and 6 can be combined in a single  select()  loop to increase parallelism even further.   The script is basically an elaboration of the  web_fetch.pl  script that we developed in Chapter 5 (Figure 5.5). In addition to the nonblocking connects and the parallel downloads, we improve on the first version by storing each retrieved document in a directory hierarchy based on its URL. For example, the URL http://www.cshl.org/meetings/index.html will be stored in the current directory in the file http://www.cshl.org/meetings/index.html.   In addition to generating the appropriate GET request, we will perform minimal parsing of the returned HTTP header to determine whether the request was successful. A typical response looks like this:   HTTP/1.1 200 OK Date: Wed, 01 Mar 2000 17:00:41 GMT Server: Apache/1.3.6 (UNIX) Last-Modified: Mon, 31 Jan 2000 04:28:15 GMT Connection: close Content-Type: text/html <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head> <title>Presto Home Page</title> </head> <body> <h1>Welcome to Presto</h1> ...   The important part of the response is the topmost line, which indicates the success or the failure status of the request. The line begins with a protocol version code, in this case HTTP/1.1, followed by the status code and the status message.   The status code is a three-digit integer indicating the outcome of the request. As described in Chapter 9, there are a large number of status codes, but the one that we care about is 200, which indicates that the request was successful and the requested document follows . If the client sees a 200 status code, it will read to the end of the header and copy the document body to disk. Otherwise, it treats the response as an error. We will not attempt to process redirects or other fancy HTTP features.   The script, dubbed  web_fetch_p.pl  , comes in two parts . The main script reads URLs from the command line and runs the  select()  loop. A helper module, named HTTPFetch, is used to track the status of each URL fetch. It creates the outgoing connection, reads and parses the HTTP header, and copies the returned document to disk. We'll look at the main script first (see Figure 13.9).   Figure 13.9. The  web_fetch  script uses nonblocking connects to parallelize URL fetches       Lines 1 “6: Initialize script  We begin by bringing in the IO::Socket, IO::Select, and HTTPFetch modules. We also declare a global hash named  %CONNECTIONS  , which will be responsible for maintaining the correspondence between sockets and HTTPFetch objects.    Lines 7 “9: Create IO::Select objects  We now create two IO::Select sets, one for monitoring sockets for reading and the other for monitoring sockets for writing.    Lines 10 “15: Create the HTTPFetch connection objects  In the next section of the code, we read a set of URLs from the command line. For each one, we create a new HTTPFetch object by calling  HTTPFetch->new()  with the URL to fetch.   Behind the scenes,  HTTPFetch->new()  does a lot. It parses the URL, creates a TCP socket, and initiates a nonblocking connection to the corresponding Web server host. If any of these steps fail,  new()  returns  undef  and we skip to the next URL. Otherwise,  new()  returns a new HTTPFetch object.   Each HTTPFetch object has a method called  socket()  that returns its underlying IO::Socket. We will monitor this socket for the completion of the nonblocking connect. We add the socket to the  $writers  IO::Select set, and remember the association between the socket and the HTTPFetch object in the  %CONNECTIONS  array.    Line 16: Start the select loop  The remainder of the script is a  select()  loop. Each time through the loop, we call  IO::Select->select()  on the  $readers  and  $writers  select sets. Initially  $readers  is empty, but it becomes populated as each of the sockets completes its connection.    Lines 17 “22: Handle sockets that are ready for writing  We first deal with the sockets that are ready for writing. This comprises those sockets that have either completed their connections or have tried and failed. We index into  %CONNECTIONS  to retrieve the corresponding HTTPFetch object and invoke the object's  send_request()  method.   This method checks first to see that its socket is connected, and if so, submits the appropriate GET request. If the request was submitted successfully,  send_request()  returns a true result, and we add the socket to the list of sockets to be monitored for reading. In either case, we don't need to write to the socket again, so we remove it from the  $writers  select set.    Lines 23 “30: Handle sockets that are ready for reading  The next section handles readable sockets. These correspond to HTTPFetch sessions that have successfully completed their connections and submitted their requests to the server.   Again, we use the socket as an index to recover the HTTPFetch object and call its  read()  method. Internally,  read()  takes care of reading the header and body and copying the body data to a local file. This is done in such a way that the read never blocks, preventing one slow Web server from holding all the rest up.   The  read()  call returns a true value if it successfully read from the socket, or false in case of a read error or an end of file. In the latter case, we're done with the socket, so we remove it from  $readers  set and delete the socket from the  %CONNECTIONS  array.    Line 31: Finish up  The loop is done when no more handles remain in the  $readers  or  $writers  sets. We check for this by calling the select objects'  count()  methods .   The HTTPFetch Module  We turn now to the HTTPFetch module, which is responsible for most of this program's functionality (Figure 13.10).   Figure 13.10. The HTTPFetch module       Lines 1 “7: Load modules  We begin by bringing in the IO::Socket, IO::File, and Carp modules. We also import the  EINPROGRESS  constant from the Errno module and load the File:: Path and File::Basename modules. These import the  mkpath()  and  dirname ()  functions, which we use to create the path to the local copy of the downloaded file.   Lines 8 “31: The  new()  constructor The  new()  method creates the HTTPFetch object. Its single argument is the URL to fetch. We begin by parsing the URL into its host, port, and path parts using an internal routine named  parse_url()  . If the URL can't be parsed, we call an internal method called  error()  , which sends an error message to  STDERR  and returns  undef  .   If the URL was successfully parsed, then we call our  connect()  method to initiate the nonblocking connect. If an error occurs at this point, we again issue an error message and return  undef  .   The next task is to turn the URL path into a local filename. In this implementation, we create a local path based on the remote hostname and remote path. The local path is stored relative to the current working directory. In the case of a URL that ends in a slash, we set the local filename to  index.html  , simulating what Web servers normally do. This local filename ultimately becomes an instance variable named  localpath  .   We now stash the original URL, the socket object, and the local filename into a blessed hash. We also set up an instance variable named  status  , which will keep track of the state of the connection. The status starts out at "waiting." After the completion of the nonblocking connect, it will be set to "reading header," and then to "reading body" after the HTTP header is received.   Line 32: The  socket()  accessor The  socket()  method is a public routine that returns the HTTPFetch object's socket.   Lines 33 “41: The  parse_url()  method The  parse_url()  method breaks an HTTP URL into its components in two steps, first splitting the host:port and path parts, and then splitting the host:port part into its two components . It returns a three-element list containing the host, port number, and path.   Lines 42 “55: The  connect()  method The  connect()  method initiates a nonblocking connect in the manner described earlier. We create an unconnected IO::Socket object, set its blocking status to false, and call its  connect()  method with the desired destination address. If  connect()  indicates immediate success, or if  connect()  returns  undef  but  $!  is equal to  EINPROGRESS  , we return the socket. Otherwise, some error has occurred and we return false.   Lines 56 “68: The  send_request()  method The  send_request()  method is called when the socket has become writable, either because it has completed the nonblocking connect or because an error occurred and the connection failed.   We first test the  status  instance variable and die if it isn't the expected "waiting" state ”this would represent a programming error, not that this could ever happen  ;-  ). If the test passes , we check that the socket is connected. If not, we recover the delayed error, stash it into  $!  , and return an error message to the caller.   Otherwise the connection has completed successfully. We put the socket back into blocking mode and attempt to write an appropriate GET request to the Web server. In the event of a write error, we issue an error message and return  undef  . Otherwise, we can conclude that the request was sent successfully and set the  status  variable to "reading header."   Lines 69 “74: The  read()  method The  read()  method is called when the HTTPFetch object's socket has become ready for reading, indicating that the server has begun to send the HTTP response. We look at the contents of the  status  variable. If it is "reading header," we call the  read_header()  method. Otherwise, we call  read_body()  .   Lines 75 “93: The  read_header()  method The  read_header()  method is a bit complicated because we have to read until we reach the two CRLF pairs that end the header. We can't use the  <>  operator, because that might block and would definitely interfere with the calls to  select()  in the main program.   We call  sysread ()  on the socket, requesting a 1,024-byte chunk. We might get the whole chunk in a single operation, or we might get a partial read and have to read again later when the socket is ready. In either case, we append what we get to the end of our internal  header  instance variable and use  rindex()  to see whether we have the CRLF pair.  rindex()  returns the index of a search string in a larger string, beginning from the rightmost position.   If we haven't gotten the full header yet, we just return. The main loop will give us another chance to read from the socket the next time  select()  indicates that it is ready. Otherwise, we parse out the topmost line, recovering the HTTP status code and message. If the status code indicates that an HTTP error of some sort occurred, we call  error()  and return  undef  . Otherwise, we're going to advance to the "reading body" state. However, we need to deal with the fact that the last  sysread()  might have read beyond the header and gotten some of the document itself. We know where the header ends, so we simply extract the document data using  substr()  and call  write_local()  to write the beginning of the document to the local file.  write_local()  will be called repeatedly during subsequent steps to write the rest of the document to the local file.   We set  status  to "reading body" and return.   Lines 94 “100: The  read_body()  method The  read_body()  method is remarkably simple. We call  sysread()  to read data from the server in 1,024-byte chunks and pass this on to  write_local()  to copy the document data to the local file. In case of an error during the read or write, we return  undef  . We also return  undef  when  sysread()  returns 0 bytes, indicating EOF.   Lines 101 “111: The  write_local()  method This method is responsible for writing a chunk of data to the local file. The file is opened only when needed. We check the HTTPFetch object for an instance variable named  localfh  . If it is undefined, then we call the  mkpath()  function to create the required parent directories, if needed, and  IO::File->new()  to open the file indicated by  localpath  . If the file can't be opened, then we exit with an error. Otherwise, we call  syswrite()  to write the data to the file, and stash the filehandle into  localfh  for future use.   Lines 112 “118: The  error()  method This method uses  carp()  to write the indicated error message to standard error. For convenience, we precede the error message with the URL that HTTPFetch is responsible for.   To test the effect of parallelizing connects, I compared this program against a version of the  web_fetch.pl  script that performs its fetches in a serial loop. When fetching the home pages of three popular Web servers (http://www.yahoo.com/, http://www.google.com/, and http://www.infoseek.com/) over several trials, I observed a speedup of approximately threefold.   Nonblocking accept()  Aside from its use in implementing timeouts, nonblocking  accept()  is infrequently used. One application of nonblocking  accept()  is in a server that must listen on multiple ports. In this case, the server creates multiple listening sockets and  select()  s across them.  select()  indicates that the socket is ready for reading if  accept()  can be called without blocking.   This code fragment indicates the idiom. It creates three sockets, bound to ports 80, 8000, and 8080, respectively (these ports are typically used by Web servers):   my $sock80 = IO::Socket::INET->new( LocalPort => 80,                                     Listen    => 20,                                     Reuse     => 1); my $sock8000 = IO::Socket::INET->new( LocalPort => 8000,                                       Listen    => 20,                                       Reuse     => 1); my $sock8080 = IO::Socket::INET->new( LocalPort => 8080,                                       Listen    => 20,                                       Reuse     => 1);   Each socket is marked nonblocking and added to an IO::Select set:   foreach ($sock80,$sock8000,$sock8080) {     $_->blocking(0); } my $listeners = IO::Select->new($sock80,$sock8000,$sock8080);  The main loop calls the IO::Select  can_read()  method, returning the list of sockets that are ready to  accept()  . We call each ready socket's  accept()  method, and handle the connected socket that is returned by turning on blocking again and passing it to some routine that handles the connection.   It is possible for  accept()  to return  undef  and an error code of  EWOULDBLOCK  even if  select()  indicates that it is readable. This can happen if the remote host terminated the connection between the time that  select()  returned and  accept()  was called. In this case, we simply skip back to the top of the loop and try again later.   while (1) {   my @ready = $listeners->can_read;   foreach (@ready) {     next unless my $connected = $_->accept();     $connected->blocking(1);     handle_connection($connected);   } }  |