Source Discussion | Visual Basic Developers Guide to ASP and IIS: Build Powerful Server-Side Web Applications with Visual Basic. (Visual Basic Developers Guides)

Let's now walk through the source for the WebAgent. To more easily understand the overall structure of the WebAgent, consider the data flow in Figure 11.4.

Figure 11.4: WebAgent data flow diagram.

Data collection occurs on a timed basis (every 10 minutes), which feeds data up to the data repositories. Using the search criteria stored in the configuration file (predefined by the user), the data is pruned. When the user requests the current information via the Web page, the HTML generator constructs a dynamic Web page that is served back through the HTTP server (see Figure 11.5).

Figure 11.5: Sample view of the WebAgent Web page.

The user can then view specific news items by clicking on the links (see Figure 11.6).

Figure 11.6: Sample view of WebAgent news article.

When the user completes reading the articles of interest, the "Mark Read" button can be depressed to clear the current news set. This is saved at the WebAgent so that these news items will not be seen again.

While the configuration is defined in a file outside of the WebAgent, the user can review the configuration settings through the WebAgent. The current configuration settings are presented by requesting the file config.html (see Figure 11.7).

Figure 11.7: Viewing the configuration of the WebAgent.

The following sections will describe the WebAgent software in the layers as defined by Figure 11.4. The Web interfaces layer provides the ability to communicate with external data servers using standard protocols. User instruction layer defines how a user will instill the agent its knowledge of the user's search constraints. The data collection and filtering layer performs the actual filtering of incoming data per the user's instruction. Finally, the user interface layer provides the HTTP server for viewing of filtered news.

Web Interfaces

The WebAgent implements simple versions of the NNTP client interface and HTTP client interface. A simple HTTP server is also implemented, which will be covered in the user interface section.

Simple HTTP Client

The HTTP client's purpose is site monitoring. We want to know when a Web site has changed, so that the user can be notified. To do this, the HTTP client interface implements a simple version of the GET request. The purpose of the GET request is to specify a file on the remote server and then to collect the file through the socket. We're interested only in the header, specifically the "Content-Length" header element. This element tells us the size of the actual content (the size of the file). We use the file size as an indication of whether the file has changed. It's not perfect, but unfortunately , not all servers send the modified header element.

The monitorSite function utilizes the monitor's structure array to know which sites to monitor (this structure initialization will be discussed later). Function monitorSite is shown in Listing 11.1.

Listing 11.1: Simple HTTP Client Interface.

 typedef struct {   int  active;   char url[MAX_URL_SIZE];   char urlName[MAX_SEARCH_ITEM_SIZE+1];   int  length;   int  changed;   int  shown; } monitorEntryType; int monitorSite( int siteIndex ) {   int ret=0, sock, result, len;   struct sockaddr_in servaddr;   char buffer[MAX_BUFFER+1];   char fqdn[80];   extern monitorEntryType monitors[];   /* Create a new client socket */   sock = socket(AF_INET, SOCK_STREAM, 0);   prune( monitors[siteIndex].url, fqdn );   memset(&servaddr, 0, sizeof(servaddr));   servaddr.sin_family = AF_INET;   servaddr.sin_port = htons( 80 );   /* Try and resolve the address */   servaddr.sin_addr.s_addr = inet_addr( fqdn );   /* If the target was not an IP address, then it must be a fully    * qualified domain name. Try to use the DNS resolver to resolve    * the name to an address.    */   if ( servaddr.sin_addr.s_addr == 0xffffffff ) {     struct hostent *hptr =       (struct hostent *)gethostbyname( fqdn );     if ( hptr == NULL ) {       close(sock);       return -1;     } else {       struct in_addr **addrs;       addrs = (struct in_addr **)hptr->h_addr_list;       memcpy( &servaddr.sin_addr, *addrs, sizeof(struct in_addr) );     }   }   /* Connect to the defined HTTP server */   result = connect(sock,                 (struct sockaddr_in *)&servaddr, sizeof(servaddr));   if (result == 0) {     /* Perform a simple HTTP GET command */     strcpy(buffer, "GET / HTTP/1.0\n\n");     len = write(sock, buffer, strlen(buffer) );     if ( len == strlen(buffer) ) {       char *cur;       len = grabResponse( sock, buffer );       cur = strstr(buffer, "Content-Length:");       if (cur != NULL) {         int curLen;         sscanf(buffer, "Content-Length: %d", &curLen);         if (len != monitors[siteIndex].length)  {           monitors[siteIndex].shown = 0;           monitors[siteIndex].changed = 1;           monitors[siteIndex].length = len;           ret = 1;         }       }     }   }   close(sock);   return(ret); }

The monitorEntryType structure includes an URL (Uniform Resource Locator), otherwise known as the Web address. The urlName is a simple textual description (the name of the Web site being monitored ). The length field holds the cached length of the Web site, used to determine if the site has changed.

On the CD

The monitorSite function implements a very simple client socket application. A socket is created using the socket function. The URL is pruned using the prune function to take an address in the form http://www.mtjones.com/ and translate it to www.mtjones.com (not shown in text, see CD-ROM).

The final address (called a fully-qualified domain name) can then be resolved using a client resolver (resolve from an FQDN to an IP address). Using the IP address, we can connect to the remote site. Note that our address could be an FQDN or a simple IP address. Therefore, we try the inet_addr function first to convert a string IP address to a numeric IP address. If this doesn't work, the resolver ( interfaced via the gethostbyname function) will convert the name into an IP address using an external domain name server (DNS).

Once we have a numeric IP address (within the servaddr structure), we try to connect to the remote server using the connect function. This function creates a bidirectional connection between the two endpoints that can be used for communication. Since we're connecting to the HTTP port on the remote server (port 80), we know that the application layer protocol used on this socket is HTTP. We issue the GET command and then await the response from the server with the grabResponse function (see Listing 11.2). Using the response, returned in buffer, we search for the " Content- Length: " header element, and store the value associated with it. If there's a change in the length (from the stored length), we mark this as changed so that it will show up in the filtering layer.

Listing 11.2: Retrieving an HTTP Response.

 int grabResponse( int sock, char *buf ) {   int i, len, stop, state, bufIdx;   if (buf == NULL) return -1;   len = bufIdx = state = stop = 0;   while (!stop) {     if (bufIdx+len > MAX_BUFFER - 80) break;     len = read( sock, &buf[bufIdx], (MAX_BUFFER-bufIdx) );     /* Search for the end-of-mail indicator in the current buffer       */     for ( i = bufIdx ; i < bufIdx+len ; i++ ) {       if      ( (state == 0) && (buf[i] == 0x0d) ) state = 1;       else if ( (state == 1) && (buf[i] == 0x0a) ) state = 2;       else if ( (state == 2) && (buf[i] == 0x0d) ) state = 3;       else if ( (state == 3) && (buf[i] == 0x0a) ) { stop = 1;         break; }       else state = 0;   }   bufIdx += len; } bufIdx -= 3; buf[bufIdx] = 0;   return bufIdx; }

The HTTP response is terminated (as the request) with two carriage -return/line-feed pairs. A simple state machine (shown in Listing 11.2) reads data from the socket until this pattern is found. Once found, the loop is terminated and a NULL is added to the end of the buffer.

This is the basis for a simple HTTP client. For each site that's monitored, a socket is created to the site and an HTTP request sent. The response is then captured to parse the length of the resulting content. This value is used to determine if the site has changed since the last check.

Simple NTTP Client

The NNTP client provides a simple API to communicate with news servers. The API allows an application to connect to a news server ( nntpConnect ), set the group of interest ( nntpSetGroup ), peek at the header of a news message ( nntpPeek ), retrieve the entire news message ( nntpRetrieve ), parse the news message ( nntpParse ), skip the current news message ( nntpSkip ), and close the connection to the news server ( nntpClose ).

NNTP is an interactive command-response protocol that is entirely ASCII-text based. By opening a simple telnet session to port 119 of the NNTP server (see Listing 11.3) we can carry out a dialog with the server. User input is shown in bold.

Listing 11.3: Sample Interaction with an NNTP Server.

 root@plato /root]#  telnet localhost 119  S: 201 plato.mtjones.com DNEWS Version 5.5d1, SO, posting OK C:  list  S: 215 list of newsgroups follows S: control 2 3 y S: control.cancel 2 3 y S: my.group 10 3 y S: new.group 6 3 y S: . C:  group my.group  S: 211 8 3 10 my.group selected C:  article 3  S: 220 3 <3C36AF8E.1BD3047E@mtjones.com> article retrieved S: Message-ID: <3C36AF8E.1BD3047E@mtjones.com> S: Date: Sat, 05 Jan 2002 00:47:27 -0700 S: From: "M. Tim Jones" <mtj@mtjones.com> S: X-Mailer: Mozilla 4.74 [en] (Win98; U) S: X-Accept-Language: en S: MIME-Version: 1.0 S: Newsgroups: my.group S: Subject: this is my post S: Content-Type: text/plain; charset=us-ascii S: Content-Transfer-Encoding: 7bit S: NNTP-Posting-Host: sartre.mtjones.com S: X-Trace: plato.mtjones.com 1010328764 sartre.mtjones.com (6 Jan 2002 07:52:44 -0700) S: Lines: 6 S: Path: plato.mtjones.com S: Xref: plato.mtjones.com my.group:3 S: S: S: Hello S: S: This is my post. S: S: S: .  C: date  S: 111 20020112122419  C: quit  S: 205 closing connection - goodbye!

From Listing 11.3, we see that the NNTP server connection is created using a telnet client. The NNTP server responds with a salutation (the server type, etc.). We can now issue commands through the connection. There are two basic types of responses that are expected from the server, single-line responses and multi-line responses. A single-line response is simple (see the date command above). The multi-line response is similarly easy to deal with given the universal termination symbol. NNTP follows the Simple Mail Transport Protocol (SMTP) by using a single '.' on a line by itself to identify the end of response (see the list and article commands, for example).

Now that we have a basic understanding of NNTP, let's now look at the API that will be used to communicate with the news server.

The basic type that is used within the NNTP API is the news_t type. This type defines the message unit that is communicated by the NNTP API functions (see Listing 11.4).

Listing 11.4: The Basic Message Structure, news_t .

 typedef struct {   char *msg;   int msgLen;   int msgId;   char subject[MAX_LG_STRING+1];   char sender[MAX_SM_STRING+1];   char msgDate[MAX_SM_STRING+1];   char *bodyStart; } news_t;

The news_t structure includes the unparsed buffer ( msg ), length of the unparsed message ( msgLen ) and the numeric identifier for the message ( msgId ). Also parsed out of the message header are the subject , sender , and msgDate . Finally, the bodyStart field points to the body of the news message.

The first function that must be used when starting a new NNTP session is the nntpConnect function. This function establishes a connection to an NNTP server given the address of an NNTP server (either an IP address or a fully-qualified domain name). The nntpConnect function is shown in Listing 11.5.

Listing 11.5: nntpConnect API Function.

 int nntpConnect ( char *nntpServer ) {   int result = -1;   struct sockaddr_in servaddr;   if (!nntpServer) return -1;   sock = socket( AF_INET, SOCK_STREAM, 0 );   bzero( &servaddr, sizeof(servaddr) );   servaddr.sin_family = AF_INET;   servaddr.sin_port = htons( 119 );   servaddr.sin_addr.s_addr = inet_addr( nntpServer );   if ( servaddr.sin_addr.s_addr == 0xffffffff ) {     struct hostent *hptr =            (struct hostent *)gethostbyname( nntpServer );     if ( hptr == NULL ) {       return -1;       } else {         struct in_addr **addrs;         addrs = (struct in_addr **)hptr->h_addr_list;         memcpy( &servaddr.sin_addr, *addrs, sizeof(struct in_addr)   );       }     }     result = connect( sock,                      (struct sockaddr *)&servaddr, sizeof(servaddr)                        );     if ( result >= 0 ) {       buffer[0] = 0;       result = dialog( sock, buffer, "201", 3 );       if (result < 0) nntpDisconnect();   }   return ( result ); }

The nntpConnect function first resolves the address passed in from the user ( nntpServer ). As with the monitorSite function, this can be a string IP address or a fully-qualified domain name, so each case is handled (see the monitorSite function discussion for a more detailed analysis of this function). The connect function is then used to connect to the remote server. As with all NNTP responses, a numeric ID is returned to identify the status of the response (through the dialog function). With an initial connect, the NNTP server should respond with a code 201 (successful connect). This is illustrated in the interactive NNTP session in Listing 11.3. If this code is found, success is returned to the caller, otherwise the NNTP session is disconnected and a failure code is returned (-1).

The dialog function is used by all NNTP API functions to validate the server's response (see Listing 11.6). We pass in the socket descriptor ID, a buffer that will be used to grab the response, our expected status response, and its length. We pass in our own buffer because of TCP re-packetization. Even though the server may emit one string containing the status line and another string with data, the network stack may combine these two before sending them out. Therefore, we provide the buffer to use to gather the response because it may have extra data within it that we'll need to parse later.

Listing 11.6: The NNTP dialog Support Function.

 int dialog( int sd, char *buffer, char *resp, int rlen ) {   int ret, len;   if ((sd == -1)  (!buffer)) return -1;   if (strlen(buffer) > 0) {     len = strlen( buffer );     if ( write( sd, buffer, len ) != len ) return -1;   }   if (resp != NULL) {     ret = read( sd, buffer, MAX_LINE );     if (ret >= 0) {       buffer[ret] = 0;       if (strncmp( buffer, resp, rlen )) return -1;     } else {       return -1;     }   }   return 0; }

Since we may not always want to send a command to the server, the buffer argument is checked to see if it contains a command (using strlen ). If it does, we'll send this through the socket using the write function. Likewise, a response is not always desired. If the caller passes a buffer for the response string, we'll read some amount of data through the socket and check our status code against it. If it matches, a success code is returned (0); otherwise a failure code is returned (-1).

Once connected to a news server, a group must be set in order to look at any messages available. This is accomplished through the nntpSetGroup API function. The caller passes in the group name to be subscribed (such as "comp.ai.alife") and the last message read from the group. At initialization, the caller would pass in -1 for lastRead which indicates that no messages have been read. When messages are finally read through nntpPeek or nntpRetrieve , the first message available will be read. Otherwise, the caller may specify the last message read, allowing the NNTP client to ignore previously read messages (see Listing 11.7).

Listing 11.7: nntpSetGroup API Function.

 int nntpSetGroup( char *group, int lastRead ) {   int result = -1;   int numMessages = -1;   if ((!group)  (sock == -1)) return -1;   snprintf( buffer, 80, "group %s\n", group );   result = dialog( sock, buffer, "211", 3 );   if (result == 0) {     sscanf( buffer, "211 %d %d %d ",              &numMessages, &firstMessage, &lastMessage );     if (lastRead == -1) {       curMessage = firstMessage;     } else {       curMessage = lastRead+1;       numMessages = lastMessage - lastRead;     }     printf("Set news group to %s\n", group);   }   return( numMessages ); }

Within the NNTP protocol, the group command is used to specify the group for subscription. The response will be a 211 (success) code, three numbers representing the number of messages available to read, and the first message and last message identifiers. These are stored internally on the NNTP client, and used on subsequent calls to NNTP API functions. The function returns the number of messages that are available to read.

Once subscribed to a group, the user may retrieve messages from the server using the message identifier attached to the message. Two API functions are available to read messages from the server, nntpPeek and nntpRetrieve .

The nntpPeek function reads only the header of the message, while nntpRetrieve reads the entire message (header and message body). The nntpPeek function is shown in Listing 11.8.

Listing 11.8: nntpPeek API Function.

 int nntpPeek ( news_t *news, int totalLen ) {   int result = -1, i, len=0, stop, state, bufIdx=0;   if ((!news)  (sock == -1)) return -1;   if ((curMessage == -1)  (curMessage > lastMessage)) return -2;   /* Save the message id for this particular message */   news->msgId = curMessage;   snprintf( buffer, 80, "head %d\n", curMessage );   result = dialog( sock, buffer, "221", 3 );   if (result < 0) return -3;   /* Skip the +OK response string and grab any data (end with     CRLF) */   len = strlen( buffer );   for ( i = 0 ; i < len-1 ; i++ ) {     if ( (buffer[i] == 0x0d) && (buffer[i+1] == 0x0a) ) {       len -= i-2;       memmove( news->msg, &buffer[i+2], len );       bufIdx = len;       break;     }   }   state = stop = 0;   while (!stop) {     if (bufIdx+len > totalLen - 80) break;     len = read( sock, &news->msg[bufIdx], (totalLen-bufIdx) );     /* Search for the end-of-mail indicator in the current buffer */     for ( i = bufIdx ; i < bufIdx+len ; i++ ) {       if      ( (state == 0) && (news->msg[i] == 0x0d) ) state =         1;       else if ( (state == 1) && (news->msg[i] == 0x0a) ) state =         2;       else if ( (state == 2) && (news->msg[i] == 0x0d) ) state =         1;       else if ( (state == 2) && (news->msg[i] ==  '.') ) state =         3;       else if ( (state == 3) && (news->msg[i] == 0x0d) ) state =         4;       else if ( (state == 4) && (news->msg[i] == 0x0a) ) {         stop = 1; break;       } else state = 0;     }     bufIdx += len;   }   bufIdx -= 3;   news->msg[bufIdx] = 0;   news->msgLen = bufIdx;   return bufIdx; }

The first task of nntpPeek is to emit the head command through the socket to the NNTP server. The NNTP server should respond with a '221' status code indicating that the head command succeeded. We then copy any other data that may have accompanied the status command from the NNTP server to our news message ( news->msg ). Finally, we read additional data from the socket until the end-of-mail indicator is found (a '.' on a line by itself). At this point, our message (stored in news->msg ) contains only the message header and can be parsed accordingly (see nntpParse discussion).

The nntpRetrieve function is very similar to nntpPeek , except that the entire message is downloaded instead of the header alone (see Listing 11.9).

Listing 11.9: nntpRetrieve API Function.

 int nntpRetrieve ( news_t *news, int totalLen ) {   int result = -1, i, len=0, stop, state, bufIdx=0;   if ((!news)  (sock == -1)) return -1;   if ((curMessage == -1)  (curMessage > lastMessage)) return -1;   /* Save the message id for this particular message */   news->msgId = curMessage;   snprintf( buffer, 80, "article %d\n", curMessage++ );   result = dialog( sock, buffer, "220", 3 );   if (result < 0) return -1;   len = strlen(buffer);   for ( i = 0 ; i < len-1 ; i++ ) {     if ( (buffer[i] == 0x0d) && (buffer[i+1] == 0x0a) ) {       len -= i-2;       memmove( news->msg, &buffer[i+2], len );       break;     }   }   state = stop = 0;   while (!stop) {     if (bufIdx+len > totalLen - 80) break;     /* Search for the end-of-mail indicator in the current buffer */     for ( i = bufIdx ; i < bufIdx+len ; i++ ) {       if      ( (state == 0) && (news->msg[i] == 0x0d) ) state =         1;       else if ( (state == 1) && (news->msg[i] == 0x0a) ) state =         2;       else if ( (state == 2) && (news->msg[i] == 0x0d) ) state =         1;       else if ( (state == 2) && (news->msg[i] ==  '.') ) state =         3;       else if ( (state == 3) && (news->msg[i] == 0x0d) ) state =         4;       else if ( (state == 4) && (news->msg[i] == 0x0a) ) {         stop = 1; break;       } else state = 0;     }     bufIdx += (i-bufIdx);     if (!stop) {       len = read( sock, &news->msg[bufIdx], (totalLen-bufIdx) );       if ( (len <= 0)  (bufIdx+len > totalLen) ) {         break;       }     }   }   bufIdx -= 3;   news->msg[bufIdx] = 0;   news->msgLen = bufIdx;   return bufIdx; }

The nntpRetrieve function uses the article NNTP command to request the entire message. Recall that nntpPeek used the head command to request the headers. NNTP returns the header and full message in the same way, a sequence of characters with a single '.' on a line by itself as an end-of-message indicator. Therefore, the nntpPeek and nntpRetrieve commands server similar purposes, but result in differing amounts of data. One additional difference between the two functions is the advancement of the current message identifier. The nntpPeek function does not advance the current message, while the nntpRetrieve function does. This is primarily because the nntpPeek function is used to look at the message to know if the entire message should be downloaded. If the user does not wish to download the message, the nntpSkip function can be used to advance the message identifier (see Listing 11.10).

Listing 11.10: nntpSkip API Function.

 void nntpSkip( void ) {   curMessage++; }

Recall that curMessage is a static variable within the NNTP client that is initialized when the nntpSetGroup command is called.

Once the message (or header) has been downloaded from the NNTP server, it is contained within the news_t structure (see Listing 11.4). This structure can then be passed to the nntpParse function to parse out the subject, date, and sender of the news posting. Additionally, the start of the message body (message excluding the NNTP headers) is found and loaded into the bodyStart field. The nntpParse function can be found in Listing 11.11.

Listing 11.11: nntpParse API Function.

 int nntpParse( news_t *news, unsigned int flags ) {   int result;   if (!news) return -1;   result = parseEntry( news, "Subject:", news->subject );   if (result < 0) return -1;   result = parseEntry( news, "Date:", news->msgDate );   if (result < 0) return -2;   result = parseEntry( news, "From:", news->sender );   if (result < 0) return -3;   fixAddress( news->sender );   if (flags == FULL_PARSE) {     result = findBody( news );   }   return result; }

On the CD

The caller must define whether a header-only parse is required or if an entire message parse should be performed. A header-only parse is defined by passing in flags equal to HEADER_PARSE . For a full parse (includes identifying where the message body exists), flags should be equal to FULL_PARSE . The nntpParse function uses support functions parseEntry and findBody . While not discussed in this text, they can be found on the CD-ROM.

The parseEntry function simply parses the value associated with the header element as passed into parseEntry . Function findBody finds where the body of the message begins within the full message (header and body).

The final API function for NNTP is the nntpDisconnect function. This closes the session with the NNTP server (see Listing 11.12).

Listing 11.12: nntpDisconnect API Function.

 int nntpDisconnect ( void ) {   if (sock == -1) return -1;   close(sock);   sock = curMessage = firstMessage = lastMessage = -1;   return 0; }

In addition to closing the socket associated with the NNTP session, the function also initializes the internal state variables that will be reset when a new session is opened.

These seven API functions provide the ability for the WebAgent to retrieve news from an NNTP server and further filter this news based upon a user's criteria.

User Instruction

The user provides their filtering knowledge to the WebAgent through a simple configuration text file. This file has a very simple format, as illustrated by Listing 11.13.

Listing 11.13: WebAgent Configuration File Example.

 # # Sample config file # [monitor] http://www.foxnews.com;Fox News http://www.wnd.com/;WorldNetDaily [feeds] nntp://yellow.geeks.org [groups] comp.robotics.misc;camera;68HC11 sci.space.moderated;mars;gemini sci.space.news;micro;satellite

The configuration file is made up of three sections, each of which is optional. The first section identifies the Web sites that are to be monitored (under the heading [monitor] ). The Web sites must be specified in full URL format, including the http:// protocol specification. After the URL, a semicolon is used to separate the text name of the site (used for display purposes only).

A single feed is supported by the WebAgent, which defines where the agent may collect news items (under the heading [feeds] ). The single feed is defined in URL format, including the protocol specification (in this case nntp:// , specifying that the Network News Transfer Protocol). This is the site for which the WebAgent has the authority to connect. In this case, a freely available (and reliable) news server is illustrated.

Given the definition of a news feed, one or more groups may be defined (specified by the [groups] heading). Each line may contain a news-group definition that is terminated by a semicolon. Each word following the group specification is a search keyword that the WebAgent will use to determine if a given news article should be presented to the user. The more words that are found within the subject of the news article, the higher it is rated and the higher on the list it will be presented.

Let's now look at how the file is parsed. Recall that in Listing 11.1, we presented the monitorEntryType structure, which is used to represent the Web sites to monitor. In Listing 11.14, we present the feedEntryType structure that represents the news feed and news groups to monitor within that feed.

Listing 11.14: Types feedEntryType and groupEntryType .

 #define MAX_URL_SIZE            80 #define MAX_SEARCH_ITEM_SIZE    40 #define MAX_SEARCH_STRINGS      10 #define MAX_GROUPS              20 typedef struct {   int  active;   char groupName[MAX_URL_SIZE+1];   int  lastMessageRead;   char searchString[MAX_SEARCH_STRINGS][MAX_SEARCH_ITEM_SIZE+1];   int  numSearchStrings; } groupEntryType; typedef struct {   char           url[MAX_URL_SIZE];   groupEntryType groups[MAX_GROUPS]; } feedEntryType;

The feedEntryType includes the URL of the feed itself (as parsed from the configuration file) and one or more group elements. A group structure maps to each line under the [groups] heading. This includes the name of the group, the last message that was read from the group and a set of search strings (as parsed after the group name). These two linked structures define the working state of the news-monitoring aspect of the WebAgent.

The first step to parsing the configuration file is a call to the parseConfigFile function. This is the main function for parsing and includes support for parsing each of the three elements from the configuration file (see Listing 11.15).

Listing 11.15: Main Configuration File Parsing Function.

 int parseConfigFile( char *filename ) {   FILE *fp;   char line[MAX_LINE+1], *cur;   int parse, i;   bzero( &feed, sizeof(feed) );   bzero( monitors, sizeof(monitors) );   fp = fopen(filename, "r");   if (fp == NULL) return -1;   while( !feof(fp) ) {   fgets( line, MAX_LINE, fp );   if (feof(fp)) break;   if      (line[0] == '#') continue;   else if (line[0] == 0x0a) continue;   if        (!strncmp(line, "[monitor]", 9)) {     parse = MONITOR_PARSE;   } else if (!strncmp(line, "[feeds]", 7)) {     parse = FEEDS_PARSE;   } else if (!strncmp(line, "[groups]", 8)) {     parse = GROUPS_PARSE;   } else {     if (parse == MONITOR_PARSE) {       if (!strncmp(line, "http://", 7)) {         cur = parseURLorGroup( line, monitors[curMonitor].url );         parseString( cur, monitors[curMonitor].urlName );         monitors[curMonitor].active = 1;         curMonitor++;       } else return -1;     } else if (parse == FEEDS_PARSE) {       if (!strncmp(line, "nntp://", 7)) {         cur = parseURLorGroup( line, feed.url );       } else return -1;     } else if (parse == GROUPS_PARSE) {       cur = parseURLorGroup( line,                              feed.groups[curGroup].groupName );       i = 0;       while (*cur) {         cur = parseString(                  cur, feed.groups[curGroup].searchString[i] );         if (strlen(feed.groups[curGroup].searchString[i])) i++;         if (i == MAX_SEARCH_STRINGS) break;       }         feed.groups[curGroup].numSearchStrings = i;         feed.groups[curGroup].active = 1;         curGroup++;       }     }   }   readGroupStatus();   return 0; }

After initializing the WebAgent base structures, the configuration file is opened and each line is read from the file in the while loop. If a line starts with a '#' sign or with a line-feed (0x0a, an empty line), the line is ignored and the loop continues to read the next line. Otherwise, we test the line to see if it labels a new section (for monitor, feed, or groups). If so, we set our local parse variable to the appropriate state to know how to parse subsequent lines of the file.

If the line does not contain a label, we parse the line according to the current parse state (as defined by the parse variable).

For monitor state parsing, we first test the line to see if it contains a hypertext protocol URL. If it does, we call the parseURLorGroup function to parse the URL from the line. We then call the parseString function to parse the text name of the Web site that represents the URL. These two functions can be seen in Listing 11.16. Finally, we set the current monitor row to active (contains an URL to monitor) and increment the curMonitor variable for the next parse.

Listing 11.16: Functions parseURLorGroup and parseString .

 char *parseURLorGroup( char *line, char *url ) {   int i = 0;   /* Search for the ';' or ' ' seperator */   while ((*line != ' ') && (*line != ';') && (*line != 0x0a)) {     url[i++] = *line++;     if (i == MAX_URL_SIZE-1) break;   }   url[i] = 0;   while ((*line != ';') && (*line != 0) && (*line != 0x0a)) i++;   return( line ); } char *parseString( char *line, char *string ) {   int j=0;   if (*line != ';') {     *line = 0;     return line;   }   line++;   while (*line == ' ') line++;   while ((*line != ';') && (*line != 0x0a)) {     string[j++] = tolower(*line++);     if (j == MAX_SEARCH_ITEM_SIZE-1) break;   }   string[j] = 0;   while ((*line != ';') && (*line != 0)) line++;   return( line ); }

For feeds parsing, we look for a single line that defines the NNTP server to use. We use the parseURLorGroup function to parse the URL from the line and store this within the feed.url . The WebAgent will later connect to this URL to gather news.

Parsing a group is very similar to monitor state parsing, except for the fact that search strings follow the news-group name. As shown in Listing 11.14, up to 10 search strings may be present. If more are provided in the configuration file, they are simply ignored. The number of search strings is stored within the numSearchStrings field and it is activated by setting the active field to one.

The parseURLorGroup function is used to parse an URL or group from a line. The URL or group is the same to this function, as it simply looks for a separator (space, semicolon, or line-feed). Each of the characters that do not exist in the separator set is copied into the url character array passed by the user. When the separator is found, we skip any blank space found to prepare for the next parsing function.

The parseString function is similar to parseURLorGroup in that we copy the string found until a separator is found or line-feed, but convert each character to lowercase as we copy. In this way, search strings are case independent and simpler to match. When the separator is found, the new string is null-terminated and white space is skipped in preparation for another potential call to parseString .

Continuing from Listing 11.15, the parsing process continues until the end of the file is reached (no additional configuration entries found). At this point, a special function called readGroupStatus is called (see Listing 11.17). The purpose of this function is to read the archived information about the last message read for the news groups read from the configuration file.

Listing 11.17: Function readGroupStatus to Read the News Group State.

 void readGroupStatus( void ) {   FILE *fp;   int i, curMsg;   char line[80];   for (i = 0 ; i < MAX_MONITORS ; i++) {     feed.groups[i].lastMessageRead = -1;   }   fp = fopen(GRPSTS_FILE, "r");   while (!feof(fp)) {     fscanf( fp, "%s : %d\n", line, &curMsg );     for (i = 0 ; i < MAX_MONITORS ; i++) {       if (feed.groups[i].active) {         if (!strcmp(feed.groups[i].groupName, line)) {           feed.groups[i].lastMessageRead = curMsg;           break;         }       }     }   }   return; }

The purpose of readGroupStatus is to read the last message read for each of the configuration news groups (if available). If the group was just added to the configuration file, no first message is available and the WebAgent will read the first available message (as gathered by the nntpSetGroup function). Listing 11.18 shows the format of the file.

Listing 11.18: Format of the Group Status File Read by readGroupStatus .

 comp.robotics.misc : 96000 sci.space.history : 135501

The format of the group status file (named group.sts within the file-system ) provides a group on each line. The first element is the news-group name, a ':' separator, and the last message read for that group.

The first step in reading the group status is clearing out the lastMessageRead field for each of the groups within the feed structure. We then walk through each element of the group status file and try to parse each line to a group name and a message number. Upon reading a line, we search the feed groups to see if the group name exists (since the user could have removed it). If the group is found, we update the lastMessageRead field for that group with the message number ( curMsg ) read from the file. We'll look later at how and when this file is generated.

That completes the configuration process for the WebAgent. We'll now look at the meat of the WebAgent, data collection and filtering.

News Gathering and Filtering

Let's now focus on the process of gathering news and filtering it according to the user's specification. The process of gathering and filtering news is performed by a call to the checkNewsSources function (see Listing 11.19). This function very simply walks through the active groups in the feed structure and calls the checkGroup function to check the specific group.

Listing 11.19: Function checkNewsSources .

 void checkNewsSources( void ) {   int i;   extern feedEntryType feed;   for (i = 0 ; i < MAX_GROUPS ; i++) {     if ( feed.groups[i].active ) {       checkGroup( i );     }   }   return; }

The checkGroup function utilizes the previously discussed NNTP API to gather news based upon the user's specification (see Listing 11.20).

Listing 11.20: Function checkGroup .

 void checkGroup( int group ) {   int result, count, index = 0;   char fqdn[80];   news_t news;   news.msg = (char *)malloc(MAX_NEWS_MSG+1);   bzero( news.msg, MAX_NEWS_MSG+1 );   news.msgLen = MAX_NEWS_MSG;   prune( feed.url, fqdn );   /* Connect to the defined NNTP server */   count = nntpConnect( fqdn );   if (count == 0) {     /* Set to the defined group */     count = nntpSetGroup( feed.groups[group].groupName,                           feed.groups[group].lastMessageRead );     index = 0;     if (count > 200) count = 200;     while (count-- > 0) {       result = nntpPeek( &news, MAX_NEWS_MSG );       if (result > 0) {         result = nntpParse( &news, HEADER_PARSE );         if (result == 0) {           testNewsItem( group, &news );         }       }       feed.groups[group].lastMessageRead = news.msgId;       nntpSkip();     }   }   free( news.msg );   nntpDisconnect();   return; }

Function checkGroup first prunes the NNTP server name using the prune function. Recall that this function removes the initial protocol specification from the URL ( nntp:// ) and any trailing '/' if it exists. Next, the news message is constructed . The NNTP API requires the user to define the actual buffer to be used to collect news (since it could be arbitrarily large to download certain messages). We allocate a buffer here of 64K for the buffer, and then load this into the msg field of the news structure. The size of this buffer is also loaded into the msgLen field so that the NNTP API does not overwrite the bounds of the buffer.

A session is constructed to the named NNTP server through the nntpConnect API function. If the return value indicates connection success, the group is set. Note that the group is set here based upon the user specification of the group argument. This is the index for the particular group of interest. This function returns the number of messages that are available to read. To avoid spending an inordinate amount of time working through all of the available messages, we cap the value at 200.

A loop is then performed to read the computed number of messages for the current group. We use the nntpPeek function since we're only interested in getting the message header information, specifically the subject field to test whether it matches the group's search criteria. The nntpParse function is called to parse the subject (and other fields) from the header and into the news structure. Given successful returns from nntpPeek and nntpParse , we pass the news item to testNewsItem to see if the search criteria match.

After the news item is tested , we update the lastMessageRead field to account for the current message and then skip the message. Recall that when only the nntpPeek function is used, we must also call nntpSkip to step to the next available message.

Upon completion of the loop, we free our previously malloc'd news buffer ( news.msg ) and disconnect from the NNTP server using nntpDisconnect .

Testing the news message, based upon the current news group's search criteria is a very simple process as shown in Listing 11.21.

Listing 11.21: Function testNewsItem .

 void testNewsItem( int group, news_t *news ) {   int i, count=0;   char *cur;   if (feed.groups[group].numSearchStrings > 0) {     for ( i = 0 ; i < feed.groups[group].numSearchStrings ; i++ ) {       cur = strstr( news->subject,                      feed.groups[group].searchString[i] );       if (cur) count++;     }   } else {     count = -1;   }   if (count) {     insertNewsItem( group, count, news );   }   return; }

We use a very simple threshold to define whether a message matches the search criteria. If any of the search strings defined for the group are found within the subject of the message, we keep the news item to later present to the user. The search is performed using the strstr function, which identifies whether one string is found in another. For each of the search strings of the current group, we search the subject using strstr . If the return is non-NULL, a match was found and we increment a count variable. The count variable represents the number of matched search strings and is used as a 'goodness' indicator of the message (the higher the count, the higher the item will appear on the news list). If the count is non-zero (or no search strings were presented by the user), then we add this to the list of news items for later presentation.

The news list is a list of elements that act as containers for news data. The elementType is provided in Listing 11.22. This structure contains all the necessary information to describe a news message so that if the user desires to view the full message later, the relevant information can be used to retrieve the message.

Listing 11.22: Structure elementType for Storing Interesting News Items.

 typedef struct elementStruct *elemPtr; typedef struct elementStruct {   int  group;   int  rank;   int  msgId;   char subject[MAX_LG_STRING+1];   char msgDate[MAX_SM_STRING+1];   char link[MAX_SM_STRING+1];   int  shown;   struct elementStruct *next; } elementType;

While many of these items are self-explanatory, the rest will be discussed within the context of the insertNewsItem function (see Listing 11.23).

Listing 11.23: Function insertNewsItem .

 // Define the head of the news list. elementType head; void insertNewsItem( int group, int count, news_t *news ) {   elementType *walker = &head;   elementType *newElement;   newElement = (elementType *)malloc(sizeof(elementType));   newElement->group = group;   newElement->rank = count;   newElement->msgId = news->msgId;   strncpy( newElement->subject, news->subject, MAX_LG_STRING );   strncpy( newElement->msgDate, news->msgDate, MAX_SM_STRING );   newElement->shown = 0;   sprintf(newElement->link, "art%d_%d", group, news->msgId);   newElement->next = (elementType *)NULL;   while (walker) {     /* If no next element, add new element to the end */     if (walker->next == NULL) {       walker->next = newElement;       break;     }     /* Otherwise, insert in rank order (descending) */     if (walker->next->rank < newElement->rank) {       newElement->next = walker->next;       walker->next = newElement;       break;     }     walker = walker->next;   }   return; }

The first step of adding a news item to the list is to create a new news element (of type elementType ). The element is created by mallocing a block of memory, and casting the memory to type elementType . The group for which the item was found is then loaded into group and the count is loaded as the rank (relative position based upon other news items in the list). The message identifier is stored as msgId (the unique ID of the message within the group) and the subject and msgDate are copied.

The shown field represents an indication of whether the particular items have been shown to the user. This is important because the user has the ability to clear the list of currently- viewed news items. When the clearing process occurs, it is performed only on those items that the user has seen (not items that may have been collected but not yet seen). When the item is displayed, the shown flag is set to one. It's initialized to zero here representing a new item not yet seen.

The link field is a special field used by the WebAgent to uniquely identify the article. The link is presented to the user as part of an HTML link tag. When the WebAgent's HTTP server receives a request for this link, it understands how to identify which particular message the user has requested to see. For example, if the group of the message is seven and the message ID is 20999, then the link will be defined as ' art7_20999 '. The WebAgent understands how to parse this link to retrieve the article (identifier 20999) from the group (index 7 in the groups table of the feed structure).

Finally, the next field is the next element in the list. We initialize this to NULL (end of list) until we figure out where this particular item should be inserted).

At this point, we have a new elementType structure with the fields initialized based upon the news argument passed by the caller. The task is now simply to insert the item into the list based upon the rank. The higher the rank, the higher it will appear in the list. Recall that at the beginning of the function we set a walker variable to the header of the list. The head is a dummy element that contains no data, but exists only to simplify the management of the linked list (see Figure 11.8).

Figure 11.8: News list example.

From Listing 11.23, it's clear that the algorithm walks down the list of elements by sitting on one element and looking forward to the next. This is the only way to insert into a singly linked-list, as you can manipulate the next pointer on the current item, as well as the item to be inserted.

The first case to be handled is where there's no element next on the list. In this case, we simply add our current element to the tail (point the tail of our current element to the element to be added). We then break from the loop and return. Otherwise, we test the rank of the next element against the rank of the element to be inserted. If our element to be inserted has a rank greater than the element next on the list, then the element should be inserted here (between the current element on which we sit, and the next element to which it points). To insert the item, we set the next pointer of our element to be inserted to the next element in the list, and then point the next pointer of our current element to our element to be inserted. This completes the chain with the newly inserted element.

If the rank test was not satisfied, then we walk to the next element in the list (set our current element to the next element), and repeat the test as discussed above. This process creates a linked-list of news items that are to be presented to the user, in rank descending order. We'll look at the presentation of this data in the next section.

User Interface

The user interface presented by the WebAgent is an HTTP server accessible through a simple Web browser. In this section, we'll describe the HTTP server and how it operates to present dynamic data collected through the NNTP API.

The HTTP server must be initialized through a call to initHttpServer . This function is called once by the main application to create the HTTP server and socket to which clients may connect (see Listing 11.24).

Listing 11.24: Function initHttpServer .

 int initHttpServer( void ) {   int on=1, ret;   struct sockaddr_in servaddr;   if (listenfd != -1) close( listenfd );   listenfd = socket( AF_INET, SOCK_STREAM, 0 );   /* Make the port immediately reusable */   ret = setsockopt( listenfd, SOL_SOCKET,                      SO_REUSEADDR, &on, sizeof(on) );   if (ret < 0) return -1;   /* Set up the server socket to accept connections from any    * address at port 8080.    */   bzero( (void *)&servaddr, sizeof(servaddr) );   servaddr.sin_family = AF_INET;   servaddr.sin_addr.s_addr = htonl( INADDR_ANY );   servaddr.sin_port = htons( 8080 );   /* Bind the socket with the prior servaddr structure */   ret = bind( listenfd,               (struct sockaddr *)&servaddr, sizeof(servaddr) );   if (ret < 0) return -1;   listen(listenfd, 1);   return 0; }

The initHttpServer function simply creates the server socket for which another function will check to see if clients have connected. This server will be non-traditional in that we won't sit blocked on the accept call awaiting client connections. Instead, we'll make use of the select call to know when a client has connected.

Upon creating the socket (in Listing 11.24), we enable the SO_REUSEADDR socket option to allow us to bind quickly to port 8080. If this option were not used, we would be required to wait two minutes between stopping and starting the WebAgent (due to the socket being in a time wait state). We then bind to port 8080, and allow any interface to permit connections (if the host happens to be multi- homed ) using the INADDRY_ANY symbol. Once bind succeeds, we call listen to put the socket into the listening state (and allow clients to connect).

Since the WebAgent is required to multi-task (collect news, monitor Web sites, and gather news), we must support the ability to periodically perform tasks while awaiting user interaction through the HTTP server. We accomplish this through the WebAgent by checking for client connections, and at timeout intervals, check to see if any data collection is necessary. The timeout is provided by the select function (see Listing 11.25).

Listing 11.25: Function checkHttpServer .

 void checkHttpServer( void ) {   fd_set rfds;   struct timeval tv;   int ret = -1;   int connfd;   socklen_t clilen;   struct sockaddr_in cliaddr;   FD_ZERO( &rfds );   FD_SET( listenfd, &rfds );   tv.tv_sec = 1;   tv.tv_usec = 0;   ret = select( listenfd+1, &rfds, NULL, NULL, &tv );   if (ret > 0) {     if (FD_ISSET(listenfd, &rfds)) {       clilen = sizeof(cliaddr);       connfd = accept( listenfd,                        (struct sockaddr *)&cliaddr, &clilen );       if (connfd > 0) {         handleConnection( connfd );         close( connfd );       }     } else {       /* Some kind of error, reinitialize... */       initHttpServer();     }   } else if (ret < 0) {       /* Some kind of error, reinitialize... */       initHttpServer();   } else {     // timeout -- no action.   }   return; }

Recall from Listing 11.24 that listenfd represents the sock descriptor for the HTTP server. This descriptor is used with the select call to notify us when a client connection is available. The select call also provides a timeout feature, so that if a client connection does not occur within some period of time, the timeout forces the select call to unblock and notify the caller. The caller can then perform other processing. This is the basis for the WebAgent's ability to perform data collection on a periodic basis. Once data collection is complete, WebAgent returns to checkHttpServer to see if a client has connected to gather data. A full discussion of the select call is outside of the scope of this book, but a few useful references are provided in the resources section.

The rfds structure is used by the select call to create a bitmap of the sockets to monitor. We have only one, listenfd , and use the FD_SET macro to enable this socket within the rfds structure. We also initialize our timeval structure, specifying that the timeout is one second. The call to select is then made, specifying that we're awaiting a read event (for which a client connect is covered) in addition to a timeout event. The caller will be notified of the first occurrence of either event.

Once select returns, we check the return value. If the return is less than zero, then some type of error has occurred and we reinitialize the HTTP server through a call to initHttpServer . A return of zero indicates that the call timed-out (based upon the caller's predefined timeout value). In this case, we take no action and simply return. Data collection may then occur, if the time has come to do this. Finally, if the return value is greater than zero, a socket event occurred. We use the FD_ISSET macro to identify which socket caused the event (which should be our listenfd , since no other sockets were configured). If the FD_ISSET function confirms to us that the listenfd indeed has a client waiting, we accept the connection using the accept call and invoke handleConnection to service this request. Otherwise, an internal error has occurred and we reinitialize using initHttpServer .

The handleConnection function handles a single HTTP transaction. The HTTP request is first parsed to identify what the client is asking us to do. Based upon this request, a handler function is called to generate an HTTP response (see Listing 11.26).

Listing 11.26: Function handleConnection .

 void handleConnection( int connfd ) {   int len, max, loop;   char buffer[MAX_BUFFER+1];   char filename[80+1];   /* Read in the HTTP request message */   max = 0; loop = 1;   while (loop) {     len = read(connfd, &buffer[max], 255); buffer[max+len] = 0;     if (len <= 0) return;     max += len;     if ((buffer[max-4] == 0x0d) && (buffer[max-3] == 0x0a) &&         (buffer[max-2] == 0x0d) && (buffer[max-1] == 0x0a)) {           loop = 0;     }   }   /* Determine the HTTP request */   if (!strncmp(buffer, "GET", 3)) {     getFilename(buffer, filename, 4);     /* Within this tiny HTTP server, the filename parsed from the      * request determines the function to call to emit an HTTP      * response.      */     if      (!strncmp(filename, "/index.html", 11))       emitNews( connfd );     else if (!strncmp(filename, "/config.html", 12))       emitConfig( connfd );     else if (!strncmp(filename, "/art", 3))       emitArticle( connfd, filename );     else       write(connfd, notfound, strlen(notfound));   } else if (!strncmp(buffer, "POST", 4)) {     getFilename(buffer, filename, 5);     /* Ditto for the POST request (as above for GET). The POST      * filename determines what to do -- the only case though      * is the "Mark Read" button which clears the shown entries.      */     if (!strncmp(filename, "/clear", 6)) {       clearEntries();       emitHTTPResponseHeader( connfd );       strcpy(buffer, "<P><H1>Click Back and Reload to "                      "refresh page.</H1><P>\n\n");       write(connfd, buffer, strlen(buffer));     } else {       write(connfd, notfound, strlen(notfound));     }   } else {     strcpy(buffer, "HTTP/1.1 501 Not Implemented\n\n");     write(connfd, buffer, strlen(buffer));   }   return; }

The first step in handling a new HTTP connection is to read the HTTP request. The initial loop reads in the request, looking for an empty line. The empty line that follows the request is HTTP's way of terminating the request. Once we have our buffer with the completed request, we test it to see if the browser client sent us a GET request or a POST request. The GET request is used to request a file from the HTTP server; in our case, all files are dynamically generated based upon the filename that was requested. The POST request specifies that the user clicked a button within the served page, and is accompanied by a filename (traditionally a CGI filename, or Common Gateway Interface). CGI provides the means to interface the HTTP server with scripts to allow the server to perform actions. We simply look at the filename that the client POSTed, and use this to determine which action to perform.

For GET requests, the server knows how to serve three filenames. The ' /index.html ' is the default file and represents the main page for news browsing and Web site monitoring (recall Figure 11.5). When the client browser requests this file, the function emitNews is called to serve the news page. File ' /config.html ' represents the WebAgent configuration page (displays the current configuration). This was shown in Figure 11.7. If the filename begins with ' /art ' then the client browser has requested a particular piece of news (recall Figure 11.6). We'll see in the emitNews function how articles are linked within the news page. Function emitArticle is used to satisfy this request. Finally, if the HTTP server does not recognize the filename requested, an HTTP error response is generated (HTTP error code 404, file not found).

For POST requests, the associated filename is parsed from the HTTP request message. If the requested file was ' /clear ', then the client has requested that we clear the news that has already been viewed (as invoked through a button-click on the news page). We clear the viewed news items using clearEntries and then emit the HTTP response message header using emitHTTPResponseHeader . Finally, we write a short message to the user (which is displayed in the browser) to hit the back button and refresh to view the current news. If ' /clear ' was not the requested file, then an error has occurred and the error response is generated.

One final error leg in the handleConnection function exists for an unknown request. If the HTTP request message was not of type GET or POST, then we emit an error response that specifies that the feature is unimplemented.

Let's now look at some of the support functions used by handleConnection . The first, getFilename (Listing 11.27) is used to parse the filename from the HTTP request message. It parses the filename by skipping the HTTP request type and then copying any characters present until a space is found. One special case for getFilename is if the file requested was simply ' / '. This is automatically converted to ' /index.html ', our news presentation page.

Listing 11.27: Support Function getFilename .

 void getFilename(char *inbuf, char *out, int start) {   int i=start, j=0;   /*    * Skip any initial spaces    */   while (inbuf[i] == ' ') i++;   for ( ; i < strlen(inbuf) ; i++) {     if (inbuf[i] == ' ') {       out[j] = 0;       break;     }     out[j++] = inbuf[i];   }   if (!strcmp(out, "/")) strcpy(out, "/index.html");   return; }

The next support function is emitHTTPResponseHeader , and is used to generate a simple HTTP response header (see Listing 11.28). This response instructs the client browser that the request was understood and that the response will be in HTML format (through the Content-Type header element).

Listing 11.28: Support Function emitHTTPResponseHeader .

 void emitHTTPResponseHeader( int connfd ) {   char line[80];   strcpy( line, "HTTP/1.1 200 OK\n" );   write( connfd, line, strlen(line) );   strcpy( line, "Server: tinyHttp\n" );   write( connfd, line, strlen(line) );   strcpy( line, "Connection: close\n" );   write( connfd, line, strlen(line) );   strcpy( line, "Content-Type: text/html\n\n" );   write( connfd, line, strlen(line) );   return; }

The caller passes in the socket descriptor for the current HTTP connection, which the emitHTTPResponseHeader function uses to send back the response.

The final support function for handleConnection is clearEntries (see Listing 11.29). This function is used to clear any news elements that have already been viewed by the client. We initialize the shown variable of the news element to zero, representing not yet viewed. This flag is set to one after the news page is served that contains those news elements. When the user clicks the "Mark Read" button on the news page, this function is called to clear the previously viewed items.

Listing 11.29: Support Function clearEntries .

 void clearEntries( void ) {   elementType *walker = &head;   elementType *temp;   int i;   extern monitorEntryType monitors[];   /* Clear the news chain (for items that have been viewed) */   while (walker->next) {     if (walker->next->shown) {       temp = walker->next;       walker->next = walker->next->next;       free(temp);     } else {       walker = walker->next;     }   }   /* Clear sites to be monitored (that have been seen) */   for (i = 0 ; i < MAX_MONITORS ; i++) {     if ((monitors[i].active) && (monitors[i].shown)) {       monitors[i].changed = 0;       monitors[i].shown = 0;     }   }   emitGroupStatus(); }

The clearEntries function works similarly to insertNewsItem (shown in Listing 11.23). The function walks through the list of news items looking for an item that has the shown flag set. Note that we look forward from the current element, as this is the only way to remove an element. This is because we must set the next pointer of the current element to the next pointer of the next element. This effectively removes the item from the chain. We store the item to the temp elementType , so that we can free it once the chain is updated.

Function clearEntries also clears the shown flag for Web sites that are monitored. If the Web site has been noted as changed and shown, these flags are cleared so that it will not show up on the next request of the news page. Finally, the clearEntries function calls the function emitGroupStatus to write the group status file. This file identifies the last message read for each of the current news groups (see Listing 11.30).

Listing 11.30: Function emitGroupStatus .

 void emitGroupStatus( void ) {   FILE *fp;   int i;   fp = fopen(GRPSTS_FILE, "w");   for (i = 0 ; i < MAX_MONITORS ; i++) {     if (feed.groups[i].active) {       fprintf( fp, "%s : %d\n",                 feed.groups[i].groupName,                 feed.groups[i].lastMessageRead );     }   }   fclose(fp);   return; }

Function emitGroupStatus simply emits the group name and last message read for each active group in the feed structure. If the WebAgent were stopped after this function, it could be restarted without missing any new messages or reintroducing older messages.

We continue with the user interface functionality by analyzing the three functions that generate the content served through the HTTP server. Recall from Listing 11.26, once the HTTP GET request was identified, the filename was parsed to route the request to the function to provide the required content.

The first content delivery function that we'll investigate is emitConfig (see Listing 11.31). This function displays the current configuration of the WebAgent for user review. The configuration cannot be changed through the Web page, but must instead be modified by editing the configuration file.

Listing 11.31: Function emitConfig .

 const char *prologue={   "<HTML><HEAD><TITLE>WebAgent</TITLE></HEAD>"   "<BODY TEXT=\"#000000\" bgcolor=\"#FFFFFF\" link=\"#OOOOEE\""   "vlink=\"#551A8B\" alink=\"#FF0000\">"   "<BR><font face=\"Bauhaus Md BT\"><font color=\"#000000\">" }; const char *epilogue={   "</BODY></HTML>\n" }; void emitConfig( int connfd ) { char line[MAX_LINE+1];   int i, j;   extern monitorEntryType monitors[];   extern feedEntryType feed;   emitHTTPResponseHeader( connfd );   write( connfd, prologue, strlen(prologue));   strcpy(line, "@h1:Configuration</H1></font></font><BR><BR>");   write( connfd, line, strlen(line));   strcpy(line, "<font size=+2>Sites to Monitor</font><BR><BR>");   write( connfd, line, strlen(line));   strcpy(line, "<center><table BORDER=3 WIDTH=100% NOSAVE><tr>\n");     write( connfd, line, strlen(line));     for (i = 0 ; i < MAX_MONITORS ; i++) {       if (monitors[i].active) {         sprintf(line, "<tr><td><font size=+1>%s</font></td><td>"                       "<font size=+1>%s<font></td></tr>\n",                  monitors[i].urlName, monitors[i].url);         write( connfd, line, strlen(line));     }   }   strcpy(line, "</tr></table></center><BR><BR>\n");   write( connfd, line, strlen(line));   sprintf(line,     "<H2>Feed %s</H2><BR><BR>\n", feed.url);   write( connfd, line, strlen(line));   strcpy(line, "<center><table BORDER=3 WIDTH=100% NOSAVE><tr>\n");   write( connfd, line, strlen(line));   for (i = 0 ; i < MAX_GROUPS ; i++) {     if (feed.groups[i].active) {       sprintf(line, "<tr><td><font size=+1>Group %s</font></td>\n",                feed.groups[i].groupName);       write( connfd, line, strlen(line));       strcpy(line, "\n<td><font size=+1>");       if (feed.groups[i].numSearchStrings > 0) {         for (j = 0 ; j < feed.groups[i].numSearchStrings ; j++) {           if (j > 0) strcat(line, ", ");           strcat(line, feed.groups[i].searchString[j]);         }       } else {         strcat(line, "[*]");       }       strcat(line, "</font></td></tr>\n");       write( connfd, line, strlen(line) );     }   }   strcpy(line, "</tr></table></center><BR><BR>\n");   write( connfd, line, strlen(line));   write( connfd, epilogue, strlen(epilogue));   return; }

The first item to notice is the construction of two constant character arrays that contain the HTML header and trailer information. The prologue sets up the color scheme, font size, and page title. The epilogue serves to complete the HTML page.

The first step in serving an HTML page through the server is emitting the HTTP response header (using the previously discussed emitHTTPResponseHeader ). Emitting the prologue follows next along with some captions for the page. All of these elements use the connfd socket descriptor passed into the function. This is our conduit to the client browser; anything we send through this socket will be received and interpreted by the client.

Prior to emitting our Web site monitoring information, an HTML is created using the <table> HTML tag. Each of the elements within the table is then encapsulated within the row tag <tr> . We walk through the monitors table looking for active elements. Once found, we emit them with the appropriate tags to create the row with two columns. The columns represent the URL name (textual name for the Web site) and the actual URL. After we've exhausted the elements of the table, we close out the table using the </table> tag.

Next, we emit the feed URL as a single write through the socket. The line is constructed as one element using sprintf , and then emitted through the socket using the standard write call.

Emitting the group's information is very similar to the method used to emit the monitors. A new table is defined and each of the rows is emitted. Two columns are present per row, one for the news-group name and the other for the search strings applied to that group. The group name is defined as one element but the search strings are independent elements. Therefore, a loop is performed to construct a single string from each of the independent search strings, along with ',' separators. If no search strings were present, we emit the '[*]' symbol to represent no search strings. Note that without any search strings; all messages are presented to the user. Once the outer loop has completed, the table is terminated with the </table> tag.

Finally, the epilogue is emitted to close out the HTML page. This tells the client browser that it can render the page and present it to the user.

The next function ( emitNews , shown in Listing 11.32) is very similar to emitConfig except for a couple of minor points. We'll illustrate these, and ignore the duplicate elements that were discussed for emitConfig .

Listing 11.32: Function emitNews .

 void emitNews( int connfd ) {   int i;   char line[MAX_LINE+1];   elementType *walker;   extern monitorEntryType monitors[];   extern feedEntryType feed;   extern elementType head;   emitHTTPResponseHeader( connfd );   write( connfd, prologue, strlen(prologue));   strcpy(line,           "<H1>Web Agent Results</H1></font></font><BR><BR>");   write( connfd, line, strlen(line));   strcpy(line, "<center><table BORDER=3 WIDTH=100% NOSAVE><tr>\n");   write( connfd, line, strlen(line));   for (i = 0 ; i < MAX_MONITORS ; i++) {     if ((monitors[i].active) && (monitors[i].changed)) {       sprintf(line, "<tr><td><font size=+1>%s</font></td>\n"                   "<td><font size=+1><a href=\"%s\">%s</a>"                   "</font></td></tr>\n",              monitors[i].urlName, monitors[i].url,              monitors[i].url);     write( connfd, line, strlen(line));     monitors[i].shown = 1;   } } strcpy(line, "</tr></table></center><BR><BR>\n"); write( connfd, line, strlen(line)); walker = head.next; if (walker) {   strcpy(line,          "<center><table BORDER=3 WIDTH=100% NOSAVE><tr>\n");   write( connfd, line, strlen(line));   while (walker) {     sprintf(line, "<tr><td><font size=+1>%s</font></td>\n"                   "<td><font size=+1><a href=\"%s\">"                   "%s</a></font></td>"                   "<td><font size=+1>%s</font></td></tr>",                   feed.groups[walker->group].groupName,                   walker->link,                   walker->subject,                   walker->msgDate );     write( connfd, line, strlen(line));     walker->shown = 1;     walker = walker->next; } strcpy(line, "</tr></table></center>\n"); write( connfd, line, strlen(line));   }   strcpy(line, "<FORM METHOD=\"POST\" ACTION=/clear\">");   write( connfd, line, strlen(line));   strcpy(line, "<BR><BR><INPUT TYPE=\"submit\" "                "VALUE=\"Mark Read\"><BR>\n");   write( connfd, line, strlen(line));   write( connfd, epilogue, strlen(epilogue));   return; }

The first item to note is in the section emitting the monitors (monitors loop). Only those rows that are active , and have previously been defined as changed are emitted. Once the item is displayed, the shown flag is set. Recall from the discussion of Listing 11.29 ( clearEntries ), that the shown flag is the indicator that when the client requests a clear of the previously displayed items, the item can be safely deleted. Similarly, the shown flag is also set in the news items that are displayed.

The final item to note in Listing 11.32 is the use of a link reference to permit the client browser to click the subject of a news item to view that actual news item. The HTML <a href> tag is used when displaying the subject to create this reference. The link field of the elementType is used to create this string reference (recall the discussion of the link field with Listing 11.23).

The final content generation function is emitArticle . This function is slightly more complicated in that it must communicate with the NNTP server to gather the body of the news item. Recall that only the head of the news message was initially extracted. This was done for time and space savings. When the user requests to view the article, a connection is created to the NNTP server to extract the entire article (see Listing 11.33).

Listing 11.33: Function emitArticle .

 void emitArticle( int connfd, char *filename ) {   int group, article, count, result;   news_t news;   char line[MAX_LINE+1];   extern feedEntryType feed;   sscanf(filename, "/art%d_%d", &group, &article);   news.msg = (char *)malloc(MAX_NEWS_MSG+1);   bzero( news.msg, MAX_NEWS_MSG+1 );   news.msgLen = MAX_NEWS_MSG;   emitHTTPResponseHeader( connfd );   write( connfd, prologue, strlen(prologue));   prune( feed.url, line );   count = nntpConnect( line );   if (count == 0) {     count = nntpSetGroup( feed.groups[group].groupName,                            article-1 );     if (count > 0) {       result = nntpRetrieve( &news, MAX_NEWS_MSG );       if (result > 0) {         result = nntpParse( &news, FULL_PARSE );         if (result == 0) {           /* Write to http */           sprintf( line,                "<font size=+1>Subject  : %s\n</font><BR><BR>",                     news.subject );           write( connfd, line, strlen(line) );           sprintf( line,                "<font size=+1>Sender   : %s\n</font><BR><BR>",                     news.sender );           write( connfd, line, strlen(line) );           sprintf( line,                "<font size=+1>Group    : %s\n</font><BR><BR>",       feed.groups[group].groupName );           write( connfd, line, strlen(line) );           sprintf( line, "<font size=+1>Msg Date : %s\n</font>"                          "<BR><BR><hr><PRE>",                     news.msgDate );           write( connfd, line, strlen(line) );           write( connfd,                   news.bodyStart, strlen(news.bodyStart) );           sprintf(line,               "</PRE><BR><BR>End of Message\n<BR><BR>");           write( connfd, line, strlen(line) );         } else {           /* Write error */           printf("Parse error\n");         }       }     }   }   write( connfd, epilogue, strlen(epilogue));   free( news.msg );   nntpDisconnect();   return; }

The first step in emitting an article is identifying which article the client browser has requested. The filename argument (parsed from the HTTP get request in handleConnection ) specifies this. We parse the filename to extract the group and article number. Recall that the group number is the index into the groups array of the feed structure. The article number is the actual numeric identifier of the article from the NNTP server.

Next, we create our news message that will be used to retrieve the full news message. The URL of the news feed is parsed and used to connect to the NNTP server with the nntpConnect function. If successful, we set the group to the group parsed from the " /art " filename (as feed.groups[group] ). Note that we set the last message read to " article-1 ." This means that once the nntpSetGroup has completed, the first message to retrieve will be the article of interest. We then call nntpRetrieve and parse the results using nntpParse . One difference here is that we pass the FULL_PARSE symbolic to nntpParse so that the body of the message is identified.

What remains is to emit the retrieved information to the user through the passed socket descriptor connfd . We've seen most of the data that's being emitted. The new item is bodyStart , which represents the start of the message body of the article. Function emitArticle completes by writing the epilogue , freeing the buffers allocated for the news message and disconnecting from the NNTP server using nntpDisconnect .

Main Function

Let's now put it all together with the main function for WebAgent. This function provides the basic loop for the WebAgent functionality (see Listing 11.34).

Listing 11.34: The WebAgent main() Function.

 int main() {   int timer=0, ret, i;   extern monitorEntryType monitors[];   /* Parse the configuration file */   ret = parseConfigFile( "config" );   if (ret != 0) {     printf("Error reading configuration file\n");     exit(0);   }   /* Start the HTTP server */   initHttpServer();   while (1) {     /* Check the news and monitor sites every 10 minutes */     if ((timer % 600) == 0) {       /* Check news from the defined net news server */       checkNewsSources();       /* Check to see if any defined Web-sites have been        * updated.        */       for (i = 0 ; i < MAX_MONITORS ; i++) {         if (monitors[i].active) monitorSite( i );       }     }     /* Check to see if a client has made a request */     checkHttpServer();     timer++;   } }

WebAgent is first initialized by reading and parsing the configuration file using parseConfigFile and then starting the HTTP server with initHttpServer . We then start an infinite loop that performs two basic functions. The first is data collection and the second is checking for HTTP client connections.

Data collection is performed every 10 minutes (as defined by (timer % 600)). Function checkNewsSources is used to check if any news is available that matches the user's search criteria. The monitorSite function is used to check to see if any Web sites have changed, using an inner loop.

At the end of the loop, the HTTP server is checked for incoming client connections using checkHttpServer . The checkHttpServer function blocks for one second awaiting client connections. If no connection arrives in that time, the function returns and we check to see if data collection occurs. If a connection arrives during the time that we're collecting data, the connection is blocked and we pick it up at the next call to checkHttpServer .