MiniCrawler: A Case Study


To show how easy WebRequest and WebReponse make Internet programming, a skeletal web crawler called MiniCrawler is developed. A web crawler is a program that simply moves from link to link to link. Search engines use web crawlers to catalog content. MiniCrawler is very simple. It starts at the URI that you specify and then reads the content at that address, looking for a link. If a link is found, it then asks if you want to go to that link, search for another link on the existing page, or quit.

MiniCrawler has several limitations. First, only absolute links that are specified using the href=“http” hypertext command are found. Relative links are not used. Second, there is no way to go back to an earlier link. Third, it displays only the links and no surrounding content. Despite these limitations, the skeleton is fully functional, and you will have no trouble enhancing MiniCrawler to perform other tasks. In fact, adding features to MiniCrawler is a good way to learn more about the networking classes and networking in general.

Here is the entire code for MiniCrawler:

 /* MiniCrawler: A skeletal Web crawler.    Usage:      To start crawling, specify a starting      URI on the command line. For example,      to start at McGraw-Hill.com use this      command line:        MiniCrawler http://McGraw-Hill.com */ using System; using System.Net; using System.IO; class MiniCrawler {   // Find a link in a content string.   static string FindLink(string htmlstr,                          ref int startloc) {     int i;     int start, end;     string uri = null;     string lowcasestr = htmlstr.ToLower();     i = lowcasestr.IndexOf("href=\"http", startloc);     if(i != -1) {       start = htmlstr.IndexOf('"', i) + 1;       end = htmlstr.IndexOf('"', start);       uri = htmlstr.Substring(start, end-start);       startloc = end;     }     return uri;   }   public static void Main(string[] args) {     string link = null;     string str;     string answer;     int curloc; // holds current location in response     if(args.Length != 1) {       Console.WriteLine("Usage: MiniCrawler <uri>");       return ;     }     string uristr = args[0]; // holds current URI     try {       do {         Console.WriteLine("Linking to " + uristr);         // Create a WebRequest to the specified URI.         HttpWebRequest req = (HttpWebRequest)                WebRequest.Create(uristr);         uristr = null; // disallow further use of this URI         // Send that request and return the response.         HttpWebResponse resp = (HttpWebResponse)                req.GetResponse();         // From the response, obtain an input stream.         Stream istrm = resp.GetResponseStream();         // Wrap the input stream in a StreamReader.         StreamReader rdr = new StreamReader(istrm);         // Read in the entire page.         str = rdr.ReadToEnd();         curloc = 0;         do {           // Find the next URI to link to.           link = FindLink(str, ref curloc);           if(link != null) {             Console.WriteLine("Link found: " + link);             Console.Write("Link, More, Quit?");             answer = Console.ReadLine();             if(string.Compare(answer, "L", true) == 0) {               uristr = string.Copy(link);               break;             } else if(string.Compare(answer, "Q", true) == 0) {               break;             } else if(string.Compare(answer, "M", true) == 0) {               Console.WriteLine("Searching for another link.");             }           } else {             Console.WriteLine("No link found.");             break;           }         } while(link.Length > 0);         // Close the Response.         resp.Close();       } while(uristr != null);     } catch(WebException exc) {       Console.WriteLine("Network Error: " + exc.Message +                         "\nStatus code: " + exc.Status);     } catch(ProtocolViolationException exc) {       Console.WriteLine("Protocol Error: " + exc.Message);     } catch(UriFormatException exc) {       Console.WriteLine("URI Format Error: " + exc.Message);     } catch(NotSupportedException exc) {       Console.WriteLine("Unknown Protocol: " + exc.Message);     } catch(IOException exc) {       Console.WriteLine("I/O Error: " + exc.Message);     }     Console.WriteLine("Terminating MiniCrawler.");   } }

Here is a short a sample session that begins crawling at McGraw-Hill.com:

 Linking to http://mcgraw-hill.com Link found: http://sti.mcgraw-hill.com:9000/cgi-bin/query?mss=search&pg=aq Link, More, Quit? M Searching for another link. Link found: http://investor.mcgraw-hill.com/phoenix.zhtml?c=96562&p=irol-irhome Link, More, Quit? L Linking to http://investor.mcgraw-hill.com/phoenix.zhtml?c=96562&p=irol-irhome Link found: http://www.mcgraw-hill.com/index.html Link, More, Quit? L Linking to http://www.mcgraw-hill.com/index.html Link found: http://sti.mcgraw-hill.com:9000/cgi-bin/query?mss=search&pg=aq Link, More, Quit? Q Terminating MiniCrawler.

Let’s take a close look at how MiniCrawler works. The URI at which MiniCrawler begins is specified on the command line. In Main( ), this URI is stored in the string called uristr. A request is created to this URI, and then uristr is set to null, which indicates that this URI has already been used. Next, the request is sent and the response is obtained. The content is then read by wrapping the stream returned by Get.ResponseStream( ) inside a StreamReader, and then calling ReadToEnd( ), which returns the entire contents of the stream as a string.

Using the content, the program then searches for a link. It does this by calling FindLink( ), which is a static method also defined by MiniCrawler. FindLink( ) is called with the content string and the starting location at which to begin searching. The parameters that receive these values are htmlstr and startloc, respectively. Notice that startloc is a ref parameter. FindLink( ) first creates a lowercase copy of the content string and then looks for a substring that matches href=“http, which indicates a link. If a match is found, the URI is copied to uri, and the value of startloc is updated to the end of the link. Because startloc is a ref parameter, this causes its corresponding argument to be updated in Main( ), enabling the next search to begin where the previous one left off. Finally, uri is returned. Since uri was initialized to null, if no match is found, a null reference is returned, which indicates failure.

Back in Main( ), if the link returned by FindLink( ) is not null, the link is displayed, and the user is asked what to do. The user can go to that link by pressing l, search the existing content for another link by pressing m, or quit the program by pressing q. If the user presses l, the link is followed and the content of the link is obtained. The new content is then searched for a link. This process continues until all potential links are exhausted.

You might find it interesting to increase the power of MiniCrawler. For example, you might try adding the ability to follow relative links. (This is not hard to do.) You might try completely automating the crawler by having it go to each link that it finds without user interaction. That is, starting at an initial page, have it go to the first link it finds. Then, in the new page, have it go to the first link, and so on. Once a dead-end is reached, have it backtrack one level, find the next link, and then resume linking. To accomplish this scheme, you will need to use a stack to hold the URIs and the current location of the search within a URI. One way to do this is to use a Stack collection. As an extra challenge, try creating tree-like output that displays the links.




C# 2.0(c) The Complete Reference
C# 2.0: The Complete Reference (Complete Reference Series)
ISBN: 0072262095
EAN: 2147483647
Year: 2006
Pages: 300

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net