Building a Browser


A protocol is the exchange between identical layers of two hosts , and a service is what one layer offers to the layer on top of it. Instead of delving into the details of how this process works, take a look at Listing 11.1, which shows you how to build a quick-and-easy browser. In this example, the browser displays a URL that is hard-coded in the source by default. You can run this application by passing a URL in the command line and that is what the browser will display. Please notice that Listings 11.1 through 11.6 must be tested when the machine is connected to the Internet.

Listing 11.1 Building a Basic Web Browser Using the URL Class
 import java.io.*; import java.net.*; public class SimpleBrowser {    public static void main(String[] args) throws Exception    {  //default URL displayed by browser       String url = "www.google.com";       if (args.length == 1)       {  //URL provided by user at command line          url = args[0].toLowerCase();       }       if (!url.startsWith("http://"))       {  //add protocol if missing          url = "http://" + url;       }       // the URL object does the work on getting the Web page       URL webPage = new URL(url);       BufferedReader page = new BufferedReader(              new InputStreamReader( webPage.openStream() ) );       String html;       StringBuffer pageBuffer = new StringBuffer();       while ((html = page.readLine()) != null)       {  //get each line of HTML from Web site          pageBuffer.append(html);       }       System.out.println(pageBuffer);       page.close();     } } //returns: //<html><head><META HTTP-EQUIV="content-type" CONTENT="text/html; //charset=ISO-8859-1"><title>Google</title><style><!- //body {font-family: arial,sans-serif;} .q {text-decoration: none; color: //#0000cc;}//--></style> //remainder of HTML removed for space 

The URL class (from the java.net package) used in Listing 11.1 in turn uses sockets, but you don't have to worry about the details of sockets with this class.

Listing 11.1 builds a browser. It has no GUI, but it is doing what Netscape Navigator and Internet Explorer do. These browser applications get a stream of characters from a Uniform Resource Locator (URL) and then process the stream. In the little browser from Listing 11.1, you simply cache the stream until you get the whole page, and then print it to screen at once. IE and Navigator do the same thing, except they perform more processing, such as converting tags into display elements and fetching other pages (for example, image tag).

Don't think Netscape Navigator and Internet Explorer have the browser market cornered. Sure, their interfaces are beautiful, but there is plenty of room to add specialized functionality. For example, with Java you can build a browser that cuts to the chase by eliminating all "garbage" and displaying just the interesting text. You can filter out pictures, scripts, and applets (Navigator and IE allow you to do this), and remove headers, legal warnings, menus , and even silly advertisements. (Navigator and IE can't do all this on their own.) You can programmatically arrange pertinent content in summary form in the order you like and show statistics. You can also perform a cross “Web site compilation of information by getting interesting material from various pages, within a single Web site or from many sites. You can filter all the links on a page and then surf to them, and in turn go through the links on each site and repeat the process. This is how most search engine robots (such as Google's) work.

Try adding filtering capability to the basic browser from Listing 11.1 to see how easy it is to customize this browser. The code in Listing 11.2 adds the capability to filter image and body tags. (You can add other tags to filter, if you want to experiment more with this code.) Here, the body and image tags are replaced with a comment tag to demonstrate how you can filter specific HTML tags from a Web page. You can easily expand this capability to remove other tags or to change targeted tags, such as translating HTML into XHTML.

Listing 11.2 Filtering Tags with the URL Class
 import java.io.*; import java.net.*; import java.util.regex.*; /* Body and image tags are replaced with comment tags. Demonstrates filtering specific HTML tags from a Web page. Could adapt to translate HTML into XHTML). */ public class FilteringBrowser {    public static void main(String[] args) throws Exception    {       String url = "www.google.com";       if (args.length == 1)       {          url = args[0].toLowerCase();       }       String html;       //use this test HTML file if needed       if (url.startsWith("test"))       {          html = ""<!DOCTYPE html "                 + "PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" "                 + "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"                 + "<html>"                 + "<head>"                 + "<title>your sample html</title>"                 + "</head>"                 + "<body bgcolor=\"white\" >"                 + "<p>some text</p>"                 + "<a href=\"/onefolder/twofolders/home\">"                 + "<img src=\"/home/doc/resource/picture.jpg\" "                 + " alt=\"resource pointer\"></a>"                 + "</body>"                 + "</html>";       } else       {          if (!url.startsWith("http://"))          {             url = "http://" + url;          }          //gets Web page          URL webPage = new URL(url);          BufferedReader page = new BufferedReader(                 new InputStreamReader( webPage.openStream() ) );          String lineHtml = "";          StringBuffer pageBuffer = new StringBuffer();          while ((lineHtml = page.readLine()) != null)          {  //places HTML into buffer for later processing             pageBuffer.append(lineHtml);          }          html = pageBuffer.toString();          page.close();       }       final int imageTag = 23;       final int bodyTag = 3;       String replaceImageTagWith = "<!--removed image-->";       String replaceBodyTagWith = "<!--removed body-->";       Filter filterImages = new Filter(html.toLowerCase());       //remove all image tags       filterImages.setPattern(imageTag);       filterImages.setReplacement(replaceImageTagWith);       filterImages.remove();       //remove all body tags       filterImages.setPattern(bodyTag);       filterImages.setReplacement(replaceBodyTagWith);       filterImages.remove();       String finalHtml = filterImages.getHtml();       System.out.println(finalHtml);     } } /* This class acts on a given HTML tag. Here it replaces a target tag with new one. It could do more, such as translate HTML into XHTML. */ class Filter {    private String html;    private String pattern;    private String replacement;    private final int imageTagFlag = 23;    private final int bodyTagFlag = 3; /* These text patterns can be as complex as desired. You can filter for entire HTML pages or single tags. */    private final String imageTagPattern =                    "[<]\s*img\s*([^>]*)\s*[>]";    private final String bodyTagPattern =                    "[<]\s*body\s*([^>]*)\s*[>]";    public Filter(StringBuffer html)    {       this.html = html.toString();    }    public Filter(String html)    {       this.html = html;    }    public void setReplacement(String replacement)    {       this.replacement = replacement;    }    public void setPattern(int removePattern)    {       switch (removePattern)       {          case(bodyTagFlag):             this.pattern = this.bodyTagPattern;             break;          case(imageTagFlag):             this.pattern = this.imageTagPattern;       }    }    public void remove()    {       Pattern newPattern = Pattern.compile(this.pattern);       Matcher newMatcher = newPattern.matcher(this.html);       String newHtml = newMatcher.replaceAll(this.replacement);       this.html = newHtml;    }    public String getHtml()    {       return this.html;    } } // java  FilteringBrowser test //returns: //<!doctype html public "-//w3c//dtd xhtml 1.0 strict//en" //"http://www.w3.org/tr/xhtml1/dtd/xhtml1-strict.dtd"> //<html><head><title>your sample html</title></head> //<!--removed body--><p>some text</p> //<a href="/onefolder/twofolders/home"> //<!--removed image--></a></body></html> 

Notice the Filter.remove() method that includes pattern matching classes (new to 1.4), a powerful capability. I could spend several chapters on just the java.util.regex package, but for the purposes of this book, I'll just mention that you can easily expand Listing 11.2 to filter out any standard HTML tag or text between the start and end tags. You will do more with patterns later in this chapter, but keep in mind that you don't have to remove the text: You can strip the text and make it the focus of further processing.

Interrogating a URL

Each Web server has properties about itself and its files that you can interrogate. Before bringing your network to its knees trying to download a patch file, for example, it would be helpful to know beforehand that the file you are about to download is 100GB. You can do that by looking at the file attributes before actually initializing the download. Also, you might want to know if a host has other IP addresses of interest. Listing 11.3 shows what properties you can see with the InetAddress object.

Listing 11.3 Interrogating a URL Using InetAddress
 import java.io.*; import java.net.*; import java.util.regex.*; public class InetAddressTest {    public static void main(String[] args) throws Exception    {       String url = "www.microsoft.com";       if (args.length == 1)       {  //get URL from user          url = args[0].toLowerCase();       }       String html;       if (url.startsWith("http://"))       {          url = url.substring(7);       }       try       {          //local machine          InetAddressReport urlReport = new InetAddressReport("local");          urlReport.setName("Local Machine");          urlReport.print();          //url machine          urlReport.setAddress(url);          urlReport.setName(url);          urlReport.print();          //multiple IP machine          url = "www.microsoft.com";          InetAddressReport nIPReport = new InetAddressReport(url,true);          nIPReport.setName(url);          nIPReport.printN();          //f = addr.equals(addrs);       } catch(Exception e)       {          System.out.println(e);       }    } } /* This class reports the major attributes associated with a given URL. It uses the InetAddress class to interrogate the attributes. */ class InetAddressReport {    private InetAddress inetAddress;    private InetAddress[] nInetAddress;    private String name;    InetAddressReport(InetAddress address)    {       this.inetAddress = address;    }    InetAddressReport(String address, boolean multipleIP)    {       if (multipleIP)       {          try          {             this.nInetAddress = InetAddress.getAllByName(address);          } catch(UnknownHostException e)          {             System.out.println(e);          }       } else       {          try          {             this.inetAddress = InetAddress.getByName(address);          } catch(UnknownHostException e)          {             System.out.println(e);          }       }    }    //creates the URL attributes report    InetAddressReport(String url)    {       if ("local"==url.toLowerCase())       {          try          {             this.inetAddress = InetAddress.getLocalHost();          } catch(UnknownHostException e)          {             System.out.println(e);          }       } else       {          try          {             this.inetAddress = InetAddress.getByName(url);          } catch(UnknownHostException e)          {             System.out.println(e);          }       }    }    //setter for the Internet address(es)    public void setName(String name)    {       this.name = name;    }    public void setAddress(InetAddress address)    {       this.inetAddress = address;    }    public void setAddress(String url)    {       if ("local"==url.toLowerCase())       {          try          {             this.inetAddress = InetAddress.getLocalHost();          } catch(UnknownHostException e)          {             System.out.println(e);          }       } else       {          try          {             this.inetAddress = InetAddress.getByName(url);          } catch(UnknownHostException e)          {             System.out.println(e);          }       }    }    //getter for the Internet address(es)    public InetAddress getAddress()    {       return this.inetAddress;    }    //this actually prints out all the attributes    //for the URL provided at the command line.    public void print()    {       boolean flag;       int i;       String s;       byte[] b;       System.out.println("For : " + this.name);       try       {  //these are all the major attributes          // or properties of an InetAddress object.          s = this.inetAddress.getHostAddress();          System.out.println("IP : " + s);          s = this.inetAddress.getHostName();          System.out.println("Machine Name : " + s);          i = this.inetAddress.hashCode();          System.out.println("hashCode = " + i);          s = this.inetAddress.toString();          System.out.println(s);          //b = this.inetAddress.getAddress(); //if you need it          //String rawIP = new String(b);          flag = this.inetAddress.isAnyLocalAddress();          System.out.println("wildcard address=" + flag);          flag = this.inetAddress.isLinkLocalAddress();          System.out.println("local address=" + flag);          flag = this.inetAddress.isLoopbackAddress();          System.out.println("loopback address=" + flag);          flag = this.inetAddress.isMCGlobal();          System.out.println("global scope=" + flag);          flag = this.inetAddress.isMCLinkLocal();          System.out.println("link scope=" + flag);          flag = this.inetAddress.isMCNodeLocal();          System.out.println("node scope=" + flag);          flag = this.inetAddress.isMCOrgLocal();          System.out.println("organization scope=" + flag);          flag = this.inetAddress.isMCSiteLocal();          System.out.println("site scope=" + flag);          flag = this.inetAddress.isMulticastAddress();          System.out.println("IP multicast address=" + flag);          flag = this.inetAddress.isSiteLocalAddress();          System.out.println("local address=" + flag);          System.out.println();       } catch(Exception e)       {          System.out.println(e);       }    }    //this also prints out all the attributes    //for an array of URLs.    public void printN()    {       boolean flag;       String s;       byte[] b;       System.out.println("For: " + this.name);       try       {          for (int i = 0; i < this.nInetAddress.length; i++)          {             s = this.nInetAddress[i].getHostAddress();             System.out.println("IP: " + s);             s = this.nInetAddress[i].getHostName();             System.out.println("Machine Name : " + s);             flag = this.nInetAddress[i].isMulticastAddress();             System.out.println("IP multicast = " + flag);             int hash = this.nInetAddress[i].hashCode();             System.out.println("hashCode = " + hash);             s = this.nInetAddress[i].toString();             System.out.println(s);             System.out.println();          }       } catch(Exception e)       {          System.out.println(e);       }    } } //java  InetAddressTest //returns: /* For : Local Machine IP : 192.168.0.2 Machine Name : machinename hashCode = -1062731774 machinename/192.168.0.2 wildcard address=false local address=false loopback address=false global scope=false link scope=false node scope=false organization scope=false site scope=false IP multicast address=false local address=true For : www.microsoft.com IP : 207.46.134.222 Machine Name : www.microsoft.com hashCode = -819034402 www.microsoft.com/207.46.134.222 wildcard address=false local address=false loopback address=false global scope=false link scope=false node scope=false organization scope=false site scope=false IP multicast address=false local address=false For : www.microsoft.com IP : 207.46.134.222 Machine Name : www.microsoft.com IP multicast = false hashCode = -819034402 www.microsoft.com/207.46.134.222 IP : 207.46.249.222 Machine Name : www.microsoft.com IP multicast = false hashCode = -819004962 www.microsoft.com/207.46.249.222 IP : 207.46.134.190 Machine Name : www.microsoft.com IP multicast = false hashCode = -819034434 www.microsoft.com/207.46.134.190 IP : 207.46.249.27 Machine Name : www.microsoft.com IP multicast = false hashCode = -819005157 www.microsoft.com/207.46.249.27 IP : 207.46.249.190 Machine Name : www.microsoft.com IP multicast = false hashCode = -819004994 www.microsoft.com/207.46.249.190 */ 

As you can see, ample information about the machine service is available before you even get to the files. Listing 11.3 demonstrates how to overload constructors and methods for cleaner code. It is best to overload like this when something in the parameter list is the only thing changing. This program works on many Internet protocols, including HTTP, file, FTP, Gopher, mailto, and others.



JavaT 2 Developer Exam CramT 2 (Exam CX-310-252A and CX-310-027)
JavaT 2 Developer Exam CramT 2 (Exam CX-310-252A and CX-310-027)
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 187

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net