HTML Parsing


In addition to supporting the hosting of dynamic pages for Web sites and providing an advanced Web services infrastructure, the Resin Server family has a large class library you can use during development of pages and services. One area of support in the class libraries is document parsing. The parsing classes are broken into three areas: DOM XML, SAX XML, and HTML. In this chapter, we will discuss each of the parsing methodologies and show through example code how to use the provided libraries. This information is important to understand, because XML is used increasingly often and you will use it in industry software development.

XML Parsing

The current standard interface for XML processing in Java is Java API for XML Processing (JAXP). By way of imports, a servlet can use XML functionality while being hosted within a Resin server. To maintain platform independence, all applications that need XML processing should stick to the API provided by JAXP. You can find more information about JAXP at http://java.sun.com/xml/jaxp/index.html.

Before any processing of an XML document can take place, the document must be parsed. There are currently two methods for parsing XML: Document Object Model (DOM) and Simple API for XML (SAX). JAXP provides an API for both methods with a class defined by the package javax.xml.parsers. With the package are two factories—SAXParserFactory and DocumentBuilderFactory—that produce SAXParser and DocumentBuilder objects for parsing XML documents.

This isn't rocket science. What makes JAXP such an interesting API is the fact that the included Factory classes for DOM and SAX don't limit you to using only Sun-provided parsing code; instead, they allow you to use third-party plug-in parsers. If third-party parsers aren't used, the default reference implementations are used. You will use both the reference implementation and Resin's fast parsers, which are designed for performance.

DOM XML Parsing

DOM parsing involves building a node hierarchy representing the elements in the XML document. The code in Listing 10.1 is the simplest for building a DOM object from an XML document on the local hard drive.

Listing 10.1: DOM XML parsing servlet.

start example
 Line 1: package servlet; import java.io.*; import javax.servlet.http.*; import javax.servlet.*; import javax.xml.parsers.*; import org.w3c.dom.*; import com.caucho.xml.*; Line 10:public class Parsing extends HttpServlet {   public void doGet(HttpServletRequest request,     HttpServletResponse response)     throws IOException, ServletException {     try {       DocumentBuilderFactory factory =         DocumentBuilderFactory.newInstance();       DocumentBuilder parser = factory.newDocumentBuilder(); Line 20:      Document doc = parser.parse("test.xml");       PrintWriter pw = response.getWriter();       FileOutputStream os = new FileOutputStream("out.xml");       XmlPrinter printer = new XmlPrinter(pw);       XmlPrinter oprinter = new XmlPrinter(os);       printer.printHtml(doc);       oprinter.print(doc); Line 30:       pw.close();       os.close();     } catch(Exception e) {        e.printStackTrace();     }   } } 
end example

Some setup steps are involved in executing the code in Listing 10.1. First, add the line

    <web-app id='parsing'/> 

to the resin.conf file. Second, under the /doc directory of the Resin server installation, create a directory structure that looks like Figure 10.1.

click to expand
Figure 10.1: Directory structure for the DOM XML parsing servlet.

The web.xml file in the WEB-INF directory looks like this:

 <web-app>   <servlet-mapping>     <url-pattern>/servlet/*</url-pattern>     <servlet-name>servlet/Parsing</servlet-name>   </servlet-mapping> </web-app> 

In the main Resin server directory, create a file called test.xml that contains a sampling of XML. Execute the servlet by browsing to the following location:

  • http://localhost:8080/parsing/servlet/Parsing

Once you compile and execute the servlet, two outputs are created: a file called out.xml in the main Resin server directory; and XML from the test.xml file, which is sent to the Web browser. Figure 10.2 shows the source sent to the browser.

click to expand
Figure 10.2: Parsing output.

Let's take a moment and look at the code. It begins starts with a number of imports necessary to support three primary functions: I/O, servlet, and XML. For XML, three imports are shown in the code: the first references the parser classes, the second references the DOM classes necessary for building a DOM object, and the last is used for a Resin-specific print class.

Lines 10 through 13 define the doGet() servlet method for the application. When the servlet is activated through a call from a browser, a DOM factory object is obtained in lines 16 and 17 through a call to the DocumentBuilderFactory class's newInstance() method. From the factory instance, a parser object is instantiated with a call to newDocumentBuilder().

In line 20, the parser's parse() method is called with the path to an XML document supplied as the single parameter. The parse() method attempts to build a new Document object based on the XML found in the supplied file. If the method is successful, the new DOM object is assigned to the doc variable.

In a production-level application, the DOM object would be used to process the imported XML document as needed by the business logic of the problem being solved. In this case, you return the XML document to the browser that made the call to the servlet, and output a new file with the DOM object contents.

Line 22 retrieves a PrintWriter object from the response object used to send text back to the browser. Line 23 creates an object to a server-based file called out.xml. Lines 25 and 26 create separate XmlPrinter objects with parameters based on the response PrintWriter and the FileOutputStream, respectively.

The XmlPrinter class is provided by Caucho to enable easy printing of XML documents. As shown in lines 28 and 29, the class includes methods for a variety of printing. The full class structure is outlined in the Javadoc pages at www.caucho.com/resin/javadoc/com/caucho/xml/XmlPrinter.html.

Once the printing has been accomplished, the FileOutputStream and PrintWriter objects are closed and the application ends. Although this is a simple example of parsing, it shows the basic operation.

SAX XML Parsing

SAX parsing can also be performed with the JAXP API and either the reference implementation provided with JAXP or a parser provided with Resin. Listing 10.2 shows a simple servlet that uses SAX for XML document parsing.

Listing 10.2: Simple SAX XML parsing.

start example
 Line 1: package servlet; import java.io.*; import javax.servlet.http.*; import javax.servlet.*; import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import com.caucho.xml.*; Line 10: public class SAXParsing extends HttpServlet {   private class MyHandler extends DefaultHandler {     public void startElement(String namespaceURI,      String localName,      String qName, Attributes atts) {       //Do something with element start     }   public void endElement(String namespaceURI, String localName, Line 20:                         String qName) {       //Do something at element end     }   }   public void doGet(HttpServletRequest request,     HttpServletResponse response)     throws IOException, ServletException {     try { Line 30:      Xml xml = new Xml();       xml.setContentHandler(new MyHandler());       xml.parse("test. xml");     } catch(Exception e) {        e.printStackTrace();     }   } } 
end example

The code for SAX parsing is quite different from that of a DOM parser. In the DOM code, the entire XML document is kept in the application's memory. If the XML document is large, the corresponding DOM object will also be large. In addition, the amount of time necessary to build the object increases as the XML is parsed. On the other hand, when an XML document is parsed using SAX, content handler methods are called based on points hit during the parsing of the document, such as the start and end of an element.

Lines 10 through 23 define a private class that extends the DefaultHandler class. This new class is used to override the content handler methods for SAX processing. This example overrides startElement and endElement so the system can process specific elements. Next comes the code for the servlet doGet(). Line 30 instantiates a new SAX parser. Line 32 associates the private content handler class with the new SAX parser. Finally, line 34 performs the actual parsing of an XML document. As each element in the XML document is encountered, the appropriate content handler methods are executed. In this case, most of the methods are null, except startElement and endElement.

Changing Parsers

If you aren't happy with the functionality found in the reference DOM and SAX parsers, you can swap them with other third-party parsers. The Resin server provides an easy way to configure which parser is used for either SAX or DOM parsing. Once the parsers are swapped, you don't need to do anything except restart the application—as long as you use the API provided in the JAXP package, you don't have to change any code. To specify which parser the factory classes should use, set the following properties in the resin.conf file:

 <web-app>   <system-property javax.xml.parsers.DocumentBuilderFactory=       "dom" />   <system-property javax.xml.parsers.SAXParserFactory=       "sax" />   <system-property javax.xml.transform.TransformerFaotory=       "trans" /> </web-app> 

As you can see, you place the parser and XSL system properties in the <web-app> portion of the configuration file.

Although we don't discuss the transformation aspect of JAXP in this chapter, note that Resin includes its own XSL processor, which can replace the reference implementation used in JAXP. The dom, sax, and trans values are replaced by the full hierarchy name of the class to use. To use the implementations provided by Resin, replace the values as follows:

 dom = com.caucho.xml.parsers.XmlDocumentBuilderFactory sax = com.caucho.xml.parsers.XmlSAXParserFactory trans = com.caucho.xsl.Xsl 

The Resin documentation cites two other possible replacements, which we list here for completeness. For Xalan/Xerces, use the following:

 dom = org.apache.xerces.jaxp.DocumentBuilderFactoryImpl sax = org.apache.xerces.jaxp.SAXParserFactoryImpl trans = org.apache.xalan.processor.TransformerFactoryImp1 

For the parsers and transformer provided with Java 1.4, use these values:

 dom = org.apache.crimson.jaxp.DocumentBuilderFactoryImpl sax = .apache.crimson.jaxp.SAXParserFactoryImpl trans = org.apache.xalan.processor.TransformerFactoryImpl 




Mastering Resin
Mastering Resin
ISBN: 0471431036
EAN: 2147483647
Year: 2002
Pages: 180

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net