What Does an XML Parser Do? | XML Programming Bible

Parsing a language refers to the process of taking a piece of code or data written in that language and breaking it into component parts as defined by the rules of that language. XML parsers are classified along two independent dimensions: validating vs. nonvalidating and stream based vs. tree based.

Validating and Non-Validating Parsers

A validating parser can use a DTD or schema to verify that a document is properly constructed according to the rules for the XML application it's an instance of, and it is supposed to complain loudly if the rules aren't followed. A DTD can also specify default values for the attributes of various elements, and a validating parser can fill them in when it encounters elements with no attributes listed. This capabililty can be important when you're processing XML documents you've received from the outside world. For example, if vendors send XML-marked invoices to your company, you'll want to ensure that they contain the right elements in the right order.

A non-validating parser only requires that the document be well-formed. Because of the design of XML, it's possible to parse well-formed documents without referring to a DTD or XSD schema. Additional information for being well-formed, which was discussed in Chapter 2, can be found in the XML 1.0 Recommendation (http://www.w3.org/TR/2000/REC-xml-20001006).

Non-validating parsers are simpler, and many of the free parsers available over the Web are non-validating. They are usually adequate for processing XML documents generated within the same organization or documents whose validity constraints are so complex that they can't be expressed by a DTD and need to be verified by application logic instead.

Stream-Based and Tree-Based Parsers

A parser can make the components of an XML document known to an application in two ways. It can read through the document and signal the application every time a new component appears, or it can read the entire document and give the application a tree structure corresponding to the element structure of the document. Parsers that use the first method are called stream-based or event-driven parsers. Parsers that use the second method are tree-based parsers. Both methods will be discussed in greater detail later.

You'll hear two common terms regarding these methods: Simple API for XML (SAX) and the Document Object Model (DOM). SAX is a standard developed informally by members of the xml-dev mailing list for how a stream-based parser should "talk" to an application (see http://www.megginson.com/SAX/index.html). The DOM is a formal Recommendation of the W3C on how an application can access and manipulate the tree structure of a document (for an example see http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001 where the first edition is referred to as Level 1).