Normalization | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Implementations have quite a bit of leeway in exactly how they parse and serialize any given document. For example, a parser may represent CDATA sections as CDATASection objects, or it may merge them into neighboring Text objects. A parser may include entity reference nodes in the tree, or it may instead include the nodes corresponding to each entity's replacement text. A parser may include comments, or it may ignore them. DOM3 adds four methods to the Document interface to control exactly how a parser makes these choices:

 public void  normalizeDocument  ()  public boolean  canSetNormalizationFeature  (String  name,  boolean  state  ) public void  setNormalizationFeature  (String  name,  boolean  state  ) public boolean  getNormalizationFeature  (String  name  )

The canSetNormalizationFeature() method tests whether the implementation supports the desired value (true or false) for the named feature. The setNormalizationFeature() method sets the value of the named feature. It throws a DOMException with the error code NOT_FOUND_ERR if the implementation does not support the feature at all. It throws a DOMException with the error code NOT_SUPPORTED_ERR if the implementation does not support the requested value for the feature (for example, if you try to set to true a feature that must have the value false). Finally, after all of the features have been set, client code can invoke the normalizeDocument() method to modify the tree in accordance with the current values for all of the different features.

Caution

These are very bleeding-edge ideas from the latest DOM3 Core Working Draft. Xerces 2.0.2 is the only parser that supports any of this so far.

The DOM3 specification defines 13 standard features:

normalize- characters , optional, default false

If true, document text should be normalized according to the W3C Character Model. For example, the word resum would be represented as the six-character string r e s u m rather than the seven-character string r e s u m e combining_'. Implementations are only required to support a false value for this feature.

split-cdata-sections, required, default true

If true, CDATA sections containing the CDATA section end delimiter ]]> are split into pieces, and the ]]> is included in a raw text node. If false, such a CDATA section is not split.

entities, optional, default true

If false, entity reference nodes are replaced by their children. If true, they're not.

whitespace-in-element-content, optional, default true

If true, all white space is retained. If false, text nodes containing only white space are deleted if the parent element's declaration from the DTD/schema does not allow #PCDATA to appear at that point.

discard-default-content, required, default true

If true, the implementation throws away any nodes whose presence can be inferred from the DTD or schema; for example, default attribute values.

canonical-form, optional, default false

If true, the document is arranged according to the rules of the canonical XML specification, at least within the limits of what can be represented in a DOM implementation. For example, EntityReference nodes are replaced by their content, and CDATASection objects are converted to Text objects. However, there's no way in DOM to control everything that the canonical XML specification requires. For instance, a DOM Element does not know the order of its attributes or whether an empty element will be written as a single empty-element tag or start-tag/end-tag pair. Thus, full canonicalization has to be deferred to serialization time.

namespace-declarations, optional, default true

If false, then all Attr nodes representing namespace declarations are deleted from the tree. Otherwise they're retained. This has no effect on the namespaces associated with individual elements and attributes.

validate, optional, default false

If true, then the document's schema or DTD is used to validate the document as it is being normalized. Any validation errors that are discovered are reported to the registered error handler. (Both validation and error handlers are new features in DOM3.)

validate-if-schema, optional, default false

If true and the validation feature is also true, then the document is validated if and only if it has a some kind of schema (for example, DTD, W3C XML Schema Language schema, or RELAX NG schema).

datatype-normalization, required, default false

If true, datatype normalization is performed according to the schema. For example, an element declared to have type xsd:boolean and represented as <state>1</state> could be changed to <state>true</state> .

cdata-sections, optional, default true

If false, all CDATASection objects are changed into Text objects and merged with any adjacent Text objects. If true, each CDATA section is represented as its own CDATASection object.

comments, required, default true

If true, comments are included in the Document ; if false, they're not.

infoset, optional

If true, the Document only contains information provided by the XML Infoset. This is the same as setting namespace-declarations, validate-if-schema, entities, and cdata-sections to false; and datatype-normalization, whitespace-in-element-content, and comments to true.

In addition, vendors are allowed to define their own nonstandard features. Feature names must be XML 1.0 names and should use a vendor-specific prefix such as apache: or oracle: .

For an example of how these could be useful, consider the SOAP servlet in Example 10.14. It needed to locate the calculateFibonacci element in the request document and extract its full text content. This had to work even if that element contained comments and CDATA sections. The getFullText() method that accomplished this wasn't too hard to write. Nonetheless, in DOM3 it's even easier. Set the create-cdata-nodes and comments features to false and call normalizeDocument() as soon as the document is parsed. Once this is done, the calculateFibonacci element contains only one text-node child.

 try {     Document request = parser.parse(in);     request.setNormalizationFeature("create-cdata-nodes", false);     request.setNormalizationFeature("comments", false);     request.normalizeDocument();     NodeList ints = request.getElementsByTagNameNS(        "http://namespaces.cafeconleche.org/xmljava/ch3/",        "calculateFibonacci");     Node calculateFibonacci = ints.item(0);     Node text = calculateFibonacci.getFirstChild();     String generations = text.getNodeValue();     // ... } catch (DOMException e) {   // The create-cdata-nodes features is true by default and   // parsers aren't required to support a false value for it, so   // you should be prepared to fall back on manual normalization   // if necessary. The comments feature, however, is required. }

This wouldn't work for the XML-RPC case, however, because XML-RPC documents can contain processing instructions, and there's no feature to turn off processing instructions.