5.1 XML Authoring | The OReilly Java Authors - JavaT Enterprise Best Practices

I l @ ve RuBoard

Today's Java programmers frequently see a wealth of tips and tricks on the Internet to improve their use of APIs such as SAX, DOM, and JAXP. While I'll address each of these in turn in the second part of this chapter, all the good coding in the world won't make up for poor document authoring. In this section I'll present a few ideas that can make your documents cleaner and less error-prone .

5.1.1 Use Entity References

An entity reference (also called an entity declaration in some circles) is one of those topics in XML that seems a little obscure. However, just think of an entity reference as a variable in XML. That variable has a declared value, and every time the variable occurs, the parser substitutes that value in the XML output. In that regard, an entity reference is like a static final variable in Java in that it cannot alter its value from an initial value defined in a Document Type Definition (DTD).

An entity reference often refers to an online resource (you'll see examples of this later in the Section 5.2), but it can also have a value defined in a DTD, such as the following:

 <!ENTITY phoneNumber "800-775-7731">

Instead of typing the phone number for O'Reilly several times in your XML document, and possibly introducing typographical errors, you can just refer to the value through its reference:

 <content>O'Reilly's phone number is  &phoneNumber;  .</content>

Of course, this seems pretty trivial, so let's look at a more realistic example. Example 5-1 shows a simple XML document fragment intended for display on a web page.

Example 5-1. Sample document without entity references

 <page>   <title>O'Reilly Java Enterprise Best Practices</title>   <content type="html">     <center><h1>O'Reilly Java Enterprise Best Practices</h1></center>     <p>       Welcome to the website for <i>O'Reilly Java Enterprise Best       Practices</i>. This book was written by O'Reilly's Java       authors for Java Enterprise professionals. And so on and       so on, ad infinitum.     </p>   </content> </page>

Notice that the title "O'Reilly Java Enterprise Best Practices" was repeated three times. Not only does this introduce room for error, but it also makes it a pain to change all occurrences. After all, there might be 10 or 15 more instances of this title in this and related documents in the future! Criteria such as these make the title a good candidate for an entity reference. First, add the following definition in your DTD:

 <!ENTITY bookTitle "O'Reilly Java Enterprise Best Practices">

Then, change the XML to look like this:

 <page>   <title>  &bookTitle;  </title>   <content type="html">     <center><h1>  &bookTitle;  </h1></center>     <p>       Welcome to the website for <i>  &bookTitle;  </i>. This book was       written by O'Reilly's Java authors for Java Enterprise       professionals. And so on and so on, ad infinitum.     </p>   </content> </page>

Now, by simply changing the entity reference's value, you can change all references in the XML document to the new value.

5.1.2 Use Parameter Entities

The natural extension of using entity references in an XML document is using parameter entities in a DTD. A parameter entity looks very much like an entity reference:

 <!ENTITY % common.attributes     'id        ID    #IMPLIED      account   CDATA #REQUIRED' >

Here, a more strict textual replacement occurs. In this case, the common.attributes definition can be used to specify two attributes that most elements in a constraint set should have. You can then define those elements in the DTD, as shown in Example 5-2.

Example 5-2. Sample DTD with parameter entities

 <!ELEMENT purchaseOrder (item+, manufacturer, purchaser, purchaseInfo)> <!ATTLIST purchaseOrder %common.attributes;>     <!ELEMENT item (price, quantity)> <!ATTLIST item %common.attributes;>     <!ELEMENT manufacturer (#PCDATA)> <!ATTLIST manufacturer %common.attributes;>     <!ELEMENT purchaser (#PCDATA)> <!ATTLIST purchaser %common.attributes;>     <!ELEMENT purchaseInfo (creditCard  check  cash)>

In Example 5-2, each element uses the common.attributes parameter entity, which will be converted into the string in the example (including the id and account attributes). This is done for each attribute list. And, like entity references, changing the value of the parameter entity changes the definitions for all elements. Again, this technique can be used to clean up the organization of your DTDs.

5.1.3 Use Elements Sparingly, Attributes Excessively

After giving you two recommendations about organization, I will now make what might seem like a counterintuitive suggestion: use elements infrequently and, instead, use attributes whenever possible.

To get a better idea of what I'm talking about, take a look at the XML fragment in Example 5-3.

Example 5-3. An element-heavy document fragment

 <person>   <firstName>Adam</firstName>   <lastName>Duritz</lastName>   <address type="home">     <street>102 Elizabeth Lane</street>     <street>Apartment 23</street>     <city>Los Angeles</city>     <state>California</state>     <zipCode>92013</zipCode>   </address> </person>

To optimize this XML, you should try and convert as much as possible into attributes. The rule of thumb here is that any single-valued content can be turned into an attribute, while multivalued content must stay as elements. So, the firstName and lastName elements can be converted into attributes; each will always have only one value. Hence, the XML can be modified to look as follows :

  <person firstName="Adam" lastName="Duritz">    <address type="home">     <street>102 Elizabeth Lane</street>     <street>Apartment 23</street>     <city>Los Angeles</city>     <state>California</state>     <zipCode>92013</zipCode>   </address> </person>

The address element could not be converted to an attribute. First, it has its own content, and second, there could be multiple addresses for the same person (a home address, work address, and so forth). Within that element, you can perform the same checks: street is multivalued, so it stays as an element, but city , state , and zipCode are all single-valued, and can be moved to attributes:

 <person firstName="Adam" lastName="Duritz">  <address type="home" city="Los Angeles" state="California" zipCode="92013">    <street>102 Elizabeth Lane</street>     <street>Apartment 23</street>   </address> </person>

To a lot of developers and content authors, this might look a bit odd. However, if you get into the habit of writing your XML in this fashion, it will soon seem completely natural. In fact, you'll soon look at XML with a wealth of elements as the odd bird.

Of course, I have yet to tell you why to perform this change; what is worth all this trouble? The reason behind this is in the way that SAX processes elements and attributes.

Some of you might be thinking that you don't want to use SAX, or that by using DOM or JAXP (or another API such as JAXB or SOAP), you'll get around this issue. However, it's unwise to assume that you will never need a specific API. In fact, almost all higher-level APIs such as DOM, SOAP, and JAXB use SAX at the lowest levels. So, while you might not think this practice affects your XML code, it almost certainly will.

Every time the SAX API processes an element, it invokes the startElement( ) callback, with the following signature:

 public void startElement(String namespaceURI, String localName,                          String qName, Attribute attributes) throws SAXException;

Typically, there is a great deal of decision-processing logic in this method, which goes something like this: if the element is named "this," perform some processing; if it is named "that," do some other processing; if it's named "something else," do something else again. Consequently, every invocation of this method tends to involve numerous string comparisons ”which are not particularly fast ”as well as several expression evaluations (e.g., if/then/else, etc.).

In addition, for every startElement( ) call, there is an accompanying endElement( ) call. So, if you processed the first XML fragment earlier in the chapter, you would suddenly find yourself staring at the lengthy list of method calls shown in Example 5-4. And that's without even looking at invocations of characters ( ) and the like within each element!

Example 5-4. Element-heavy SAX processing

 startElement(  ) // "person" startElement(  ) // "firstName" endElement(  )   // "firstName" startElement(  ) // "lastName" endElement(  )   // "lastName" startElement(  ) // "address startElement(  ) // "street" (1st one) endElement(  )   // "street" (1st one) startElement(  ) // "street" (2nd one) endElement(  )   // "street" (2nd one) startElement(  ) // "city" endElement(  )   // "city" startElement(  ) // "state" endElement(  )   // "state" startElement(  ) // "zipCode" endElement(  )   // "zipCode" endElement(  )   // "address" endElement(  )   // "person"

That is a lot of processing time! However, with each invocation, the attributes for the element are passed along. This means there is no difference in processing time between an element with several attributes and an element with just one attribute. So, as I mentioned earlier, decreasing the number of single-value elements and instead loading them as attributes onto an element can drastically decrease the parsing time.

Revisiting Example 5-3 and converting most of the elements to attributes, the long list of method calls in Example 5-4 comes out much shorter, as shown in Example 5-5.

Example 5-5. Element-light SAX processing

 startElement(  ) // "person" startElement(  ) // "address" startElement(  ) // "street" (1st one) endElement(  )   // "street" (1st one) startElement(  ) // "street" (2nd one) endElement(  )   // "street" (2nd one) endElement(  )   // "address" endElement(  )   // "person"

Eighteen method calls became eight ”a change of over 50%. ^[2] Add to that the reduction in decision-processing logic in the startElement( ) method because there are fewer elements, and the reduction in characters( ) callback invocations, and this is clearly a good practice to follow.

^[2] This ignores the work to parse the attributes, which may reduce it from 50%.

I l @ ve RuBoard