Item 16. Prefer URLs to Unparsed Entities and Notations | Effective XML: 50 Specific Ways to Improve Your XML

Unparsed entities and notations are one of the weirder parts of the XML specification. They're not well understood in the community and are not properly implemented in all APIs. There's little you can do with an unparsed entity you can't do with a URL in an attribute value anyway, so there's not a lot of need for them. They're mostly a holdover from SGML and the pre-Web world.

For example, suppose you want to embed images in a variety of formats in your documents. This is a completely reasonable thing to do. The unparsed entity approach is to first define notations for all the different image formats in the DTD.

 <!NOTATION GIF  SYSTEM "image/gif"> <!NOTATION JPEG SYSTEM "image/jpeg"> <!NOTATION PNG  SYSTEM "image/png"> <!NOTATION SVG  SYSTEM "image/svg+xml">

Unfortunately, there's no standard format for notation values. I've used MIME types here because that seems reasonable. However, there's no guarantee that any particular program will recognize these types. The parser will simply tell the client application what the notation is. It will not tell the client application how to interpret data with this notation. (Some parsers and APIs won't even do that much.)

Having defined the notations, the next step is to define the unparsed entities. This is normally done in the internal DTD subset so that different documents can load different entities. For example, the following document type declaration defines two unparsed entities, one named PUPPY for a JPEG image at the relative URL images/fido.jpg and the other named LOGO for an SVG image at the absolute URL http://www.example.com/images/cup.svg:

 <!ENTITY PUPPY SYSTEM "images/fido.jpg" NDATA JPEG> <!ENTITY LOGO  SYSTEM     "http://www.example.com/images/cup.svg" NDATA SVG>

Unparsed entities like these cannot be referenced with a simple entity reference as parsed entities can be. Instead, you have to declare an attribute with type ENTITY and place the name of the unparsed entity in that attribute value. For example, the following line declares that the source attribute of the Image element has type ENTITY .

 <!ATTLIST Image source ENTITY>

Finally, in the instance document you would add an Image element with a source attribute.

 <Image source="PUPPY"/>

That's a huge amount of work, especially for what little it buys you. There is a much simpler alternative: Put the URL for the image in the source attribute directly.

 <Image source="images/fido.jpg"/>

In every API I've seen it is much easier to read the URL directly out of the attribute than to load it from the entities declared in the DTD. For example, in SAX loading an unparsed entity from an attribute requires first storing all the entities declared in the DTD using an implementation of the DTDHandler interface such as the one below.

 import org.xml.sax.*; import java.util.Hashtable; public class UnparsedEntityCache implements DTDHandler {   private Hashtable entities = new Hashtable();   public void unparsedEntityDecl(String name, String publicID,    String systemID, String notationName) {     entities.put(name, systemID);   }   public String getUnparsedEntity(String name) {     return (String) entities.get(name);   } }

Then you need to reference this data structure from inside the startElement method.

 public void startElement(String namespaceURI, String localName,  String qualifiedName, Attributes attributes) {   Attribute source = attributes.getValue("source");   String url = cache.getUnparsedEntity(source);   // Download the image from the URL... }

Here's the equivalent code that loads the same URL directly from an attribute. No separate cache is required.

 public void startElement(String namespaceURI, String  localName,  String qualifiedName, Attributes attributes) {   String url = attributes.getValue("source");   // Download the image from the URL... }

I think you'll agree this is much simpler (and reading unparsed entities is actually much easier in SAX than in every other common API).

If you want something marginally more standard, you can always use XLinks instead. For example:

 <Image xmlns:xlink="http://www.w3.org/1999/xlink"        xlink:type="simple" xlink:actuate="onLoad"        xlink:show="embed" xlink:href="images/fido.jpg"/>

The one thing the unparsed entity offers that a direct URL in an attribute value doesn't is the notation. However, in practice, the data type can often be determined from the file name or the metadata associated with the URL stream. For example, HTTP includes a Content-Type header that specifies the MIME type for all images it transmits.

 HTTP/1.1 200 OK Date: Thu, 30 Jan 2003 15:55:18 GMT Server: Apache/1.3.27 (Unix) DAV/1.0.3 mod_fastcgi/2.2.12 Last-Modified: Tue, 28 Apr 1998 13:31:47 GMT Content-Length: 900 Connection: close Content-Type: image/gif

Notations can also be used to identify the type of an element rather than an unparsed entity. The only XML application I've ever seen that actually uses this approach is DocBook, which uses the linespecific notation to identify elements in which white space should be preserved. For example:

 <!NOTATION linespecific SYSTEM "linespecific"> <!ATTLIST programlisting format NOTATION (linespecific) >

However, this is mostly a holdover from DocBook's SGML legacy. In XML, the preferred way to do this is with an xml:space attribute.

 <!ATTLIST programlisting xml:space #FIXED "preserve">

You can imagine other uses for notation type attributes. For example, you could use them to assign data types to elements.

 <!NOTATION decimal SYSTEM   "http://www.w3.org/TR/xmlschema-2/#decimal"> <!ATTLIST  weight type NOTATION (decimal) >

However, the W3C Schema Working Group chose to go with a global xsi:type attribute instead.

 <!ATTLIST weight xsi:type NMTOKEN (decimal) >

In effect, the Working Group chose to use namespaces and a predefined attribute rather than notations. In general, this seems to be the way the wind is blowing. Most parsers and APIs have some support for notations, but if you actually rely on them in your applications, you're just going to confuse most users. This is a case where the simpler, more direct solution is much preferred.