White Space in Attributes | Effective XML: 50 Specific Ways to Improve Your XML

The final and trickiest case are attribute values. Depending on attribute type, the parser normalizes attribute values before reporting them to the client application. First it converts all tabs, carriage returns, and line feeds to one space each. This is done for all attributes. Next, if the attribute has a declared type and that type is anything other than CDATA, the parser also condenses all runs of space to a single space and finally trims all leading and trailing white space from the value. However, the parser does not perform this second step for attributes that have type CDATA or are undeclared.

For example, consider the following document. The parser will trim the leading and trailing white space from the year attribute because it has type NMTOKEN, but it will not trim the white space from the source attribute that has type CDATA or the group attribute that is undeclared.

 <?xml version="1.0"?> <!DOCTYPE motto [   <!ATTLIST motto year   NMTOKEN #IMPLIED                   source CDATA   #IMPLIED ]> <motto year=" 1908 " source=" Scouting  for  Boys " group=" BSA ">   Be prepared </motto>

It is not necessary for the document to be valid in order for normalization to applyindeed, the document above is not validonly that the attribute be declared on the element where it appears and that the parser read that declaration. All conforming XML parsers are required to read the internal DTD subset (up to the first external parameter entity reference they don't read) and use ATTLIST declarations in the internal DTD subset to decide whether to normalize or not. However, if an attribute is declared in the external DTD subset, then nonvalidating parsers may or may not read the declaration. This means validating and nonvalidating parsers can report different values for the same attribute, as can two nonvalidating parsers. The values will differ only in white space, but this can still be important. If this is a concern, make sure you use either a fully validating parser or a nonvalidating parser that is known to read the external DTD subset.

Note

Tim Bray, one of the primary authors of XML 1.0, has admitted that normalization of attribute values was a mistake. In his words, "Why the $#%%!@! should attribute values be 'normalized' anyhow? This was a pure process failure: at no point during the 18-month development cycle of XML 1.0 did anyone stand up and say 'why are you doing this?' I'd bet big bucks that if someone had, the silly thing would have died a well-deserved death." ^[1]

^[1] "Re: Attribute normalisation and character entities," posted on the xml-dev mailing list, January 27, 2000. Accessed in June 2003 at http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Jan-2000/1085.html.