Tag Each Unit of Information | Effective XML: 50 Specific Ways to Improve Your XML

The key idea is that what's between two tags should be the minimum unit of text that can usefully be processed as a whole. It should not need to be further subdivided for the common use cases. An Amount element contains a complete amount and nothing else. The amount is a single thing, a whole unit of information, in this case a number. It does not have internal structure that any application is likely to care about.

Occasionally the question of what constitutes a unit may depend on where and how the data is used. For example, consider the Date element in the above Transaction element. It contains implicit markup based on the hyphen. It could instead be written like this:

 <Date>   <Year>2003<Year>   <Month>12</Month>   <Day>15</Day> </Date>

Whether this is useful or not depends on how the dates will be used. If they're merely formatted on a page as is or passed to an API that knows how to create Date objects from strings like 2003-12-15, you may not need to separate out the month, day, and year as separate elements. Generally, whether to further subdivide data depends on the use to which the information will be put and the operations that will be performed on it. If dates are intended purely to notate a particular moment in history, then a format like <Date>2003-12-15</Date> is appropriate. This would be useful for figuring out whether it's time to drink a bottle of wine, determining whether a worker is eligible for retirement benefits, or calculating how much time remains on a car's warranty, for example. In none of these cases is the individual day, month, or even year very significant. Only the combination of these quantities matters. That dates are even divided into these quantities in the first place is mostly a fluke of astronomy and the planet we live on, not something intrinsic to the nature of time.

On the other hand, consider weather data. Since weather varies with the seasons and has a roughly periodic structure tied to the years and the months, it does make sense to compare weather from one February to the next , without necessarily considering the year. Other real-world data tied to annual and monthly cycles includes birthdays, pay periods, and financial results. If you're modeling this sort of data, you will want to be able to separate months, days, and years from each other. In this case, more structured markup such as <date><year>2003</year><month>12</month><day>15</day></date> is appropriate. The question is really whether processes manipulating this data are likely to want to treat the text as a single unit of information or as a composite of more fundamental data.

However, just because you don't need to extract the individual components of a date does not mean that no one who works with the data will need to do that. Generally, I prefer to err on the side of too much markup rather than too little. Larger chunks of data can normally be formed by manipulating the parent or ancestor elements when necessary. It is easier to remove structure when processing than to add it.

The classic example of what not to do is Scalable Vector Graphics (SVG). SVG uses huge amounts of non-XML-based markup. For example, consider the following polygon element.

 <polygon points="350,75 379,161 469,161 397,215 423,301                  350,250 277,301 303,215 231,161 321,161" />

In particular, look at the value of the points attribute. That's not just a string of charactersit's a sequence of x, y coordinates. An SVG processor cannot simply work with the attribute value. Instead, it first has to divide the attribute value into matching pairs and decide which are x's and which are y's. The proper approach would have been to define the coordinates as child elements.

 <polygon>   <point x="350" y="75"/>   <point x="379" y="161"/>   <point x="469" y="161"/>   <point x="397" y="215"/>   <point x="423" y="301"/>   <point x="350" y="250"/>   <point x="277" y="301"/>   <point x="303" y="215"/>   <point x="231" y="161"/>   <point x="321" y="161"/> </polygon>

This way the XML processor would present the coordinates to the application already nicely parsed. This also demonstrates the important point that attributes don't support structure very well. (See Item 12.) Structured data normally needs to be stored in element hierarchies. Only the lowest , most unstructured pieces should be put in attributes.

The reason for this bad design was to avoid excessive file size and verbosity . However, terseness of markup is an explicit nongoal of XML. If you really care that much about how many characters a user must type, you shouldn't be using XML in the first place. In this case, however, terseness truly has no benefits. Almost all practical SVG is either generated by a computer program or drawn in a WYSIWYG application such as Adobe Illustrator. Software can easily handle a more verbose, pure XML format. Indeed, it would be considerably easier to write such SVG processing and generating software if all the structures were based on XML. File size is even less important. SVG documents are routinely gzipped in practice anyway, which rapidly eliminates any significant differences between the less and more verbose formats. (See Item 50.)

SVG goes even further in the wrong direction by incorporating the non-XML Cascading Style Sheets (CSS) format. For example, a polygon can be filled, stroked , and colored like this:

 <polygon style="fill: red; stroke: blue; stroke-width: 10"          points="350,75 379,161 469,161 397,215 423,301                  350,250 277,301 303,215 231,161 321,161" />

Fortunately for the most important and common styles, SVG also allows an attribute-based alternative. For example, this is an equivalent polygon:

 <polygon fill="red" stroke="blue" stroke-width="10"          points="350,75 379,161 469,161 397,215 423,301                  350,250 277,301 303,215 231,161 321,161" />

Nonetheless, because the CSS style attribute is allowed, an SVG renderer needs both an XML parser and a CSS parser. It's easier to write a CSS parser than an XML parser, but it's still a nontrivial amount of work. Furthermore, it's much harder to detect violations of CSS. Its less draconian error handling makes it easier to produce incorrect SVG documents that may not be noticed by authors. SVG is less interoperable and reliable than it would be if it were pure XML.

XSL Formatting Objects (XSL-FO), by contrast, is an example of how to properly integrate XML formats with legacy formats such as CSS. It maintains the CSS property names , values, and meanings. However, it replaces CSS's native structure with an XML equivalent. XSL-FO doesn't have polygons, but here's a paragraph whose color is blue, whose background color is red, and whose border is ten pixels wide:

 <fo:block color="blue" background-color="red" border="10px">   The text of the paragraph goes here. </fo:block>

This has all the advantages of familiarity with CSS but none of the disadvantages of non-XML structure. The semantics of CSS are retained while the syntax is changed to more convenient XML.