Where to Stop? | Effective XML: 50 Specific Ways to Improve Your XML

At the absolute extreme I've seen it suggested (facetiously) that an integer such as 6587 should be written like this:

 <integer>   <thousands>6</thousands>   <hundreds>5</hundreds>   <tens>8</tens>   <ones>7</ones> </integer>

Obviously, this is going too far. It would be far more troublesome to process than a simple, unmarked-up number. After all, almost everyone who wants to use a number treats it as an atomic quantity rather than a composition of single digits. However, this does suggest a good rule of thumb for where to stop inserting tags. Anything that will normally be treated as a single atomic value should not be further divided by markup. However, if a value is composed of smaller parts that will need to be addressed individually, they should be marked up.

Here are a few other common edge cases and my thoughts on why I would or wouldn't further divide them.

Numbers with units such as 7px, 8.5kg, or 108db: Neither the unit nor the number means anything in isolation. It doesn't help much to know that a mass is denoted in kilograms without knowing how many kilograms. Similarly, there's not much point to knowing that the mass is 3.2 if you don't know whether that's 3.2 grams, 3.2 kilograms, or 3.2 metric tons. Thus I prefer to write such quantities as <mass>7.5kg</mass> and <speed>32mph</speed>.
Time: The division of time into hours, minutes, and seconds is very similar to the date case. Indeed, a date is just a somewhat more coarsely grained measure of time, and times can be appended to dates to more precisely identify a moment. However, durations of time are a different story. These include quantities such as the flight time from San Jose to New York or the number of minutes that can be recorded on a video tape in SP mode. Here it is the total time that matters, not the beginning point and end point. The division of time into 24 hours per day, 60 minutes per hour , and 60 seconds per minute is a historical relic of Babylonian astronomy and their base-60 number system, not anything fundamentally related to natural quantities (a point proved by the fact that durations can be flattened to a total number of minutes or seconds rather than using three different units). Thus I tend to treat a duration as a single quantity and write it using a form like <FlightTime>6h32m</FlightTime> instead of a more structured form such as <FlightTime><hours>6</hours><minutes>32</minutes></FlightTime> .
Lists: Both DTDs and schemas define list data types that can describe content separated by white space. In DTDs, these include attributes declared to have type IDREFS or ENTITIES. In schemas this includes any element or attribute declared with a list type. I really don't like this. This may be the only way to store plural quantities such as a list of entities or numbers in attributes. However, when faced with potentially plural things I prefer to use child elements. Overuse of attributes leads to markup that's hard to manage.
URLs: A URL (or URI) has a lot of internal structure. For instance, the URL http://www.cafeconleche.org:80/books/xmljava/chapters/ch09s07.html#d0e15480 has a protocol, a host (which itself has a host name, a domain name , and a top-level domain), a port, a file path , and a fragment identifier. Theoretically, you could mark this up like so:
```
 <url>   <protocol>http</protocol>   <host>www.cafeconleche.org</host>   <port>80</port>   <file>/books/xmljava/chapters/ch09s07.html</file>   <fragment>d0e15480</fragment> </url> 
```
However, in practice this is almost never done, and with good reason. Almost every use of a URL, from passing it to a method in a programming API to copying it and pasting into the browser location bar to painting it on the side of a building, expects to receive an entire URL, not a piece of one. In those rare cases where you need to divide a URL into its component parts, most APIs provide adequate support. Thus it's best not to subdivide the URL beyond what everyone expects.

In general, if I suspect that an element might usefully be further divided, I will divide it. XML has the opposite of the Humpty-Dumpty problem: It's much easier to put the pieces back together again when content is split by tags than it is to break it apart when there aren't enough tags. Having too much markup in your data is rarely a practical problem. Having too little markup is much more cumbersome.