Avoid Implicit Structure | Effective XML: 50 Specific Ways to Improve Your XML

You need to be especially wary of implicit markup, often indicated by white space. For example, consider the simple case of a name :

 <Name>Lenny Bruce</Name>

The name is sometimes treated as a single thing, but quite often you need to extract the first name and last name separately, most commonly to sort by last name. This seems easy enough to do: just split the string on the white space. The first name is everything before the space. The last name is everything after the space. Of course, this algorithm falls apart as soon as you add middle names :

 <Name>Lenny Alfred Bruce</Name>

You may decide that you don't really care about middle names, that they can just be appended to the first name. You're just going to sort by last name anyway. However, now consider what happens when the last name contains white space:

 <Name>Stefania de Kennessey</Name>

The obvious algorithm assigns people the wrong last name. This can be quite offensive to the person whose name you've butchered, not that I haven't seen a lot of na ve software that does exactly this.

What about titles? For example, consider these names:

 <Name>Mr. Lenny Bruce</Name> <Name>Dr. Benjamin Spock</Name> <Name>Timothy Leary, Ph.D.</Name> <Name>William Kunstler, Esq.</Name> <Name>Ms. Anita Hoffman</Name> <Name>Prof. John H. Exton, M.D., Ph.D.</Name>

Given a large list of likely titles you can probably design an algorithm that accounts for these, but what seemed like a simple operation is rapidly complexifying in the face of real-world data.

Finally, let's recall that not all cultures put the family name last. For example, in Japan the family name normally comes first:

 <Name>Kawabata Yasunari</Name>

Thus when sorting Japanese names you sort by first name rather than last name. Do you really want to try to design a system that can guess whether a string is a Japanese name or an English one? To make matters worse , often, but not always, when Japanese names are translated into English the order of the names is reversed :

 <Name>Yasunari Kawabata</Name>

In fact, Japanese written in Kanji normally doesn't even use white space between the family and given name:

 <Name>  </Name>

The problem is a lot messier than it looks at first glance.

All of this goes away as soon as you use explicit markup to identify the different components of a name, instead of relying on software to sort it out.

 <Name><Given>Lenny</Given> <Family>Bruce</Family></Name> <Name><Given>Lenny <Middle>Alfred</Middle> Bruce</Family></Name> <Name>   <Given>Stefania</Given> <Family>de Kennessey</Family> </Name> <Name>   <Title>Mr.</Title> <Given>Lenny</Given> <Family>Bruce</Family> </Name> <Name>   <Title>Dr.</Title>   <Given>Benjamin</Given> <Family>Spock</Family> </Name> <Name>   <Given>Timothy</Given>   <Family>Leary</Family>, <Title>Ph.D.</Title> </Name> <Name>   <Given>William</Given>   <Family>Kunstler</Family>, <Title>Esq.</Title> </Name> <Name>   <Title>Ms.</Title> <Given>Anita</Given>   <Family>Hoffman</Family> <Name>   <Title>Prof.</Title>   <Given>John</Given> <MiddleInitial>H.</MiddleInitial>   <Family>Exton</Family>,   <Title>M.D.</Title>, <Title>Ph.D.</Title> </Name> <Name><Family>  </Family><Given>  </Name> <Name><Family>Kawabata</Family> <Given>Yasunari</Name> <Name><Given>Yasunari</Given> <Family>Kawabata</Family></Name>

Another example of abuse of white space occurs in narrative documents that attempt to treat white space as significant, as in the following poem. ^[1]

^[1] "Now" by Eleanor Alexander, republished in The County Series of Contemporary Poetry No. IX, Middlesex Poetry (High Holborn, U.K.: Fowler Wright Ltd., 1928).

 <poem type="sonnet" poet="Eleanor Alexander">   For me, my friend, no grave-side vigil keep    With tears that memory and remorse might fill;    Give me your tenderest laughter earth-bound still,    And when I die you shall not want to weep.    No epitaph for me with virtues deep    Punctured in marble pitiless and chill:   But when play time is over, if you will,    The songs that soothe beloved babes to sleep.   No lenten lilies on my breast and brow    Be laid when I am silent; roses red,    And golden roses bring me here instead,    That if you love or bear me I may know;   I may not know, nor care, when I am dead:    Give me your songs, and flowers, and laughter now. </poem>

Here the line breaks indicate the end of a verse, and the blank lines indicate the end of a stanza. However, this can be problematic when the content is displayed in an environment where the lines are wrapped or the white space is otherwise adjusted for typographical reasons. Furthermore, these white-space -based constraints can't be validated with respect to either XML (every stanza contains one or more lines) or poetry (the first stanza of a sonnet has eight lines; the second has six). Authors are likely to make mistakes when the white space is too significant. It's much better to make the stanza and line division explicit, as shown below.

 <poem type="sonnet" poet="Eleanor Alexander">   <stanza>     <line>For me, my friend, no grave-side vigil keep</line>     <line>With tears that memory and remorse might fill;</line>     <line>       Give me your tenderest laughter earth-bound still,</line>     <line>And when I die you shall not want to weep.</line>     <line>No epitaph for me with virtues deep</line>     <line>Punctured in marble pitiless and chill:</line>     <line>But when play time is over, if you will,</line>     <line>The songs that soothe beloved babes to sleep.</line>   </stanza>   <stanza>     <line>No lenten lilies on my breast and brow</line>     <line>Be laid when I am silent; roses red,</line>     <line>And golden roses bring me here instead,</line>     <line>That if you love or bear me I may know;</line>     <line>I may not know, nor care, when I am dead:</line>     <line>       Give me your songs, and flowers, and laughter now.</line>   </stanza> </poem>

I think the only time you should insist on exact white space preservation is when the white space is actually a significant component of the content, as in the poetry of e.e. cummings or Python source code.

Computer source code, whether in Python or in other languages, is a special case. It has a huge amount of structure that just does not lend itself to expression in XML. Furthermore, parsers for this structure exist and are as common and useful as parsers for XML. (They're generally bundled as parts of compilers.) Most importantly, there are only two normal uses for source code embedded in XML documents:

Passing the code to a compiler
Displaying the complete, unformatted code to an end user , as in a programming tutorial

In neither of these cases is the process reading the XML likely to want to subdivide the data into smaller parts and treat them individually, even though these parts demonstrably exist. Thus it makes sense to leave the structure in source code implicit.