Overlapping Markup | Effective XML: 50 Specific Ways to Improve Your XML

Not all markup fits neatly into tree structures. The classic case of overlapping markup is tracking the structure of a text such as The Aeneid along with the identity of the scribe (usually a medieval monk) who copied it. This can be important for recognizing likely transcription errors, determining which sources multiple monasteries in different areas shared, whether different versions of a text were extant in the ancient world, and more. It's not uncommon for one scribe to pick up in the middle of a paragraph where another monk left off, as shown below.

 <Scribe name="Marcus"> <Stanza>   <Verse>ARMA virumque cano, Troiae qui primus ab oris</Verse>   <Verse>Italiam, fato profugus, Laviniaque venit</Verse>   <Verse>litora, multum </Scribe>   <Scribe name="Josephus">ille et terris iactatus et alto</Verse>   <Verse>vi superum saevae memorem Iunonis ob iram;</Verse>   <Verse>multa quoque et bello passus,          dum conderet urbem,</Verse>   <Verse>inferretque deos Latio, genus unde Latinum,</Verse>   <Verse>Albanique patres, atque altae moenia Romae.</Verse> </Stanza> <Stanza>   <Verse>Musa, mihi causas memora, quo numine laeso,</Verse>   <Verse>quidve dolens,</Scribe>   <Scribe name="Marcus"> regina deum tot volvere casus</Verse>   <Verse>insignem pietate virum, tot adire labores</Verse>   <Verse>impulerit.  Tantaene animis caelestibus irae?</Verse> </Stanza> </Scribe>

This is completely malformed XML. One occasional solution is to use processing instructions to mark the beginning and end of the authorship.

 <?beginscribe name="Marcus"?> <Stanza>   <Verse>ARMA virumque cano, Troiae qui primus ab oris</Verse>   <Verse>Italiam, fato profugus, Laviniaque venit</Verse>   <Verse>litora, multum <?beginscribe name="Josephus"?>          ille et terris iactatus et alto</Verse>   <Verse>vi superum saevae memorem Iunonis ob iram;</Verse>   <Verse>multa quoque et bello passus,          dum conderet urbem,</Verse>   <Verse>inferretque deos Latio, genus unde Latinum,</Verse>   <Verse>Albanique patres, atque altae moenia Romae.</Verse> </Stanza> <Stanza>   <Verse>Musa, mihi causas memora, quo numine laeso,</Verse>   <Verse>quidve dolens,   <?beginscribe name="Marcus"?>          regina deum tot volvere casus</Verse>   <Verse>insignem pietate virum, tot adire labores</Verse>   <Verse>impulerit.  Tantaene animis caelestibus irae?</Verse> </Stanza>

On the one hand, this is well- formed XML. On the other hand, this doesn't really do anything you couldn't do with empty elements. Revision tracking presents similar issues because users may revise and delete ranges of text that do not necessarily coincide neatly with element boundaries. In both cases, the source of the difficulty is the same. The lexical issues of characters in a row don't always match up with the tree structure of well-formed XML. You can't always fit all useful information into a single tree.

I think empty elements are a much better fit here than processing instructions. The content of the instructions in examples like this is really a key part of the markup. It's not just supplementary information for one particular process. It reflects real information about the content. Thus it properly belongs in tags, not processing instructions. If the tag structure is too limiting, it's time to consider whether XML really fits the data in the first place. XML is a round hole that works very well with round pegs. Square pegs might better be pounded into different holes.

One somewhat more plausible use of begin and end processing instructions occurred in a spell checking program that recognized <?begin-nospell?> and <?end-nospell?> instructions to identify words that should not be checked, such as foreign words, proper names , and technical terms. For example:

 <para>   Francine read <title>Processing   <?begin-nospell?>XML<?end-nospell?> with Java</title> in   its French translation, <?begin-nospell?><title>Traitement   de XML avec Java</title>. </para>

In this case, overlapping markup is not a fundamental issue, as it is in the classic text example. However, such processing instructions can still fail to nest properly within elements. Worse, unlike overlapping elements caused by misplaced start- and end-tags, the parser will not detect this as an error. And people will and do make such errors. Did you notice that I left off the last <?end-nospell?> instruction in the above example?