6.3 XML utilities
6.3.1 XML diff tools
Dealing with plain text or programming in any text-based language is unthinkable without the diff  and patch  utilities. Sooner or later, you will find yourself looking for their XML analogs. Automatic diffs might be especially useful for group work: Figuring out exactly what a colleague has changed in the file you're working on, or merging changes to a collaboratively edited document, is much easier if you can extract a diff and then apply it back.
Don't touch what you can't parse. Of course you can use the regular, plain text versions of the diff and patch utilities. This is a less than optimal approach, however.
One problem is that two XML documents may be the same in the "XML sense" while being worlds apart from the viewpoint of a plain text diff.XML permits a fair amount of syntactic variation that does not change the semantics of a document, such as varying the amount of whitespace in tags or the order of attributes.
Another problem is that if a document to be patched has changed even slightly since the diff was extracted, a plain text patch may either fail or ( worse ) produce a non-valid or non-well-formed output. The plain text utilities work with lines, but to keep XML well- formed , you must change it only at the level of elements or attributes. An XML diff utility must therefore include a complete XML parser and produce differences attached to specific locations in the XML tree rather than to line numbers .
Several open source diff projects for XML exist, implementing various approaches to this task. Some of them can only do diff; others attempt patching as well. Formats of diff lists also vary. Commercial products are also available. 
184.108.40.206 Diffs for viewing
If you are only interested in seeing the changes, then the diffmk utility,  written in Perl by Norman Walsh, might do the trick. Given two XML files, it outputs the second of the two with some additional markup: Revision attributes are added to the elements whose content or attributes changed, and revision elements enclose new or deleted text nodes. (Deleted element nodes are not shown; if you want to see them, simply compare the same two documents in reverse order.)
You can specify different element type and attribute names for this revision markup each time you run diffmk. With the utility's output, it is easy to visualize the differences using CSS ( 6.1.5 ). Alternatively, you can write an XSLT stylesheet (or modify your existing one) so that the changed fragments are, for example, painted different colors in an HTML rendition .
Overall, the utility is very useful, even though it is not perfect. For one thing, it does not differentiate between elements whose data content has changed and those whose attributes have changed.
A more fundamental problem with diffmk is that its revision markup is only as granular as your source markup. For example, if a long paragraph is a single text node (i.e., has no child elements), then the entire paragraph will be marked as changed even if a single character was modified in it.
220.127.116.11 Diffs for patching
Other XML diff projects promise to do something diffmk cannotpatch an XML document so you can accumulate changes from several independent revisions.
The two utilities called diffxml and patchxml  were written in Java by Adrian Mouat. They use their own format for diffs called DUL (Delta Update Language). A DUL diff is an XML document where each element ( insert , delete , update , etc.) describes one change. These elements use XPath expressions similar to
to uniquely identify the changed nodes. As you can guess, this XPath is rather fragile, as it relies upon the number of nodes of any type before the changed node remaining constant. Once you add or remove a node in your document, such an XPath expression will likely miss its target.
Another utility, called xmldiff ,  is written in Python and uses a similar XML-based diff language called XUpdate .  Again, XPath is used for specifying the location of a change, but XUpdate uses a robust syntax that stores element node names of all ancestors of a changed node:
To merge XUpdate changes, you can use another utility called 4update , which is a part of 4Suite,  a Python-based XML processing platform ( 18.104.22.168 ).
6.3.2 XPath tools
Many XML editors offer an XPath facility allowing you to see the results of addressing a document with an XPath expression ( 22.214.171.124 ). A number of standalone XPath tools are worth checking out as well. In some respects, these tools may even be superior to an average XPath engine built into an XML editor.
126.96.36.199 Command line
The simplest tool in this category is a command-line utility that allows you to apply an XPath expression to a document and display the results. Such a utility, called simply xpath , was written by Matt Sergeant as part of the XML::XPath Perl package.  You'll need an up-to-date Perl installation in order to use this package. Similar utilities, written in Java  and in C++,  are included with the Xalan XSLT processor.
 xml.apache.org/ xalan-c /samples.html#xpathwrapper
No matter which XPath utility you choose, the usage is straight-forward. You just give it the document pathname and the XPath expression as command-line parameters, and it displays (in serialized form) a nodeset returned by that expression. For example, the command
xpath en/team/index.xml //int
applied to our sample page document (Example 3.1, page 141) will display
Found 2 nodes: -- NODE -- <int link="solutions">products</int> -- NODE -- <int link="fbplus">FooBar Plus</int>
Context provided separately. Note that the version of the utility from Xalan C++ requires two parameters: one is the XPath context and the other is the expression that is evaluated in that context. Formally speaking, this is unnecessary, because you can always combine the two into an equivalent composite expression evaluating in the "context" of the entire document. However, for an XSLT programmer, this separation makes practical sense: In an XSLT stylesheet, any XPath expression is also evaluated with regard to some context that is set outside of the expression.
If you prefer a GUI to a command-line interface, then the XPath Explorer utility  (Figure 6.8) may be what you are looking for. It provides source and tree views of an XML document and lets you evaluate XPath expressions against it, highlighting the matching nodes (and listing them on the "Matching Nodes" tab).
Figure 6.8. XPath Explorer focuses on one thingevaluating XPath expressions on a documentbut does that really well.
Both matching and calculating. Unlike the more simplistic XPath tools we've seen in the previous section, XPE offers some useful extensions to this basic functionality. Most importantly, it not only shows you the matching nodes, but displays the value returned by the expression as a string, as a boolean, and as a number. This makes it possible to run not only match-type expressions but arbitrary calculations and comparisons as well. (You can even use XPE as a simple numeric calculator; type 6*9 in the "XPath:" field and see the result in the "Number value:" field.)
Expression expanded. The "Expanded:" field is another nice touch. It shows you an equivalent of your expression with all XPath abbreviations expanded (e.g., @ is replaced with attribute:: , element with child::element , and so on).
Generated paths. Finally, you can explore the loaded document by clicking on a node and seeing its corresponding XPath in the "Generated:" field below. There is an unbounded number of XPath expressions matching a given node, but most XPath-enabled tools (such as XML diff utilities, 188.8.131.52 ) generate XPath expressions similar to
XPE uses the same "canonical" form for generated XPath expressions, but with a twist: If the target node is an element with an id attribute, XPE assumes that this attribute uniquely identifies the element and uses it as a selectorfor example,
Such an address is more robustit will still work even if the target node is moved around in the document.
Another nice tool is xsh  (XML Shell) written by Petr Pajas. It is a full-featured interactive shell that can be used not only for exploring the structure of XML documents with XPath, but also for modifying, transforming, and creating new XML documents. Programs in the xsh command language can thus perform many of the functions of XSLT stylesheets, but look more like traditional Unix shell scripts.
So long as it's hierarchical. The principal metaphor of xsh is that of "XML document as a hierarchical file system." Thus, inside xsh, the Unix shell commands such as ls , cd , and cp work on nodes or nodesets, and XPath expressions are used in place of file system pathnames. This idea may seem far- fetched at first, but it is very convenient once you are used to it. Manipulating and exploring documents from such a command-line interface may be almost as convenient and fast, and in some cases even faster, than working with the document source in a text editor.
Tab, complete that. The really nifty capability of xsh is its "tab completion." As in any other Unix shell, you can press the Tab key at any time while editing the command line, and the program will fill in as much of the current XPath expression as is unambiguous, or list possible completions otherwise . Thus, at any step of a complex XPath, you can get immediate feedback on what nodes are located along any of the axes, starting from the context of this step (i.e., in terms of XPath 2.0, from the inner focus ).
Tab completion in xsh speeds up XPath hacking incredibly. With it, you can learn a lot of things about the document structure even before you complete and run your command. For example, getting a list of all element type names at any level in a document is as simple as typing ls // and pressing Tab . If you think that Unix shells ' tab completion is handy (and most people who ever tried it will agree), then you'll probably find the xsh version of this feature addictive .
6.3.3 Grammar generation
As we've seen in the review of XML editors ( 6.1 ), having a grammar-based schema of your source documents is valuable , if only because it enables guided editing. Schematron rules can only stop you from going in a wrong direction, but they cannot tell you where to go when you are lost. If you're not very familiar with the source definition you must comply with, guided editing is a big boon.
Luckily, if you need a DTD, you don't have to write it from scratch. The idea is to use an existing valid document as input to an algorithm that deduces the DTD from that sample. Some of the XML editors mentioned in this chapter include this functionality, but there also exist standalone tools that will do the job just as well.
Saxon in the 6.* series includes a sample application called DTDGenerator that, given a well-formed XML sample, outputs a DTD to which it conforms. This DTD can then be fed to a converter to produce an equivalent schema in XSDL or any other schema language. An example of such a converter is NekoDTD ,  which works by first converting a DTD into an XML representation and then using a bunch of XSLT stylesheets to convert this representation into any of the several supported schema languages.
This procedure cannot be completely automatic. You will have to review the generated DTD and fix its misconceptions, usually by relaxing the restrictions that were deduced from a too small and unrepresentative sample. To make this manual fixup less of a necessity, it's a good idea to aggregate as much material as possible from real-world documents into the sample document that will be used by the grammar generator.
Of course, this aggregation must make sense structurally; for example, you cannot have multiple body elements in your sample if a body is supposed to occur only once per page. Instead, you might be able to take the content from many pages' body elements and insert it into your sample's body .