Item 40. Avoid Vendor Lock-In | Effective XML: 50 Specific Ways to Improve Your XML

Although XML is a nonproprietary, vendor-independent technology, it doesn't have to stay that way. Be extremely cautious of any tool that would tie you to one vendor's systems. In some cases the lock-in is obvious. For instance, one vendor went so far as to patent its DTDs. That's easy to avoid. But sometimes the lock-in is less obvious. The real danger is complexity. If the system is so complex that you cannot imagine writing your own tools to process the documents it uses, avoid it. It's one thing to buy a useful tool from a vendor that will save a you a couple of weeks of programmer time. It's a completely different thing to depend on a system that you couldn't reimplement given a couple of years of expert developer time.

Things to watch out for include the following.

Opaque , binary data used in place of marked -up text. Base64 encoding a proprietary, undocumented binary format and then stuffing it between two tags does not make it XML. Structure should be indicated by tag names .
Overabbreviated, unclear element and attribute names like F17354 and grgyt . Tag names should be obviously related to what they contain.
Binary encodings of XML that can't be processed with open tools, most especially those that rely on patented algorithms.
APIs that try to shield developers from the raw XML, especially when the API becomes the spec. The web services space is rife with these. Specifications for web services need to be defined in terms of the XML transmitted, not the methods called.
Products that focus on the Infoset to the exclusion of real XML. The Infoset is a useful vocabulary for talking about XML, but it is not XML. The only interoperable form of XML is Unicode in angle brackets. All transmissions between different systems should take place as real XML text.
Alternate serializations of XML. Once again, there is no XML other than Unicode in angle brackets. The Unicode in angle brackets can be hidden in different forms such as raw ASCII text files, gzipped data, or database fields. However, it always comes back to Unicode characters in a row. Beware of formats that attempt to replace rather than encode Unicode characters in a row.

Make sure the XML that your tools emit is standards-compliant, well- formed XML 1.0. At one time or another various vendors have published formats and tools that attempt to "fix" things they feel are broken in XML. For instance, some early Microsoft formats such as CDF were case-insensitive and used XML declarations like this:

 <?XML version="1.0"?>

Needless to say, this could only be parsed by Microsoft's parser, not by everybody else's. For the most part the market has resoundingly rejected such fundamental nonconformance , so you don't have to worry a lot about it today. Nonetheless, in the last month alone I've seen two different companies pushing their own nonstandard and noninteroperable (with each other or anybody else) variants of XML. Worse yet, they're labeling their tools as XML tools and obscuring the difference between what they're doing and real XML. Casual XML users could easily be fooled into thinking that what they're buying from these companies makes them XML-compliant, even though it won't work with any of the other products on the market. Always remember that well-formed Unicode in angle brackets is the absolute minimum for XML. Don't accept any tool that tries to get away with less.

Variations of this approach take a different, less obvious path . Schema languages can be changed from standard, vendor-independent languages such as DTDs, the W3C XML Schema Language, or RELAX NG to proprietary systems like Microsoft's XML Data Reduced (XDR) or the patented XLinkIt. Don't accept languages based on such owned foundations. Insist on nonproprietary schema languages as the normative definitions of any XML application you use.

Stylesheet languages have also been subject to vendor efforts to embrace and extend them in ways that are incompatible with the pure language. The worst offender here is Microsoft's XSLT implementation. Internet Explorer 5 shipped with an XSLT engine that partially implemented an early working draft of XSL. At the time this was understandable because there wasn't anything better, although if nothing else it was a serious error in judgment to ship software based on such an early, rapidly changing draft specification. However, Microsoft continued to ship this nonconformant engine in later versions of Internet Explorer even after the final version of XSLT was available. The company eventually wrote a parser and transformer that did implement the specification. However, Microsoft then configured the installer so that it did not actually replace the original, broken transformer. Worst of all, even after shipping a conformant engine, the company's web sites and trainers continued to evangelize and teach the old, experimental version of XSLT that was never implemented by anyone other than Microsoft.

Microsoft finally shipped a mostly conformant version of Internet Explorer in version 6, but even the allegedly conformant IE6 still has numerous standards compliance issues, among them those listed below.

Internet Explorer only recognizes the phantom MIME type text/xsl, not standard types like text/xml and application/xml+xslt. No text/xsl MIME type has been or is likely to be registered with the IANA. The XSLT specifications are clear that you should use text/xml or application/xml to identify XSLT stylesheets in xml-stylesheet processing instructions until the more specific MIME type application/xml+xslt has completed the registration process.
By default, the MSXML parser IE uses throws away all white-space -only nodes before transforming. Microsoft has made the jesuitical argument that this takes place during parsing and tree construction rather than during transformation, so it's allowed by the specification. The same argument applies equally to the claim that it's OK to turn all text nodes in the document to "All work and no play makes Jack a dull boy." I don't buy it.

Worst of all, the latest versions of IE6 and MSXML still accept the old, broken half XSLT beta/half meat byproduct syntax from IE5, thus continuing to foist this monstrosity on the world.

Perhaps the least offensive effort to "embrace and extend" the independent standards is in the area of APIs. XML deliberately does not define a standard data model or API. Interoperability comes through shared document syntax, not through a common API or binary representation. Of course, we do need APIs to process XML; but each developer is free to choose the one that works best for him or her. I may use JDOM and you may use SAX, but we can still understand each other as long as we pass XML back and forth. However, as Item 31 indicates, using a standard API like SAX or DOM that's implemented by many parsers does make your code more portable and allow you to switch vendors as necessary or convenient .

Another common way to attempt to lock you in to particular tool sets is to hide the XML behind an apparently simple API. This is particularly common in the web services world. As long as the API works, and as long as you stick with the same vendor, you may not notice a problem. However, if you try to migrate off that vendor's tools or introduce a new platform that the vendor doesn't support, you may find that it's not so easy to get your data out of the proprietary system. Here a lot depends on just how well the vendor has documented all its different XML formats. If it has, that's not such a big deal. If it hasn't, the migration process will be considerably more challenging. You might even discover that the vendor has chosen to encode key information in binary data, all wrapped up in a nice Base64 package, and hasn't told you anything about the format of that binary data. Before committing to an API, always inspect the formats it generates and the documentation for those formats. Make sure you're not limited to that one library for producing and consuming the data.

XML is designed for openness. It is a text-based format because text is less opaque and more interoperable than binary formats. In practice XML data is far easier to reverse engineer, exchange, and export than almost any traditional data format. Needless to say, this is good news for users but bad news for companies that have built business models around locking data up into their proprietary formats. Some of these companies are trying to stuff the XML genie back into proprietary bottles. Fortunately, there are many more genuinely open tools for working with XML than closed ones. Be a little skeptical when evaluating vendor hype, especially if the vendor is promising to save you from all the evils of XML, and you should be fine.