16.2 Developing Record-Like XML Formats

Despite the mature status of most of XML's core technologies, XML application development is only now being recognized as a distinct discipline. Many architects and XML developers are attempting to apply existing design methodologies (like UML) and design patterns to the problem of constructing markup languages, but a widely accepted design process for creating XML applications still does not exist.

The term "XML application" is often used in XML contexts to describe an XML vocabulary for a particular domain rather than the software used to process it. This may seem a little strange to developers who are used to creating software applications, but it makes sense if you think about integrating a software application with an XML application, for instance.

XML applications can range in scope from a proprietary vocabulary used to store a single computer program's configuration settings to an industry-wide standard for storing consumer loan applications. Although the specifics and sometimes the sequence will vary, the basic steps involved in creating a new XML application are as follows :

Determine the requirements of the application.
Look for existing applications that might meet those requirements.
Choose a validation model.
Decide on a namespace structure.
Plan for expansion.
Consider the impact of the design on application developers.
Determine how old and new versions of the application will coexist.

The following sections explore each of these steps in greater depth.

16.2.1 Basic Application Requirements

The first step in designing a new XML application is like the first step in many design methodologies. Before the application can be designed, it is important to determine exactly what needs the application will fulfill. Some basic questions must be answered before proceeding.

16.2.1.1 Where and how will new documents be created?

Documents that will be created automatically by a software application or database server can be structured differently than those that need to be created by humans using an editor. While software wouldn't have a problem generating 100 elements with attributes that indicate cross-references, a human being probably would find those expectations frustrating.

If you have an application or a legacy format to which you're adding XML, you may already have data structures you need to map to the XML. Depending on the other requirements for the application, you may be able to base your XML format on the existing structures. If you're starting from scratch or need to share the information with other programs that don't share those structures, you probably need to look at the data itself and build the application creating the XML around the information.

16.2.1.2 How complex will the document be?

Obviously, the complexity of the data that will be modeled by the XML document has some impact on how the application will be designed. A document containing a few simple element types is much easier to describe than one that contains dozens of different elements with complex data type requirements. The complexity of an application will affect what type of validation should be used and how documents will be created and processed .

16.2.1.3 How will documents be consumed?

If the XML documents using this vocabulary will only pass between similar programs, it may make sense to model the XML documents directly on the internal structures of the programs without much concern for how easy or difficult that makes using the document for other programs or for humans. If there's a substantial chance that this information needs to be reused by other applications, read by humans (for debugging purposes or for direct access to information), or will be stored for unknown future use, it probably makes sense to ensure that the document is easy to read and process even if that makes creating the document a slightly more difficult task.

16.2.1.4 How widely will the resulting documents be distributed?

Generally , the audience for a new XML application is known in advance. Some documents are created and read by the same application without ever leaving a single system. Other documents will be used to transmit important business information between the IT systems of different organizations. Some documents are created for publication on the Web to be viewed by hundreds or even thousands of people around the world. XML formats that will be shared widely typically need comprehensive documentation readily available to potential users. Formal validation models may also be more important for documents that are shared outside of a small community of trusted participants .

16.2.1.5 Will others need to incorporate this document structure into their own applications?

Some XML applications are never intended to be shared and are only useful when incorporated into other XML applications. Others are useful standards on their own but are also suitable for inclusion in other applications. Here are a few different methods that might be used to incorporate markup from one application into another:

Simple inclusion: Markup from one application is included within a container element of another application. Embedding XHTML content in another document is a common example of this.
Mixed element inclusion: Markup from one application is mixed inline with content from another application. This can complicate validation and makes the including application sensitive to changes in the included application. The Global Document Annotation (GDA, http://www.oasis- open .org/cover/gda.html) Initiative provides an example of this type of application.
Mixed attribute inclusion: Some XML applications are comprised of attributes that may be attached to elements from the host application. XLink is a prime example of this type of application, defining only attributes that may be used in other vocabularies.

Answering these questions will provide a basic set of requirements to keep in mind when deciding whether to build a new application, acquire an existing application, or some combination of the two.

16.2.2 Investigating Available Options

Before committing to designing and implementing a new XML application, it is a good idea to take a few minutes to search the Internet for prior art. Since the first version of the XML Recommendation was released in 1998, thousands of new XML applications have been developed and released around the world. Although the quality and completeness of these applications vary greatly, it is often more efficient to start with an existing DTD or schema (however imperfect) than to start from scratch. In some cases, supporting software is already available, potentially saving software development work as well.

16.2.2.1 XML vocabulary development

It is also possible that the work your application needs to do may fit into an existing generic framework, such as XML-RPC or SOAP. If this is the case, you may or may not need to create your own XML vocabulary. XML-RPC only uses its own vocabulary, while different styles of SOAP may reduce the amount of work your vocabulary needs to perform.

Beyond the average search engine, XML Cover Pages (http://xml.coverpages.org) provides information about a wide variety of XML- related vocabularies, software, and projects. The search for existing applications may also find potential collaborators, which is helpful if the XML format is intended for use across multiple organizations.

16.2.3 Planning for Growth

Some applications may not need to evolve over time, but some thought should be given to how users of the application will be able to extend it to meet their own needs. In DTD-based applications, this is done by providing parameter entity "hooks" into the document type definition, which could either be referenced or redefined by an instance document. Take the simple DTD shown in Example 16-8.

Example 16-8. extensible.dtd

 <!ENTITY % varContent "(EMPTY)"> <!ELEMENT variable %varContent;>

This fragment is not a very interesting application by itself, but since it provides the capability for extension, the document author can make it more useful by providing an alternative entity declaration for the content of the variable element, as shown in Example 16-9.

Example 16-9. Document extending extensible.dtd

 <?xml version="1.0"?> <!DOCTYPE variable SYSTEM "extensible.dtd" [ <!ENTITY % varContent "(#PCDATA)"> ]> <variable>Useful content.</variable>

The W3C XML Schema language provides more comprehensive and controlled support for extending markup using the extension , include , redefine , and import elements. These mechanisms can be used in conjunction to create very powerful, customizable application frameworks.

16.2.4 Choosing a Validation Method

The first major implementation decision when designing a new XML application is what type of validation (if any) will be performed on instance documents. In many cases, prototyping a set of instance documents is the best way to determine what level of validation must be performed.

If your application is simply saving some internal program state between invocations (such as window positions or menu configurations within a GUI application), the structure is fixed by the program logic itself. Even though these configuration documents will always be written and read by the same program, writing a schema and validating documents on input can detect file corruption, not to mention bugs in the software itself. All too often we've watched our computers crash because various software (most often Microsoft Word) went down in flames when it encountered content in a document it had assumed could not possibly be present (most recently while working on Chapter 27 of this book). Validation may be a key defense against such attacks, intentional or otherwise .

Validation is even more important when XML documents are exchanged between different related systems that are not maintained by the same development organization. In this case, a DTD or schema can serve as a definitive blueprint to ensure that all systems are sending and receiving information in the expected formats. If something does go wrong and one process begins rejecting the other processes' inputs, validation can help assign the blame and the concomitant responsibility for fixing the problem.

The most rigorous type of validation is required when developing a new XML standard that will be implemented independently by many different vendors without any explicit control or restrictions. For example, the XHTML 1.1 standard is enforced by a very explict and well-documented DTD that is hosted by the W3C. This well-known public DTD allows tool and application vendors to ensure that their systems will interoperate as long as instance documents conform to the standard.

After determining the level of validation for a particular application, it must be decided what validation language will be used. DTDs are still the most widely supported standard, although they lack the expressive power that is required by many record-like applications. The W3C XML Schema language provides very rich type and content model expression, but brings with it a commensurate level of complexity.

Developers can also provide both DTDs and XML Schemas for a given vocabulary, or even combine them with other vocabularies for describing XML structures, notably RELAX NG (http://www.oasis-open.org/ committees /relax-ng/ ) and Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html). Some organizations, particularly the W3C, are using RELAX NG as a base and generating DTDs and XML Schemas from the RELAX NG schemas. RDDL, described in Chapter 15, provides a set of tools for supporting and explaining such combinations for formats that use namespaces.

16.2.5 Namespace Support

Virtually every XML application that will be shared with the public should include at least a basic level of namespace support. Even if there are no current plans to publish documents in a particular vocabulary to the outside world, it is much simpler to implement namespaces from the ground up than it is to retrofit an existing application with a namespace.

Namespaces affect everything from how the document is validated to how it is transformed (using a stylesheet language such as XSLT). Here are a few namespace issues to consider before selecting a URI and starting work.

16.2.5.1 Will instance documents need to be validated using a DTD?

If so, some planning of how namespace prefixes will be assigned and incorporated into the DTD is necessary. DTDs are not namespace aware, so strategic use of parameter entities can make modification of prefixes much simpler down the road.

16.2.5.2 Will markup from this application need to be embedded in other applications?

If so, some thought needs to be given to potential name collisions. The safest approach is to force every element and possibly every attribute from your application to be explicitly qualified with a namespace. This can be done within an XML Schema by setting the elementFormDefault and attributeFormDefault attributes of the schema element to qualified . If you expect to be mixing the vocabulary only at the element level, you should probably leave your attributes unqualified.

16.2.5.3 Are there legacy XML document formats to support?

If an application will include existing XML documents, some thought should be given to the effort involved in migrating them. In many cases, where the document didn't use namespaces at all, simply adding a default namespace declaration will be sufficient to make the documents work with applications that depend on namespaces to distinguish among vocabularies. Once documents and document formats are out "in the wild," it's difficult to get people to change. It may be necessary to keep programs around that handle both the original format and the new format or to create transformations from the older format to the new format. These multiple levels of processing or transformation are maintenance problems over time, so it's generally worth encouraging users to switch to the new format, possibly turning off the old one at some point.

16.2.6 Maintaining Compatibility

Maintaining backward compatibility with existing documents and processing software is a primary concern for XML applications that are widely used by diverse audiences. Standards organizations face formidable difficulties when updating a popular application (such as HTML). While few applications will become as widespread as HTML, some thought should be given in advance to how new versions of a schema or DTD will interact with existing documents.

One possible, although problematic , approach to maintaining backward compatibility is to create a new, distinct namespace that will be used to mark new element declarations or perhaps to change the namespace of the entire document to reflect a substantially changed version. This has substantial costs, however, and generally makes sense only when the new functionality is itself a separate vocabulary. Working with documents that have parts written in different namespace-indicated versions is a tough problem for developers.

A better strategy is only to extend existing applications without removing prior functionality. In this case, it is a good idea to ensure that each instance document for an application has some readily identifiable marker that associates it with a particular version of a DTD or schema. The good news is that the highly transformable nature of XML makes it very easy to migrate old documents to new document formats.

Removing functionality is possible, but frequently difficult, once a format is widely used. Deprecating functionalitymarking it as a likely target for removal a version or several before it is actually removedis one approach. While deprecated features often linger in implementations long after they've been targeted for removal, they change the expectations of developers building new applications and make it possible, if slow, to remove functionality.