Devising an Architecture | Using XML with Legacy Business Applications

Several issues must be addressed when considering how to build XML support into your application. We've talked a bit in earlier chapters about choosing an appropriate API and processing paradigm. We'll talk a bit more about it later in this chapter, too. Beyond the considerations we've already mentioned throughout the book, there are two others we need to review.

Add-on or Major Redesign?

This is as much an assessment for scoping your project as it is a choice to make. You need to determine whether you are going to add incremental functionality and modules or whether you are going to do a major application redesign for XML. Depending on several factors, XML might be supported as just one more import/export format. Here are a few criteria to bear in mind when making the assessment.

What language is your application written in, and what is your development environment? We touched on this briefly in Chapter 1 and will discuss it more later, but here's a basic assessment. If your application is written in Java, C++, or Visual Basic, or if your development and runtime environment supports linking with modules written in those languages, you will have a relatively easy time integrating at the code level. If your application is written in a strictly procedural language like C, COBOL, or FORTRAN and you can't link with C++, you do have some code-level integration options, but they aren't necessarily going to be very cheap or pretty. You may want to rely on external utilities (such as those developed in this book) for XML support.
From a code-level perspective, how well does the current application support importing and exporting data? If the code is modular and well structured, supporting XML as an additional format may not require much redesign. However, if it's a real mess of spaghetti, you may want to redesign the code before trying to build in XML support.
From a functional perspective, what is the support for traditional, document-oriented electronic commerce? Many applications with good support for EDI integration offer users features for managing information needed for trading relationships and by trading partners . (We'll discuss some specifics later.) It is a near certainty that many of the same features would be needed for electronic commerce using XML rather than EDI formats. If you offer these features now, an XML add-on may not require much work. If you don't, you are looking at a major project.

This brief list serves to introduce the topic. More issues and details will arise as we look at integration in greater detail.

If, How, and Where to Validate

This is a fundamental choice, and you'll benefit by making it as early as possible. We'll assume in this discussion that you're going to use a standard DOM or SAX API that offers validation against W3C schemas. A major choice you need to make is whether or not you are going to have the API validate documents against a schema. This is primarily of concern for documents that your application will consume , but it may also relate to documents that your application will produce. A related decision is the choice of schema, which we'll talk about shortly; you can pick a standard (if you can find an appropriate one) or you can create your own. If you do create your own, you have flexibility about how strictly you write it.

Schema validation has its tradeoffs just like any other design decision. Here are a few.

Schema validation may incur performance penalties. We may be concerned about CPU and memory usage on a loaded system, delays in elapsed time, or both. Both can be considerations if you are validating against large schemas, especially if they are spread over several different files. Time delays due to network latency can be a factor if the schemas are standard schemas that must be retrieved from some organization's server on the Internet, though various schema caching mechanisms may ease this. The overall performance penalty may or may not be very great, but I know of at least a few organizations that don't use validation because of performance concerns. A compromise alternative to never validating might be to validate during testing but then to not validate after moving to production if schema validation errors have been eliminated from the normally anticipated production data.
The choice to forego validating can make writing code harder. Particularly if you are using the DOM, being able to make reliable assumptions about the structure of a document you are reading can make writing the code much easier. If you aren't able to assume that an Element will be present, you must test for null pointer returns from methods such as get ElementsByTagName or nextSibling before trying to retrieve an Attribute or child Elements from the Element. Users don't like null pointer exceptions. Also, in general terms the code can be trickier to write if you can't be sure that the structure of the document is what you expect. If you get a document that is well formed but has subtrees in the wrong places, what is your application going to do?
Schema validation is either on or off; there is no in-between. Some EDI management systems offer the ability to turn on or off the validation of coded values against a code list while still validating for required segments and data elements. It would be nice if the XML APIs offered strict or lax validation and various flavors of validation. However, the XML Schema Recommendation doesn't define levels of validation so the APIs don't either. Most of them stop cold at the first invalid component, no matter how seemingly trivial the noncompliance might be.
Schema validation removes the ability to correct errors from within the application ( generally ). Let me give you an example: Some systems that import sales orders received by EDI hold them in a review state, allowing a clerk to review them and correct for errors or missing data before accepting them into the system. If your application has a similar feature and you validate in the schema for an invalid item number, the data may never reach such a review state. (One way around this limitation would be to load the document without validation and somehow mark it as invalid.)
Schema validation removes many validation burdens from application code. This is perhaps one of the greatest benefits to validating against schemas. You let the API validate that your business rules are satisfied, and you don't have to write code for your application to do validation.
There are limits to schema validation. Schema validation is very good for determining the presence or absence of data and for ensuring the data type and contents of individual items of data. However, by itself it won't allow you to enforce constraints based on conditional relationships between different data items. Specifically, it won't flag one Element or Attribute as being invalid based on the contents of a different Element or Attribute. Some types of processing are rife with these types of requirements. Let's take an example in health care. A health care claim might have a coded field for the type of medical procedure performed. Some types of procedures might require information in other Elements that isn't relevant to other procedures. A tooth extraction would require that the tooth be identified, whereas fitting a bite guard would not. Some readers may be thinking that this is stupid; problems like this could easily be avoided by modeling the data correctly. You're probably right, but the fact of the matter is that data is not always modeled to prevent the need for this type of validation. It is a very real concern for some applications.

Beyond these considerations, if you do validate you must determine where your schemas are going to be stored. This decision in itself may be relevant to whether or not you validate. For example, if your application is a desktop bookkeeping system that will process standard purchase orders received from customers, you may not want to enforce validation if a significant number of your users rely on dial-up Internet connections. You can't validate if you can't get to the schema. If your bookkeeping system requires that imported orders be validated and if you don't validate, you don't process the order. If you validate against a remote schema, how reliable is the server that hosts the schema? If you provide your own schemas, like any other application component you must decide on a directory location. You must also take steps to ensure that schemas won't be modified or inadvertently deleted by users.

I have one final word on schema validation. Just because a document is schema valid doesn't mean in and of itself that the document is what you expect. The default behavior of most APIs is to validate against the schema referenced in the document's root Element. That schema could be the one you coded for, or it could be something completely unrelated. You could be processing a valid shipment notice when you were expecting a valid invoice. So, to properly use schema validation in your architecture, you'll need to do things beyond just validating against a schema. The first step is to check the root Element name . We routinely did this in several of the book's routines. The other thing is to retrieve and check the document namespace and the schema location Attributes from the document's root Element. There is no way you can be absolutely sure of the document you're processing unless you can match the schema location URI to a known schema. If the document uses a named namespace, verifying that it matches what you expect provides another level of assurance. (Add that to your list of reasons for using namespaces.) I haven't verified namespace names in the book's utilities, but it is something you should consider doing in your applications.