Is There Data or Not? | Using XML with Legacy Business Applications

It is customary to specify that certain data must be present in a document before it can successfully be imported into a business application. Many people coming from an EDI background are very accustomed to having their EDI systems verify that essential information like purchase order numbers and item IDs are present in a message. They look to schemas and schema validating parsers to help them enforce these types of constraints. However, things in this area don't quite work as many people expect. I can best show this with an example.

Let's consider a simple invoice. Our customer requires us to send them back their original purchase order number when we send them an invoice electronically . They have set their PurchaseOrderElement to be mandatory, using a min Occurs Attribute with a value of "1" in the xs:complexType sequence where it lives. It is a string field since their purchase order numbers have a mix of alphabetic and numeric characters . Which of the following PurchaseOrderNumber Elements do you think will pass their invoice schema's validation constraints, and which do you think will fail?

<PurchaseOrderNumber>AZ-39AAY</PurchaseOrderNumber>
<PurchaseOrderNumber> </PurchaseOrderNumber>
<PurchaseOrderNumber></PurchaseOrderNumber>
<PurchaseOrderNumber/>

If you understood the schema constraint as saying that a purchase order number must be present, you would think that cases 2,3, and 4 would fail. If you were from an EDI background, you would probably think that cases 3 and 4 would fail and that case 2 might not even be legal XML. However, you would be wrong. If you said that none of the cases would fail and all of them would pass, you win the bonus prize!

Especially in the EDI world, saying that a data element is present is synonymous with saying that there is data in the data element. This is not the case with XML, at least not for string Elements. The empty Elements in cases 3 and 4 satisfy the minOccurs of one constraint. Case 2 may not be so obvious. The string data type allows any Unicode character to be present in the Element, including spaces. So, a single space is valid. In EDI a single space in a data element is not valid.

There are two ways to get around this problem. The easiest is to use the token data type instead of string. Token does not allow leading or trailing spaces or any other whitespace beyond a single space between other strings of characters. However, if you think someone might do something like use two or more spaces between a first and last name , you have to fall back to the other method. Create a new simple type, derived by restriction from string, with a pattern facet of at least one character that isn't a space. Then, use that wherever you want to have at least one nonspace character. Here's an example of how this might be done.

Example of a Type Requiring at Least One Nonwhitespace Character in nonSpaceString:xsd

 <xs:simpleType name="nonSpaceString">   <xs:annotation>     <xs:documentation>       This is a type, derived from string by restriction, that       requires at least one nonwhitespace character to be present.     </xs:documentation>   </xs:annotation>   <xs:restriction base="xs:string">     <xs:pattern value=".*\S.*"/>   </xs:restriction> </xs:simpleType>

There are many ways to write the regular expressions used in pattern facets. In this particular example I describe a string that starts and ends with zero or more characters other than the new line or carriage return. This is indicated by the .* notation. The \S notation, bracketed in the middle, indicates a single character that is not a whitespace character. (The \s notation is a "multicharacter escape" that indicates a character from the class of whitespace characters. This class, or set, is composed of the space character, tab, new line, and carriage return. Capitalizing it to \S indicates any character that is not in that class.)

I don't know whether people just aren't aware of this problem, are concerned about the processing overhead of pattern validation, or just don't care. But I have yet to see any schemas that enforce this type of constraint.

It is important to note that this problem occurs only with the string data type. The other commonly used built-in data types don't have this problem (at least not with most parser APIs!).

So, with any luck you should now be able to read and understand most schemas you will encounter. The next chapter will show programmers how to code to use schemas. In doing that we'll add validation capability to the two utilities we developed in Chapters 2 and 3.