Describing XML Documents

Built-In Datatypes

One of the more useful features of XML Schema is that it defines a core set of datatypes. These include basic programming types such as string, int, float, and double; mathematical types such as integer and decimal; and XML types such as NMTOKEN and IDREF.

One of the most significant advantages of the XML Schema type system is that it is completely platform independent. Values of types are consistently represented no matter what hardware, operating system, or XML processing software is used. The XML Schema type system allows XML-based protocols such as SOAP to achieve strong interoperability in heterogeneous computing environments.

Datatypes are useful for defining schemas that describe the type of data that must be contained within a document. I will heavily leverage datatypes when I describe how to create schemas later in this chapter. Another way that you can leverage the XML Schema type system is by annotating an XML document with the type of data it contains. This helps remove ambiguity about the intentions of the document's creator.

In the previous section, I created a SOAP message to submit a PurchaseItem request. Here, I will use the built-in datatypes to indicate the types of the parameters being passed:

<?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">   <soap:Body>     <PurchaseItem>       <item xsi:type="xsi:string">Apple</item>       <quantity xsi:type="xsi:int">1</quantity>     </PurchaseItem>   </soap:Body> </soap:Envelope>

I added an xsi:type attribute to each parameter element. The value of the attribute represents the datatype of the encoded parameter. Decorating elements within an XML document with type information removes all ambiguity regarding the type of data the sender encoded into the message. The recipient of the preceding message will know that quantity is represented as an int and item is represented as a string.

The SOAP specification defines a polymorphic accessor as an element whose type is determined at run time. The polymorphic accessor is conceptually similar to the Object type in Visual Basic .NET. If a polymorphic accessor appears within a SOAP message, it must contain a type attribute indicating the type of data the element contains.

The Appendix provides a full list of XML Schema built-in datatypes. Note that SOAP 1.1 was defined against a working draft of the XML Schema specification published in 1999. Some of the built-in types have undergone name changes through the course of becoming a recommendation. Even though the SOAP 1.1 specification and its related schemas reference a previous version of the XML Schema, most SOAP implementations including ASP.NET and Remoting reference the built-in datatypes defined by the current XML Schema specification.

In the remainder of this section, I cover a few of the more interesting datatypes.

Integers

The XML Schema language defines a number of types that are used to describe integers. The two types that often get confused are integer and int. The int type is actually a derivative of the integer type with additional restrictions.

Even though both int and integer represent an integer value, they serve different purposes. Elements and attributes of type integer contain a value that meets the mathematical definition of an integer. Numbers of type integer are boundless and therefore might contain values that cause an overflow condition if copied into a CPU's register. Because elements and attributes of type int are restricted to containing a 32-bit integer, they are better suited for computer science applications.

The same distinction holds true for decimal numbers. Instances of the float type conform to the Institute of Electrical and Electronics Engineers (IEEE) single-precision floating point type. On the other hand, the decimal type represents an arbitrary precision decimal number.

Strings

XML Schema defines a string datatype, but it is not identical to the string type in many database or programming languages. In particular, string types in many database and programming languages allow characters that are forbidden to appear in the XML Schema string datatype.

Characters that might invalidate or alter the meaning of the XML document cannot be contained within an element or attribute of type string. For example, reserved characters such as the less than and ampersand signs carry special meaning and cannot appear within an XML document. Other characters such as quotation marks and apostrophes cannot appear within the value of an attribute. These characters must be escaped or encoded in some fashion before they can be serialized into an XML document.

XML defines a means of encoding individual characters within an XML document by using character references. A character reference consists of an ampersand followed by a character identifier and then a semicolon. The character identifier can be either the numeric identifier of a Unicode character or a character entity reference.

A numeric character reference is used to identify a specific character in the Unicode (ISO/IEC 10646) character set. The character identifier is either the decimal or hexadecimal value of the character, prefixed by a pound sign. For example, the character A can be encoded as &#65; or &#x41;.

The XML specification provides character entity references, which are more readable character identifiers for a small subset of Unicode characters. Even though the HTML 4 specification defines hundreds of character entity references, the XML specification defines only five, for characters that interfere with well-formed XML, as shown in Table 4-1.

Table 4-1  XML Character Entity References

Character

Numeric Character Reference

Character Entity Reference

"

&#34; or &#x22;

&quot;

'

&#39; or &#x27;

&apos;

&

&#38; or &#x26;

&amp;

<

&#60; or &#x3C;

&lt;

>

&#62; or &#x3E;

&gt;

Double quotes and apostrophes have character entity references defined for them because under certain conditions they are not allowed within the value of an attribute. If the value of an attribute is surrounded by double quotes, a double quote cannot appear within the value of the attribute. The same holds true for apostrophes. For example, the following elements contain illegal characters:

<e a="Scott says, "This is illegal.""> <e a='Don't do this, either.'>

The following elements are valid:

<e a='Scott says, "This is perfectly fine."'> <e a="This isn't a problem, either.">

Yet another way to embed strings with reserved characters within an XML document is within CDATA sections. XML defines the sequence of characters <![CDATA[ to tell the XML processor to ignore special characters until ]]> is encountered. Here is an example:

<myString><![CDATA[I can now use all five reserved characters. (", ', &, <,  and >)]]></myString>

When you serialize string variables into an XML document, be sure to encode special characters using character references or escape the string within a CDATA section.

Binary Data

Binary data must be encoded before being inserted into an XML document to ensure that it does not introduce any characters that might invalidate the XML. The XML Schema specification defines two built-in datatypes for binary data, base64Binary and hexBinary.

Type hexBinary encodes each binary octet into its two-character hexadecimal equivalent. For example, the binary value of 11111111 would be encoded as FF, ff, Ff, or fF.

The .NET platform provides support for encoding and decoding binhex. You can use the XmlTextReader.ReadBinHex method to decode binhex to binary data and the XmlTextWriter.WriteBinHex method to encode binary data to binhex.

It is far more common to see binary data of type base64Binary. This is especially true with Web services because the SOAP 1.1 specification recommends that all binary data embedded in a message be encoded using the base64 algorithm defined by RFC 2045.

Elements and attributes of type base64Binary contain data that are encoded using the Base64 encoding algorithm described in RFC 2045. As Table 4-2 shows, each 6-bit chunk of an array of binary octets is encoded into an XML-compatible character.

Table 4-2  The Base64 Alphabet

Binary

Base64

Binary

Base64

Binary

Base64

Binary

Base64

000000

A

010000

Q

100000

g

110000

w

000001

B

010001

R

100001

h

110001

x

000010

C

010010

S

100010

i

110010

y

000011

D

010011

T

100011

j

110011

z

000100

E

010100

U

100100

k

110100

0

000101

F

010101

V

100101

l

110101

1

000110

G

010110

W

100110

m

110110

2

000111

H

010111

X

100111

n

110111

3

001000

I

011000

Y

101000

o

111000

4

001001

J

011001

Z

101001

p

111001

5

001010

K

011010

a

101010

q

111010

6

001011

L

011011

b

101011

r

111011

7

001100

M

011100

c

101100

s

111100

8

001101

N

011101

d

101101

t

111101

9

001110

O

011110

e

101110

u

111110

+

001111

P

011111

f

101111

v

111111

/

Base64 also defines a 65th character for padding purposes. One or more = signs can appear at the end of the encoded string. If the binary object fits neatly into 6-bit chunks, no padding characters are applied to the end of the Base64 string. All other conditions require zeros added to the end of the binary object. An = character is appended to the end of the encoded string for every two zeros added to the binary object. Because a binary object is composed of a series of bytes (8 bits), there are three possible scenarios, including the one just mentioned:

  • A single byte remains to be encoded. In this case, four zeros are appended, the two resulting 6-bit chunks are encoded, and two = characters are appended to the end of the encoded string.

  • Two bytes remain to be encoded. In this case, two zeros are appended, the three resulting 6-bit chunks are encoded, and a single = character is appended to the end of the encoded string.

  • Three bytes remain to be encoded. In this case, the remaining bytes can be evenly divided into four 6-bit chunks, so no = character is appended to the end of the encoded string.

Much like binhex, XmlTextWriter and XmlTextReader provide the WriteBase64 and ReadBase64 methods for encoding and decoding Base64-encoded data. In addition, the .NET platform take cares of properly encoding and decoding binary data for Web services built on top of the ASP.NET and Remoting frameworks. You saw an example of this in Chapter 2, where I created an ASP.NET Web service that sent and received binary files.



Building XML Web Services for the Microsoft  .NET Platform
Building XML Web Services for the Microsoft .NET Platform
ISBN: 0735614067
EAN: 2147483647
Year: 2002
Pages: 94
Authors: Scott Short

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net