5.2 XML Definitions | Programming Web Services with Perl

This chapter introduces the structure and mechanics of SOAP and SOAP messages. In practice, the majority of the work behind creating SOAP messages and disassembling them on the receiving end is done using a toolkit such as SOAP::Lite , which will be introduced in Chapter 6. Understanding the form and function of SOAP messages helps you understand the functionality of the toolkit components . This chapter is by no means a complete overview of SOAP. The number (and length) of books devoted to the topic of SOAP is a testimony to the depth of the subject.

Because SOAP is by its very nature a more complex protocol than XML-RPC, it should be no surprise that the depth of the XML it uses is appropriately more complex. One of the key differences between parsing SOAP messages as opposed to XML-RPC is the need for support of XML namespaces. SOAP not only uses namespaces for their original purpose of mixing document-type elements, it also uses them to distinguish between different versions of the specification.

Another important factor in processing a typical SOAP message is that the specification requires the message contain no DTD declaration or processing instructions (which includes XML comments). This may limit the ability to apply other technologies, such as XSLT, to either the request or response messages. The current draft proposals for SOAP 1.2 allow processing instructions to be present but mandate that receivers ignore them. This allows other processing, such as XSLT (XML Stylesheet Language Transforms), without burdening the servers.

Despite these minor differences, a SOAP message is still at its core just another XML document. As such, it is self-describing and (in most cases) very readable by the average viewer. Thoughtful and consistent labeling of the namespaces can also add to the readability.

5.2.1 The Basic Message Structure

In the simplest of terms, a SOAP message is made up of the following parts :

The containing Envelope tag, with proper namespace qualification and declaration.
An optional Header tag, which provides additional information directly to the server attempting to handle the message.
The Body tag, within which is encoded the data ranging from details of the request being made, to the expression of any data that may be a part of a request.

Example 5-1 is a simple message with Envelope , Header , and Body .

Example 5-1. A basic SOAP message

 <  SOAP-ENV:Envelope  xmlns:SOAP-ENV=     "http://schemas.xmlsoap.org/soap/envelope/">   <  SOAP-ENV:Header  >     <auth:Authorize xmlns:auth=         "http://auth.xmlsoap.org/auth/">       <auth:type>user</auth:type>       <auth:name>soapclient</auth:name>       <auth:passwd>s0m3*th!ng</auth:passwd>     </auth:Authorize>   <  /SOAP-ENV:Header  >   <  SOAP-ENV:Body  >     <call:getQuote xmlns:call=         "http://bigtrade.com/soap/service/">  ...detail of function call and data  </call:getQuote>   <  /SOAP-ENV:Body  > <  /SOAP-ENV:Envelope  >

The message starts with the required Envelope tag. The tag is qualified as being in a namespace that is referenced by the URI http://schemas.xmlsoap.org/soap/envelope/ . This particular namespace is what identifies the message as being SOAP 1.1, so the declaration does double-duty, both declaring the prefix and identifying the SOAP version.

The Header element is an optional part of the message. In this case it illustrates how a client might pass authentication credentials to the server. In a more complete example, there would likely be attributes in the Authorize tag that control how the server would respond if it did not understand what to do with the tag (discussed in more detail in the section Section 5.2.3). Here, it is enough to see how the child tags are required to be namespace-qualified as well, to set them apart from the SOAP-ENV space. Note that the child tags of Authorize use the same namespace as their parent. (We made up the namespace for the Authorize tag, despite the xmlsoap.org domain, so don't try to use it in your own programs!)

Because HTTP messages aren't encrypted, in practice you wouldn't send a password in the clear like this. The easiest way to secure it is to use a transport layer of HTTP over Secure Socket Layer (SSL). This is the https protocol used to implement secure web sites that handle credit card numbers and other data that shouldn't be sent unencrypted over the wire.

Following the Header is the mandatory Body element. As with the header part of the message, the immediate child tag within Body is expected to have a separate namespace from that of the SOAP envelope. Of course, the authorization and the function call could have used the same namespace, but the getQuote tag would still have had to declare it. At this point, the earlier declaration made for Authorize has gone out of scope.

The tag in the body of the message is the beginning of a common SOAP example, the stock trading-price quote. ^[1] The details of the call itself and the data within will be explained when we get to the Body element.

^[1] It may be taken on very good authority that this is the only reference to stock-quote applications to be found in this entire book.

Because SOAP messages can become lengthy and complex, many of the examples through the rest of this book will be expressed as fragments , rather than whole and complete messages.

5.2.2 The Envelope Tag: Declaring Namespaces

In Example 5-1, the Envelope tag was used only to declare the basic SOAP namespace. In fact, a tag may be used to declare as many namespaces as the application needs to define at a given point, and the start of the message is often the choice. Some frequently used namespaces that get in SOAP messages are listed in Table 5-1. The table also shows the labels that are consistently used to identify them throughout the book.

Table 5-1. The most common namespaces

Namespace URI	Label
http://schemas.xmlsoap.org/soap/envelope/	`SOAP-ENV` (1.1)
http://www.w3.org/2002/06/soap-envelope	`SOAP-ENV` or `env` (1.2)
http://schemas.xmlsoap.org/soap/encoding/	`SOAP-ENC` (1.1)
http://www.w3.org/2002/06/soap-encoding	`SOAP-ENC` or `enc` (1.2)
http://www.w3.org/2001/XMLSchema	`xs` (sometimes `xsd` )
http://www.w3.org/2001/XMLSchema-instance	`xsi`

The SOAP-ENV and SOAP-ENC labels occur multiple times in the table because the namespace URI is also used to identify which version of the SOAP specification has been used for the message. In the working drafts for SOAP 1.2, we use the labels env and enc in place of SOAP-ENV and SOAP-ENC . The older labels are used by some toolkits and languages, regardless of the SOAP version. Within this and the chapters that follow, the newer and shorter labels will be used for SOAP 1.2 examples, while the longer labels will be retained for examples that are SOAP 1.1. The last two lines in the table correspond to XML Schema namespaces that SOAP messages use a lot in expressing data.

The SOAP-ENV label is familiar; it was used in Example 5-1. But what exactly is the SOAP-ENC label used to annotate? This label is used in the SOAP specification to define the namespace for the data encoding SOAP itself provides for convenience. Messages may often be able to express all data using these encodings, without having to use any XML Schema. Some messages may use multiple encodings. It isn't unusual to see a tag like that in Example 5-2, where several namespaces are declared at once.

Example 5-2. Envelope tag with several declarations

 <SOAP-ENV:Envelope  xmlns:SOAP-ENV  =     "http://schemas.xmlsoap.org/soap/envelope/"  SOAP-ENV:encodingStyle  =     "http://schemas.xmlsoap.org/soap/encoding/"  xmlns:xsd  ="http://www.w3.org/2001/XMLSchema"  xmlns:xsi  ="http://www.w3.org/2001/XMLSchema-instance">  ...rest of message    </SOAP-ENV:Envelope>

Besides specifying a namespace for the encoding rules, an envelope may define a default encoding style that should be assumed over all elements in the message that aren't explicitly assigned a different style. Later, other attributes will be discussed in the context of the header and body of the message.

5.2.3 The Header Tag: Routing and More

The Header tag is the only optional part of the basic SOAP message. Despite this, it is present more often than not. The header section of a message may include authentication credentials (as shown in Example 5-1), information about other servers the message will visit, transaction management, and so forth. Another way of looking at this is that headers provide metadata for the message, adding more detail and information.

Each tag that is a child element of header tag is considered to be a single block , considered independently of all others. Each block must be qualified by a namespace. There are no child elements defined for Header within the SOAP specification itself; thus, all blocks use tags imported from a different specification. You may see attributes from the SOAP environment on these tags because XML permits the mixing of attributes within a tag by way of namespace qualification, just as it does with the tags themselves .

Example 5-1 showed a single block within the Header element, a block that provided authentication credentials for the caller. Example 5-3 uses the same Header but adds some additional blocks.

Example 5-3. A header containing several blocks

 <SOAP-ENV:Header>   <auth:Authorize xmlns:auth=       "http://auth.xmlsoap.org/auth/">     <type>user</type>     <name>soapclient</name>     <passwd>s0m3*th!ng</passwd>   </auth:Authorize>   <p:priority  xmlns:p="http://tasks.xmlsoap.org/priority"   SOAP-ENV:mustUnderstand="0">   <value xsi:type="xs:int"  >19</value>   </p:priority>  <tr:transaction xmlns:tr="soap-transaction  "  SOAP-ENV:mustUnderstand="1">  <transactionID>141421</transactionID>   </tr:transaction> </SOAP-ENV:Header>

Taking the tags somewhat literally, it can be assumed that the second block acts to specify that the task may be run at a particular priority. It refers to a different namespace, and thus a completely separate specification from SOAP itself. Following the "priority" block is a third block that specifies a transaction identifier for the message. Its namespace is a relative URI computed against the namespace URI of the parent ( Header ) tag.

Both blocks also introduce a new piece of the SOAP picture. They each bear an attribute called mustUnderstand , that is qualified into the namespace that SOAP-ENV represents. The SOAP envelope provides two global attributes that may be used within blocks in the header: mustUnderstand and actor . In simple terms, actor associates blocks from the header with a specific step (or node , as will be defined soon) in the lifecycle of the message, while mustUnderstand defines whether the node that would handle a given block must do so. SOAP also provides the attribute introduced earlier, encodingStyle , that can choose a different encoding to express the actual information contained in those blocks.

5.2.3.1 The encodingStyle attribute

Looking back to Table 5-1, there are predefined namespaces for the basic encoding rules the SOAP specification provides. These are generally referenced in the Envelope tag, at the same point that the main namespaces are being declared. An encoding style may also be specified at the header-block level. This provides a different default encoding from that given in the Envelope tag and extends to the closing tag that matches the declaring tag.

Data encoding, and the differences in encoding styles themselves, are covered in greater detail later within this chapter.

5.2.3.2 Actors, roles, nodes, and responsibility

Before looking at the actor (called a role in SOAP 1.2) and mustUnderstand settings, it may help to understand what actors and nodes are, in the context of a SOAP message.

A SOAP transaction isn't limited to just one receiver, or endpoint. The more traditional model of the client/server system has the server as the single destination of a message, with the response that the client receives coming completely from that server. In the SOAP model, a message may pass through several hands before reaching a final destination. Each intermediate step may act on and alter some part of the message before passing control to the next . The response that the initiating client eventually receives may be a composite of response data from any number of the intermediaries involved.

Each step on such a path may be thought of as a node . There are always at least two nodes, each of which has exactly one role: it is either the initial sender of a request (the client) or it is the ultimate receiver of a request and responsible for generating the response (the server). In some cases, though, a node may be responsible for doing both. A request that it gets may only be making one of several stops. In these cases, the node must process those parts it has responsibility for, then send the message along to the next node in the chain.

SOAP manages these routes through the definition of actors and responsibility. An actor is merely a role played by a node along the trip. An actor is said to have responsibility for a block if the block refers to the current actor, or if the block refers to the anonymous actor, and the acting node is the ultimate receiver of the message. Actor roles are specified through attributes within the block's container element.

5.2.3.3 The actor/role attributes

There are two defined specific actors in SOAP 1.1, with a third one added to the specification with SOAP 1.2. The actor attribute is specified in the opening tag of a given block and must be qualified into the envelope namespace (as appropriate for the SOAP version). If there is no actor attribute in a block's tag, that block is said to be the responsibility of the anonymous actor . This actor represents the ultimate receiver of the message, regardless of the number of intermediate nodes involved.

In SOAP 1.2, the actor attribute was renamed as role . The names "actor" and "role" should be considered identical in the rest of the text.

The second of the common actor roles is the current actor . This refers to the currently active node. The actor attribute is set to a particular URI, slightly different between Versions 1.1 and 1.2, which tells the node that it has responsibility for the given block. An actor will see this value of the attribute only if it was specifically set by the client from whom the message was received. However many nodes there are in the complete message lifespan, as soon as one of those nodes receives a header block with the actor set to this URI, that node will have to either process it, ignore it, or report an error. Whichever of these may be the case, if the message proceeds forward to a new node, the block is removed from the header.

With SOAP 1.2, a third predefined role was added to the specification: that of the null actor . If a sender of a message wishes to convey some information in the header that isn't intended to be explicitly processed at any point, it may signal this by defining an actor attribute with the appropriate URI. This prevents the final recipient of the message from wrongly trying to process the information while acting in the role of anonymous actor. In fact, the SOAP 1.2 specification makes it clear that a node must not process any block that specifies this actor role. There is no corresponding functionality in SOAP 1.1. However, this should only be needed when there is a chance that the data block might be mistaken for a block to be processed. If there is no such concern, the null actor specification is probably not needed.

Table 5-2 lists these predefined actor roles and the URIs that identify them in SOAP 1.1 and 1.2:

Table 5-2. The URIs that define actor roles

Actor role

Identifier

Current actor

http://schemas.xmlsoap.org/soap/actor/next (1.1)

http://www.w3.org/2002/6/soap-envelope/role/next (1.2)

Null actor

http://www.w3.org/2002/6/soap-envelope/role/none

(not defined for SOAP 1.1)

Some other actor

Any other valid URN

The third row is there to illustrate that the current actor, null actor, and anonymous actor are simply those explicitly defined by SOAP. The application (and application developer) is free to use this attribute in other ways. One common function is to explicitly provide the URI for the next node in the chain. In situations in which the locations are already known, the actor may simply be a unique identification of the next service that should receive the message. There is no requirement that the URI provided as the value of an actor attribute actually refers to a valid, reachable web server. The interpretation of these actor URI values is left to the implementation of the applications and servers themselves.

5.2.3.4 The mustUnderstand attribute

The mustUnderstand attribute is used to tell the actor currently processing the message whether the block bearing this attribute is something the application must handle. The values allowed for the attribute in SOAP 1.1 are either or 1 . In Version 1.2, those values are acceptable as well, and in addition the value may also be an instance of the boolean type from the XML Schema namespace (generally tagged with xs or xsd , refer to Table 5-1 earlier).

The interpretation of the value is simple: when the value is true (either the explicit boolean true , or 1 ), the block must be understood by the responsible party. Combining this with the actor attribute prevents other nodes from having to understand all elements. Referring back to Example 5-3, the priority block is specified as not being mandatory. If the application understands the block and acts on the information it provides, it's acceptable. But if it can't, the block can be safely disregarded. Conversely, the transaction block is marked as mandatory. If the server application can't act on it, it has to signal a failure to the requesting client. In Example 5-3, all the blocks are the responsibility of the same actor, the ultimate recipient of the message. If the transaction block also has an actor attribute present, only that actor has to honor the mustUnderstand attribute.

When this attribute isn't present at all, it is the same as having it present with a value of (or false ). In Example 5-3, the authorization block doesn't have a mustUnderstand , so the default is used. Because of this, if the application doesn't understand the block, it's free to ignore it.

5.2.3.5 Attribute placement and example

The global attributes are recognized only when associated with the outer tag of a block; that is, the immediate children of the Header tag. Both versions of the specification allow for the attributes to be present in deeper tags but instruct that the application must ignore them in such cases. Because of this, all the blocks in Table 5-3 are equivalent.

Table 5-3. Header blocks using mustUnderstand

Block	Reasoning
<env:Header> <tr:transaction xmlns:tr= "soap-transaction> <transactionID> 12217 </transactionID> </tr:transaction> </env:Header>	The specification says that when there is no `mustUnderstand` attribute given, the server must treat it as though it were explicitly set to the false value.
<env:Header> <tr:transaction xmlns:tr= "soap-transaction" env:mustUnderstand="0"> <transactionID> 12217 </transactionID> </tr:transaction> </env:Header>	It is present, in a tag that is an immediate child of the `Header` tag. However, it's also set to the value both SOAP 1.1 and 1.2 (the case here, since `env` is used) consider `false` .
<env:Header> <tr:transaction xmlns:tr= "soap-transaction"> <transactionID env:mustUnderstand="1"> 12217 </transactionID> </tr:transaction> </env:Header>	The attribute is both specified and set to a type of `true` value that this version uses, but it isn't in an element that is an immediate child of `Header` . Thus it is ignored by the application.

With this ability, a client can, for example, specify a priority without the danger that a server might reject the request out-of-hand because of an inability to act on that part of the header.

Example 5-4 shows a larger header with many content blocks to tie all this material together. The header shown here will be traced through several stages of processing by multiple nodes.

Example 5-4. A complex header block, illustrating the attributes

 <env:Header>   <t:data xmlns:t="null" env:actor=       "http://www.w3.org/2002/6/soap-envelope/role/none">     <value id="timestamp">1014457345</value>   </t:data>   <crm:userUpdate xmlns:crm="x-schema:crm_structure.xml"       env:mustUnderstand="true"       env:actor=       "http://www.w3.org/2002/6/soap-envelope/role/next">     <crm:userIdNum><value>501</value></crm:userIdNum>   </crm:userUpdate>   <cert:checkSignature xmlns:cert="http://cert.ssl.org/cert"       env:mustUnderstand="true"       env:actor="http://soap.ssl.org/soap/checkSignature">     <cert:signatureType>seeded-gpg</cert:signatureType>     <cert:signatureSeed>         <value href="#timestamp" />     </cert:signatureSeed>     <cert:pkKeySerial>128-4095-1014458368</cert:pkSerial>     <cert:md5>823dd310c0d0842b2b5e1b4d28822db9</cert:md5>   </cert:checkSignature>   <crm:enactUpdate xmlns:crm="x-schema:crm_structure.xml"       env:mustUnderstand="true"       env:actor="http://soap.blackperl.com/crm/enactUpdate">     <crm:userId>       <value href="#userID" />     </crm:userId>     <crm:updateTokens>       <value href="#data_in_body" />     </crm:updateTokens>   </crm:enactUpdate> </env:Header>

This header contains four subblocks within it:

A <data> tag that refers to the null actor, thus ensuring that none of the nodes will attempt to directly process it.
A <userUpdate> tag that starts the processing chain due to the fact that it is marked with the current actor URI, as specified in SOAP 1.2. It is given a namespace, crm , that is expressed as an XML Schema Language file, which it is assumed the relevant nodes have access to.
A <checkSignature> block in the cert namespace of a fictional certification service at the ssl.org domain. Of all the blocks, this one contains the most in terms of child tags, which implies that processing it is probably not meant to involve the message body.
Lastly, another tag from the crm namespace, this time <enactUpdate> . It too has some data tags within it, each of which refers to an identifier. The identifiers may be in the body, or they may be added by other steps along the way.

When the header is received, the node, filling the role of current actor, is responsible for the userUpdate block. For this example, assume that this node is expected to either verify that a valid user ID is present, or to retrieve one based on some information provided. Here, the numerical user ID is present as a <value> tag to userUpdate . This node must turn this data into something useful to the ultimate recipient prior to passing along the message to the next node in the chain. Looking at the next block, it is meant to prove the message's validity through a digital signature. The current actor must eventually pass the message off to this next service at the URL given in that block's actor . In doing this, the acting node creates what is essentially a new message, with the header shown in the following code. Note that in this new header, the block that has already been processed is now removed, and a new block has been added to the end of the chain. Additionally, the actor attribute for the next block has been set so that the receiving node will know that it is responsible for handling that block. Once all this is done, the intermediary is then responsible for routing this new message, based on the URL from the actor . The changed elements are emphasized as follows :

 <env:Header>   <t:data xmlns:t="null" env:actor=       "http://www.w3.org/2002/6/soap-envelope/role/none">     <value id="timestamp">1014457345</value>   </t:data>   <cert:checkSignature xmlns:cert="http://cert.ssl.org/cert"       env:mustUnderstand="true"       env:actor=       "  http://www.w3.org/2002/6/soap-envelope/role/next">  <cert:signatureType>seeded-gpg</cert:signatureType>     <cert:signatureSeed>         <value href="#timestamp" />     </cert:signatureSeed>     <cert:pkKeySerial>128-4095-1014458368</cert:pkSerial>     <cert:md5>823dd310c0d0842b2b5e1b4d28822db9</cert:md5>   </cert:checkSignature>   <crm:enactUpdate xmlns:crm="x-schema:crm_structure.xml"       env:mustUnderstand="true"       env:actor="http://soap.blackperl.com/crm/enactUpdate">     <crm:userId>       <value href="#userID" />     </crm:userId>     <crm:updateTokens>       <value href="#data_in_body" />     </crm:updateTokens>   </crm:enactUpdate>  <t:data xmlns:t="null" env:actor=  "  http://www.w3.org/2002/6/soap-envelope/role/none">   <value id="userID">rjray</value>   </t:data>  </env:Header>

The assumption here is that the process of checking the digital signature will immediately respond with an error notification (called a fault within SOAP, and discussed later in this chapter) back to the sender if the signature doesn't pass validation. Note that this block uses the data in the first header block, the one tagged with the null actor. Taking it on faith that the signature is valid, the key-service node creates a new message with the third-generation header as shown here:

 <env:Header>   <data env:actor=       "http://www.w3.org/2002/6/soap-envelope/role/none">     <value id="timestamp">1014457345</value>   </data>   <crm:enactUpdate xmlns:crm="x-schema:crm_structure.xml"       env:mustUnderstand="true"       env:actor=       "  http://www.w3.org/2002/6/soap-envelope/role/next">  <crm:userId>       <value href="#userID" />     </crm:userId>     <crm:updateTokens>       <value href="#data_in_body" />     </crm:updateTokens>   </crm:enactUpdate>   <t:data xmlns:t="null" env:actor=       "http://www.w3.org/2001/12/soap-envelope/actor/none">     <value id="userID">rjray</value>   </t:data>  <cert:data xmlns:cert="http://cert.ssl.org/cert">   <cert:messageString>signature: good</cert:messageString>   <cert:messageCode>200</cert:messageCode>   </cert:data>  </env:Header>

With this final version of the header, it can be assumed that the node handling it is going to do the actual requested operation, if it can. It may be guessed that the functionality the enactUpdate tag represents required the signature verification from a trusted service, such as the one used here.

Through this example, all the changes that were enacted by intermediaries took place within the header itself. In fact, the various nodes may add tag blocks to both the header or the body.

5.2.4 The Body Tag: Anatomy of a Message

The chained examples just shown managed to illustrate a complex, multinode process that made little use of the message's body. According to the specification, the Body tag is still a required element. It also marks the end of the message because the Envelope tag is expected to close after the Body tag does.

There is a subtle but significant distinction between SOAP Versions 1.1 and 1.2. In SOAP 1.1, it wasn't explicitly stated that the Body tag was the end of the envelope. As such, it silently permitted additional tags following Body . This loophole is closed in SOAP 1.2, which explicitly states that no other tags may follow the body. A SOAP 1.2 application that plans to handle SOAP 1.1 messages would have to watch for this.

As with the Header tag, the Body tag is required to be qualified with the same namespace as the Envelope tag. The opening tag of a message body doesn't provide any attributes itself. There is no purpose for any of the common SOAP attributes at this level. All child elements of the body may utilize the encodingStyle attribute. However, neither actor nor mustUnderstand have any relevance to the child blocks of a message body. In essence, all parts of the body are given over to the same ultimate recipient. This is the same node that fills the role of the anonymous actor for the header blocks that don't specify an actor.

Like the header of the message, the contents of the body must all bear namespace qualifiers. But unlike the header, the specification does provide one tag within the envelope's namespace that may appear within a body. This is the Fault tag, which is used in error reporting. Faults themselves are covered in detail later. The body isn't required to have any child elements. The only limitation on child elements is that only one Fault element can be present in a Body .

So what role does the body of the message play? It carries information, just as the Header does (and to a lesser degree, the Envelope as well). The message body enjoys the distinction of always being present, and always being the responsibility of the ultimate recipient of the message. Intermediaries may add to the body of a message, but they are expected to not actually try to process any part of it in any way.

Depending on the nature of the application itself, the body of the message may be little more than a storehouse of information. In the extended set of header blocks that were traced in Examples Example 5-4 through Example 5-6, all the functionality of the request was contained in the headers of the message. Each of these tasks were directed by a header block: the look up of a username through numerical ID, the validation of the message against a digital signature, and the actual request to update the user information against whatever system was being modeled . The body of the message is used only to provide the data the update action utilizes. In that extended example, none of the intermediaries updated the body itself. Both intermediate operations ”the ID lookup and the signature check ”appended their information to the header.

There are other ways this could have been done. For example, the digital signature application might have taken as part of its specification the element ID of one of the body's child elements. Applications on either side of the signature function could then have added their data to the body without interfering with the signature. The signature application itself didn't have to worry about changing the body because the signature was already computed and evaluated. All these choices were open to the designers of the applications involved. Their only responsibility was to provide clear and accurate specification of their interfaces when dealing with each other.

Of course, the body can be responsible for much more than just carrying data. Encapsulation of RPC within SOAP can use the body to provide the remote call being marshaled, for example. The part of the header that provides the actual update can be placed within the body, instead of in the header. Bodies can contain many child elements, ranging from multiple application instructions to complex serialized objects.

5.2.5 Expressing and Encoding Data

Messages necessarily demand the ability to communicate data. The encoding of data was a major issue in the cross-platform compatibility of the early RPC implementations. The XML-RPC protocol provides a simple yet flexible encoding that covers the same ranges the original RPC implementations did (and more than most).

Encoding (often called serialization ) of data within SOAP is considerably more complex than in XML-RPC. As such, trying to comprehend it just from the specification by itself can be an intimidating task. This is made even more complex by the fact that a message may use an encoding other than the basic SOAP encoding. In fact, a message may use several different encodings at various places within the envelope. Specifying these encodings is the role of the encodingStyle attribute described earlier.

5.2.5.1 XML Schema and encoding

The SOAP encoding model is based primarily on the XML Schema Language, generally referred to as XML Schema. This is convenient because it provides many of the most-common "basic" types as building blocks from which more complex data descriptions are constructed .

However, SOAP also requires that a message be manageable without forcing the recipient to process any schemas. Because of this, all attributes that are parts of data definitions must be present; they can't be assumed to be inherited from the corresponding schema.

The encoding schema that describes the data serializations provided by SOAP is the one specified earlier in Table 5-1, in which the commonly used namespace labels were introduced. Refer to those URIs (for SOAP 1.1 and 1.2), as well as the URIs that are associated with the namespace labels xsi and xsd (the SOAP 1.2 specification uses xs in place of xsd ). While the URI is necessarily different for the basic SOAP encoding itself, both versions of SOAP use the same namespace URIs for the XML Schema declarations.

5.2.5.2 Simple types, values and enumerations

Through XML Schema, SOAP provides some very familiar simple types, listed in Table 5-4.

Table 5-4. Simple types available

Type	Example
`int`	4294967295, -1, 2600
`float`	2.7182818284590452354, -6.28318, 3.40282347E+38
`negativeInteger`	-212, -32768
`string`	Just about anything you want

Unlike XML-RPC, SOAP data is expressed in the form of named parameters . Take the following construct:

 <Name>James</Name>

This doesn't refer to Name as a type. Name is the parameter itself, and the string "James" is its value. It is presumed that the schema that describes the document this parameter appears in has defined Name as a string (or some other possibilities, that will be explored soon).

The XML Schema specification provides the definitions of these types, and attributes that may be used in qualifying elements that contain such data. However, no tags are provided by XML Schema directly. To aid in encoding values in places and situations in which the type may not be immediately known, the SOAP encoding schemas provide tags in their namespaces as a convenience. The previous fragment can also be written as follows:

 <Name><SOAP-ENC:string>James</SOAP-ENC:string></Name>

Within SOAP terminology (and borrowing from XML Schema), a simple type categorizes a set (or class) of simple values. A simple value is like that in earlier examples. Simple values don't have named parts or even multiple parts.

Likewise, one of the XML Schema types can also be provided in an attribute:

 <Lang xsi:type="xsd:string">Greek</Lang>

This naming of data elements is what SOAP calls accessors. Accessors are explained in greater detail in a later section, but for now it is enough to understand that the name by which data is referred to is its accessor. When the type of the data a name refers to can be changed from one instance to the next using the type attribute, that is what SOAP considers polymorphic. To Perl, of course, it seems like an ordinary scalar.

One simple type remains, the array of bytes . This is a type provided as a "catch-all" of sorts, for data that doesn't conveniently fit into the other types, such as audio or image data. While the encoding isn't mandated as such, the recommended encoding is base64 , as defined in XML Schema (and generally familiar outside of XML Schema, as well). The SOAP encodings provide a corresponding subtype for this, as well. An example of an array of bytes might look like this: ^[2]

^[2] If this looks familiar, it's because it's the example from the specification itself. Because the Base64 format is fairly reader-unfriendly, it's what almost all authors use as an example of this type.

 <picture xsi:type="SOAP-ENC:base64">   aG93IG5vDyBicm73biBjb3cNCg== </picture>

The general definition of the "base64" encoding includes limits on line length that aren't enforced in the context of SOAP values. Lines may be broken (or not broken) at any point convenient to the application.

5.2.5.3 Compound types and values

SOAP and XML Schema provide for more complex data expression through compound types , types that are built up by creating associations of several other values under one name. Compound types come in two forms: the structure (sometimes referred to simply as struct ) and the array .

The structure is a familiar concept. In terms of Perl, this may be thought of as similar to the hash table, in which the named values that make up the parts of the compound type are the key/value pairs, and the hash table itself represents the container. Any elements in a compound type may refer to a simple type or another compound type. Here's an example of a simple structure:

 <Book xmlns="http://www.loc.gov/schema">   <title>Perl Guidebook, The</title>   <isbn>01-55677-1234</isbn> </Book>

This corresponds to the simple hash:

 %Book = (          title => 'Perl Guidebook, The',          isbn  => '01-55677-1234'         );

As with the simple types, compound types are described using the XML Schema syntax. A schema fragment to describe the previous structure might look like this:

 <xs:element name="Book" type="tns:Book"/> <xs:complexType name="Book">   <xs:element name="title" type="xs:string"/>   <xs:element name="isbn" type="xs:string"/> </xs:complexType>

5.2.5.4 Accessors, scoping, and reference

The introduction of the structure also means introducing the concept of the data accessor . The term is used in this context to mean the retrieval of a part of a compound value such as a structure or an array. This is potentially confusing, as the "accessor" already has a place in object-oriented programming. The usage in this context is similar but not completely the same. The SOAP usage is presented here because the specifications use the word in this role consistently, and this is meant to help avoid confusion.

An accessor in the SOAP sense of the word is the way a specific data element is uniquely referred to. In more traditional object-oriented terms, an accessor is generally a type of method that provides access to internal ("protected") class elements without exposing those elements directly. A SOAP accessor isn't a routine or method. In a structure, the name of the structure element is the accessor. This can mean that the accessor is a combined sequence of element names when referring to a deeply nested data element. In an array, the position within the array itself is the accessor.

The scoping of an accessor is important when a specific accessor isn't unique throughout the entire message. For example, if a message body contains two or more instances of the same structure with the containing tags at the same depth within the Body content, an accessor can refer to either structure, unless there is enough information to narrow it down to one specific structure. But when the focus is already within a structure, an accessor to a piece of data contained within is valid, even though the same accessor points to a different value in a different structure. This is what is known as the scope of an accessor and isn't unlike the concept of scope within Perl.

Another type of accessor is the reference . References in the XML sense aren't really new, either. Any opening tag may bear an id attribute whose value is a unique (within the message) string. Other tags may then refer to that element using the href attribute and providing the name as the value, as in Example 5.5.

Example 5-5. Defining references between values

 <enc:string id="question-1">   What is the answer to life, the universe and   everything? </enc:string> <enc:int id="answer-1">42</enc:int>     <q:TestItem xmlns:q="http://testgiver.org/basic">   <q:Question href="#question-1"/>   <q:Answer href="#answer-1"/> </q:TestItem>

Both versions of the SOAP specification strongly encourage that references be used only on data that is being referred to at least twice. If data is being referred to in only one place, it should be directly embedded within the containing tags, rather than being given a reference. The TestItem tag of Example 5-5 should have been expressed as follows:

 <q:TestItem xmlns:q="http://testgiver.org/basic">   <q:Question>     What is the answer to life, the universerse and     everything?   </q:Question>   <q:Answer>42</q:Answer> </q:TestItem>

Using references in such a simple fragment may not seem obvious, but references don't have to refer to the current document. A reference can point to an external resource, just as the href attribute often does within HTML. The TestItem tag may wish to refer to the answers externally, away from the prying eyes of XML-skilled students:

 <q:TestItem xmlns:q="http://testgiver.org/basic">   <q:Question href="#question-1"/>   <q:Answer href="http://test.server.edu/test-1#answer-1"/> </q:TestItem>

Recall that the header blocks in Example 5-4 also used references, both to refer between the header and the body and to refer to data provided by intermediary processes.

5.2.5.5 Arrays and partial arrays

As was mentioned earlier, arrays are regarded as compound types in which the ordering of the elements provides the accessors, not the names, of the elements. Arrays may contain elements of any variety of types, including structures and other arrays. Arrays may also be constrained to a certain subset of types by proving type information in the opening tag or in the schema that defines the namespace being used. The arrayType attribute provided by the SOAP encoding namespace is the mechanism used to provide this information.

The value of the arrayType attribute is a type identifier. That value is generally from the XML Schema namespace, but can also be other namespace-qualified types. Example 5-6 illustrates three arrays.

Example 5-6. Different kinds of arrays

 <enc:Array enc:arrayType="xs:int[3]">   <enc:int>1</enc:int>   <enc:int>2</enc:int>   <enc:int>3</enc:int> </enc:Array>     <a:AnyThingGoes xmlns:a="something.xml"     enc:arrayType="xs:anyType[4]">   <enc:int>27</enc:int>   <enc:string>Three-cubed</enc:string>   <enc:float>3.14159</enc:float>   <enc:anyURI>http://www.oreilly.com</enc:anyURI> </a:AnyThingGoes>     <loc:Booklist xsi:type="enc:Array"     xmlns:loc="http://www.loc.gov/schema"     enc:arrayType="loc:Book[2]">   <loc:Book>     <loc:title>Perl Guidebook, The</loc:title>     <loc:isbn>01-55677-1234</loc:isbn>   </loc:Book>   <loc:Book>     <loc:title>Perl Primer, The</loc:title>     <loc:isbn>01-55832-4321</loc:isbn>   </loc:Book> </loc:Booklist>

The first of these is just an array of integers, while the second shows an array in which elements are different types. The third array uses the Book structure definition from an earlier fragment to define the element types. It also uses a tag taken from the same namespace to serialize the array itself.

In this example, the second array used a named parameter tag provided by a namespace that was defined in the opening tag. In the third example, the tag was again a named parameter, but the type was provided in an inline fashion, using the xsi:type attribute. The second array also illustrated how different types are encoded into a single array. When the type provided for arrayType has derivatives (such as a signed subtype of int ), a derivative value is allowed to appear in the array. This is like Java or other strongly typed object languages in which a subclass instance may be passed in a place where the parameter had been typed as an ancestor class.

The arrayType attribute provides not only the type for the data, but the size of the array as well. The specifications define the format of the arrayType value using a set of rules similar to the style used for parser definitions. In simple terms, an array type definition has two parts: the element type and the dimension. The dimension is given in the last set of square brackets ( [ and ] ), and may be zero or more integers separated by commas. The element type may also be followed by a pair of brackets, which indicates that each array element is itself also an array. This inner set of brackets doesn't have any numbers in it, but has commas to show multiple dimension. It is the responsibility of the arrays in the data to provide their specific size information when they are serialized. Additionally, if the dimension of the array itself isn't given, that means the application should determine size by inspecting the data. Table 5-5 shows some declarations with their explanations .

Table 5-5. Sample arrayType values and explanations

Value	Type	Dimension	Explanation
int[10]	Int	[10]	One row of integers, 10 elements in length
string[ ][3]	string[ ]	[3]	One row of 3 elements, each an array of strings
int[,][ ]	Int[,]	[ ]	One row of elements, each of which is a 2D array of integers, with the size of the outer array to be determined from the data
anyType[5,5]	anyType	[5,5]	A 5-by-5 matrix whose elements may be of any type that includes other arrays

The special type specifier anyType is the XML Schema equivalent of the Object type in Java or SmallTalk. It may be thought of as the true Perl scalar of the SOAP encoding mechanism. It simply means that any type may occur as an element.

You may have noticed that an array's size may be left to examination only in the case of single-dimensional arrays. A serialized array specifies elements only in a linear sequence. Thus, an array of dimension two or higher must explicitly provide the sizes of the dimensions. This is a change from C or Perl, in which secondary and later dimensions were implemented as arrays of references or pointers. In other words, the type int[5,5] isn't the same as int[5][5] . The latter isn't a valid type definition, for one thing, and the first declaration results in a series of 25 integer values being serialized, not 5 occurrences of a further 5-element array. The following 2x2 array shows this in more detail:

 <enc:Array enc:arrayType="xs:string[2,2]>   <xs:string>Row 1, Column 1</xs:string>   <xs:string>Row 1, Column 2</xs:string>   <xs:string>Row 2, Column 1</xs:string>   <xs:string>Row 2, Column 2</xs:string> </enc:Array>

This may seem like an unusual approach, but it becomes more clear when encoding partial arrays and sparse arrays.

Partial and sparse arrays are similar, but aren't identical. A partial array is one that isn't sent in its entirety. The elements that are sent are still considered to be ordered; they just don't represent a complete array, and they may not start at the very first element. The attribute offset provides the information as to where in the array the data starts. This fragment encodes elements 3 and 4 of a 5-element array:

 <enc:Array enc:arrayType="xs:int[5]" enc:offset="[2]">   <xs:int>3</xs:int>   <xs:int>4</xs:int> </enc:Array>

In contrast, a sparse array is one in which some number of elements are sent, but there is no guarantee about the relative positions of any of the elements. The elements that are sent have an attribute called position , which provides the placement within the whole for that element, as follows:

 <enc:Array enc:arrayType="xs:string[10,10]">   <xs:string enc:position="[2,4]">Row 3, Col 5</xs:string>   <xs:string enc:position="[7,9]">Row 8, Col 10</xs:string> </enc:Array>

Using the position attribute, only two of the 100 elements are passed, but the receiver knows exactly which two they are.

References may also be used within arrays, including partial and sparse ones. The references may refer to individual data elements or to other arrays that are used to fill in when the datatype for the referencing array allows:

 <enc:Array enc:arrayType="xs:int[ ][10]">   <enc:Array enc:arrayType="xs:int[10]">     <xs:int position="[9]">-1</xs:int>     <xs:int position="[0]">3</xs:int>   </enc:Array>   <enc:Array enc:position="[3]" href="#array-3"/> </enc:Array>

The fragment defines a sparse array with 10 arrays of integers. The first slot (slot 0, since the opening tag had no offset , and the first child tag had no position attribute) has data for its slots 0 and 9. The fourth slot of the outer array is a reference elsewhere in the document, to what will presumably be an array whose type matches what is expected.

5.2.5.6 Structures and generic compound types

Returning briefly to the subject of structures, the encoding rules aren't limited to structures that have all their accessors known in advance. It is possible and permitted to have structures in which the accessors are known only by inspection of the serialized data itself. Any accessors that contain data whose type can't be derived in advance still must provide the type information by means of an xsi:type attribute.

To further support generic compound data, the rules also allow for compound types that mix together accessors that are distinguished by type and accessors distinguished by ordinal position, as in Example 5-7.

Example 5-7. A compound type with mixed accessors

 <newUserList xmlns="http://linux.com/userSchema">   <ShellUser>     <name>rjray</name>     <shell>/bin/tcsh</shell>     <password xsi:type="md5pw">       8be3d7e3ccf03a98026acfce9dcc6487     </password>   </ShellUser>   <CvsUser>     <name>rjray</name>     <commitAccess xsi:type="xs:boolean">true</commitAccess>     <password xsi:type="md5pw">       8be3d7e3ccf03a98026acfce9dcc6487     </password>   </CvsUser>   <CvsUser>     <name>guest</name>     <commitAccess xsi:type="xs:boolean">false</commitAccess>     <password xsi:type="md5pw">       473640010efda30acaedea407ab4aa64     </password>   </CvsUser> </newUserList>

5.2.5.7 The SOAP root attribute

When a serialization contains multiple values and structures that are all children of the Body tag, SOAP generally assumes that the first one is the root of the object graph. This isn't always the case, especially when there are references involved in the serialization. To allow an application greater control over data expression, SOAP provides an attribute in the encoding namespace called root . This attribute is of type xs:boolean .

All elements have an implied value of false (or ) for this attribute, except for the element considered to be the graph root. Changing the consideration of the root requires only that the attribute be explicitly provided in the true root. It is a good idea to include the attribute with a false value in other elements that might otherwise be considered possible root candidates (such as cases where the first element will not be the root).

The element that is considered the root may in some cases affect the way the message is deserialized for the server application. This is a very rarely used feature, and no examples were found in the specifications or related documents to clarify the functionality.

5.2.6 Signaling a Problem: Faults

The last of the XML and serialization topics to be covered is the SOAP Fault . In the simplest terms, faults are errors. The Fault tag was referred to earlier when discussing the Body tag. It has the distinction of being the only preprovided child element for Body as defined by the SOAP specifications. If one of the SOAP nodes has a problem with the message, this is the mechanism for communicating the problem back to the original client.

In practice, a fault child element looks just like a structured datatype. The biggest difference is that the permitted members are predefined, as is their order. The tag itself must be qualified into the same namespace as the Body , Header , and Envelope tags.

5.2.6.1 Fault elements

The elements within the Fault tag are strictly defined both in name and in order. The first two of the four are mandatory; they must be present in all fault messages. The third and fourth elements are optional and serve to provide greater detail on the nature of the error at hand. These elements are defined in Table 5-6 (with columns for SOAP 1.1 and 1.2, which use different names). Note that the types given for the elements are from XML Schema.

Table 5-6. The element tags of a fault

Tag name (1.1)	Tag name (1.2)	Type	Role
faultcode	Code	`Qname` (SOAP 1.2 uses subelements)	This is a qualified name that identifies the specific type of fault that has occurred. SOAP has a small number of predefined fault codes for certain common situations.
faultstring	Reason	string	The string provided by this element is meant to be a human-readable representation of the error. It should be a relatively brief phrase, similar in nature to the text responses of HTTP itself.
faultactor	Role	anyURI	This element is intended to specify which node in the message path is the source of the fault. Any node that isn't the ultimate receiver, but is generating the fault, must set this.
detail	detail	tag blocks	Conveys the application-specific fault information. The contents of this element are zero or more tag elements. This element may also have an attribute, the familiar `encodingStyle` (which propagates to the child tags).

The SOAP 1.2 specification for faults also includes a fifth element, called Node . This identifies at which node the fault occurred, if the data is relevant. The updated specification also changes the content of the Code element, so that it uses child elements called Value and Subcode (the latter is optional). More information on SOAP 1.2 faults can be found in the specifications.

The detail element can be an important piece of content. If the nature of the fault is a failure to process the message body, the detail element must be present, even if it contains no child tags. Likewise, the absence of this element can be safely interpreted as meaning that the fault was not related to the message body at all. No information on header-related errors may appear within detail . Any information on header-based errors must be provided in the header of the response.

Each of the child elements present in a detail block may be namespace-qualified and may have their own encodingStyle attribute for more localized encoding.

The SOAP 1.1 specification explicitly allows for additional elements within a Fault , providing that all added elements are namespace-qualified. The SOAP 1.2 specification, however, makes no mention of this. Unless later drafts clarify, this stands as another potential trap when providing SOAP 1.1 compatibility in a SOAP 1.2 application.

5.2.6.2 Predefined faults

The designers of SOAP recognized that consistency in fault codes is an important aspect of interoperability between different clients and varying servers. It isn't enough to rely on the faultstring accessor, for several reasons:

Developers may not phrase the same thought using the exact same words.
Trying to deduce the nature of an error from the string completely ignores any and all issues of localization and language.
Errors naturally fall into categories , whether those categories are defined in terms of system resources ("IO Error," "Out of Memory," etc.) or protocol definitions ("404 Not Found," "500 Server Error," etc.).
The point of having two cooperating values is so that the code defines the general fault, with the string providing detail specific to the error itself.

Both SOAP 1.1 and 1.2 define sets of fault codes to use as a basis for generating these diagnostics. Neither specification expects their list to be exhaustive or restrictive . The goal is to provide for consistency in reporting the most common faults.

Of the changes between SOAP 1.1 and SOAP 1.2, this is the most radical . The fault codes defined for SOAP 1.1 were completely removed and replaced with different codes in SOAP 1.2. Each of the faults from 1.1 has a corresponding entry in the 1.2 table, while the 1.2 table defines an additional one.

Table 5-7 presents the fault codes for SOAP 1.1. In this version of the specification, the intent was to have the codes be the first element of a sequence of two or more elements linked with "." characters . The part of the code to the left of the first period specifies the class of the fault, and the portion to the right provides more specific classification of the fault itself.

Table 5-7. The predefined faults for SOAP 1.1

Code	Meaning
VersionMismatch	The processing node found a namespace (SOAP version) it can't process.
MustUnderstand	A child element of the `Header` that specified a `mustUnderstand` attribute of `1` can't be handled.
Client	The `Client` errors were used to describe errors in which the problem was in the encoding or specification of the data as it was received from the client.
Server	In contrast to the `Client` errors, `Server` errors were defined as those that came from problems the server encountered while processing an otherwise valid, correct request.

Based on the model that SOAP 1.1 described, the previous codes would have appeared in forms more like VersionMismatch.SOAP1-1Required , or Client.MalformedBodyContent . The longer text of the error would be left to the faultstring field, with the other fields used where appropriate and required.

SOAP 1.2 provides a more complete fault model. Part of this is replacing the dotted -groups with qualified names. Another part is to give more detail in the definition of the predefined faults. Table 5-8 lists the SOAP 1.2 faults.

Table 5-8. The predefined faults for SOAP 1.2

Code	Meaning
VersionMismatch	The meaning and use of this is the same as in SOAP 1.1.
MustUnderstand	As above, this matches the SOAP 1.1 fault of the same name.
DataEncodingUnknown	A header or body element that the current node is responsible for processing specified an encoding (via `encodingStyle` ) the node doesn't support.
Sender	This fault corresponds to the `Client` code from SOAP 1.1. It causes the `detail` part of the fault to be required in the response.
Receiver	As with `Sender` , this corresponds to the `Server` code from SOAP 1.1. It, too, will make the `detail` element of the fault necessary.

The two codes that are the exact same "name" as their SOAP 1.1 counterparts shouldn't be interpreted as exactly the same. The SOAP 1.2 definitions and restrictions are much more specific with regards to these predefined values.

As was mentioned earlier, the SOAP 1.2 specification also goes into more depth on the exact contents of the remaining elements of a Fault block, depending on the error itself. Rather than reproduce all that here, when the specification is still in revision, you should check the sections of the SOAP 1.2 specification that deal with faults. One unfortunate reality of developing complex software is the need to provide clear, intuitive, and specific error messages. ^[3]

^[3] A "requirement" that many large software companies continue to overlook.