2.2 XML


The Extensible Markup Language, or XML, is arguably one of the most useful and important technologies to emerge as a result of HTML and the World Wide Web. While the basic concepts and theories behind it aren't very complicated, it has proven to be a critical tool in solving numerous problems, from providing neutral data representation between very different architectures, to bridging the gap between software systems with minimal effort.

2.2.1 Self-Describing Data

XML is often referred to as "self-describing data," because the XML version of the data contains information you'd otherwise use to describe the data format including: element/parameter names, structural relationships between elements, hierarchical relationships, and so forth. If the element tags ( names ) are chosen so as to be meaningful and descriptive, the resulting XML can often be read and reasonably understood separate from the applications that use it.

At the simplest level, understanding XML is a matter of understanding the definitions and roles of the three primary building blocks: elements, attributes, and data. There are other things that can appear in an XML file, and they're also explained in the following sections.

This XML fragment shows elements ( book ), attributes ( isbn ), and data ("Programming Web Services with Perl"):

 <book isbn="0596002068">Programming Web Services with Perl</book> 
2.2.1.1 Elements and namespaces

Elements are the building blocks of XML. To those familiar with HTML, elements are what make XML look like HTML at first glance. However, XML is very different from HTML, and much of the difference is in the rules governing the elements.

An element (also referred to interchangeably as a tag ) is a name , or symbol, made up of alphabetic, numeric, and a handful of special characters (hyphens, underscores or periods). The very first character of an element name must be either alphabetic or an underscore ; numbers or the other special characters can't start an element name. Also, the leading three characters can't be XML , in any combination of case. These are reserved for the W3C's use. Unlike HTML, the case of the letters is important: <Start> , <start> and <StArT> are all different tags in an XML document, and each would have matching closing tags </Start> , </start> , </StArT> , respectively. (But using tags so similar to each other is a good way to confuse others who use the data, so it isn't recommended.)

An element either contains data or is an empty element . An empty element is a special subtype of element that is defined from the outset to not contain any data (plain data or other elements). HTML introduced several empty tags: <img> for images, <hr> for horizontal rules, and <br> to force line breaks. Empty elements in XML look slightly different. They are denoted by putting the / character, which usually denotes the closing tag, before the closing > character; for example, the XHTML equivalents of the previous tags (XHTML is a reengineering of HTML4 in XML) look like: <img/> , <hr /> , and <br/> . The presence or absence of a space between the tag name and the /> sequence is completely up to you ”a matter of readability. It has no effect on the syntax of the element itself.

In the previous paragraph, XHTML was defined as a use of XML, which is also called an application . This is a case where the terminology of XML can sometimes be confusing. Programmers are used to the word "application" meaning a piece of software, a program (or collection of programs) that runs to provide some service or functionality. Here, it means a specific use of XML. Saying that XHTML is an XMLapplication is the same as saying that it is an application of XML, in this case to the problem of defining HTML with the inherent strictness of XML.

An element isn't identified only by its name. Elements (and attributes, as will be shown in the next section) may also have a namespace associated with them. Namespaces associate tag names with specific XML applications especially in cases in which tag names might conflict with each other. An XML file may contain data expressed using elements from any number of different XML applications. The namespace prefixes are what keep the elements different enough to manage.

Namespaces are defined by declaring them within an opening tag. They may be associated with a specific prefix, which is then used on all elements governed by the namespace; or a default namespace may be declared. When a prefix is declared, the namespace is applied to an element by joining the prefix and the element name with a ":" character. An element may declare several namespaces at once, but may only declare one default. The snippet in Example 2-1 shows the declaration of two namespaces, including a default, and their application to different elements.

Example 2-1. Declaring namespaces within elements
 <message xmlns="urn:namespace:example"          xmlns:xsd="http://www.w3.org/2001/XMLSchema">   <messageStructure>     <xsd:schema>       ...     </xsd:schema>   </messageStructure>   <messageBody>     ...   </messageBody> <message> 

In this simple fragment, the opening tag, message , declares both a labeled namespace and a default. The default namespace applies to all tags that don't have a specific namespace label. The second namespace is associated with the label xsd , and uses the URI http://www.w3.org/2001/XMLSchema . All the elements with names that start with the characters xsd: are considered associated with that namespace. In the example, the schema element is linked to this namespace. Without the prefix, it would be in the same namespace as message , and the distinction between the description of a message and the declaration of a schema fragment might be lost.

If a tag doesn't declare a default namespace, it either inherits a default from the parent tag, or (if there is no default defined at any higher level) it's said to have an empty namespace. Likewise, it's not unusual to see XML documents that don't use default namespaces at all but instead declare their namespaces and explicitly qualify every element. This is common in SOAP messages, as you'll see in later chapters.

2.2.1.2 Attributes

Attributes provide information about an element, as opposed to the information the element itself provides (either in its contents, or merely by its presence). This is another case in which XML strays from the familiar ground of HTML; XML attributes always have a name and a value.

An attribute's name must follow the same rules as an element name; alphabetic, numeric, and a few special characters are all that are allowed. Just like elements, the leading character of an attribute must be alphabetic or an underscore. The value of an attribute must always be quoted, with either single or double quotes.

It is considered a good design principle to keep attributes focused on the element to which they are attached. Table 2-3 shows some examples of attributes, including some that violate this principle.

Table 2-3. Examples of attributes

Element and attribute

Notes

<Text lang="english">

Good; lang clearly refers to the language the content (text) of the element is in.

<age units=" years ">

Also good; the attribute assists applications in interpreting the content.

<cost purch_order="3554">

Dubious; the relevance of a purchase order number in a cost field is questionable. It represents data that probably should be associated with a higher-level element.

<img src="a.gif" noborder />

Bad; while empty tags are capable of having attributes, and src is a valid attribute, attributes can't be "empty" in the sense of having no value component, so noborder is invalid.

Besides the limitation of not using XML as the leading three characters of a name, there are some attribute names that are reserved in XML to have special meaning. These are shown in Table 2-4. An attribute may appear in a given element only once, but aside from that, their use and content are very flexible. XML entities (explained in the next section) expand within attribute values.

Table 2-4. Reserved attributes in XML

Attribute

Function

xml:lang

Specifies the (human) language the content of the element is in, such as en for English.

xml:space

Used to specify how the XML parser treats whitespace in an element's data.

xml:link

This conveys information to an XLink processor. XLink is a type of XML processing, beyond the scope of this book.

xml:attributes

Also related to XLink processing, this is used to remap attributes in cases in which there could be a conflict between names XLink is expecting to see. (This may be changed to xlink:attributes in a future revision of the relevant specifications.)

Attributes aren't generally given namespace qualification, unless the reference is to an attribute from a completely different XML application. In the previous examples of reserved attributes, all are prefixed with xml: , which is an indication that they belong to the core XML definition. In the previous explanation of namespaces, the declaration of a given namespace looked like an attribute whose name was xmlns .

Declaring a prefix for an attribute is a matter of using xmlns itself as a "prefix" for an attribute whose name is the desired prefix name, and whose value is the URI of the namespace:

 xmlns:xsd="http://www.w3.org/2001/XMLSchema" 

This example declares xsd as a prefix, but syntactically it can be confusing, given the fact that it looks more like an attribute itself. The later chapters on SOAP show elements using attributes from other sources to declare the datatype of an element by referencing a type attribute from the XML schema namespace, and in many cases by providing a value that is namespace-qualified into a related (but different) namespace.

2.2.1.3 Data

Data, the text within elements, is pretty much self-explanatory. The format and layout are up to the person who designed the XML structure. Data is the content between opening and closing tags of an element, minus specialized pieces such as processing instructions, comments, etc.

In many XML applications, data and elements aren't mixed as content. That is, element hierarchy is designed such that an element either contains data or other elements, but not both. The syntax of UDDI (Universal Description, Discovery and Integration, a technology related to SOAP and covered in a later chapter) is a good example of this sort of design. As a counter example, the DocBook XML application defines a wide range of elements, only some of which don't allow a mixture of data and other elements as their content.

A special kind of data is the XML entity reference . This is also a familiar syntax to those experienced with HTML. An entity reference is a sequence that starts with & and ends with ; with no space in between. While HTML supports a large number of entities, there are only five predefined entities in XML, shown in Table 2-5. XML allows the expression of characters using the entity syntax, with the Unicode value for the character as the contents between the delimiters. It may be in hexadecimal or decimal. These are also shown in Table 2-5.

Table 2-5. XML entities

Entity

Character

Notes

&amp;

&

Not allowed inside a processing instruction (see next section)

&lt;

<

Use inside attributes quoted with " " characters, to avoid processing problems

&gt;

>

Use after ]] in ordinary text and inside processing instructions

&quot;

"

Can be used inside attributes quoted with " "

&apos;

'

Can be used within attributes quoted with ' '

&# nnn ;

Variable

The character whose decimal Unicode value is nnn (with no leading zeros) is returned

&#x nnn ;

Variable

The character whose hexadecimal Unicode value is nnn (with no leading zeros) is returned

A common mistake often made by people moving from HTML to XML is to assume all the same entities (such as &eacute; for ) are available. Depending on the parser and the application using it, the results may be a fatal error or the unknown entities discarded.

2.2.1.4 Comments, processing instructions, and specialized content

No language is complete without the ability to provide notes to the reader that don't interfere with the processing of the file itself, and XML is no exception to this rule. Comments in XML follow the same syntax as in HTML; they start with the sequence <!-- , and end with the sequence --> .

Comments don't nest. A comment will end at the first occurrence of the closing sequence the parser finds, but the opening sequence may occur within the scope of the comment (and will be considered a part of the comment's text).

Processing instructions are special sequences that provide information to the application that is processing the XML document. They specify the type of processor that should receive the instruction, so an instruction for a XSLT (Extensible Stylesheet Language Template) processor is clearly marked as such and can then be discarded by an ordinary XML processor. A processing instruction looks like this:

 <?xml-stylesheet href="oreillystyle.xsl" type="text/xsl"?> 

Note the special syntax of the opening and closing delimiters; <? and ?> are what denote a processing instruction. The string immediately following the opening delimiter is called the target of the instruction, and everything else is the data . The data is usually made up of attributes, like an ordinary element. But it doesn't have to be, and the instruction may have no data.

This instruction is the first line of most XML documents:

 <?xml version="1.0" encoding="iso-8859-1"?> 

It tells the XML processor that the document requires the 1.0 version of the XML specification, and that the character set (or encoding ) used to express the document is the ISO 8859-1 set, also known as "Latin-1." XML documents can use any recognizable character set, and the character set is what defines alphabetic and numeric characters, so <caf > would be a valid element name in the given character set.

Lastly, there are some specialized sequences used in XML documents. Most are rarely, if ever, seen in the context of web services. However, they deserve mention. One of these is the document type-declaration, <!DOCTYPE> . This is most often used when declaring a Document Type Declaration (DTD), but can also be used to declare entities. DTDs and entity declarations are covered in the next section.

Another type of specialized content is the CDATA section, which is used to express a segment of the document that shouldn't get any special processing. Within a CDATA section, entities aren't expanded, and elements aren't noted as anything other than character data. A CDATA section is initiated with the (complex) sequence, <![CDATA[ , and continues until it sees the ending sequence, ]]> .

2.2.2 Describing XML with DTD and XML Schema

XML documents have two levels of correctness: well- formed and valid . Well-formed simply means that the document is structurally sound, that all opening tags have matching closing tags, that elements don't overlap each other, and so forth. Most parsers will generate errors otherwise (depending on whether the parser is built to be tolerant of the faults). Being valid is a different subject completely.

An XML document may be associated with a description document, a metadocument of sorts, against which it can be validated . Validation ensures that the elements that appear are allowed in the XML application the document supposedly represents, and that the order in which they appear is also permitted. Validation checks attributes and if data content is and isn't permitted.

There are different methods for expressing the valid syntax of an XML document, and this section will address the two most common: the DTD and the XML Schema representation.

2.2.2.1 The DTD

Initially, the Document Type Declaration was the only tool for describing an XML application. The DTD is an inherited syntax from XML's (and HTML's) roots in SGML, the Standard Generalized Markup Language. DTDs have the advantage of a wide range of SGML-oriented tools already in the marketplace .

A DTD can declare elements, attributes, and entities. Further, the way entities are declared and used often make DTDs more clear and concise . Example 2-2 shows a sample DTD.

Example 2-2. A typical DTD layout
 <!ENTITY % container '(name, version?, hidden?, signature+,                        help?, package?, code)' > <!ELEMENT  proceduredef  %container; > <!ELEMENT  methoddef     %container; > <!ELEMENT  functiondef   %container; > <!ELEMENT  name          (#PCDATA) > <!ELEMENT  version       (#PCDATA) > <!ELEMENT  hidden        EMPTY > <!ELEMENT  signature     (#PCDATA) > <!ELEMENT  help          (#PCDATA) > <!ELEMENT  code          (#PCDATA) > <!ATTLIST  code          language (#PCDATA) > 

This example doesn't use all the features that a DTD may exhibit. The first declaration defines an entity, similar in nature to the &amp; , which is already familiar to HTML developers. Like the character entities, container acts as a macro expansion. In the next three declarations, three elements are defined. Each can have the same sort of content, the sequence of elements defined by the container entity. This gives the visual effect of "defining" these elements as "containers." In fact, they are meant as the three choices for the document's top-most element, hence the association.

Example 2-2 is meant only to provide a rough overview of the syntax. The DTD is fading in popularity against the XML Schema language. Many books specifically on XML still cover DTD syntax in its entirety, however, such as XML in a Nutshell by Elliotte Rusty Harold and W. Scott Means (O'Reilly).

2.2.2.2 XML Schema

The XML Schema language is a much more flexible, and therefore much more complex, method of describing document content. It is covered in greater detail in a later section; here are some of the reasons for choosing it over DTD syntax.

First and foremost, XML Schema is a complete XML application, unlike the DTD syntax. A schema document can be parsed and processed as XML, whereas the tools for handling DTDs are confined mainly to the SGML world. This means that the same tools being used by the software application itself can also manage the syntax description. This is reason enough in many cases.

As an XML application, XML Schema also integrates more easily with other XML applications. SOAP uses XML Schema datatypes as the basis for data modeling in remote method calls. WSDL, the Web Services Definition Language (covered in a later chapter), uses XML Schema directly within WSDL files to provide the description of complex elements and datatypes used within the service description.

XML Schema isn't the only post-DTD description format, but it has the endorsement and backing of the W3C organization, which has lent it a great deal of momentum and credibility. The syntax and structure of XML Schema is covered later in this chapter.

2.2.3 XML Modules and Tools for Perl

Perl has a multitude of XML-related tools. In many cases, the challenge isn't whether a module exists to solve a given task, but rather which of the available modules would be the best choice. Since this book is less focused on XML itself, this section will just briefly examine some of the parsing-related tools. In fact, the toolkits for XML-RPC and SOAP that are available on CPAN abstract the underlying XML parser from the user , freeing the programmer to focus on other issues.

2.2.3.1 XML::Parser

This parser was the first XML parser for Perl. Larry Wall, Perl's author, developed its earlier incarnations. Over the years, the responsibility for its maintenance has changed hands several times, but it remains a very fast parser. It is built around the Expat parser library for C written by James Clark.

The parser itself suffers from certain drawbacks and limitations. It doesn't have full namespace support because these weren't part of the XML suite of specifications when Expat was written. It uses an event model that was designed before XML experts settled on the SAX (Simple API for XML) and SAX2 models. Thus, while the event model is similar, it isn't fully compatible with SAX or SAX2. Furthermore, the parser don't validate; it detects only whether a document is well-formed or not.

This shouldn't be taken as a condemnation, however. Many XML-based packages on CPAN are built around this parser, and it has become more and more portable as time has gone on. This parser is a useful tool that will continue to be used for some time to come.

2.2.3.2 XML::LibXML and XML:: LibXSLT

The new kid on the XML parsing block is the XML::LibXML module. This is a validating parser built around a C library for parsing XML called simply libxml2 . The C library itself integrates smoothly with a second library called libxslt , which applies XSLT transformations to XML based on stylesheet inputs. As such, the XML::LibXSLT package is usually also installed at the same time XML::LibXML is.

This parser is also very fast, and forms the basis for a SAX/SAX2 package available through CPAN. More and more tools are being built around this parser due to the more advanced features offered through libxml2 over Expat. The interface offered by this package gives the user a choice of parsing based on either DOM (the Document Object Model) or SAX events.

2.2.3.3 XML::SAX

Where the XML::LibXML parser supports a parsing style that emits SAX or SAX2 events, the XML::SAX package provides a more thorough implementation of the SAX and SAX2 models for developers to use. It can use the XML::LibXML parser as its basis, shoring it up with packages to manage namespaces. It also provides a pure-Perl implementation of a SAX-compliant parser programs can fall back on if the faster parser isn't available.

2.2.3.4 XML::XPath, XML::Simple, and others

There are modules that offer alternatives to the SAX-based approaches to XML parsing. The W3C defined the XPath syntax as a way of referencing data within large XML documents using a path syntax based on the element names and attributes. The XML::XPath module implements the XPath syntax completely, while attempting to provide a means for other packages to add in extensions. It uses XML::Parser as the base parsing engine for the documents themselves .

Another XML module worth noting is the XML::Simple package. This package provides one of the most simple, basic interfaces to XML available. It converts XML data to a hash-table structure, maintaining as much of the inherent nature of the data as it can. Though not suitable for more intense projects, such as handling SOAP messages, this package manages to meet the needs of many software projects.

This only scratches the surface of the XML tools available in Perl. The XML::RSS package was referred to earlier in the LWP programming example, and later chapters cover XML-based modules for XML-RPC and SOAP. Full details about parsing XML with Perl are provided in the book Perl & XML by Erik T. Ray and Jason McIntosh ( O'Reilly ) .

Using XML in an application means trading off efficiency in speed and memory for such flexibility. Most XML parsers add significantly to the size and performance of an application, and XML data itself is larger in storage size than the same data would be if maintained in a more compact, application-specific format. XML is best used in those places where the benefits outweigh the drawbacks, such as sharing data between several different applications or languages.



Programming Web Services with Perl
Programming Web Services with Perl
ISBN: 0596002068
EAN: 2147483647
Year: 2000
Pages: 123

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net