Fundamentals of XML | Oracle Application Server 10g Web Development (Oracle Press)

This chapter is not intended as a comprehensive guide to XML (for references on learning XML, see Appendix A). Addressing XML completely is beyond the scope of a single chapter in any book, and there are many fine texts available that cover the subject much more completely. Rather, in this section, we ll take a brief look at the fundamental aspects that make XML so useful to the task of communicating data to give you a basis for the subsequent sections on Oracle s implementation. We ll examine the basic structure of XML documents, parsing and validation, and converting XML data into other formats.

XML Document Declaration

Let s examine the first line of the XML document in Listing 1 again. Here is the first line:

 <?xml version="1.0" encoding="UTF-8"?>

This is known as the XML declaration and it identifies the contents of the file as XML data that adheres to the XML standard. It is not uncommon to see XML data in a file without this declaration, but without it, there can be no expectation of adherence to the conventions. If it does appear in the document, it must appear before any data elements in the XML document. The declaration specifies a couple pieces of information to both human readers and software parsers of the document.

The first piece of information is the version of the XML standard that the document follows . The XML standard is active and is being improved on an ongoing basis. The creators of the standard realized that there would be a need to indicate , especially to parsing software, what features of the various versions of the XML standard applied to a particular document. At the time of writing, the applicable version numbers are 1.0 and 1.1. The second piece of information specified is the document character encoding. The standard supports character sets listed by the Internet Assigned Numbers Authority (IANA) (http://www.iana.org/assignments/character-sets).

XML Document Structure

An XML document has one element called the root element. All other elements are children (or grandchildren of varying degrees) of the root element. Technically, the root element could have attributes and its own data, but more commonly, it has no attributes or data of its own as its primary purpose is to identify the document data at the highest conceptual level. An example of this case is the <ADDRESSBOOK> root element of the XML document in Listing 1. The XML standard also defines how elements are to be nested. More precisely, all XML elements must begin and end in the same element scope. For example, consider the following XML fragment:

 <SITE>      <NAME>Joes Garage</NAME>      <WEBURL>http://www.joesgarage.com/</WEBURL> </SITE>

This XML fragment is correctly nested. The <SITE> element, because it is a parent of the <NAME> and <WEBURL> elements in the hierarchy, correctly nests the others within it. The <NAME> and <WEBURL> elements in Listing 2 do not overlap each other because they are at the same hierarchical level. However, the following copy of a similar fragment is incorrect (Listing 3):

 <SITE>      <NAME>Joes Garage      <WEBURL>http://www.joesgarage.com/ </NAME>      </WEBURL> </SITE>

Again, by the rules of proper XML element nesting, the </NAME> tag in Listing 3 must come before the <WEBURL> tag because the <NAME> and <WEBURL> elements are at the same hierarchical level. In short, once you are inside an XML element, another one cannot be specified at the same hierarchical level until the current one is closed.

 <SITE>      <NAME>Joes Garage           <WEBURL>http://www.joesgarage.com/</WEBURL>      </NAME>      </SITE>

This code fragment (Listing 6) is correct because the <NAME> element is closed off after the </WEBURL> tag, making the <WEBURL> element a hierarchical child of <NAME>. Note that the indentation used in the XML document examples in this chapter are there for readability only and to make them more understandable by pointing out the element hierarchy. An XML document does not require such formatting to be proper XML (i.e., the elements and text data can all be left aligned in the text file); however, it is good practice and makes it easier to track down problems when trying to troubleshoot a parsing error.

Document Type Definitions and Schemas

One of the benefits of XML is that it is easier to parse the data in XML documents, than, say, write custom code to parse a comma-separated value (CSV) file. In the former case, a prefabricated parser does all work. In the latter, you have to write the parsing code yourself. The XML standard defines a specific methodology for ensuring that the data in a given XML document will fit the expected model.

A Document Type Definition (DTD) is one way of accomplishing this. A DTD describes to a parser what type of data can be expected in an XML document, what values are allowed for elements and attributes, and how the hierarchy of elements will be arranged. In short, you, as the publisher of the XML document, can issue a DTD to declare (to any parser that might read the XML document) what parameters the data in the DTD must fit for the document to be considered valid. If the document is verified to be valid, the parser does its work. If the document is found to be invalid, the parser rejects the document and no work is done. The XML standard defines two structural verification concepts for its documents: well- formed and valid.

Note	A document must be well-formed to be valid, but the opposite is not necessarily true. In other words, a document can be well-formed without being valid. An XML document without an associated DTD is said to be well-formed, but, by definition, is not valid.

Well-Formed and Valid XML Documents

An XML document is said to be well-formed if it follows the basic structure rules of XML. For instance, a well-formed document contains one or more elements, only one of which can be the root element, and the subordinate elements all nest properly (i.e., adhere to proper element scope). No elements can be open -ended in XML. In other words, all XML elements must have a discernable beginning and end. For a complete description of well-formed XML, please see the XML specification (http://www.w3.org/XML/).

An XML document is valid if it has an associated DTD and the XML data in the document fits the constraints expressed in the DTD.

XML Document Validation “Document Type Definition (DTD)

In order for an XML document to be considered valid, we need something to validate its contents against. This is where the DTD comes in. The listing below (Listing 7) shows a possible DTD for the Address Book XML doc in Listing 1.

 <?xml encoding="UTF-8"?> <!ELEMENT ADDRESSBOOK (ENTRY)+> <!ELEMENT ENTRY (FIRSTNAME,LASTNAME,ADDRLINE1, ADDRLINE2?,CITY,STATE,ZIP,HOMEPHONE?,EMAIL?)> <!ATTLIST ENTRY ID CDATA #REQUIRED> <!ELEMENT FIRSTNAME (#PCDATA)> <!ELEMENT LASTNAME  (#PCDATA)> <!ELEMENT ADDRLINE1 (#PCDATA)> <!ELEMENT CITY      (#PCDATA)> <!ELEMENT STATE     (#PCDATA)> <!ELEMENT ZIP          (#PCDATA)> <!ELEMENT HOMEPHONE     (#PCDATA)> <!ELEMENT EMAIL     (#PCDATA)>

You can think of a DTD as a set of instructions that identify what type of XML contents should be considered valid for a particular XML document. As we ll see later in this section, software called an XML parser uses DTDs in this way to determine whether an XML document is valid or not.

Let s look at the rules set forth by DTDs. In general, a DTD allows you to specify two major components of an XML document: the elements and their attributes (if any). The DTD also shows a clear relationship in the hierarchy of the elements represented in the XML documents it will be used to validate. Table 16-1 lists some of the more common DTD components :

Table 16-1: Some of the More Common DTD Components
DTD Element	Example Declaration	Definition
<!ELEMENT ename EMPTY>	<!ELEMENT X EMPTY>	The element declaration that specifies an empty element (i.e., an element containing no data or other elements) with a name specified by ename.
<!ELEMENT ename ANY>	<!ELEMENT BOOK ANY>	The element declaration of an element with name specified by ename. The element may contain any mixture of character data and elements. The exact content of the element is, therefore, undefined.
<!ELEMENT ename (datatype)>	<!ELEMENT TITLE (#PCDATA)>	The element declaration that specifies an element called ename that contains data of type datatype.
<!ELEMENT ename (child,)>	<!ELEMENT BOOK (TITLE, PUBLISHER, ISBN)>	The element declaration that specifies an element called ename that contains a set of one or more child elements.
<!ELEMENT ename (datatypechild1child2)>	<!ELEMENT USERPROFILE (#PCDATACUSTOM DEFAULT)>	The element declaration that specifies an element called ename that contains either data of type datatype or child1 or child2 elements.
Attribute Declaration
<!ATTLIST ename adef>	<!ATTLIST BOOK TITLE CDATA #REQUIRED>	The attribute declaration for element ename. Attribute details are specified in adef, which consists of the attribute name, datatype, and default value handling.
Optionality
(child1,child2)	<!ELEMENT PERSON (FIRSTNAME, LASTNAME)>	The set of child elements consists of child1 AND child2.
Optionality
(child1child2)	<!ELEMENT PHONENUM (HOMEWORK MOBILE)>	The set of child elements consists of child1 AND/OR child2.
Datatypes
CDATA	<!ELEMENT FIRSTNAME (#CDATA)>	Specifies that data is a character string and that it does not contain markup that needs to be parsed. In other words, the data will be examined verbatim.
PCDATA	<!ELEMENT SUBJECT (#PCDATA)>	Specifies that the data is a character string but may contain markup information and needs to parsed.
Repeat Rules
?	<!ELEMENT ADDRESS (ADDRL1, ADDRL2, CITY, STATE, ZIP)?>	The element contains zero or one sets consisting of the elements ADDRL1, ADDRL2, CITY, STATE, ZIP.
?	<!ELEMENT SINGLECOMMENT (CMTTEXT?)>	The element contains a set consisting of zero or one instance of the CMTTEXT element.
*	<!ELEMENT ADDRESS (ADDRL1, ADDRL2, CITY, STATE, ZIP)*>	The element contains zero or more sets consisting of the elements ADDRL1, ADDRL2, CITY, STATE, ZIP.
*	<!ELEMENT MULTICOMMENT (CMTTEXT*)>	The element contains a set consisting of zero or more instances of the CMTTEXT element.
+	<!ELEMENT ADDRESS (ADDRL1, ADDRL2, CITY, STATE, ZIP)+>	The element contains one or more sets consisting of the elements ADDRL1, ADDRL2, CITY, STATE, ZIP.
+	<!ELEMENT MULTICOMMENT (CMTTEXT+)>	The element contains a set consisting of one or more instances of the CMTTEXT element.

Let s take a look at our DTD in Listing 2 and examine how it relates to our XML document from Listing 1 (Table 16-2).

Table 16-2: Analysis of the Address Book XML Document and DTD.
DTD Declaration	XML Element Tags	Rule Description
<!ELEMENT ADDRESSBOOK (ENTRY)+>	<ADDRESSBOOK> </ADDRESSBOOK>	Specifies the root element for the Address. Book XML document; can contain one or more (+) ENTRY elements.
<!ELEMENT ENTRY (FIRSTNAME, LASTNAME, ADDRLINE1, ADDRLINE2?, CITY, STATE,ZIP, HOMEPHONE?, EMAIL?)>	<ENTRY ID= 1 > </ENTRY>	Specifies the ENTRY element and that it contains one set of the following: one FIRSTNAME element, one LASTNAME element, one ADDRLINE1 element, zero or one ADDRLINE2 element, one CITY element, one STATE element, one ZIP element, zero or one HOMEPHONE element, and zero or one EMAIL element.
<!ATTLIST ENTRY ID CDATA #REQUIRED>	<ENTRY ID= 1 >	The ENTRY tag has an attribute called ID that contains unparsed character data. The attribute has no default value and is required on each ENTRY element.
<!ELEMENT FIRSTNAME (#PCDATA)> <!ELEMENT LASTNAME (#PCDATA)> <!ELEMENT ADDRLINE1 (#PCDATA)> <!ELEMENT CITY (#PCDATA)> <!ELEMENT STATE (#PCDATA)> <!ELEMENT ZIP (#PCDATA)> <!ELEMENT HOMEPHONE (#PCDATA)> <!ELEMENT EMAIL (#PCDATA)>	<FIRSTNAME>John </FIRSTNAME> <LASTNAME>Doe </LASTNAME> <ADDRLINE1> 100 Maple Lane </ADDRLINE1>	Each element contains parsed character data payload.

Now that we have a DTD, we have to associate it with our XML document. There are two ways to do this. The internal DTD actually resides in the XML document itself:

 <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE BOOKLIST [      <!ELEMENT TITLE (#PCDATA)>      <!ELEMENT ISBN (#PCDATA)>      <!ELEMENT AUTHOR (#PCDATA)>      <!ELEMENT PUBLISHER (#PCDATA)>      <!ELEMENT PUBDATE (#PCDATA)> ]>

This method includes the DTD inside the XML document to be validated . The advantage to using an internal DTD is that everything a parser needs to validate the file is self-contained. The disadvantage is that the inclusion of the DTD can bloat what might already be a rather large XML file to begin with. Also, if you are receiving this file to insert into your database and need to validate that it has followed the specifications you ve provided to the publisher of the data, there is no guarantee of this; the publisher of the XML document has control over the DTD because the publisher has included it in the XML document.

An alternate methodology is to specify an external DTD. This method allows flexibility with regard to who controls the validation of the data.

 <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE BOOKLIST SYSTEM "http://www.booksonline.com/booklist.dtd">

The SYSTEM keyword instructs the parser to look in the location of the specified Universal Resource Identifier (URI) (the http address in the preceding example) to find the document type definition information used to validate the XML document. The advantage to this method is added flexibility in regards to which party controls the DTD. In this case, it can be either the publisher or the consumer of the information. The disadvantage is that if a remote resource is specified but is unavailable due to network problems, parsing of the document cannot proceed.

XML Schemas

While the DTD serves as a necessary and useful component for validation of an XML document s contents, it has some shortcomings:

DTDs do not provide a mechanism for strict data-type enforcement or support for complex data-types.
Support for namespaces in DTDs is not inherent and what little support that can be simulated is complicated and involves difficult-to-implement techniques.
DTDs use a specialized format that does not follow the XML standard.

These are just some of the disadvantages that have cropped up over time using DTDs. To solve these shortcomings and others, a W3C group (http://www.w3.org/XML/) was formed to draft a standard for XML Schemas (XSDs). Some of the benefits of XML Schemas are

XML Schemas are capable of enforcing tighter data-type checking during validation as well as supporting complex data-types.
XML Schemas support for namespaces is built in; no work-around is needed.
XML Schema Definitions (XSDs) are XML files, so there is no new format to learn once you understand how to properly format an XML document.

The XML Schema Definition (XSD) language is much more powerful and complex than DTDs and does not lend itself easily to a short discussion of the subject.

Note

XML Schemas are an advanced subject and there are entire texts written on the topic alone; therefore, this chapter will not go into detailed coverage of them. Keep in mind that the concepts applied to DTDs with respect to their usage with Oracle also apply to XSDs. If you wish to learn about XSDs, we urge you to consult a more comprehensive text on the subject.

XML Parsers ”Manipulating and Searching an XML Document

While presenting data in an XML document format has strengths of its own, the real power behind XML is what you are able to do with it. An XML parser is a software component that allows developers to access and manipulate raw XML document data.

XML Parsers:

Allow developers to write a minimal amount of code to extract the component data of the XML document for programmatic use. One application might be to populate Oracle tables with data from the XML document.
Can manipulate XML data either as entire documents using the Document Object Model (DOM) or as individual chunks called events using the Simple API for XML (SAX).
Are supported by multiple languages. Oracle supports SAX and DOM parsers for C/C++, Java, and PL/SQL (DOM only).
Provide validation of the structure and content of XML documents, via the specification of a DTD or an XML Schema Definition (XSD).
Allow developers (using the DOM parser) to search a parsed XML document using XPATH that resembles a URI path specification (somewhat) in style.

There are two major activities that XML parsers engage in: validation and parsing. A nonvalidating parser is an XML parser that only parses the XML given to it and does no active validation of the data. Most parsers, however, do both and the validation feature can usually be turned on or off with configuration parameters.

DOM Parsers

Document Object Model (DOM) is a platform-independent interface for accessing HTML and XML documents from within programming and scripting languages. The DOM presents documents in an object-oriented fashion. DOM exists in three levels. DOM Level 1 is a W3C Recommendation and is currently supported by lots of implementations . DOM Level 2 extends DOM Level 1 with regards to things like access to the DTD and namespaces. DOM Level 2 is yet another W3C Proposed Recommendation. DOM Level 2 is specified in a modularized document structure; i.e., Core Specification, HTML Specification, Views Specification, Style Specification, Events Specification, and Traversal-Range Specification. Core is the entry point to read the specification. The DOM group has recently published a first suggestion for DOM Level 3. A DOM interface is provided by Microsoft s XMLDOM ActiveXControl.

SAX Parsers

SAX is a simple, standardized API for XML parsers developed by the contributors to the xml-dev mailing list. The interface is mostly language-independent, as long as the language is object-oriented. The first implementation was written for Java, but a Python implementation is also available. SAX is supported by many XML parsers.

JAXP

JAXP stands for Java API for XML processing. It enables applications to parse and transform XML documents using an API that is independent of a particular XML processor implementation.

XSL Processors

Extensible Stylesheet Language (XSL) is a language for creating a style sheet that describes how data sent over the Web using XML is to be presented to the user . XSL specifies the styling of an XML document by using XSL Transformations (XSLT) to describe how the document is transformed into another XML document that uses the formatting vocabulary. XSLT is a language for transforming XML documents into other XML documents. It is designed to be used as part of XSL, which is a style sheet language for XML. In addition to XSLT, XSL includes an XML vocabulary for specifying formatting. XSLT is a W3C specification but also Part 2 of the XSL specification.

Oracle XDK “The XML Developer ˜s Kit

The Oracle XML Developer s Kit (XDK) 10 g is a set of components, tools, and utilities in Java, C, and C++ and is available in Oracle Database 10 g and Oracle Application Server 10 g that ease the task of building and deploying XML-enabled applications. Unlike many shareware and trial XML components, the Oracle XDK is fully supported. Oracle XDK consists of the following components:

XML parsers Create and parse XML using DOM (including 3.0), SAX, and JAXP interfaces. Directly access XMLType in the Oracle Database 10 g with unified C DOM interfaces.
XSLT processors Transform or render XML. Supports XSLT 2.0 Java.
XSLT virtual machine (VM) and compiler Provides high performance C XSLT transformation engine using compiled stylesheets.
XML Schema processor s Support XML schema validation. Include validation interfaces for stream-based processing.
XML Java Beans Parse, transform, diff, retrieve, and compress XML documents via Java components.
XML Class Generator Supports JAXP; automatically generates classes from DTDs and XML schemas to send XML from web forms or applications.
XML SQL Utility Generates XML documents, DTDs, and XML schemas from SQL queries in Java and inserts XML documents into Oracle databases.
XSQL Servlet Combines XML, SQL, and XSLT in the server to deliver dynamic web content and build sophisticated database- backed web sites and services.
XML Pipeline Processor Invokes Java processes through XML control files.
TransX Utility Makes it easier to load globalized seed data and messages into Oracle databases.