Using XML with Relational Database Management Systems

RDBMSs have emerged as the predominant DBMS available today. Corporations large and small depend on extremely reliable and scalable RDBMSs to manage the data required by applications for automating their business processes. Because of their popularity in the business world, most XML database integration projects are implemented with RDBMSs. In this section we will explore the different ways that RDBMSs have evolved to adopt support for XML.

Retrieving and Storing Data-Centric XML Documents

To retrieve and store data-centric XML documents, most RDBMSs provide a mapping mechanism to allow the transformation of relational data to and from XML data. Broadly speaking, the different mapping mechanisms provided can be categorized as two types: resultset-based mapping and schema-based mapping. These two mapping methods allow you to dynamically discover and represent your database structures from and into XML documents.

Resultset-Based Mapping

Resultset-based mapping models XML documents based on the tabular nature of the resultset generated by SQL queries. The columns and rows in a resultset are usually mapped as either XML elements or attributes or a combination of both. Consider Listing 4-3, a simple SQL query that returns records from the Employee table.

Listing 4-3 sample_query1.sql: A simple SQL query that returns a list of employee records.

 SELECT Employee_ID, First_Name, Last_Name, Dept_No FROM Employee

This query will generate a resultset that might look like that shown in Figure 4-3.

Figure 4-3 A sample resultset returned from the SQL query shown in Listing 4-3.

To transform this into an XML document, we can decide to map the rows as elements of the XML document and the columns as attributes of these elements. An example of such an XML document is shown in Listing 4-4.

Listing 4-4 sample_query1.xml: A sample XML representation of the Employee table shown in Figure 4-3.

 <?xml version="1.0"?> <ROOT> <row Employee_ First_Name="Adam" Last_Name="Barr" Dept_No="3"/> <row Employee_ First_Name="Katie" Last_Name="Jordan" Dept_No="2"/> <row Employee_ First_Name="Scott" Last_Name="MacDonald" Dept_No="1"/> <row Employee_ First_Name="Tim" Last_Name="O'Brien" Dept_No="2"/> <row Employee_ First_Name="Sunil" Last_Name="Koduri" Dept_No="3"/> <row Employee_ First_Name="Laura" Last_Name="Norman" Dept_No="2"/> </ROOT>

To ensure that the XML document is well-formed, we have included a root element that we call <ROOT>. One alternative naming convention is to use the name of the table, Employee in this case, as the root element name.

For each row in the resultset, we generated a <row> element in the XML document. The columns are then mapped to attributes of these <row> elements. Another way to map columns is to generate the child elements of the <row> elements, as shown in Listing 4-5.

Listing 4-5 sample_query2.xml: Same as Sample_Query1.xml, but using elements instead of attribute values to represent records.

 <?xml version="1.0"?> <ROOT> <row> <Employee_ID>1</Employee_ID> <First_Name>Adam</First_Name> <Last_Name>Barr</Last_Name> <Dept_No>3</Dept_No> </row> <row> <Employee_ID>2</Employee_ID> <First_Name>Katie</First_Name> <Last_Name>Jordan</Last_Name> <Dept_No>2</Dept_No> </row> <row> <Employee_ID>3</Employee_ID> <First_Name>Scott</First_Name> <Last_Name>MacDonald</Last_Name> <Dept_No>1</Dept_No> </row> <row> <Employee_ID>4</Employee_ID> <First_Name>Tim</First_Name> <Last_Name>O'Brien</Last_Name> <Dept_No>2</Dept_No> </row> <row> <Employee_ID>5</Employee_ID> <First_Name>Sunil</First_Name> <Last_Name>Koduri</Last_Name> <Dept_No>3</Dept_No> </row> <row> <Employee_ID>6</Employee_ID> <First_Name>Laura</First_Name> <Last_Name>Norman</Last_Name> <Dept_No>2</Dept_No> </row> </ROOT>

One advantage of resultset-based mapping is that the generated XML document automatically reflects the live data and schema information present in the resultset from which the XML document is generated. This means that, by controlling the number of columns and rows in a resultset using standard SQL constructs, you automatically restrict the amount of data that will be included in the resultant XML document. In addition, aliased and computed columns will also be reflected in the XML document. For example, the following SQL query will generate a resultset that can be transformed into an XML document:

 SELECT CONCATENATE( LastName, ` `, FirstName ) AS FullName FROM Employee WHERE LastName = `Jordan'

The document would then look like Listing 4-6.

Listing 4-6 sample_query3.xml: XML representation of a resultset with computed columns.

 <?xml version="1.0"> <ROOT> <row Employee_ FullName="Jordan, Katie"/> </ROOT>

Only the row matching the condition specified in the WHERE clause is included in the XML document. Also, we've automatically picked up the FullName alias from the resultset and used it in the XML document.

Resultset mapping will function just as effectively for a resultset that consists of rows originating from more than one table. Consider the following multijoin SQL query as an example:

 SELECT Dept.Dept_ID,   Dept.Dept_Name,   Employee.Employee_ID,   CONCATENATE( Employee.Last_Name, ' ', Employee.First_Name ) AS Employee_Name FROM Dept, Employee WHERE Dept.Dept_ID = Employee.Dept_ID ORDER BY Dept.Dept_ID, Employee.Employee_ID

This query will generate a resultset that looks like that shown in Figure 4-4. We can easily transform this resultset into the XML document in Listing 4-7.

Figure 4-4 A multitable resultset.

Dept_ID	Dept_Name	Employee_ID	Employee_Names
1	IT	3	McDonald, Scott
2	HR	2	Jordan, Katie
2	HR	4	O'Brien, Tim
3	HR	6	Norman, Laura
3	Operations	1	Barr, Adam
4	Operations	5	Koduri, Sunil

Listing 4-7 sample_query4.xml: XML generated from the resultset in Figure 4-4.

 <?xml version="1.0"?> <ROOT> <row Dept_ Dept_Name="IT" Employee_ Employee_Name="MacDonald, Scott"/> <row Dept_ Dept_Name="HR" Employee_ Employee_Name="Jordan, Katie"/> <row Dept_ Dept_Name="HR" Employee_ Employee_Name="O'Brien, Tim"/> <row Dept_ Dept_Name="HR" Employee_ Employee_Name="Norman, Laura"/> <row Dept_ Dept_Name="Operations" Employee_ Employee_Name="Barr, Adam"/> <row Dept_ Dept_Name="HR" Employee_ Employee_Name="Koduri, Sunil"/> </row> </ROOT>

Our multitable XML document has one significant weakness. It doesn't represent the true hierarchical nature of the resultset. Ideally, what we might want is to group Employee rows that have the same Dept_ID under the same container element. Listing 4-8 is an example.

Listing 4-8 sample_query5.xml: Using nested elements to represent a multitable resultset.

 <?xml version="1.0"?> <ROOT> <Dept Dept_ Dept_Name="IT"> <Employee Employee_ Employee_Name="MacDonald, Scott"/> </Dept> <Dept Dept_ Dept_Name="HR"> <Employee Employee_ Employee_Name="Jordan, Katie"/> <Employee Employee_ Employee_Name="O'Brien, Tim"/> <Employee Employee_ Employee_Name="Norman, Laura"/> </Dept> <Dept Dept_ Dept_Name="Operations"> <Employee Employee_ Employee_Name="Barr, Adam"/> <Employee Employee_ Employee_Name="Koduri, Sunil"/> </Dept> </ROOT>

Notice that we needed a new container element to hold data that is specific to a department (Dept). Instead of just using <row>, we decided to name this element according to the name of the table itself, <Dept>. Similarly, we've named the element used to hold data from each row of the Employee table as <Employee>.

To transform the resultset in Figure 4-4 into a nested XML document that looks like that shown in Listing 4-8, an RDBMS needs to navigate the rows in the resultset in a forward-only fashion and the columns in a left-to-right fashion. By comparing the column values of the current row with those of the previous rows, an RDBMS can infer the necessary hierarchical structure of the resultset. The nesting order is determined by the order in which the columns will appear in the resultset.

To better illustrate what's really involved, let's step through the process with the resultset shown in Figure 4-4. Before we begin using XML to mark up the data, we will first need to generate the header and root element, like the following:

 <?xml version="1.0"?> <ROOTSET>

Now let's start from the top of the resultset in Figure 4-4. Because this is the first row, however, we only need to "remember" the values here. We don't generate any XML at this point. We can move on to the second row.

Comparing the second row with the first row, we immediately find that the data has changed, so we need to generate elements in the XML. Because we see two groups of columns belonging to two different tables in these rows—Dept_ID and Dept_Name of the Dept table and Employee_ID and Employee_Name of the Employee table—we know we will need two separate elements to hold the column values. Furthermore, because the columns of the Dept table precede those of the Employee table, we will nest the element corresponding to the rows of the Employee table within those of the Dept table. The following XML fragment shows the result of this process:

 <Dept Dept_ Dept_Name="IT"> <Employee Employee_ Employee_Name="MacDonald, Scott"/>

Here we have generated the elements and attributes for the previous row (row one) and not for the current row. This occurs because at this point we know only that the previous row is different from the current row. We don't yet know whether the current one will be different from the next row. Therefore, we need to generate the XML data for the previous row and remember the data in the current row. This is the general pattern we're going to follow until we reach the end of the resultset.

After moving on to the third row and comparing it with the previous row (row two), we find that both Dept_ID and Dept_Name stayed the same, but Employee_ID and Employee_Name have changed. This change means we need to add another <Employee> element to represent the Employee data in the previous row.

There is one slight complication. Because the column values of the row we're about to generate for both the Dept table and Employee table (that is, those of the second row) are different from those we have just generated (that is, those of the first row), we know we need to generate a new <Dept> element. However, because <Dept> is rendered as a container element, we also need to close the previous <Dept> element, as shown in the following code:

 </Dept> <Dept Dept_ Dept_Name="HR"> <Employee Employee_ Employee_Name="Jordan, Katie"/>

Now we will move on to the fourth row, where we find that, compared to the previous row, the column values of the Employee table have changed, which means we need to generate a new <Employee> element. The resulting XML that will be added looks like the following:

 <Employee Employee_ Employee_Name="O'Brien, Tim"/>

Once we have this element, we move on to the fifth row where we encounter a situation similar to the second row. Therefore, we use the same solution and generate the necessary elements.

The XML result that we add to our growing document as follows here:

 <Employee Employee_ Employee_Name="Norman, Laura"/> </Dept> <Dept Dept_ Dept_Name="Operations">

Now we hit the sixth row. First we find that the Employee column values have changed, so we generate a new <Employee> element, as shown here:

 <Employee Employee_ Employee_Name="Barr, Adam"/>

Finally we find that we've hit the end of the resultset. Therefore, the only task left is to spit out the elements for the last row and properly end our XML elements. We also know that we have a <Dept> element open, so we need to close it. Lastly we add a </ROOT> end tag to get our remaining elements. These steps are shown below:

 <Employee Employee_ Employee_Name="Koduri, Sunil"/> </Dept> </ROOT>

You can see that the process is not that complicated and can easily be automated with some simple software.

Schema-Based Mapping

In this section we will show you how to generate database schemas to and from XSDs. Before we begin, however, you should realize that, because you can model an XSD in more than one way, the mapping method we are about to explore with you is not the only legitimate means of mapping XSD with database schemas. As you will learn soon, a few constructs in a database schema have no direct correspondence in XSD, and the solution we will use to solve this problem might be different from other people's similar attempts.

In a relational database the hierarchical relationship between tables can be modeled by using foreign keys. Therefore, in principle, we can model an XML representation of databases based on information found in the database schema. Take note, however, that this mapping method is most often used at design time to generate only an XML DTD or XSD schema. The reason for this is that, because most applications work with a known database schema, this method is not suitable for storing random XML documents in databases. That would require altering database schemas from XML at runtime.

This section makes heavy use of XSD. For a quick introduction to XSD, please read Appendix A. A more detailed coverage of the subject, "XML Schema Primer 0," can be found at http://www.w3.org/TR/xmlschema-0/.

Generating XSD from a Database Schema

The exercise in this section is to generate an XSD schema that can correctly model both the database schemas shown in Listing 4-9. Note that Employee and Dept have a parent-child relationship because of the presence of the foreign key reference in the Employee table.

Listing 4-9 db_schema1.sql: Database schema for sample Dept and Employee tables.

 CREATE TABLE Dept ( DeptID INT PRIMARY KEY NOT NULL, DeptName VARCHAR( 20 ) NOT NULL ) CREATE TABLE Employee ( EmployeeID INT PRIMARY KEY NOT NULL, FirstName VARCHAR( 20 ), LastName VARCHAR( 20 ) NOT NULL, DeptID NUMBER REFERENCES Dept( Dept_ID ) NOT NULL )

First we need to pick one of the two tables to generate an XSD model for it. Let's start with the Employee table. Now for each data column in the Employee table, namely FirstName and LastName, we need to generate a corresponding element in the XSD. Before we can do this, however, we need to define a new data type to model the VARCHAR( 20 ) data type definition that is used by both FirstName and LastName in our database schema. We can do this by extending the xsd:string data type to include this 20-character limit:

 <xsd:simpleType name="varchar20"> <xsd:restriction base="xsd:string"> <xsd:maxLength value="20"/> </xsd:restriction> </xsd:simpleType>

We can now define the elements for FirstName and LastName as varchar20.

 <xsd:element name="FirstName" type="varchar20"/> <xsd:element name="LastName" use="required" type="varchar20"/>

Notice that, in addition to using our new varchar20 data type, we have also specified LastName as required because it is defined as not nullable in the database schema.

Next we tackle the primary key field of the Employee table, namely EmployeeID. First we model it as a unique and required element.

 <xsd:element name="EmployeeID" use="required" type="xsd:int"> <xsd:unique name="EmployeeUnique"> <xsd:selector xpath="my:Employee"/> <xsd:field xpath="EmployeeID"/> </xsd:unique> </xsd:element>

Notice that the uniqueness of the EmployeeID element is defined using the <xsd:unique> element. The <xsd:selector> element specifies that any value of EmployeeID must be unique within the <Employee> element. (We will include the definition for the my namespace URI used here when we get to the final steps of completing our XSD.) The <xsd:field> element associates this <xsd:unique> definition with the EmployeeID element.

Our model for the EmployeeID field has a problem, though. In a database schema a column can be defined as not primary key but still unique. Just saying that our EmployeeID is required is not enough; we need to be able to specify that EmployeeID is also a primary key.

It turns out that XSD has no direct notation to represent the concept of a primary key. One possible way to embed this extra information in our XSD is by using an <appinfo><annotation>. An <appinfo><annotation> is analogous to processing instructions in XML 1 because it is intended to be used for passing information to a processing application. Therefore, let's add our custom primary-key <appinfo><annotation> to our XSD.

 <xsd:element name="EmployeeID" use="required" type="xsd:int"> <xsd:unique name="EmployeeUnique">  <xsd:selector xpath="my:Employee"/> <xsd:field xpath="EmployeeID"/> <xsd:annotation> <xsd:appinfo> primary-key </xsd:appinfo> </xsd:annotation> </xsd:unique> </xsd:element>

Here we have included the <annotation> element within the <unique> element so that a processor can easily associate this primary-key property with the correct element.

Besides primary key, many other attributes and metadata of databases cannot easily be modeled directly using the existing XSD constructs. One important example is all the information related to the definition of indexes. Of course, you can use the same <appinfo> <annotation> technique shown here for these purposes.

The only column left to model now is DeptID. In the Dept table, DeptID is defined as a foreign key. How do we represent this in our XSD? The XSD language has constructs that are perfect for this purpose, the <key> and <keyref> elements. Here <keyref> acts as a foreign key while <key> acts as a primary key. We can now model the DeptID field as follows:

 <xsd:element name="DeptID" use="required" type="xsd:int"> <xsd:keyref name="DeptKeyRef" refer="DeptKey"> <xsd:selector xpath="my:Employee"/> <xsd:field xpath="DeptID"/> </xsd:keyref> </xsd:element>

As you will notice, the syntax of <keyref> is very similar to <unique>, in that we use <selector> and <field> child elements to specify where we want the key relationship to be placed, and in what range it must be unique. The refer attribute is used to specify the name of the <key> element that we want to refer back to. When we model the DeptID field of the Dept table, we will name the associated <key> element as DeptKey.

Now that our XSD model for the Employee table is complete, we can move on to the Dept table. Modeling the DeptName field is easy because it is the same as the FirstName field of the Employee table:

 <xsd:element name="DeptName" type="varchar20"/>

The DeptID field is defined as a primary key in the Dept table. We already know how to model a primary key from when we tackled the <EmployeeID> element, so we can quickly jump to the following definition of the <DeptID> element:

 <xsd:element name="DeptID" use="required" type="xsd:int"/> <xsd:unique name="DeptUnique"> <xsd:selector xpath="my:Dept"/> <xsd:field xpath="DeptID"/> <xsd:annotation> <xsd:appinfo> primary-key </xsd:appinfo> </xsd:annotation> </xsd:unique>

However, one piece of information is still missing. We need to include a <key> element here so that our earlier <keyref> definition in the <DeptID> child element of the <Employee> element is complete. The final definition for the <DeptID> element is as follows:

 <xsd:element name="DeptID" use="required" type="xsd:int"/> <xsd:unique name="DeptUnique"> <xsd:selector xpath="my:Dept"/> <xsd:field xpath="DeptID"/> <xsd:annotation> <xsd:appinfo> primary-key </xsd:appinfo> </xsd:annotation> </xsd:unique> <xsd:key name="DeptKey"> <xsd:selector xpath="my:Dept"/> <xsd:field xpath="DeptID"/> </xsd:key> </xsd:element>

Now that we have all the pieces, we can tie them together into a complete XSD schema document. Our final completed XSD schema, including the XML header and definition for a root element, is shown in Listing 4-10.

Listing 4-10 employee_dept.xsd: XSD generated from the Employee and Dept schemas.

 <?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3c.org/2001/XMLSchema" targetNamespace="http://mspress.microsoft.com/corexml/" xmlns:my="http://mspress.microsoft.com/corexml/"> <xsd:element name="ROOT"> <xsd:complexType> <xsd:sequence> <xsd:element ref="Employee" minOccurs="0"  maxOccurs="unbounded"/> <xsd:element ref="Dept" minOccurs="0" maxOccurs="unbounded"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:simpleType name="varchar20"> <xsd:restriction base="xsd:string"> <xsd:maxLength value="20"/> </xsd:restriction> </xsd:simpleType> <xsd:element name="Employee"> <xsd:complexType> <xsd:sequence> <xsd:element name="EmployeeID" use="required" type="xsd:int"> <xsd:unique name="EmployeeUnique"> <xsd:selector xpath="my:Employee"/> <xsd:field xpath="EmployeeID"/> <xsd:annotation> <xsd:appinfo> primary-key </xsd:appinfo> </xsd:annotation> </xsd:unique> </xsd:element> <xsd:element name="FirstName" type="varchar20"/> <xsd:element name="LastName" use="required" type="varchar20"/> <xsd:element name="DeptID" use="required" type="xsd:int"> <xsd:keyref name="DeptKeyRef" refer="DeptKey">  <xsd:selector xpath="my:Employee"/> <xsd:field xpath="DeptID"/> </xsd:unique> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:element name="Dept"> <xsd:complexType> <xsd:sequence> <xsd:element name="DeptID" use="required" type="xsd:int"/> <xsd:unique name="DeptUnique"> <xsd:selector xpath="my:Dept"/> <xsd:field xpath="DeptID"/> <xsd:annotation> <xsd:appinfo> primary-key </xsd:appinfo> </xsd:annotation> </xsd:unique> <xsd:key name="DeptKey"> <xsd:selector xpath="my:Dept"/> <xsd:field xpath="DeptID"/> </xsd:key> </xsd:element> <xsd:element name="DeptName" type="varchar20"/> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>

As you can see, the XSD language provides many features, such as the unique and key elements we showed you here, to enable accurate mapping to and from database schemas. For the actions that cannot be easily implemented directly in XSD, such as the primary-key property, we can use methods such as annotation to achieve a workable solution at least.

Generating a Database Schema from an XSD

In this section we will show you how to generate a database schema based on an XSD. For the examples here we will use the XSD schema in Listing 4-10 that we generated in the previous section. To better illustrate the mapping process, our database schema will be represented by an SQL creation script.

The first detail we need to decide on is what tables we're going to be mapping. For our case this is easy, because in Listing 4-10 we know that all these tables will be listed under the <ROOT> element. For each of the tables listed we generate a CREATE TABLE statement block for it in our SQL script.

 CREATE TABLE Employee ( ) CREATE TABLE Dept ( )

Now we need to work through each of these tables to model their structures as XSD. Let's start with the Employee table because it's the first table listed in <ROOT>. In Listing 4-10 we know that every table is modeled as a <complexType> and that each column in a specific table is modeled as a child element. Armed with this knowledge, we find that the Employee table needs four columns.

Each of the columns in this table will have a corresponding child element defined under the <Employee><complexType>. In addition to their names, we also find other important information about these columns, such as whether they are required, unique, or both. We will need to generate the appropriate column definitions for each of these columns in our SQL script. Notice that the required attribute of the elements is represented in our SQL scripts using the NOT NULL standard SQL modifiers.

 CREATE TABLE Employee ( EmployeeID INT NOT NULL, FirstName VARCHAR( 20 ), LastName VARCHAR( 20 ) NOT NULL, DeptID INT NOT NULL )

Going back to our XSD in Employee_Dept.xsd, we find that EmployeeID has an associated unique element and a primary-key <appinfo><annotation>. As we've seen, the use of these two pieces of information represent our way of saying this element is a primary key column. Let's add this information to our SQL script:

 EmployeeID INT PRIMARY KEY NOT NULL,

Now comes the fun part: modeling the foreign key. We find from the definition of DeptID in the <Employee> element that it has a <keyref> to a <key> called DeptKey. Searching through Listing 4-10, we find that this DeptKey <keyref> is associated with the DeptID element within the <Dept><complexType> element. Now we have the information we need to model this foreign key relationship in our database schema.

 DeptID INT REFERENCES Dept( DeptID ) NOT NULL

This step completes our modeling for the Employee table, so now let's turn to the Dept table.

As before, first we add a column to our database schema for each element defined within the <Dept><complexType>.

 CREATE TABLE Dept ( DeptID INT NOT NULL, Name VARCHAR( 20 ) NOT NULL )

Now we find out that DeptID should be defined as a primary key because of the presence of the primary-key <appinfo><annotation>. Here we make the final change to our database schema, which yields Listing 4-11.

Listing 4-11 generated_db_schema.sql: DB Schema generated from the Employee_Dept.xsd XML document.

 CREATE TABLE Employee ( EmployeeID INT PRIMARY KEY NOT NULL, FirstName VARCHAR( 20 ), LastName VARCHAR( 20 ) NOT NULL, DeptID INTEGER REFERENCES Dept(DeptID) NOT NULL ) CREATE TABLE Dept ( DeptID INT PRIMARY KEY NOT NULL, Name VARCHAR( 20 ) NOT NULL )

Storing and Retrieving Document-Centric XML Documents

Support for document-centric XML documents in most RDBMS products is usually minimal, if not totally missing. This absence results from the freeform and less organized nature of document-centric XML documents that makes them difficult to fit into the framework of a relational model. To illustrate this difficulty, consider Listing 4-12.

Listing 4-12 sample_html.html: An sample document-centric HTML document.

 <html> <head> <title>Sample HTML</title> </head> <body> <h1>This is a header</h1> Some other text <p>This is a paragraph.</p> <p>This is <b>another</b> paragraph.</p> </body> </html>

Note that although some elements look well structured, such as the <title> element, some mixed content is in the document, such as the <b> element within the <p> element. The presence of mixed content and deeply nested elements in the document makes it difficult to model properly using a relational model. For example, which elements do you model as tables and which as rows/columns? Also, is there a point to decomposing this document to fit into a relational model? What will you gain by doing that?

Most RDBMS products recommend that you store document-centric XML documents as Character Large Object (CLOB) or BLOB columns in a relational database. Doing so allows XML documents of arbitrary size to be easily stored in and retrieved from databases.

This concludes our discussion of using XML with RDBMS in general. In the following sections, we will explore the XML features provided by some commercial RDBMS products.