Databases and XML | XML Programming Bible

One of the computer's most powerful applications is the ability to store, organize, and retrieve large quantities of data. An organized collection of data is referred to as a database, and the programs designed to manage databases are known as Database Management Systems (DBMSs).

In this era of information explosion, DBMS has evolved into the foundation of virtually all of an organization's information management. Performing a substantial information search without using a DBMS is rare indeed.

In this section we will take a quick look at some popular DBMSs. We will also briefly introduce some of the issues in using XML with databases.

Types of DBMS

A number of different types of DBMS have emerged over the years. The most significant difference between them is the model used to store, manage, and query databases. Besides affecting what software you need to acquire, the model used affects the way you will think about data and can be surprisingly difficult to change later. Let's take a quick look at some of the more popular DBMSs, their designs, their advantages, and their disadvantages.

Hierarchical DBMS

Widely used during the mainframe era, a Hierarchical DBMS (HDBMS) links records, also called nodes, together like a family tree such that each record type has only one owner. For example, an order is owned by only one customer. Figure 4-1 shows a sample hierarchical database containing customers and the orders they've placed.

Figure 4-1 An HDBMS example.

The Orders hierarchical database shown in Figure 4-1 has six nodes of type Customer and six of type Order. These nodes are linked together by pointers that the user must explicitly specify. For example, Order (0706) is linked to Customer (055). All the nodes linked together form a strictly defined tree structure.

The major disadvantage of a hierarchical database is that it can be hard to query flexibly in ways that go against the tight coupling of the data hierarchies. For example, a query to retrieve the Item node of Item No 000763 in the Orders database in Figure 4-1 might look like the following:

 055/0706/000763

Note that this query follows a precise path to the required node. (The acute reader will notice that this is very similar to how one would use XPath to traverse to a node in XML). However, if you wanted to know all the Orders that also used the Item of this ItemNo, the only way to discover this would be through an extremely costly search of every Order node.

Relational DBMS

The relation database model was first proposed by Dr. E. F. Codd of IBM in his seminal paper titled "A Relational Model of Data for Large Shared Data Banks" in 1970. A Relational DBMS (RDBMS) models data as a set of tables where each table consists of a fixed collection of columns, or fields. An indefinite number of rows, or records, can occur within each table. Compared to the hierarchical model, the relational model is complicated. However, the relational model provides a much more flexible framework for data access and manipulation. You can ask any question you want of an RDBMS, but for an HDBMS you can ask only the questions programmed into the system.

Figure 4-2 shows the relational database structure of three example tables: Customer, Order, and Item.

Figure 4-2 An RDBMS example.

Unlike hierarchical databases, relationships between the tables in a relational database are built at runtime by linking key columns from one table to another. The database uses two types of key columns. The first one, called a primary key, is mandatory for every table in a relational database and is used to uniquely identify rows in a table. The second type, called a foreign key, corresponds with the primary key of other tables to form a parent-child relationship. For example, in Figure 4-2, CustomerNo is the primary key column of the Customer table, while OrderNo is the primary key column of the Order table. The Order table also has a foreign key column, CustomerNo, which links to the CustomerNo column of the Customer table. Therefore, in this case the Customer table is said to be the parent and Order the child.

The process of linking tables together at runtime using key fields is called joining the tables. One of the key features of relational databases is that they require few assumptions about how tables can be joined and data extracted. This flexibility allows the same database to be viewed and used in different ways.

In addition, because a single relational database can be spread across several tables, the database can be designed so that data redundancy is minimized. This is known as normalization, and the objective is to isolate data so that additions, deletions, and modifications can be made in just one table and propagated throughout the rest of the database based on the defined relationships.

Finally, true RDBMSs use Structured Query Language (SQL) to query, extract, and update data stored in databases and conform the data as closely as possible to the theoretical relational rules of normalization. For example, to find all Items ordered by Customer John, you can use the following SQL query:

 SELECT ProductNo, Desc FROM Customer, Order, Item WHERE Customer.CustomerNo = Order.CustomerNo AND Order.ItemNo = Item.ItemNo AND Customer.FirstName = `John'

Object-Oriented DBMS

Object orientation for a database means the capability of storing and retrieving objects in addition to mere data. Objects are bundles of data and behaviors. Because the data in an object can be just another object or collection of objects, objects are similar to the nodes of an HDBMS, which likewise contains a collection of child nodes.

The ability to query objects is often provided through a set of methods inherited from a parent object provided by the Object-Oriented DBMS (OODBMS) framework. Each object can implement these methods in a manner appropriate for its purposes. A SQL, called OQL, that allows for the querying of these systems in a standardized way, has been proposed by the Object Database Management Group (ODMG) (http://www.odmg.org).

One of the biggest advantages of OODBMSs is their direct integration with object-oriented programming languages. Another important advantage is that the amount of data that can be modeled by an OODBMS is increased, and modeling this information is easier.

Why XML?

Over the past five years there has been an explosion in demand for access over the Internet to the data stored in enterprise databases and for the ability to use the databases to support electronic business operations. These include transactions between systems within an enterprise ("enterprise integration"), between businesses in a supply chain ("B2B e-commerce"), and directly to customers ("B2C e-commerce"). Because so many types and brands of DBMSs exist, two parties involved in the business transactional exchanges will likely be using incompatible DBMSs. On the other hand, even if they are using the same DBMS, their schemas might be so different that offering systems direct access to the other database can turn out to be complicated at best and disastrous at worst. If, miraculously, the two parties are using the same schemas, the protocols used by the DBMSs might not be accessible or even secured over a public network like the Internet.

Therefore, what many organizations are striving for is universal access to data in a fashion that is easy to manage and in a format that is self-describing and standards compliant. To help facilitate this goal, several industry vendors are turning to XML.

The Challenges of Using XML with a DBMS

Even though XML offers many advantages as a universal data-exchange format, using it with a DBMS can be problematic. For example, the way XML data is structured (that is, data modeling) can be radically different from the structure in many databases. This means that some sort of data model mapping must be used before XML can be retrieved from or stored in databases. In this section we will take a brief look at some of the issues that can arise through the use of XML with a DBMS.

Data Modeling

A data model is a conceptual representation of data structures. The data structures include the data objects, the relationships between the objects, and the rules that govern which operations can be performed on the objects.

Before we explore the problem with data modeling, you should realize that XML documents can be categorized into two major types: data-centric and document-centric. This categorization is important because it will often govern what's possible and what isn't when using XML with a DBMS.

Data-Centric XML Documents

The main characteristics of a data-centric XML document are a more organized and regular structure and a lack of (or very little) mixed content. Data-centric XML documents are designed for application consumption and application-to-application data exchange. Common examples include invoices, stock quotes, product catalogs, and application configuration files.

Listing 4-1 is an example of a data-centric XML document used to hold contact information for a personal phonebook. Notice that every information item, such as the full name, and the zip or postal code, is represented by an element, and there is no mixed content.

Listing 4-1 data_centric.xml: A sample data-centric XML document.

 <?xml version="1.0"?> <Contacts> <Contact ContactNumber="197459531"> <FullName>Abercrombie, Kim</FullName> <CompanyName>Tailspin Toys</CompanyName> <EMail>akim@tailspintoys.com</EMail> <PhoneNo type="business">1-415-123-1234</PhoneNo> <Address> <Street>123 Main St.</Street> <City>San Francisco</City> <State>CA</State> <PostalCode>94116</PostalCode> <CountryCode>US</CountryCode> </Address> </Contact> <Contact ContactNumber="250021025"> <FullName>Chen, John Y.</FullName> <CompanyName>Contoso, Ltd</CompanyName> <EMail>johny.chen@contoso.com</EMail> <PhoneNo>852-1234-5678</PhoneNo> <Address> <Street>99 Dry St.</Street> <City>Cardiff</City> <PostalCode>CF17XX</PostalCode> <CountryCode>UK</CountryCode> </Address> </Contact> </Contacts>

Document-Centric XML Documents

Document-centric design represents a more liberal use of free-form text that is "marked-up" with elements. Document-centric XML documents are usually meant for human consumption and are characterized by a less regular structure with plenty of mixed content. Examples are books, letters, e-mails, and HTML/XHTML documents. Listing 4-2 is an example of a document-centric XML document. Notice that while some parts of the document may look well-structured (<number>, <author>, and so forth), mixed content is present within the <bullet> and <para> elements.

Listing 4-2 document_centric.xml: A sample document-centric XML document.

 <?xml version="1.0"?> <chapter> <number>4</number> <title>Sample Chapter</title> <author email="john.kelly@thephone-company.com">Kelly, John </author> <contents> <header level="1"> <para>This is a sample paragraph.</para> <list> <bullet>This is <ref link="#12">item 1</ref></bullet> <bullet>This is <ref link="#13">item 2</ref></bullet> </list> <header level="2"> <para>This is paragraph with embedded <bold>formatting </bold>.</para> </header> </header> </contents> </chapter>

Modeling Issues

On the surface XML appears to fit in the hierarchical modeling category. In practice, XML is a hybrid. XML is probably most similar to object modeling because it can be regarded as consisting of nodes, and nodes can contain heterogeneous data. On the other hand, document-oriented XML documents can have a high degree of heterogeneity of nodes (almost any element can appear almost anywhere) that few database models are able to replicate in terms of flexibility. In fact, it can be argued that the verbosity and formatting looseness of XML is the opposite of the strategies most DBMSs use to maximize performance and reliability.

Because of all these issues, most DBMS vendors provide indirect support for XML. Support for retrieving data-centric XML documents is usually generated dynamically from the underlying data models using predefined mapping rules. Some vendors support persisting the data-centric XML documents, and they use similar mapping rules to translate XML into data models supported by the DBMS. The biggest problem with this approach is that perfect round-tripping of a document (that is, storing the data from a document in the database and then reconstructing the document from the data) might not be possible because of the difference in data representation.

The complexity of providing a modeling framework that can handle the sheer flexibility involved means that most DBMSs do not provide any special built-in support for storage and retrieval of document-centric XML documents. Some DBMSs allow the user to store XML documents as Binary Large Object (BLOB) columns designed for storing unstructured or binary data. This approach gives you the usual advantages provided by databases, such as transactional control, security and centralized access. More sophisticated DBMSs have XML-aware tools with full-text query capabilities (such as proximity, synonym, and fuzzy searches) for the XML documents stored in the databases.

Data Types

Strictly speaking, text is the only data type supported by an XML Document Type Definition (DTD). Most databases, however, have strong data type support. Therefore, RDBMSs will need to convert data from text in the XML document to their native types in the databases and vice versa. The biggest problem with this is that no universal standard exists for textual representation of the different types of data. Dates are especially problematic because the range of possible formats is enormous.

Fortunately, W3C XSD provides a richer set of data types than DTDs, including number and date, and allows users to derive their own data types. RDBMSs that support XSD can therefore provide a more accurate representation of a database table schema using XML.

Binary Data

Because everything in XML is textual, binary data must be encoded before they can be included in an XML document. The two most common ways of encoding binary data in XML are unparsed entities and Base64 encoding. Unfortunately, if you're using a DTD, no standard XML notation exists for indicating that an element contains Base64-encoded data, so software might not recognize that the data is binary-encoded. With RDBMSs that support XSD, however, users can use the base64Binary data type to indicate that a certain data unit contains encoded binary data.

Null Values

Data with null values has a special meaning in the database world. They're used to represent such factors as uninitialized columns, invalid values, or data that simply isn't there. Note that this is different from a value of 0 for numbers or zero length strings. For example, two columns of null values are not considered equal, while two zero length strings are.

In XML optional elements and attributes can be used to represent null values. If a database value is null, it is simply not included in the corresponding XML element or attribute. Note that, just as with databases, empty values (such as zero length string) are represented by elements or attributes without any content (for example, <element /> or attribute="").

Instead of leaving out optional elements and attributes, some RDBMSs allow the user to define what constitutes null in an XML document, including supporting the XSD xsi:null type attribute.

Charsets

A charset is a byte-sequence-to-character mapping method and its associated data structures. For most Western European languages the charset, specified by the ISO-8859-1 standard (also called Latin-1), is simply a one-to-one byte-to-character mapping table. For other languages a different charset might be required. For example, for the Hebrew alphabet the ISO-8859-8 charset can be used. Some languages, though, might require more complex mapping structures and rules. For example, for some Asian languages that contain thousands of characters, including Chinese, Korean, and Japanese, a charset that uses multibyte characters is required. The EUC-KR charset is used for Korean, and the Shift_JIS charset is used for Japanese. Finally, the Unicode standard provides a single 16-bit charset that can be used to represent practically all the common languages used in the world today.

A charset is not the same as a character set. For a full explanation of the difference, check out RFC 2278 at http://www.ietf.org/rfc/rfc2278.txt.

Being text, XML is designed to support different charsets, including Unicode. Databases, on the other hand, often offer limited support at best for charsets. Most RDBMSs can support only one charset for the entire table or even the entire database. Therefore, retrieving and storing XML documents with databases might sometimes require an intermediate charset conversion step.