Chapter 4
So far we've covered the basics of XML and parsing XML documents. While these topics lay the foundation for using XML, they do not show you how to really apply XML. This chapter begins our exploration of using XML for solving real-world problems by focusing on a usage currently in the spotlight: using XML with databases.
We will begin this chapter with a short introduction to some important database-
One of the computer's most powerful applications is the ability to store, organize, and retrieve large
In this era of information explosion, DBMS has evolved into the foundation of virtually all of an organization's information management. Performing a substantial information search without using a DBMS is rare indeed.
In this section we will take a quick look at some popular DBMSs. We will also
A number of different types of DBMS have emerged over the
Widely used during the mainframe era, a Hierarchical DBMS (HDBMS) links records, also called nodes, together like a family tree such that each record type has only one owner. For example, an order is owned by only one customer. Figure 4-1 shows a sample hierarchical database containing customers and the orders they've placed.
Figure 4-1 An HDBMS example.
The Orders hierarchical database shown in Figure 4-1 has six nodes of type Customer and six of type Order. These nodes are linked together by
pointers
that the
The major
055/0706/000763
Note that this query
The relation database model was first proposed by Dr. E. F. Codd of IBM in his seminal paper titled "A Relational Model of Data for Large Shared Data Banks" in 1970. A Relational DBMS (RDBMS) models data as a set of tables where each table consists of a fixed collection of
Figure 4-2 shows the relational database structure of three example tables: Customer, Order, and Item.
Figure 4-2 An RDBMS example.
Unlike hierarchical databases, relationships between the tables in a relational database are built at runtime by linking key columns from one table to another. The database uses two types of key columns. The first one, called a primary key , is mandatory for every table in a relational database and is used to uniquely identify rows in a table. The second type, called a foreign key , corresponds with the primary key of other tables to form a parent-child relationship. For example, in Figure 4-2, CustomerNo is the primary key column of the Customer table, while OrderNo is the primary key column of the Order table. The Order table also has a foreign key column, CustomerNo, which links to the CustomerNo column of the Customer table. Therefore, in this case the Customer table is said to be the parent and Order the child.
The process of linking tables together at runtime using key fields is called
joining the tables
. One of the key features of relational databases is that they require few assumptions about how tables can be joined and data extracted. This flexibility allows the same database to be
In addition, because a single relational database can be spread across several tables, the database can be designed so that data redundancy is minimized. This is known as normalization, and the objective is to isolate data so that additions, deletions, and modifications can be made in just one table and propagated throughout the rest of the database based on the defined relationships.
Finally, true RDBMSs use Structured Query Language (SQL) to query, extract, and update data stored in databases and conform the data as closely as possible to the theoretical relational rules of normalization. For example, to find all Items ordered by Customer John, you can use the following SQL query:
SELECT ProductNo, Desc FROM Customer, Order, Item WHERE Customer.CustomerNo = Order.CustomerNo AND Order.ItemNo = Item.ItemNo AND Customer.FirstName = `John'
Object orientation for a database means the capability of storing and retrieving objects in addition to mere data. Objects are bundles of data and behaviors. Because the data in an object can be just another object or collection of objects, objects are similar to the nodes of an HDBMS, which likewise contains a collection of child nodes.
The ability to query objects is often provided through a set of methods inherited from a parent object provided by the Object-Oriented DBMS (OODBMS) framework. Each object can implement these
One of the biggest advantages of OODBMSs is their direct integration with object-oriented programming languages. Another important advantage is that the amount of data that can be
Over the past five years there has been an explosion in demand for access over the Internet to the data stored in enterprise databases and for the ability to use the databases to support electronic business operations. These include transactions between systems within an enterprise ("enterprise integration"), between businesses in a supply chain ("B2B e-commerce"), and directly to customers ("B2C e-commerce"). Because so many types and brands of DBMSs exist, two parties involved in the business transactional exchanges will likely be using incompatible DBMSs. On the other hand, even if they are using the same DBMS, their schemas might be so different that offering systems direct access to the other database can
Therefore, what many organizations are striving for is universal access to data in a fashion that is easy to manage and in a format that is
Even though XML offers many advantages as a universal data-exchange format, using it with a DBMS can be
A data model is a conceptual representation of data structures. The data structures include the data objects, the relationships between the objects, and the rules that
Before we explore the problem with data modeling, you should realize that XML documents can be categorized into two major types: data-centric and document-centric . This categorization is important because it will often govern what's possible and what isn't when using XML with a DBMS.
The main characteristics of a data-centric XML document are a more organized and regular structure and a lack of (or very little) mixed content. Data-centric XML documents are designed for application consumption and
Listing 4-1 is an example of a data-centric XML document used to hold contact information for a personal phonebook. Notice that every information item, such as the full
Listing 4-1 data_centric.xml: A sample data-centric XML document.
<?xml version="1.0"?> <Contacts> <Contact ContactNumber="197459531"> <FullName>Abercrombie, Kim</FullName> <CompanyName>Tailspin Toys</CompanyName> <EMail>akim@tailspintoys.com</EMail> <PhoneNo type="business">1-415-123-1234</PhoneNo> <Address> <Street>123 Main St.</Street> <City>San Francisco</City> <State>CA</State> <PostalCode>94116</PostalCode> <CountryCode>US</CountryCode> </Address> </Contact> <Contact ContactNumber="250021025"> <FullName>Chen, John Y.</FullName> <CompanyName>Contoso, Ltd</CompanyName> <EMail>johny.chen@contoso.com</EMail> <PhoneNo>852-1234-5678</PhoneNo> <Address> <Street>99 Dry St.</Street> <City>Cardiff</City> <PostalCode>CF17XX</PostalCode> <CountryCode>UK</CountryCode> </Address> </Contact> </Contacts>
Document-centric design represents a more
Listing 4-2 document_centric.xml: A sample document-centric XML document.
<?xml version="1.0"?> <chapter> <number>4</number> <title>Sample Chapter</title> <author email="john.kelly@thephone-company.com">Kelly , John </author> <contents> <header level="1"> <para>This is a sample paragraph.</para> <list> <bullet>This is <ref link="#12">item 1</ref></bullet> <bullet>This is <ref link="#13">item 2</ref></bullet> </list> <header level="2"> <para>This is paragraph with embedded <bold>formatting </bold>.</para> </header> </header> </contents> </chapter>
On the surface XML appears to fit in the hierarchical modeling category. In practice, XML is a hybrid. XML is probably most similar to object modeling because it can be regarded as consisting of nodes, and nodes can contain heterogeneous data. On the other hand,
Because of all these issues, most DBMS vendors provide indirect support for XML. Support for retrieving data-centric XML documents is usually generated dynamically from the underlying data models using predefined mapping rules. Some vendors support
The complexity of providing a modeling framework that can handle the sheer flexibility involved means that most DBMSs do not provide any special built-in support for storage and retrieval of document-centric XML documents. Some DBMSs allow the user to store XML documents as Binary Large Object (BLOB) columns designed for storing unstructured or binary data. This approach gives you the usual advantages provided by databases, such as transactional control, security and centralized access. More sophisticated DBMSs have XML-aware tools with full-text query capabilities (such as proximity, synonym, and fuzzy searches) for the XML documents stored in the databases.
Strictly speaking, text is the only data type supported by an XML Document Type Definition (DTD). Most databases, however, have strong data type support. Therefore, RDBMSs will need to convert data from text in the XML document to their native types in the databases and vice versa. The biggest problem with this is that no universal standard exists for textual representation of the different types of data. Dates are
Fortunately, W3C XSD provides a richer set of data types than DTDs, including number and date, and allows users to derive their own data types. RDBMSs that support XSD can therefore provide a more accurate representation of a database table schema using XML.
Because everything in XML is textual, binary data must be encoded before they can be included in an XML document. The two most common ways of encoding binary data in XML are unparsed entities and Base64 encoding. Unfortunately, if you're using a DTD, no standard XML notation exists for indicating that an element contains Base64-encoded data, so software might not recognize that the data is binary-encoded. With RDBMSs that support XSD, however, users can use the base64Binary data type to
Data with null values has a special meaning in the database world. They're used to represent such factors as
In XML optional elements and attributes can be used to represent null values. If a database value is null, it is simply not included in the corresponding XML element or attribute. Note that, just as with databases, empty values (such as zero length string) are represented by elements or attributes without any content (for example, < element / > or attribute ="").
Instead of leaving out optional elements and attributes, some RDBMSs allow the user to define what
A charset is a byte-sequence-to-character mapping method and its associated data structures. For most Western European languages the charset, specified by the ISO-8859-1 standard (also called Latin-1), is simply a one-to-one byte-to-character mapping table. For other languages a different charset might be required. For example, for the Hebrew alphabet the ISO-8859-8 charset can be used. Some languages, though, might require more complex mapping structures and rules. For example, for some Asian languages that contain thousands of characters, including Chinese, Korean, and Japanese, a charset that uses multibyte
A charset is not the same as a character set. For a full explanation of the difference, check out RFC 2278 at http://www.ietf.org/rfc/rfc2278.txt.
Being text, XML is designed to support different charsets, including Unicode. Databases, on the other hand, often offer limited support at best for charsets. Most RDBMSs can support only one charset for the entire table or even the entire database. Therefore, retrieving and storing XML documents with databases might sometimes require an intermediate charset conversion step.