An XML Document Is a Database | Beginning XML Databases (Wrox Beginning Guides)

Is an XML document a database? Yes, of course. As you already know from previous chapters, an XML document contains both data and metadata. The most basic definition of a database is a repository for data. That repository contains information or data, such as all your companys account information, or in the case of this book, a slew of worldwide demographics data. In addition, the basic database structure describes the data or information contained within that database. What is the structure describing data in a database? The answer, of course, is the database metadata. Database metadata is the data in a database describing the information that is stored in that database. That metadata allows for sensible retrieval of that information, such that when your customer or demographics data is read, you actually know that you are reading the names of customers, or the populations of all countries of a particular region of the world. Without that metadata, the data stored in your database is completely meaningless. Obviously, the semantics of data can be written into programs. Those programs can be instructed where on the disk to read specific items, such as customer names. However, it is much easier to store some semantic metadata into a database itself. The resulting structure is much easier on programmers and applications in general, and thus more efficient and cheaper in that most costly of commodities: time. Time really is money!

One of the terms commonly used to describe XML data is that it is self-describing . XML is self-describing because it contains both data and metadata that describes the structure of the data in the XML document. That metadata is contained both in the elements and attributes of the XML document, and also in the hierarchy of the XML document itself. XML metadata can be described as being two-fold in nature:

Important

You might be asking something like this: Isnt the information about the data con tained in something like an XML Schema Definition (XSD) instead of the actual XML document itself? The answer to that question is of course, no! An XML document is a database unto itself it is what it is. An XSD or Document Type Definition (DTD) is only necessary to interpret XML documents as relational structures.

XML elements and attributes describe the properties of data. This is the equivalent of tables and fields in a relational database.
A properly structured hierarchical XML document describes relationships between different types of data in a dataset. This is more or less equivalent to relationships between tables in a relational database, but better matched to the structure established between classes in an object model of data.

Important

Again, you might be thinking something like this: I thought the structure of a rela tional database table in a database was related to an XML Schema in XML? Again, that would be incorrect. A schema, is a schema, is a schema. XSD and DTD are for mapping relational tables to objects. Because an XML document is an object structure, XSDs and DTDs are completely superfluous with respect to a native XML database. By definition, any XML document is a native XML database. Metadata in an XML document is inherent with the elements (tables), attributes (fields), and element hierarchy (relationships between tables) of the XML document they are the meta data. An XML document does not need a relational table structure to define itself because it is already a database in itself. See Chapters 6 and 13 for applications of XSD and DTD. This chapter focuses on an XML document as an independent stor age medium (a native XML database). If you have relational tables and schemas in your mind while reading this chapter, then try to mentally shelve those thoughts until you read Chapter 13 . An XML document does not need a schema, an XSD, or a DTD or anything other than itself to be a database. Only a relational database needs an XSD or DTD in order to map to XML document structure.

So, the advantages of using XML are that an XML document is actually a database structure in itself. XML is self-describing in that it contains both data and metadata. Metadata is also descriptive in both a scalar (as fields and values) and a structural sense (as a hierarchy). XML is portable because it is universally understood (that is the goal at least).

Disadvantages of using XML as a database storage medium are numerous . XML documents can include duplication, and thus become verbose. The result is that access to XML documents, in its most basic form, will involve full scanning and parsing of what could potentially be enormous text files.

Physical size comparison between commercial implementations of native XML databases and that of relational databases is not practical. Typical Oracle or SQL Server database implementations (both data warehouses and OLTP database) are considered as being VLDB (Very Large Database) at terabyte lev els. Five years ago relational database sizes were generally considered very large at a few hundred megabytes. I may be incorrect in this assumption but so far I have not found any native XML databases described as being larger than gigabyte-level in size. Thus, comparison is impractical and pointless. XML documents can be stored in Oracle and SQL Server XML data types. What you dont know is that a typical installation stores a collection of XML data types.

In reality, an XML document contains a large amount of information all in on place, and quite often even in a single file. This single-file nature can create tremendous issues with respect to large amounts of data, lots of users, and any general background processing. Native XML databases are therefore usually commercially most appropriate to that. This includes small scale applications, small quantities of data, very few users (low concurrency and multi- user requirements), low security, low data integrity needs, and above all, low performance.

So, how can an XML document be described as a database?

Defining a Native XML Database

By definition, an NXD (native XML database) can be both an XML document and an XML data type. An XML data type is a specialized storage facility containing a relational database. It follows that a native XML database is essentially any method of storing XML data as an XML document. Thus, an XML document, as executed in a browser, is a native XML database. Additionally, using XML data types in relational databases, such as Oracle or SQL Server databases (as described in Chapters 5 and 6) makes some relational databases native XML databasecapable as well. Essentially all that is needed to describe a database as being an NXD, or at least NXD-capable, is that an XML document is stored as an XML document. No particular modeling technique applies. Thus relational, object, or even hierarchical databases can be used.

A native XML database does not have to be a self-contained, independent database. An NXD can be included in other database engines using XML data types, as used in Oracle and SQL Server databases.

Creating a Native XML Database

Now let me demonstrate what the term native XML database really means. The way to do this is very simple you can try to create a native XML database, simply by describing how that native XML database might be created using XML.

The term native XML essentially means that the database part of that term is an intrinsic part of XML. In other words, as described previously and repeatedly in this book, XML documents describe both data and metadata. Data is the data, and metadata applies structure or at least some meaning to that data. Examine the example XML document shown in Figure 9-1.

Figure 9-1: A single-layered XML document

The XML document in Figure 9-1 shows a single-layered, poorly structured, XML document. However, those who are familiar with relational database structure will find this particular document easier to interpret as containing both data and metadata. Lets break down the document shown in Figure 9-1. This section of the XML document shown in Figure 9-1 shows data values in boldface:

   <planet name="  earth  ">    <record region="  Africa  " country="  Mauritius  " language="  Bhojpuri  " year="  1983  " fema les="  99467  " males="  97609  "></record>    <record region="  Africa  " country="  Mauritius  " language="  French  " year="  1983  " female s="  19330  " males="  16888  "></record>    <record region="  Europe  " country="  Finland  " language="  Czech  " year="  1985  " females="  42  " males="  36  "></record> ... </planet>

And this section of the XML document in Figure 9-1 shows metadata values in boldface:

   <  planet name  ="earth">    <  record region  ="Africa"  country  ="Mauritius"  language  ="Bhojpuri"  year  ="1983"  fema   les  ="99467"  males  ="97609"></record>    <  record region  ="Africa"  country  ="Mauritius"  language  ="French"  year  ="1983"  female   s  ="19330"  males  ="16888"></  record  >    <  record region  ="Europe"  country  ="Finland"  language  ="Czech"  year  ="1985"  females  =" 42"  males  ="36"></  record  > ... </  planet  >

If you can execute (some may call this rendering) an XML document in Windows Internet Explorer, then the XML document shown in Figure 9-1 should show the metadata and data in two different colors. The metadata will be a reddish-brown color , and the data will be both bold and black. Different browsers may show XML documents differently. The point is that even a browser program delineates between data and metadata in XML documents.

So, if you think in terms of a relational database, the XML document shown in these previous three examples could look something like what is shown in Figure 9-2.

Figure 9-2: Converting a single-layered XML document to a single relational table

In Figure 9-2 the metadata is on the left in the form of the PLANET table. The data is on the right side of Figure 9-2 shown as rows, or multiple iterations of the table structure (metadata). Each row is divided into specific fields called REGION, COUNTRY, LANGUAGE, YEAR, FEMALES, and MALES.

Now lets use the same information as shown in Figures 9-1 and 9-2, but this time make sure that the XML document has an acceptable hierarchical structure. That document is partially shown in Figure 9-3.

Figure 9-3: A properly structured, multiple-layered, hierarchical XML document

In Figure 9-3, you now see a completely different picture. The tags in the XML document are now structured within a hierarchy and parent object data is not duplicated . The tree structure shown in Figure 9-3 has a number of advantages over that of the flattened structure shown in Figure 9-1:

There is less repetition.
The structure of the data mimics reality.
Searching the hierarchy in Figure 9-3 can use a specialized tree scanning algorithm allowing for potentially much faster performance.

Additionally, as for the flattened structure, XML divides up its content between metadata and data where data is the values of the elements. The metadata is the actual element names. So, in the case of the XML document in Figure 9-3, the following XML document snippet shows the metadata in boldface:

   <  planet  name="earth">    <  region  name="Africa">      <  country  name="Mauritius">        <  language  name="Bhojpuri">          <  year  >1983</year>          <  female  >99467</female>          <  male  >97609</male>        </language>      </country>      <  country  name="Mauritius">        <  language  name="French">          <  year  >1983</year>          <  female  >19330</female>          <  male  >16888</male>        </language>      </country>    </region>    <region name="Europe">    ...    </region> </planet>

So this is all very well and good. An XML document can effectively be mentally pictured as being a database in itself, which is therefore natively XML, or a native XML database. Is there a way to allow some kind of consistent programming access to an XML document, or perhaps even a collection of XML documents, where the sum of those XML documents makes up a native XML database as a whole? The answer is yes.

Schema-Less Native XML Database Collections

Native XML databases often store and manage multiple XML documents as collections of XML fragments. A single native XML database can contain multiple collections of different types of XML documents, where those different XML documents cover different subject matter, and thus unrelated sets of data. For example, one collection could contain customers for a company, and another collection could contain demographics for a specific country in the world. Additionally, within each collection, the structure of individual XML fragments does not have to be consistent across the collection. In fact, every single fragment within a collection can have a different structure in relation to every other fragment within that same collection. The result is XML data that is structurally independent, and thus schema independent or schema-less .

Collections can be associated with schemas, as described at the end of Chapter 6, when briefly demonstrating schema collections and XSD (XML Schema Definition) with SQL Server database.

There is also the potential for validation of XML data using DTD (Document Type Definition) and XML Schema. All of these are advanced topics and will be covered later in this book.

Schema-independent XML is extremely flexible, which enables more rapid and easier development. However, flexibility comes at a price in terms of low data integrity and the risk of errors propagating throughout the data in the database.

In reality, there are probably innumerable approaches to making collections of XML documents into a native XML database, capable of useful storage capacity and performance. Lets present one simplistic method, which also happens to tie into much of the information presented throughout the rest of this book.

As already mentioned, you may remember that Chapter 6, which covers XML use in SQL Server database, contained a section at the end of the chapter describing schema collections, using XSD to apply schema structures to XML data type storage in. It also describes a SQL Server database. An XML document, or even a collection of XML documents as a database, does not need any schema definitions to function as a native XML database. Once again, this is because XML contains both data and metadata. This is what is known as a schema-less database. The following example defines a database describing the region of Africa:

   <?xml version="1.0"?> <database name="Africa">    <collections>      <collection name="Countries">      </collection>    </collections> </database>

The preceding example has no data in it but it includes a container for multiple collections, where each collection represents a set of countries. In this example, the set of countries would be in the region of Africa.

We can expand the previous example to allow definition of multiple databases. Each database defines a separate region. In this example, the two databases are the regions of Africa and Europe. Again each region contains collections of countries within their respective regions :

   <?xml version="1.0"?> <databases>    <database name="Africa">       <collections>         <collection name="Countries">         </collection>       </collections>    </database>    <database name="Europe">       <collections>         <collection name="Countries">         </collection>       </collections>    </database>    ... </databases>

And obviously, you can retain a single database and simply create more collections. This is shown in the following example where each region is a collection, and each country within each region is in a separate collection:

   <?xml version="1.0"?> <database>    <collections>       <collection name="Regions">          <region name="Africa">             <population>789548670</population>          </region>          <region name="Europe">             <population>488674441</population>          </region>       </collection>       <collection name="Countries">          <country name="Finland" region="Europe">             <population>5231372</population>          </country>          <country name="Germany" region="Europe">             <population>82422299</population>          </country>       </collection>       ...    </collections> </database>

You can also embed the countries (as collections) within their respective regions contained within regional collections:

   <?xml version="1.0"?> <database>    <collections>       <collection name="Regions">          <region name="Africa">             <population>789548670</population>          </region>          <region name="Europe">             <population>488674441</population>             <collection name="Countries">                <country name="Finland" region="Europe">                   <population>5231372</population>                </country>                <country name="Germany" region="Europe">                   <population>82422299</population>                </country>             </collection>          </region>       </collection>       ...    </collections> </database>

The collections within collections can be completely dispensed with in place of a realistic hierarchy of countries within their respective regions:

   <?xml version="1.0"?> <database>    <collections>       <collection name="Regions">          <region name="Africa">             <population>789548670</population>          </region>          <region name="Europe">             <population>488674441</population>             <country name="Finland" region="Europe">                <population>5231372</population>             </country>             <country name="Germany" region="Europe">                <population>82422299</population>             </country>          </region>       </collection>       ...    </collections> </database>

And you can even duplicate information into separate collections:

   <?xml version="1.0"?> <database>    <collections>       <collection name="Regions">          <region name="Africa">             <population>789548670</population>          </region>          <region name="Europe">             <population>488674441</population>             <country name="Finland">                <population>5231372</population>             </country>             <country name="Germany">                <population></population>             </country>          </region>       </collection>       <collection name="Countries">          <country name="Finland" region="Europe">             <population>5231372</population>          </country>          <country name="Germany" region="Europe">             <population>82422299</population>          </country>       </collection>       ...    </collections> </database>

The enormous flexibility of using XML data as a storage medium should be apparent. Obviously, misunderstanding and abusing this flexibility can lead to problems, for example, as too much data through duplication, too much or too little structural complexity, and so on. The list of potential pitfalls is as long as the list of different possible variations on a theme when using something as flexible as XML.

What Is Indexing?

What I have not intended to do in this section is to describe a specific native XML database or tool used for accessing XML documents as a database. What I have attempted to do is present you with a description of how a database can be created using XML documents. The result is that XML is incredibly flexible in terms of how it can be constructed . As you can see, creating a basic database storage facility or model using XML is actually very easy. Your options are limited only by what your applications require and your understanding of the data involved with your applications.

Indexing is an entirely different kettle of fish. In a database, an index is a physical copy of a small portion of an entire table. The concept behind the creation of the index that assists performance is, in its simplest form, that the physical size of the index is much less than the table size itself, as demonstrated in Figure 9-4.

Figure 9-4: Using indexing reduces I/O activity

In Figure 9-4, an index is created for only the region names. When searching for a region with a region name, only the index can be scanned, potentially scanning much less physical disk space. The result is less physical I/O activity and generally less hardware resource usage. The result is improved performance.

Creating indexing for fast access is a little more complex because, technically speaking, an index usually involves a copy of a small portion of data, and it allows for specialized searching algorithms. Figure 9-5 illustrates the physically scanned path through a specially constructed index.

Figure 9-5: Indexing allows fast scanning using specialized index structures and scanning algorithms.

When searching for the region of Europe in Figure 9-5, only the following physical areas of the index are read to find the region of Europe:

The physical start of the index (the root), which points to three other areas containing ranges of the first letters in region names.
A branch section containing the first letters of all regions between E and N. Each branch section contains references to physical areas on disk containing each region.
A leaf section containing the actual region of Europe.

It is important to note that the index shown in Figure 9-5 contains only the name of the region. When the region is found in the index, there will be a pointer (some kind of address reference) into the entire table, and all the fields relevant to the region of Europe.

So the objective of using indexing is to reduce disk I/O activity and to utilize specialized high performance searching algorithms. Commonly used algorithms for indexing relational databases include binary trees, hashing algorithms, and sometimes bitmaps. A binary tree creates a tree structure much like that shown in Figure 9-5. A hashing index creates a number code for each unique string value where subsequent searching scans the hash table of numbers . A bitmap literally creates a map of bits a 0 indicates a value does not exist, and 1 indicates a value does exist. Even a brief understanding of different indexing algorithms is well beyond the scope of this book. However, you now know that there are numerous different methods .

There is one more thing to remember about indexing. An index imposes a sorted order on itself. Thus, when an index is read the data is read in the sorted order in which the index is created, regardless of the order of data in a table or XML document. However, XML documents do contain an inherent sorting based on the structure of the XML document concerned so index sorting is not always an advantage with respect to native XML database access.

Indexing a Native XML Database

Different native XML database tools will probably use all sorts of different indexing methods in actual implementation. However, the indexing structures created are unlikely to vary from those just presented from an algorithmic perspective.

The most likely method for indexing XML documents in a relational database, even XML data types, is the creation of a separate structure containing the XML elements that are required to be indexed. So, in the case of the index in Figure 9-5, the index would contain a single field copy of all the records in the table. That single field would contain a single region name entry for each region. In addition, the index would contain some kind of pointer, which would allow a direct link between index and table (or XML document), as a disk address at the I/O level. In other words, your database needs to assign a disk address to every item in the XML document (every index item). Then the index contains a copy of that pointer address. The result is that when a region is found in the index, the pointer related to that region in the index object is passed to a procedure that finds the record in the table (or XML document) based on the disk address of the entire record. The disk addresses assigned to elements in the table or XML document will either be assigned on the creation of the full dataset, or perhaps when an index is created. The process and placement of all these various steps depend entirely on the software that is used to access an XML native database, or essentially what is actually a single XML document, or a set of XML documents maintained as a set of collections. And those collections are stored as XML as well.

Important

You might ask yourself some questions here. Perhaps something like this: Why store XML documents into XML data types when XML data type indexing is suspect, and perhaps inferior to relational database indexing? Also, Why not simply store data into relational tables, and convert backwards and forwards between tables and XML documents as needed? The second question should be answered first, and the answer is that storing as XML removes the need to continually convert. It also gives access to specialized, highly flexible and powerful XML functionality, such as XPath and XQuery. In answer to the first question: An index, is an index, is an index. Some relational databases have more sophisticated indexes than others. There is no reason why XML indexing is not particularly less efficient than all types of indexing, in all relational databases. Of course, XML data can be stored in relational tables. Its an option and a choice. There are advantages and disadvantages to both relational database and XML data type storage of XML data. In general, the larger a database is, the more prudent the option of relational table storage becomes and not XML storage into XML data types. Unless of course you use XML data type collections.

So thats how indexes might be constructed from a purely technical perspective. How would indexes need to be created with respect to XML document content? It is likely that indexing on an XML document is to be of four different types:

Structural index: Indexes of elements and attributes, plus the locations relative to other elements and attributes within a single XML document. These help searches are based on elements. For example, in a demographics database, you could search for all countries in a particular region that have more than one city for a country.
Value index: Text and attribute values are often searched for in a single XML document and thus creating indexes on some, all, or combinations of textual and attribute values makes perfect sense. For example, in a demographics database, you might want to find all countries where population exceeds 1,000,000.
Full-text index: This applies to searching for specific values across a collection of XML documents to return a subset collection of XML documents. This index is effectively a large value index across many XML documents, or fragments, in a collection.
Context index: This is a more generalized form of indexing, perhaps a little antiquated, where many documents are indexed by creating an index containing some value that uniquely identifies subject matter of each XML document. The indexed values are stored in what is sometimes known as a side table , and then an index is created on the side table. The result is fast indexed access into a large collection of XML documents, based on whatever indexed values are created for each XML document in the collection. This approach is probably very tedious . It is better to index XML document contents. This is because of the flexible nature of XML. It is more likely that XML documents will be large and disorganized, rather than small fragments that can be easily categorized. Any kind of manual categorization could take forever with the sheer physical size of information stored in modern databases.

What About Using XSL and the XML DOM?

XSL is used to apply consistent templating, or visible results, to repeated data content and context across an XML document. The XML DOM is a generic structure upon which an XML document is built. The result is that the XML DOM can be used to access XML documents programmatically, regardless of the data and metadata content of an XML document. XSL can be used to apply formatting based either on XML document content or even applied to the XML DOM, thus templating generically across an XML document, again regardless of data or metadata content. What does all that mean? It means that you dont have to know what the content, context, subject matter, or topic area of an XML document is in order to apply XML DOM and XSL templating at the programming level. In other words, XSL and the XML DOM simply dont care whether you are dealing with customers, demographics, aircraft parts , or simple wobbly widget things. Thats why its called programmatically . You can put something onto a web page in a similar fashion, regardless of what the actual data is. Obviously, generic access to data will result in generic output, depending on how much specific programming is applied.

When using the XML DOM, you can actually store the XML DOM in a native XML database. Thus you can then use the XML DOM structure for future generic programming and processing, when retrieving XML data from the database at a later point in time.

Using the XML DOM boils down to one simple point. The XML DOM, for any particular XML document, is generally much bigger than the document itself in terms of the amount of code it generates, and ultimately the amount of memory it occupies. As a result, using the XML DOM from within a native XML database, which requires storage of the XML DOM in the database, can cause serious performance issues down the road. This is true for both large and smaller XML documents, and is likely to be relatively exacerbated for the smallest of XML documents. The general consensus of opinion, as far as I can see, is that if you are going to use XSL and the XML DOM, do so at run-time and on a web or application server. Do not persistently store the XML DOM for XML documents into a native XML database. If you think about it, XML data encompasses all of data, metadata, and structure. Thats plenty of definition with respect to data and database storage. Why store programming semantics in a database, in the form of XML DOM and XSL programming structures?