What Can XML Do? | Beginning XML Databases (Wrox Beginning Guides)

XML can do all sorts of things and can be used for all sorts of things. Mostly, it can manage complexity and it can help make data transfers easier. It is also far more readily human readable than the contents of a relational database. In fact, it is more computer readable as well because it is universally understandable by being platform independent. Thats the idea anyway.

One of the problems with XML and database storage is its probable inability to compete with the sheer incredible database sizes and volume processing capabilities of modern relational databases. Another factor often mentioned is that schema changes are easier in XML. You can decide that one for yourself but I would suggest looking at it from a database perspective. Your conclusion might be quite different when considering the same concepts from an application development perspective.

Managing Complex Data with XML

XML is sometimes touted as a revolutionary method for management of highly complex data structures. This is something that is sometimes true, and also sometimes not true. Why both?

XML is object structured and theoretically capable of being used to build what is effectively an object database. Obviously, a fully capable object database includes all object-oriented techniques, including capabilities such as inheritance. Going further into the realms of encapsulation of coding within an object-built XML database, you would have to expand on XML with specialized tools, such as DTDs and XSDs. Object databases are much more capable than relational databases at managing highly complex data structures. The real reasoning behind these comments is that as an object design breaks things down into smaller parts , that object design becomes more efficient, and more amenable to applications. The opposite is often true of relational databases. This is because as a relational database is normalized to extremes, it becomes less efficient and more difficult to write applications for. In commercial environments, denormalization of relational database models is common in order to increase performance.
XML is also not good at dealing with huge amounts of data, even when that data is complicated. Relational databases can do everything, even if management complexity is mediocre. The reasons for this are two-fold:
- XML and object structures are far too complex for large quantities of information.
- Relational databases are better industry tested than any other database modeling approach.

Does Database Size Matter?

So, XML can be useful for handling complex data structures, but generally only when the database content size is small. Small is a relative term . Ten years ago small could have been 1 megabyte. Present day small could be 100 megabytes. In fact, measuring a database as being small is probably pointless. A better method might be to measure a database as being large, or not. Very large databases (VLDB) are presently in the terabyte range. On the contrary, a few years ago a VLDB might have been a few hundred gigabytes. The issue is that using XML documents to store even a few hundred megabytes is probably very brave to say the least. A few megabytes, or under 100 megabytes, might be realistically within the capabilities of XML document storage.

Storing large quantities of data into XML documents can cause serious performance issues. Even collections of XML documents will ultimately make the individual XML fragments in the collection too large for efficient searching. Of course, some modern native XML database engines do allow some forms of indexing, which might help performance.

From a purely I/O disk-reading physical perspective, reading an XML document requires the XML document to read from beginning of file to end of file, unless effective indexing can be used. Large files equate to heavy I/O activity. Reading all of a large file is incredibly inefficient, especially when you dont want to read the entire file. A relational database, on the other hand, is built to point at different isolated parts of a database using indexes. Additionally, even the indexes can be read selectively because indexes have specialized algorithms to search indexes algorithmically.

Selective disk I/O means that only a very small portion of a database must be read to satisfy database queries. Unless of course the intention is to read all the data, which is of course counterproductive when reading terabytes of data on disk.

Once again, some native XML databases do incorporate indexing. How effective those XML indexes are is unknown as this technology is very new, and comparisons with Oracle or SQL Server indexing is also unknown. I wouldnt write home about it without extensive commercial implementation. And if anyone tells you that XML can handle data warehouse storage and activities, then you might want to experiment even more before implementing.

Are Schema Changes Easier with XML?

There is also much talk about schema changes between XML data storage and relational databases. Yes, it is quite possible that some older relational databases do not allow easy changes to schemas. Most up-to-date relational databases do allow profligate and relatively simple schema changes, using built-in commands. For example, change a table with an ALTER TABLE command. And some relational databases even allow dynamic changes to metadata objects, and even cater to concurrency during the processing of those changes. So, even multiple user and concurrency aspects are automatically handled by a relational database when changing a schema.

The issue with XML is that XML is so flexible that it more or less expects structural changes to its schemas. Part of the reason for its flexibility is it accessibility, which is of course its intention, and a sensible intention it certainly is. One case for flexibility in schema structure is this: Do you really want flexibility in metadata? Yes, perhaps for B2B data transfers but perhaps not in database environments requiring high levels of security from prying eyes.

The fact of the matter is that XML documents store both data and metadata. And program code can access and manipulate both data and metadata within the same XML document. Access to relational database data and metadata is a little more complex, in that data must be accessed separately from metadata. Additionally, in a relational database, changing any metadata, such as a table, does force data changes. This may not always be a requirement when changing XML metadata because some XML metadata changes alter structure only, not the data itself (easily transformed using XSLT). However, changing the hierarchical structure of XML data will still involve disk I/O activity, and probably heavier than that for a relational database. This is because more of the XML document will likely be read to accomplish any specific task.

The result is that schema changes are perhaps easier in XML for two reasons:

When the size of a database is small enough, it is more efficiently read in its entirety, rather than reading a small part of it.
Accessing XML is visibly much more appealing and requires use of low complexity, highly available tools (Internet Explorer is a good example).

In XML you can see the data and the metadata simply by popping it up in a browser. Retrieving the same data and metadata, on the same page, from a relational database might require complex coding and sometimes even more than one tool.

All this leads you back to using the size of a database to determine if XML is a better option. Try this as an experiment:

Load a very small XML document, maybe a few kilobytes in size, into a browser such as Internet Explorer. How long does it take?
Now do the same thing with a larger XML document, such as the demographics.xml document file mentioned throughout this book. It takes a lot longer because even though you see the top part of the XML document, the little load time bar at the bottom of the Internet Explorer sits around for while. If you open or close one of the subtrees you might have to wait for a response. And this demographics XML document is only 4 megabytes. Just imagine the same thing, in a browser, with an XML document of a 1 gigabyte, or even 1 terabyte.

Dont do this at home! You might wind up watching your browser load up the document for a week or so.

After 20 years of building databases and writing software, I find that an appropriate mix of development environments (this includes the database you use) is the most prudent option. An object approach is often the most efficient modern approach to building applications code. However, as little as ten years ago, this was not the case because computers were simply not fast enough. Now they are. CPU processing speed and front side bus speed (the thing between the processor and the RAM on the motherboard of your home computer) have caught up with software tools such as Java.

Then again, I also worked on financial applications more than ten years ago. This was back when computers could not cope with object-oriented application processing requirements. Those financial applications were so incredibly complicated that implementation in a relational database and a procedural programming language would have resulted in bankrupt software development companies.

I have also worked extensively with purely object-oriented databases, with tools such as Java as front-end. Object databases enable you to create classes, and allow multiple inheritance and encapsulation of procedural code, all black-boxed within the database itself. I concluded from my work with object databases that they have their niche. Object modeling can handle complexity as long as the data remains small (a few megabytes). Anything larger than that needs enormous hardware computing power. Perhaps as computers become more powerful and cheaper, the use of XML as a database storage device will become increasingly cost effective and practical. In fact, this is highly likely. Then again, the established relational database vendors have in the past become very proficient at simply including new technologies within their databases (see Chapters 5 and 6). And some of those vendors perform this inclusion process quite successfully. As quite an odd comparison, one of the most successful civilizations in history was the Roman Empire. They were savage, brutal, and quite undemocratic. But when it came to the religions of conquered nations, they assimilated into their own. And sometimes even vice versa, assimilating themselves . Relational database vendors have been quite adept at assimilating new technologies. History can often repeat itself.

Use whatever solves your problems. Dont dig yourself into a hole by not keeping an open mind about other options and products. And other options and products include both older technologies (relational databases) as well as newer leading edge technologies (native XML databases). Then again, beware of rushing in and being overzealous with the use of bleeding edge technologies. It is always best to test, verify, and make sure something works before sinking precious time and money into it.