Section 20.8. Documents and data | XML in Office 2003: Information Sharing with Desktop XML


Prev	don't be afraid of buying books	Next

20.8. Documents and data

For many decades, data processing got the big budgets while document processing got a room in the basement with a copying machine. While the data processors relished their importance to the organization, the document processors basked in their importance to humanity. They were preservers of human knowledge, not just high-speed bean counters.

No wonder the two never got along!

Markup languages are changing all that. With XML, documents and databases both store data and can share it, so document processing and data processing can be performed at the same time, by the same people.

20.8.1 It's all data!

In an XML document, the text that isn't markup is data. You can edit it directly with an XML editor or plain text editor. With a stylesheet and a rendering system you can cause it to be displayed in various ways.

In a database, you can't touch the data directly. You can enter and revise it only through forms controlled by the database program. However, rendition is similar to XML documents, except that the stylesheet is usually called something like "report template".

The important thing is that, in both cases, the data can be kept in the abstract, untainted by the style information for rendering it. This is very different from word processing documents, of course, which normally keep their data in rendered form. Even WordML is a rendition, despite its use of XML.

20.8.2 Data-centric vs. document-centric

Documents, data, and processes are sometimes characterized as "data-centric" in contrast to "document-centric". Since all XML documents (except empty ones) contain data, these terms are actually a misleading shorthand. Worse, they are applied in two very different contexts:

how much the XML resembles relational data; and,
whether you have to deal with the whole document at once.

20.8.2.1 How relational is it?

The data-centric misnomer is common among database hackers trying to describe structures that map easily onto relational tables and primitive datatypes. Structures that don't are called document-centric.

The intended meaning of data-centric is that the document structure – really element structure, since a document is essentially just the largest element – is fully predictable.

An element has a fully predictable structure if it and its subelements are constrained to contain either:

type-sequenced elements (e.g., a sequence of elements of the types: quantity, itemNum, description, price),
data characters only (i.e., #PCDATA), or
nothing at all.

Fully predictable elements can easily be visualized as forms. A business transaction document such as a purchase order is more likely to be fully predictable than a memo.

In addition to "data-centric", the misnomer highly structured is sometimes used. However, highly predictable would be more precise, particularly as many documents that aren't fully predictable are still much more predictable than they are freeform.

20.8.2.2 How granular is it?

Another (mis)use of data-centric is to characterize the storage and/or access of documents at the level of individual elements, rather than the entire document at once (document-centric). Once again, the usage is misleading because what it describes has nothing to do with data per se, and because it implies a contradiction between data and documents that does not exist.

20.8.3 Document processing vs. data processing

While "data-centric" and "document-centric" aren't rigorous terms for characterizing information, they are quite meaningful when applied to processing. XML, however, because it can preserve abstract data (like a database) but still be interchanged and processed as a character string (like a document), is starting to break down the historic separation of the two paradigms. Applications can now intermix data processing and document processing techniques to get the job done.

20.8.4 Comparing documents to data

Since documents contain data, what are people doing when they compare or contrast documents and data?

They are being human. Which is to say, they are using a simplified expression for the complex and subtle relationship shown in Table 20-1. They are comparing the typical kind of data that is found in XML and word processing (WP) documents with business process (BP) transactional data (operational data), which usually resides in databases.

Table 20-1. Typical traits of data

	XML data	BP data	WP data
Presentability	Abstraction	Abstraction	Rendition
Source	Written	Captured	Written
Structure	Hierarchy+ links	Tables	Paragraphs
Purpose	Processing	Processing	Presentation
Location	Document	Database	Document

Note that the characteristics in the table are typical, not fixed. For example, XML data can be a rendition (HTML and WordML are examples). In addition, XML data could:

Be captured from a data entry form or a program (rather than written);
Consist of simple fields like those in a relational table (rather than a deeply nested hierarchy with links among the nodes); and
Be intended for presentation as well as processing.

Caution

The true relationship between documents and data isn't as widely understood as it ought to be, even among experts. That is in part because the two domains existed independently for so long. This fact can complicate communication.


	Amazon