3.1. Information capture and reuse
For all the
valuable
abstract data that is managed in database systems, there is even more that is hidden in rendered word processing documents. That fact represents an
enormous
intellectual property loss for
enterprises
, of course, but it also represents a nuisance and a time-
waster
for the information workers who work with those documents.
Consider the articles written for a company's
websites
and newsletters. Every one is likely to contain a title, author, and date within it, but more often than not that information has to be retyped, or individually
copied
and pasted, to get it into a catalog entry. That's because there is no reliable way for a computer to recognize those data items in order to extract them.
3.1.1 Word processing
In contrast, look at Figure 3-1, which shows an article being edited in Microsoft Word.
The article is actually an XML document that conforms to a schema of the
user
's choosing, in this case
article
. The user has
opted
to display icons that represent the start- and end-tags. Note that there are distinct elements for the
title
,
author
, and
date
.
Solution developers can use the XML elements to check and normalize information as it is entered, whether or not the tag icons are displayed. An application, for example, could notify the user if the text entered for a
date
element isn't really a valid date. Or it could automatically supply the current year if none was entered.
The right-hand pane is called the
task pane;
it can be used for various purposes. In the figure, the top of the task pane shows the XML structure of the document. At the bottom is a list of the types of element that are valid at the current point in the document, according to the
article
schema.
The document is also a normal Word document, so Word's formatting features can be used in the usual way.
There are three ways to save this document as XML:
WordML
WordML
is Word's native XML file format. It
preserves
the Word document just as the DOC format would, including formatting and
hyperlinks
. However, it doesn't include any of the
article
markup, so we won't discuss this option further here. (We cover it in Chapter 5, "Rendering and presenting XML documents", on page 86.)
custom XML
The document can be saved as an XML document conforming to a custom schema; in this case,
article
. A custom schema would normally be defined by an enterprise, or by a committee set up by an industry to which the enterprise belongs. For that reason, it would be designed to preserve the abstract data needed for the user's applications. For example, the
title
,
author
, and
date
can easily be identified by software and extracted for use in a catalog of articles.
mixed XML
The saved document could contain both WordML and the
article
markup, since the two are in different namespaces. This option preserves the formatting applied by the user, while still
preserving
the abstract data and distinguishing it from the
rendition
information.
In our example, the article is the entire Word document, but that isn't a requirement. It is possible to intersperse short XML documents within a larger Word document. For example, a travel guide might include multiple XML structures that describe hotels, with subelements for the
name
, address, number of rooms, rates, etc.
Using XML with Word documents enables companies to capture more of the intellectual property that is created informally by individuals and work groups, and that typically remains inaccessible to enterprise information systems. As XML, that property becomes a portable asset that can be reused as needed.
3.1.2 Forms
For many purposes, a data entry form is more suitable for information capture than a typically larger and less constrained word processing document. InfoPath lets you design and use forms that are really XML documents that conform to your own custom schemas.
Figure 3-2 shows the layout of an order form in InfoPath's design mode. The structure of the
order
schema is shown in the task pane on the right, from which element types can be dragged onto the form.
Note that there is only one
item
line in the form design. Because the
order
schema allows
item
elements to be repeated, a user entering data will be able to add
item
lines as needed. Had
customer
elements been repeatable, the form would expand to allow insertion of the
group
of customer information fields.
Unlike Word, InfoPath generates an XSLT stylesheet to control the rendering of the form. The formatting can even be based on the data entered in the form. For example, the dialog box in Figure 3-3 specifies that negative prices should be shown in a different
color
.
InfoPath is described in detail in Chapter 9, "Designing and using forms", on page 180.
3.1.3 Relational data
XML elements, whether captured in Word or Excel or InfoPath (or any other way, for that matter), are as
well-defined
and predictable as the
columns
and tables of a database. XML documents of all kinds are therefore a source of information as rich as any other operational data store. Companies can aggregate, parse, search, manage, and reuse the data in documents in the same way they do the transactional data that is typically captured for relational databases.
They can also import the document data into a database and use it in conjunction with data from other sources. In addition, they can export DBMS data as XML documents.
Figure 3-4, for example, shows the options Access offers when exporting data as XML. You can specify which tables and records to export and how to
sort
and/or transform them.
Figure 3-5 shows the options for exporting a schema as XML. You can choose whether or not to export the schema, and whether it should be exported within the data document or as an independent schema document.
|