Section 1.2. Office documents as information assets


Prev	don't be afraid of buying books	Next

1.2. Office documents as information assets

These days, no one seems to doubt that data is a valuable company asset. Business data that is more up-to-date and accessible can lead to more accurate planning, improved customer service, and more efficient business processes.

Technologists therefore treat data with great care. They carefully design databases and make them available to a variety of business processes. They organize these databases so that the data can be quickly and easily queried, joined, sorted and summarized. They protect them using security measures and disaster recovery plans.

All too often, though, this great care is extended only to databases containing operational data about customers, products, sales, and suppliers. There is a vast wealth of equally important data, stored in documents, that is largely ignored. It may include legal contracts, Web pages, memos, budget spreadsheets, product specifications, press releases, marketing brochures, conference presentations, product manuals and company policies and procedures.

These documents are often isolated and inaccessible. They are stored in proprietary binary formats on file servers, using inconsistent directories, names and versioning strategies. They are in rendered form, and therefore cannot be searched, summarized or indexed consistently in any meaningful way.

To coin a phrase, the data in documents isn't integrated with the other data of the enterprise. Using XML in Office allows that data to be integrated. Ironically, XML enables integration by separation:

Separating the document representation from the software.

Because XML is a non-proprietary, open file format, a wide variety of software tools can act on your documents. This means that these documents can be integrated into your business processes just like any data in a database.
Separating the data content from style information.

In order to use your data effectively, you need to understand what it means rather than just what it looks like. XML lets you identify and describe the content of your document, not just its outward appearance.

Without delving too far into the technical details just yet, let's take a closer look at how these two separations can help you.

1.2.1 Separating the document representation from the software

Previous versions of Office primarily used proprietary formats, such as .doc and .xls files, to store documents. These binary formats could effectively be used only by the software that created them. If you've ever tried to open up a .doc file in Notepad, as shown in Figure 1-2, you know that the format is indecipherable to the human reader, and it is largely indecipherable to other software applications as well. Only the Microsoft Word application knows how to make complete sense of it.

Figure 1-2. Word `.doc` document viewed in a text editor

We are so accustomed to this state of affairs that we do not think of the documents separately from the software. A document is a "Word document" or an "Excel worksheet" that has no use outside Office.

XML, in contrast, is a non-proprietary character-based data representation that can be processed both by humans and, more importantly, by hundreds of computer programs. When an Office document is saved as XML, it can be used by other tools in addition to Microsoft Word.

The document is no longer a "Word document", but an XML document that can be edited in, or processed by, whatever tool makes the most sense for a given task. The document can be queried, transformed, sorted, viewed on the Web, passed around in e-commerce business processes, validated, stored in a database, shared with non-Office users and archived and indexed.

Separating the document format from the software makes the Office software more useful, too. Suddenly Office applications can be used to edit a wider variety of documents. The analytical tools of Excel can be used on any data that can be represented as XML, not just data that is stored in an Excel worksheet. Word can be used to edit any XML document.

Example 1-1 shows Doug's article represented as an XML document. Tags, such as <title> and </title>, are used to mark the start and end of data elements. Doug didn't type those tags; Word did it for him. In fact, Office users don't even have to see them on the screen if they don't want to.

Example 1-1. Doug's article, represented in XML

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <article xmlns="http://xmlinoffice.com/article"          type="sales" >   <title>Sales Update</title>   <author>Doug Jones</author>   <date>February 3, 2004</date>   <body>    <section>      <header>A great month!</header>      <para>This month's figures are a <em>huge</em> improvement over this month last year. We sold 1,342 widgets for a total revenue of $14,327.</para>    </section>    <section>      <header>More work to do</header>      <para>Let's not rest on our past success. Let's get out there and sell, sell, sell!</para>    </section>   </body> </article>

There is more information in Chapter 2, "XML concepts for Office users", on page 20.

1.2.2 Separating data content from style information

One thing you may have noticed about our XML example is the absence of formatting information. Nothing in the document tells an application to indent a particular paragraph, display a word as bold, or use any particular font for a header. Rather, the tags convey the meaning and structure of the document.

There are tags that identify simple data elements, such as date and title. There are also elements with complex structures, such as para and section.

The vocabulary in the tags was created just for articles, which assures that there is a way to identify everything the company thinks is important about an article.

Separating the data content of the document from its formatting is one of the important principles of XML. The content itself is stored as an XML document, and different stylesheets are applied to it to achieve different renditions. This separation has two important benefits: self-describing content and flexible rendering.

1.2.2.1 Self-describing content

Describing the data rather than the style means that applications can identify what the document contains rather than what it looks like. The document becomes self-describing.

For example, in Doug's article, we know exactly where to find the author name, because tags identify it as an author element. Otherwise, we might have to assume that the author name is "that thing in bold on the second line".

Being able to identify data elements is powerful because it allows you to perform all kinds of automated functions on the document, such as:

searching for all articles whose author is "Doug Jones" (not just articles that contain the words "Doug Jones")
specifying rules for article documents, such as "at least one author must be specified", or "the title must be between 10 and 72 characters long"
automatically generating a list of all the article titles, authors and dates
generating a summary calculation of the average number of paragraphs in an article

1.2.2.2 Flexible rendering

Another advantage of separating data from style is that it provides flexibility in the formatting. If the same material is to be presented in more than one way, it does not have to be written (and maintained) multiple times.

For example, suppose Doug's article is to appear in a printed newsletter and also be available on the Web. Perhaps the Web version has links and some sidebar information that does not appear in the print version. The look is different, too: the fonts are larger on the website, and the text is continuous rather than being broken into pages.

In this situation, the unrendered (abstract) data content is in a single XML document. There are two different stylesheets that create different renditions of the content. If the content must be changed, it only needs to be modified in one place.

You can also create different subsets or views of the same data. For example, if:

Different readers are interested in different aspects of the document.
Security concerns allow only part of an article to be read by a particular audience.
On the Web, you want to provide just the first paragraph of the article as a tease before requiring a reader to sign up for a service.
The document contains information that is not normally presented, such as search keywords, or information about who last updated the document and when.

In each of these situations, you can write a stylesheet that shows only the relevant parts of the article.

Supporting multiple renditions of the same content is increasingly necessary. Browsing the Web has come to mean a lot more than just looking at HTML pages in a Web browser on a PC. People now use telephones, PDAs and other handheld devices to browse the Web, and they are inventing new ways of using the Web all the time. Different devices have different screen sizes and memory limits, and therefore need information presented in a different way.

When style information is kept separate, it is easier to change it without affecting the content. For example, if you decide you want the author name to appear in italics, you can simply change the stylesheet once, rather than restyling the author name in every article document.


	Amazon