Section 15.2. XML and Word s New File Format


15.2. XML and Word's New File Format

Office 2007 uses XML in a big way. Word, Excel, and PowerPoint each have a new document file format that's based on XML. The file formats, like Word's .docx and .dotx files, are smarter , smaller, and tougher than the previous document files.

  • Why are the new files smarter? Your Word 2007 document file is essentially a package that contains several XML files, as shown in Figure 15-1. Each XML file describes a different part of the Word document. For example, one part of an XML file describes images, another part that describes headers, and another part that describes the main body of the document. If you create custom XML elements in a Word document, they're stored in a special datastore folder. As a result, other programs can reach in and use parts of your Word documents. Now that's smart.

  • Why are the new files smaller? All these XML files are stored in a single Word file that's compressed to make it smaller. Smaller files travel faster when you send them over the Internet, which is a big advantage in today's world. Microsoft adopted the well-known Zip standard to combine all the files and then compress the whole shebang. Any other programeven a non-Microsoft programthat can open Zip files can easily open and work with the innards of a Word file. (Feeling brave? See the box in Section 15.2.1 to learn how to take a peek inside one of your Word files.)

    Figure 15-1. The new file format for Word documents (and other Office documents) consists of XML files combined in a Zip file. This figure shows what a Word file looks like on the inside. The XML files describe different parts of the document. The document.xml file holds the text for the document. The fontTable.xml file has details about the fonts used in the document.



    Note: You may be familiar with Zip compression if you've used programs like WinZIP. Just how much Zip shrinks a file depends on the content. In some cases, if the stars are aligned properly, Zip can reduce a file to about a quarter of its original size .
  • Why are the new files tougher ? These new-format files are more reliable than their predecessors, because even if part of the file gets corrupted, there's a good chance you can use the rest of it. Chances are, only one of the several files that make up a Word document will get corrupted at a time. The other files will be fine. So, if a picture or a sound clip that's embedded in your document gets fouled up, you can still read, retrieve, and save the text.

15.2.1. Reading XML Tags

As mentioned in the box in Section 15.1.1, HTML and XML both use tags between angle brackets. The tag at the end includes a slash mark. For example, a tag can describe a bit of information as a title:

<TITLE>A Tale of Two Cities</TITLE>

Another program looking at the document understands that the words between those tags are a title. In fact, since XML uses basic English words, some humans can figure it out too. Now, when you tag your information with XML, you don't know exactly how another program may use the information. But that's part of the beauty of XML. It gives you a way to use and reuse information now and in the future. Imagine a city newspaper keeping track of every story it publishes with XML tagsa tag for the title, the author, and the date. Another tag encompasses all the text for the story. Tags also categorize the story as business, crime, or politics. (And sometimes all three at once.)

Over time, the newspaper creates a great library of all the stories they've published. People can search for articles by specific authors who cover specific topics.

By simply storing the story and the details about the story inside XML tags, the newspaper creates a versatile, searchable library of newspaper articles that can continue to grow. They can even add new tags as time goes on (like Dateline: Mars), and that won't bother XML a bit. Today, businesses are storing information in the XML language. Letters , memos, proposals, reports , and sales transactions are stored using XML, making the information accessible to coworkers and clients .

A pair of XML tags, including the information between them, is called an element in XML lingo. End tags in XML, like those in HTML, have a slash that distinguishes them from beginning tags. Elements can be nested inside other elements to establish relationships and create more complex objects. For example, here's a more detailed chunk of information describing a book:

<BOOK>
<TITLE>A Tale of Two Cities</TITLE>
<AUTHOR>Charles Dickens</AUTHOR>
<PUBLISHER>Chapman and Hall</PUBLISHER>
<PUBLICATION DATE>1859</PUBLICATION DATE>
<FICTION>Yes</FICTION>
<GENRE>Historical Fiction</GENRE>
<ILLUSTRATED>Yes</ILLUSTRATED>
<PAGES>358</PAGES>
</BOOK>

In this example, the TITLE, AUTHOR, and other tags are nested within the BOOK tags, making it clear that these are parts of the information regarding a book. XML tags are called extensible , which is computer lingo meaning that they can be extended to describe any type of information or data. HTML, by contrast, has a limited number of tags, and everyone agrees on just what they mean. With XML, you can create your own tags to describe your information. But that raises the question, if you create your own tags, how will someone else or some other computer program know what your tags mean? That's where different types of helper files come into play.

POWER USERS' CLINIC
Unzipping a Word File

In the past, Microsoft's file formats were somewhat mysterious and proprietary. Those days are over. With Office 2007, the document formats for Word, Excel, and PowerPoint are an open book. They use the information-sharing XML language to describe their contents, and Zip technology to combine and compress the files. Both XML and Zip are well documented and freely available, so anyone can write a program that can read and write to Word files.

To see what the XML innards of a Word file look like, follow these steps:

  1. Using Windows Explorer, copy (Ctrl+C) one of your Word (.docx) files, and then paste (Ctrl+V) a copy back into the same folder. Windows creates a new file with a name that begins "Copy of" to distinguish it from the original.

  2. Right-click the new file, and then choose Rename from the shortcut menu. Then change the ending of the Word file from .docx to .zip . With this change, your computer will recognize the file as a file that's been compressed using the Zip standard. Notice that the icon for the file has changed from the big "W" Word icon to a folder with a zipper or, if you have WinZIP, a folder that looks like it's being squashed by a C-clamp.

  3. Right-click an empty spot in the folder, and then choose New Folder from the pop-up menu. A new folder appears, and your cursor is poised for you to type in a folder name, say, Word Innards .

  4. To open your newly renamed Zip file, right-click it, and then choose Explore from the pop-up menu.

    Windows Explorer knows how to open up Zip files, and it shows them as if they were ordinary files and folders. When you explore a Zip file, you see its contents. In this case, that includes both folders and files. You see folders named _rels, docProps , and word . You also see an XML document called [Content_Types].xml .

  5. Drag over all the files and folders inside the Zip file to select them. Then, with everything selected, drag the whole bunch to your Word Innards folder. Windows copies the files to the Word Innards folder.

  6. Double-click to open the folders inside Word Innards.

    You can browse and explore to your heart's content, checking out the mysteries of Word's new file format. Double-click to open folders and to see what's stored inside them. In the glossary folder, you see XML files with names like fontTable and webSettings . (Almost all the files in the folders are XML files.)

  7. Double-click to view the contents of the XML files.

    You can view the insides of XML files using a Web browser like Internet Explorer, as shown in Figure 15-2. So, double-click any file you want to view, and your browser opens, displaying the XML file like a Web page. Everything's formatted nicely in different colors to show the XML tags. You may be able to recognize some of the text in your document if you view the document.xml file inside the word folder.

    You can see how easy it would be for a program to open and edit the contents of a Word file, and then Zip everything back up. You could do it yourself with the tools that Windows provides. You can't edit and save an XML file with Internet Explorer, but Wordpad or Notepad will do the job.

    When you're done you can reverse the process and create a Word file. Just select all the files in the Word Innards folder, and then right-click to show the popup menu. Choose Send to Compressed ( zipped ) Folder. Rename the new file word innards.docx . If you made changes to the text in document.xml , you see them when you open word innards.docx in Word.


Figure 15-2. You can use Windows Explorer to view the contents of an XML file. Explorer formats the text neatly and uses colors to separate tags from content. Click the + and - buttons to the left of the text to expand and collapse elements. This file shows the address label and tag at the top. The address text is shown below.


15.2.2. The Files That Make XML Work

An open-ended system for tagging information with descriptions and making that information available to other programs is all well and good. But how do those other programs know how to interpret your tags and use your information? In addition to an XML file that stores the tagged information, there's often a file called a schema , which explains the tags in a way that makes sense to other computer programs. For example, the schema will explain that a telephone number should be ten numbers and shouldn't include letters or punctuation. Another type of file called a transformation (or transform for short) can take the information in an XML file and turn it into a different type of document, like a Web page or a Word document.

Here are some of the types of files related to XML documents:

  • Data files (.xml) are the files that contain the information held within tags.

  • Schema files (.xsd) describe the contents of an XML file so that other programs know what's inside specific tags. Figure 15-3 shows what a schema looks like on the inside.

  • Transform files (.xslt) are used to transform an XML file into a different type of document, like a Web page or a database file.

  • XPath is the language used to find information inside an XML file.

Figure 15-3. This schema file describes the information inside a purchase order. If you look carefully , you can identify some of the elements you'd expect in a purchase order, such as poNum for purchase order number and poDate for the date. The details after "type=" explain the type of information that's expected: a string of characters for the PO number and a date for the PO date.




Word 2007[c] The Missing Manual
Word 2007[c] The Missing Manual
ISBN: 059652739X
EAN: N/A
Year: 2006
Pages: 180

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net