6.2 Converting into XML
Apart from authoring directly in XML, the only way to create source documents is by converting them from some other document format. This is not quite analogous to, say, converting from one image format to another. By itself, XML is a simple markup syntax, but the way it is used in semantic vocabulariessuch as that of a web site sourceis conceptually different from most other document formats used today.
Not a simple algorithm. Transforming a typical document format into XML does not amount to simply renaming, rearranging, or reformatting bits of content; this task always involves certain rethinking as well. Unless your other format was developed specifically to provide a one-to-one mapping to your particular source vocabulary, conversion will inevitably be an unreliable, heuristic approximation always requiring manual checks and fixes.
Generally , any document-to-XML conversion may be broken into two stages:
Low-level (syntactic) processing is where you break into the source format's envelope, extract the atoms of content you are interested in (such as words, numbers , or paragraphs), and decode or decipher them if necessary. The difficulty of this stage may range widely, depending on the clarity of the data format and how well it is documented. Its output is best stored in some intermediate XML vocabulary reflecting the structure of the source format, not the target format. Sometimes, however, software limitations leave HTML or even plain text as the only choice for such an intermediate format.
High-level (semantic) processing is the process where the atoms of meaning extracted during the first stage are examined, and educated guesses are made as to what each atom represents and how to mark it up using your target XML vocabulary. This stage may also be very difficult, especially if the source format provides no clues about the semantics of its content atoms or if these clues are not used consistently.
Sometimes, these two stages are combined into a single application. With your custom XML vocabulary, however, this is not likely to happen because existing conversion tools cannot be aware of your vocabulary. This means you'll have to use one of the existing tools for low-level processing, save the result as intermediate XML, and then write your own XSLT stylesheet for high-level transformation of that intermediate format into your target vocabulary.
Multiple inputs, multiple outputs. With this approach, you can have one low-level converter plugged into several high-level transformation stylesheets for different target vocabularies. Or, you can handle several source formats with different low-level converters sending their output to one high-level transformation stylesheet.
6.2.1 Plain text
When you have documents or data in plain text, low-level processing is usually easy to implement. All programming languages, without exception, can read plain text files and break them into lines, words, or other units. Semantic processing, however, is much more difficult because the source format provides no regular markup and scarcely any clues as to what part of the file is meant to be what.
Still, since plain text has long been the only format of Usenet and email, a number of well-known conventions exist that can be used to recognize and mark up a few common semantic elements in a plain text file. Based on these conventions, txt2html ,  an amazingly versatile Perl script, converts plain text into HTML. It attempts to mark up headings, paragraphs, lists, links, inline elements, and even tables. With the --xhtml command-line option, txt2html will output XHTML, which you can then convert into your target XML vocabulary with an XSLT stylesheet.
 txt2html.sf.net; look at a sample text file, txt2html.sf.net/sample.txt, and its conversion to HTML, txt2html.sf.net/sample.html.
Chaperon  is used with Cocoon ( 7.2 ) as a text-to-XML generator, but it can also be run separately. It is controlled by grammar and lexicon files in a special XML vocabulary and can handle arbitrary structured text as input.
Another project worth looking at is txt2xml .  This is a Java library that is more database-oriented than freeform-oriented; the example application on its web site is a conversion from a comma-separated plain text spreadsheet. This tool is similar to XSLT in that the mapping from a source text format to the output XML is defined by a set of processors and subprocessors (similar to XSLT's templates), each generating elements of one type when triggered by a regexp match in the source file.
Even though this book is mostly devoted to the opposite taskconverting semantic XML into presentation-oriented HTMLsometimes you may run into a pile of legacy data that is only available as HTML. By this I do not mean modern XHTML, which is actually XML and can be transformed into whatever you want with XSLT, but old, supposedly-SGML-but-really-just-lousy HTML with swamps of weird formatting code and swarms of markup errors that only a seasoned web browser can make sense of.
Luckily, there is some competition for browsers in this area. HTML Tidy  (originally written by Dave Raggett) is a wonderful piece of software that knows the entire HTML specification by heart, including its most arcane bits. More importantly, it is smart enough to fix any broken HTML files you throw at it and output them as either valid HTML (you can specify which version) or XHTML.
Common to many toolchains. The importance of Tidy goes beyond handling broken web pages. Many older applications (e.g., some of the office suites discussed in the next section) are totally unaware of XML but can produce some sort of HTML. In most cases, before you can work with this intermediate HTML any further, you have to fix it using Tidythe only way to rescue valuable data from these deadend formats and let it flow into the boundless ocean of XML.
Forcibly tidy. Sometimes, an HTML document is so severely broken that Tidy will refuse to process it. In this situation, use the --force-output yes command-line option to force Tidy to produce output no matter what, but be aware that some of the original markup may be lost or misinterpreted.
6.2.3 Office formats
Converting from office document formats is perhaps the most important practical scenario, as most content authors tend to use office suites for their work. And more often than not, they assume that everyone else does, too. 
 Sometimes, they even claim it must be so because their favored office suite is "industry standard." Explaining to them that the meaning of the word standard in that phrase is actually very different from that in standards-compliant or open standards based may be frustrating.
Only if they are willing to listen. It would be niceand profoundly beneficial for everyone involvedto train all your content creators to use semantic XML markup, but in many cases this is simply not doable. User inertia is a force to be reckoned with; don't expect to get many XML converts among people who've spent a good share of their lives mastering Microsoft Word and feel quite comfy with it. Also, there are certain content-oriented tasks , such as collaborative editing, where word processors still have a lead over anything today's XML editors have to offer. 
 Word 2003 Professional Edition ( 18.104.22.168 ) offers some relief, with its ability to create valid custom XML using Word's normal WYSIWYG user interface.
Who is to blame? Unfortunately, converting legacy office formats into XML is also one scenario where even low-level processing may be far from trivial, and high-level processing can easily become a nightmare. An important difference, however, is in who is responsible for these two levels of obstaclesand what you can do about them.
Low-level processing problems are entirely the responsibility of the office suite's vendor. There is little you can do about itother than switch to a different office suite with a more open document format. Reverse-engineering a closed office format and creating a low-level conversion utility for it is a daunting task that even large groups of dedicated programmers don't always succeed in accomplishing.
If an office suite cannot itself export all of its content and markup into a meaningfully parsable representation, you are stuck with whatever third parties have to offer for this format ( 22.214.171.124 ).
What is to be done? It is all quite different with the high-level semantic processing of the low-level converter output. Here, both content authors and yourself as the web site developer can do a lot to make your work anything from almost automatic to hard, largely manual, and excruciatingly tedious . Office applications are notoriously feature-bloated, and it is your job to make it known to those who use them what features are acceptable and what are not.
You must be proactive with this if you don't want to be swamped by tons of badly formatted, inconvertible documents. Start by testing the office suite your authors will be using. Search through all of its features (including those you may never have used yourself) to figure out the best possible approximation of your target XML vocabulary using the program's native formatting tools.
The reference implementation. The selected subset of approved features and the recommendations on how best to use them will be revised repeatedly as you test your chosen low-level converter and write the stylesheet for the high-level transformation. As the result of all this work, you will have created a template document in the office format that demonstrates and explains all the best practices of for-XML office suite authoring. This template must be publicized, and no web site author should begin working on a document without first studying it in detail.
This office-format template for site authors is an important part of your source definition (Chapter 2). Its scope and features will depend not only on your target vocabulary but, above all, on the expressiveness of the intermediate format that you standardize upon. Let's see what intermediate formats we can use and their implications for the setup of your office-format conversion system.
126.96.36.199 Converting via plain text
In the simplest possible case, you can use plain text as your intermediate format. This makes sense if you don't want to trouble your authors with any format or structure concerns at allthat is, if all you want to get from them is the data, not markup. This approach has one big advantage: You don't need any third-party converters because all office suites, without exception, can export their documents as plain text. 
 Even if they couldn't, you would still be able to use copy-and-paste . In fact, it may sometimes be faster and even give better results.
Extracting paragraphs. Plain text is not entirely structureless ( 6.2.1 ). One structural unit you can almost always identify in a text document is a paragraph (separated from other paragraphs by either newlines or empty lines). Approaching this problem from the other end, you can always ask the authors to structure their output at the document level by only producing small documents that correspond , not to entire web pages, but to high-level constructs within a page (such as sections or blocks, 3.1.2 ).
As such block- sized documents are likely to be small, you can parse them simply by guessing the role of each paragraph based on its position within the plain-text rendition of the document. For example, you may assume that the first paragraph of each block document is the heading, the last one is the author byline, and all paragraphs in between are the content of the block. Of course, any such conventions are very limited and relying on them is risky, but they will allow you to build a working word-processor-to-XML toolchain really fast assuming you can get the necessary cooperation from the authors.
188.8.131.52 Converting via HTML
Using HTML ( 6.2.2 ) as an intermediate format is the next option to consider after plain text. Like text, HTML does not normally require any external converter; it has been around for so long that all office applications can "Save as HTML" by now.
Microsoft HTML. HTML exported by Microsoft Office is notoriously bulky, contaminated with Office-specific extensions, and often simply broken. For many projects it is actually easier and more reliable to use plain text. Still, with the help of Tidy ( 6.2.2 ), Office's HTML can be used as a starting point for high-level processing.
If you need to access MS Word documents but do not wish to touch MS Word itself, I recommend the wv library  and its accompanying utilities. This open source software runs on many platforms and converts MS Word documents to plain text, HTML, LATEX, and other formats.
Rigidly quirky. HTML as an intermediate format will likely work only for relatively simple projects. This is because the inventory of structural units preserved in conversion cannot differ much from the inventory of element types in HTMLwhich pretty much limits you to headings, paragraphs, lists, links, and simple inline elements.
This approach isn't very flexible either: Generally, you have little control over which styles and formatting in the source get converted to which HTML elements. With some third-party converters, generated HTML will be adorned with CSS class attributes storing the names of the corresponding source styles, which helps.
184.108.40.206 Converting via XML
Using XML as an intermediate format for converting Microsoft Office documents is only available with third-party conversion utilities or with Office 2003 or later ( 220.127.116.11 ). Therefore, a lot of details of your project setup will depend on the capabilities of these conversion utilities. Some commercial XML editors, especially those implementing the word processor paradigm ( 6.1.4 ), will also import common office formats such as RTF and even Microsoft Word.
Le style est l'homme mme. What kind of XML do we want to get from these converters? The key idea is this: When editing a document in an office application, apply named styles (such as "paragraph" or "list item") instead of anonymous formatting properties (such as margin width or font size ) to the structural units. Then, ask the converter to please translate these styles into XML elements with the same names . . . thanks, we can take it from here. No, we don't need anything else. We'll do the rest in XSLT.
There are several MS Word to XML converters that will do this job. In our testing, Upcast  proved sufficiently reliable. Other standalone converters include Logictran  and Majix.  All of them can handle RTF, but converting directly from a .doc file can only be done on Windows and requires that MS Word be installed. 
 Oh the joy of closed formats!
The law of inertia. The biggest problem with this approach is not technical, howeverit is user inertia. The requirement that every structural unit (including, for example, inline emphasis as in "every" in this sentence ) must be assigned a corresponding named style is likely to put off at least some users.
The concept of named styles has been in MS Word for ages, yet it is astounding how many users simply ignore it. They actually prefer to format their stuff by manually applying fonts and colors instead of just selecting one of the styles from a drop-down list. It's up for discussion whether Word's interface is to blame for this; in practice, just be prepared to annoy your Word authors for quite some time before you start getting consistent styles-only documents from them.
Styles are flat. There are some technical issues with styles as well. The biggest problem is that unlike XML elements, styles generally do not nest .
MS Word styles come in two flavors, paragraph (block-level) and character (inline-level). A fragment of text may have a character style of its own and be affected by the paragraph style of its paragraph. You cannot, however, mark a text fragment with more than one character style, or make a paragraph belong to two or more overlapping areas with different paragraph styles. Each character has exactly one character style, and each paragraph has exactly one paragraph style.
For example, suppose you want to use numbered programming code examples, each consisting of a preformatted listing and a caption. In Word, you can assign the corresponding paragraph styles to a listing and its caption , but then you cannot tie them together by applying another common style to both. Alternatively, you can mark both a listing and its caption by a common example stylebut then you cannot separate the caption from the code.
As a result, your transformation stylesheet will have to be less straightforward, and therefore less reliable than it could be if styles would nest. For instance, your stylesheet could provide that if a caption is immediately followed by a listing , both are wrapped into an example . Such provision, apart from being utterly inelegant, is likely to break often, especially if caption s or listing s are allowed to be part of other constructs as well.
Write the Style Bible. It is in situations like these that a comprehensive template document is invaluable, showing samples of all styles that your stylesheet can handle, as well as their most common combinations. Simply creating the new styles and embedding them into the template is not sufficient; you must provide complete instructions on when to use each style, why this is important, and what will happen if the user just clicks on the I button instead of selecting the emphasis style.
As a part of the complete source definition, this template document has to make the same internal distinction between rules that can be enforced automatically and those that can only be explained (and then reiterated) to the user because they cannot be checked with software ( 18.104.22.168 ). The only difference is that in a word processor, the boundary between what you can and cannot enforce automatically is much lower than in any schema language.
Minimize the disruption. Here are some additional bits of advice:
Where possible, authorize the use of standard Word styles, as they have the big advantage of familiarity . You can change their formatting somewhat, to give better visual clues to what is being edited (e.g., by making the formatting of paragraphs or headings similar to that of the final web pages).
If there's no direct analog for what you need in the inventory of standard styles, create a new style with a consistent and descriptive name .
Keep the number of new styles low. (This one is important: If a user feels daunted by the amount of new stuff he or she has to learn, the motivation to comply with your rules will suffer.)
Don't forget to remove from the template any standard styles that you don't want to see in the submitted documents.
Include in the template document all relevant rules concerning special characters and typographic conventions ( 22.214.171.124 ).
Allow some time for users to adapt to the new word processing rules. Errors will be plentiful at first, but you must be persistent in tracking down each one to either fix what is broken in your software (the template or translation stylesheet) or talk over the error with the user who turns out to be the culprit.
Redefining the "source." After a page document is submitted, converted, transformed, and uploaded to the server, which format is the "master source" of the web page? In other words, will you edit the word processor file or the XML document when you need to make a change to the page?
The answer to this question depends on what kind of changes you will be making and how often, as well as on whether your word-processor-to-XML conversion is completely automatic. Obviously, if your updates are infrequent, small, and mostly technical in nature (i.e., not requiring the author's expertise), it is tempting to treat the word processor source of the page as a shed skin that you will never return to.
On the other hand, if the updates are more or less regular and can only be done by the original author of the page and if your conversion routine is fully automatic, tested , and proved reliableyou should let the author maintain his or her own copy of the word processor file and rerun the conversion to XML (and then to HTML) on each change.
126.96.36.199 MS Office 2003
Microsoft Office 2003 (also known as Office 11) offers XML equivalents of the Word and Excel file formats.  For example, the Word format, called WordML, is a direct equivalent of the binary .doc format. It includes the data content of the document, the styles associated with it (whether or not used), and the various settings (such as page margins and tabs).
It is substantially easier to parse and use the Word document format with WordML than with .doc or RTF. One enterprising company has even developed an XSLT stylesheet for converting from WordML to XSL-FO. 
The Professional editions of Office that also support user-defined custom XSDL schemas (but not DTDs). Word, for example, provides guided editing and the option of saving the document both in WordML and as an instance of your custom schema. The Professional Enterprise edition of the suite also includes the new InfoPath formsbased editor.
188.8.131.52 Other office suites
The increasingly popular open source office suites, such as KOffice,  OpenOffice.org,  and Sun's StarOffice,  also save documents in XML using their own presentation-oriented vocabulariesin fact, this is their native document format. However, no custom schema support and no guided editing are currently available.
 With a name like this, who needs a URL?
These XML document formats are pretty well documented  and free of patent or license restrictions. The OASIS consortium has started a project to develop a common XML-based office format  to be used by all office suite vendors . 
 For example, see xml.openoffice.org/xml_specification.pdf for the OpenOffice.org format documentation.
 www.oasis-open.org/ committees /tc_home.php?wg_abbrev=office
 Microsoft, however, does not participate.
6.2.4 Semantic processing and XML-to-XML conversion
If you expect that I will now reveal a yet-unmentioned XML-to-XML conversion tool comparable to XSLT, I'm sorry to disappoint you. If you want to convert between arbitrary XML vocabularies, XSLT is the most powerful and, perhaps, the only practical solution.
This is one case where the low-level parsing of input is not a problem at allyou get it for free when you run your conversion stylesheet. Therefore, the only issue we'll facebut it may be a major oneis the semantic processing ( 6.2 ) of the source vocabulary and all the rearranging, renaming, and rethinking that may be involved. As we've just seen, most something-to-XML conversions have an XML-to-XML last stage, so this issue is relevant universally .
The right point to fork. Before starting to write a conversion stylesheet, it is worthwhile to check if the particular source XML vocabulary is your only option. If not, alternatives may be more suitable.
The complexity of an XML-to-XML conversion directly depends on how similar the conceptual bases of the source and target vocabularies are. For example, transforming from XSL-FO or SVG into a semantic vocabulary is bound to be hard. Figuring out what is what in XSL-FO's endless stream of fo:block s may require complex heuristics that will be very unreliable.
Luckily, any XML vocabulary that has the information you need, but treats it differently or focuses on different aspects of it, is usually but a step of a stairway of abstractions ( 1.1.1 ) on which you can freely go in either direction. Your goal is therefore to find the most appropriate step from which to jump sideways to your target XML vocabulary.
So, instead of trying to parse your content out of an XSL-FO rendition, check if its semantic source is available and if it would better suit your needs. It is not always the most abstract level (stairway's topmost step) that you're interested in; for example, a document workflow may start by compiling data from several sources, and you will want to grab this data as soon as it is complete, but before it is migrated to a lower-level vocabulary.
Refactoring strategies. When translating between two similar XML vocabularies, a lot of work consists of simple renaming of element types: What was p becomes para , img turns into image , and so on. Each such mapping is a one-liner in XSLT. It is also very easy to remove extra markup or extra data (as long as it is marked up unambiguously so it can be separated).
More tricky is adding markup where none existed in the source. This usually involves parsing the source document's character data with regular expressions (see 184.108.40.206 for an example).
Sometimes, markup exists in the source but is less specific than you need. For example, the source XML may mark up list items as paragraphs, but you'll likely have a special element type for list items in your target vocabulary. When you need to infer missing semantics, there are several possibilities:
You can get clues from the formatting- related markup of the source. For example, a paragraph that is actually a list item may have a CSS style attribute setting a wider left margin. This is the primary strategy when dealing with XML or XHTML originating from a presentation-oriented tool such as a word processor with few (if any) named styles ( 220.127.116.11 ) available.
You can determine the position of a source element relative to other known elements and make your conclusions based on that position. For example, a paragraph of preformatted text immediately following a caption may automatically be made a listing . This is another trick commonly used with word-processor-generated XML, since word processor styles do not nest ( 18.104.22.168 ).
As a last resort, you may try to analyze the content of an element to determine its role. For example, if a paragraph starts with a hyphen character, you can mark it up as a list item and remove the hyphen (instead, a graphic bullet may be displayed when your XML is formatted for presentation). This is the primary approach for plain text ( 22.214.171.124 ), but it may turn out necessary for XML-to-XML conversions as well.