2.1 The big picture
Defining the definition. What this chapter's title, "The source definition," refers to is a detailed specification listing all the element types and attributes you will use in your XML, the rules of structuring these units within source documents, and various constraints on their values. All parts of this specification must be in place before you can translate your content into XML.
Some of the source definition rules may be stored in a special document called a schema definition , or simply a schema . Schemas enable automatic validation; that is, you can feed your documents and a corresponding schema to a validator program and get a list of all errors found. A common type of schema is a document type definition (DTD). However, a complete source definition is likely to include more than just a schema.
Vocabulary alert. Another common term , vocabulary , refers to a set of element types and attributes used in an XML document. Therefore, its chief difference from a schema is that a vocabulary does not imply a formalized description in any schema language. On the other hand, depending on the schema language, a schema may be anything from a vocabulary with simple rules for its use (a DTD) to an almost complete source definition (a Schematron schema).
The ways of being correct. The point of creating a source definition is to ensure that conformant documents are transformed successfully by the stylesheet and yield web pages that are correct , in a broad sense. Here are just a few of the requirements that a set of correct web pages must satisfy : 
 What no source definition can guarantee is the meaningfulness and relevance of your web site content this is why you are here, after all.
all visible and invisible components of each page must be in their proper places;
there must be no missing or wrong page components;
there must be no missing pages or orphaned links;
all pages must be correctly linked up by the site's navigation system; and so on.
You can add your own requirements or limitations of almost any kind that the site's pages (or parts thereof) must meet. In most cases, the site's source definition is the best place to formally expressand thus enforcethese requirements.
After design but before implementation. A conformant source must, of course, match the site's transformation stylesheet ( 1.3.3 ). However, usually you create your source definition first and write a corresponding stylesheet later. What you do need to have before starting to work on a source definition is a detailed plan of what the final pages will look like and what they will contain. This means that your project must be well past the stages of content design (deciding what to put up on your web site and how to distribute the material across pages) and visual design (deciding how to present the material graphically).
In the real world, all of these stages tend to overlap. You may find it necessary to make design and content adjustments while working on source definition, then modify the source definition while writing and debugging the stylesheet, and finally polish all of these components both before and after the launch of the site.
2.1.1 Two- tier architecture
As we've just seen, a source definition is more than simply an inventory of the element types and attributes that you can use in your XML documents. It makes sense to subdivide the source definition into two layers : the document layer and the super-document layer .
Document layer. The document layer is where you declare what can be used within individual source documents. This includes declarations of element types and attributes, as well as the rules for what data types are allowed for them and how these structural units must be laid out in the source. The document layer of the source definition is further subdivided into document types (for example, one for the site's front page and another for the subpages). Each document type may have its own hierarchy of element types and enforce its own structure rules.
Often, each document type is described by a separate schema that can be written in the DTD notation or in any other schema language ( 2.2.1 ). The most valuable aspect of a schema is that it allows you to automatically validate your documents using corresponding validator software. For example, with a DTD, a validating XML parser such as Xerces  can validate a document during parsing, so any errors will be reported as soon as you attempt an XSLT transformation (as it requires that your source document be parsed first).
Super-document layer. A complete source definition will likely include a number of rules that cannot be placed into any single document type. These rules involve relations among different XML documents or their parts, so I will call them collectively the super-document layer of the source definition. This layer's rules might control
information distribution: what data to store in what source documents;
file conventions: how to name the source documents (this often defines the URIs of the resulting HTML pages after the transformation) and in what directories to put them;
file-to-file correspondences: rules of the type "if you add a page document, you must provide an image file for that page's heading photo";
file-to-element correspondences: rules of the type "if you add a page document, you must add a corresponding page element to the site's master document ( 126.96.36.199 )."
Depending on your site's requirements, you may not need all of the above rule types, but you may just as well need others. Not all super-document rules are equally important. Some of them are just recommendations or accepted conventions whose breach will cause no easily identifiable consequences. Breaking other rules, however, may result in more obvious and unpleasant problems, such as a missing menu item, a wrong image, or a broken link.
188.8.131.52 Implementing the super-document rules
Unfortunately, the DTD notation, as well as most other schema languages, is unable to express the super-document rules of a source definition. DTDs or XSDL schemas can only control (to various degrees) the XML markup within a document, not connections or dependencies between different documents.
On the other hand, a system of rules is only useful when there are ways to enforce these rulespreferably automatically. What are our options for implementing the super-document layer of a source definition?
Human-readable instructions. These may take the form of standalone documentation or comments embedded either in the schema or in the source templates ( 2.2.3 ). For example, if you want to ensure that a photo is provided for each section page's heading, you can write down a human-readable rule to this effect in whatever place is the most convenient for the site's editors (in the schema, in the section page template, or on a separate reference sheet printed out and glued to the wall).
This approach works best with qualified permanent maintenance staff and cannot guarantee automatic conformance. Still, documenting your source definition in some form is necessary in any case, and for simple sites or small teams , it may also suffice for all of your super-document definition needs.
Back-end scripting. You can program the super-document rules into your back-endthe part of the system responsible for interacting with the site's editor. With this approach, checks are made right when a source document is changed and before the stylesheet is run.
Your back-end may be anything from a simple text editor in which you touch up your XML sources, to a specialized XML editor ( 6.1 ), to a complete local or distributed content management system (CMS). Obviously, the scripting capabilities of your back-end and its ability to rise from the currently edited document to the super-document context may vary widely. Some rules may be simple to enforce, while others may require writing external programs to be called from within the back-end. Form-based XML editors ( 6.1.3 ) make especially good back-end scripting hosts ; in them, checks may be activated not only when the entire document is completed but when a particular field is filled in.
While using a "smart" back-end may be convenient, this approach has several disadvantages. First, not all web sites need a complex, scriptable back-end. Many projects will run just fine from a set of static XML sources manually editable with a text or XML editor with limited, if any, scripting capabilities. Second, this solution forces all site maintainers to use the same back-end, which may be suboptimal. Finally, source validation implemented in one back-end may not be easily portable to another.
Build layer checks. You can also program your super-document checks into the build framework that controls stylesheet execution. For example, the make utility is often used to perform for programming projects what is a close analog of super-document checks (verifying file existence, checking file dates, etc.)and, as we'll see ( 6.5.1 ), make can be successfully used for building an XML-based web site.
Obviously, this approach and the back-end scripting described above share disadvantages. Not all projects and not all developers within a project need to use a separate build layer, and checks created for one build system (e.g., make ) are not easily portable to another (e.g., Apache Ant).
Stylesheet checks. Your site's XSLT transformation stylesheet can use the document() function to access any available XML documents (even those stored remotely). Any data from these external documents can be used in arbitrary calculations or comparisons. Also, extension functions ( 5.3.2 ) can be called from XSLT to perform other types of checks, such as verifying the presence of files and directories, determining image sizes and formats, etc. Any errors found by the stylesheet are reported during transformation.
This is perhaps the most natural approach to implementing super-document checks. Once set up, validation is automatic in that the checks are guaranteed to be made before you upload your web-ready files onto the server. Error reports can be arbitrarily long and detailed, and can in effect serve as a sort of on-demand documentation for specific rules. Also, this option does not require that you write a separate subsystem for the sole purpose of source validation; you can (and are encouraged to) add checking and reporting as you implement the corresponding stylesheet logic.
The big problem with this method is that it mixes up what should really be kept apartsource semantics and presentation algorithms. For example, imagine you want to render the same source documents in a different format (e.g., PDF or WML). Instead of writing a completely independent new stylesheet for this, you'll end up borrowing the super-document checks from your original stylesheetbecause these checks are actually part of the complete source definition that you are reusing, not part of the presentation algorithms that you are rewriting for the new format. Such duplication of code across stylesheets is prone to errors and difficult to maintain.
Schematron schemas. Schematron  is one schema language that stands apart from others in that it uses XPath expressions for defining arbitrary constraints on the structure and data of an XML document. As a result, Schematron allows you to implement all the same checks that are possible with an XSLT stylesheet: arbitrary calculations with XML data, both within one source document and across documents, and practically unlimited checks of non-XML data using extension functions.
In other words, Schematron provides all the benefits of the stylesheet checksbut without their downsides. The validation layer implemented as a Schematron schema is completely orthogonal to the stylesheet logic  and is reusable across applications. In fact, with Schematron, the distinction between the document and super-document layers becomes largely irrelevant, since you can elegantly implement all necessary rules and constraints in one schema.
 It may, however, share some code with the stylesheet ( 5.1.1 ).
Greet the winner. Summarizing, Schematron comes very close to being the ideal solution for implementing the entire source definition, including both document and super-document layer rules. The only disadvantage to it is that due to its rule-based nature, expressing entire grammars in Schematron may not be as straightforward as with other schema languages (see 2.2.1 for more on this). In the examples in the following chapters, we'll focus on the Schematron source validation techniques.
Do not forget to super-document. Although combining different super-document enforcement techniques is not usually a good idea, the first of the options described above (human-readable documentation) must accompany any other approach you are using. All the super-document layer rules (as well as the most important document layer rules) must be fully documented so that authors can produce valid documents without having to go through too many trial-and-error iterations.
2.1.2 Organizing source documents
The most obvious approach to translating a web site into XML is, "one XML source document maps to one HTML page." However, following this path pedantically would mean placing all of a web page's information into one XML source document. This is demonstrably wrong; even if present on a particular web page, some bits of information may logically belong elsewhere.
184.108.40.206 Master document
Even in the simplest web site setup, we need at least two source XML documents for each web page. One is the page document storing the material specific to this particular page: text, links, image references, and so on. The other document, which we will call the master document , provides material that is common to more than one page: navigation, logos, copyrights, disclaimers, some of the metadata (such as keywords and descriptions that apply to the entire site or section), parent sections' titles and links, etc.
Certain bits of master document data may be used on all pages of the site, while others may apply only to a section within a site. You can therefore further subdivide your master document into several documents (one for the site and one for each section), or you can still store all information in a single master document. In the latter case, the markup of the master document should make it easy for the stylesheet to locate data corresponding to each section.
Find it on the map. Figure 2.1 depicts the process of transforming a sequence of page documents in XML into a sequence of HTML pages and the place of the master document in this process. Note that the page documents and the master document are validated before being fed to the XSLT processor controlled by the transformation stylesheet.
Figure 2.1. The page documents and the master document are fed to the transformation stylesheet that produces HTML pages.
Site directory. The most important data in a master document is a description of the structure of the site, listing all its pagesa site directory in XML. Individual pages use this information to figure out their place in the context of the site and build their navigation accordingly . A site map page, often found on complex sites, can be generated by the stylesheet directly from the master.
It may make sense, just to be logical, to separate the master file into a site directory document and a metadata document holding the rest of the site-wide metadata. However, for most projects, such separation holds little advantage by itself.
What it is not. The master document is not to be confused with the site's front page; no matter how different is the latter from the rest of the pages, it is still just a page that is generated from its own source document. A master document's role is not to generate any specific page but to provide the site directory, common content, and metadata for all pages of the site.
Those familiar with Cocoon ( 7.2 ) might wonder how a Cocoon sitemap compares to the master document of a site. They have little in common: The sitemap defines the processing patterns of a site, while the master document is part of the site's content.
A sample master document is examined in Chapter 3 (Example 3.2, page 143).
220.127.116.11 Orthogonal content
Apart from the master document, other source documents leaves of the source treenormally correspond one-to-one to the final pages of the site. However, this is not always true. Certain pieces of content may be orthogonal to the site's hierarchythat is, they may appear on more than one page regardless of those pages' place in the tree and sometimes even regardless of their content.
Examples of orthogonal content include news blocks (except those that are the main content of their pages), advertisements, sidebars, featured links, "quotes of the day," etc. Some of it borders on metadata, whose proper place is in the master document; however, unlike metadata, orthogonal content is meaningful even outside of its web site context and is usually updated regularly. It can be organized into its own hierarchy, which is usually independent of (but may be in some aspects parallel to) the main site hierarchy.
Compared to the information they are neighboring with on the final web pages, these orthogonal content units may be maintained by different people, obey different update rules, and even use a different markup vocabulary. This last difference, however, should be avoided because it may be quite costly in terms of complexity of the validation and stylesheet code.
Referencing orthogonals. It makes sense to store orthogonal content in separate source documents. To specify what orthogonal units should go to what site pages, you can use any of the following methods .
You can hard-code this into the stylesheet. For example, if you want the same sidebar to appear on all pages of the site except the front page, it is straightforward to program your stylesheet so it retrieves the sidebar source, formats it, and places it on all subpages it generates. Also, this is the only method that works whenever the logic of placement of orthogonal information cannot be expressed declaratively but is algorithmic.
For example, if you want each page to automatically display a textual ad block most closely related to its content (which may change often), static XML cannot express that. Only the code that actually builds the page can dynamically implement this algorithm, for example, by searching each page for keywords and matching these to the keywords from an orthogonal pool of ad blocks.
You can link orthogonal pieces to the main site hierarchy via the master document. For example, you can add an attribute to the elements that represent pages in your master document's tree, specifying the source of the orthogonal unit(s) to be placed on the corresponding page. This is perhaps the most logical approach; it works for an arbitrarily complex (but static) distribution of orthogonal content, making it easy to overview and maintain.
Finally, you can extend your source markup vocabulary to include an element or attribute that, when encountered in a page document, triggers the inclusion of orthogonal information. This approach may be the first to come to your mind, but it is not necessarily optimal: It contaminates your source with low-relevance information (after all, orthogonal content is so called exactly because it has little direct relation to the other content on the page) and makes the structure of the site more difficult to update by decentralizing connections between pages.
Still another solution is to combine the two last approaches from the above list. You could use the master document to associate some unique identifiers with orthogonal content units stored elsewhere, and then reference these identifiers in page documents to incorporate those units. This gives you both centralized control (you can update all pages that use some unit simply by changing its identifier association in the master) and editing convenience (you only have to edit one page document when you want to add or remove some predefined orthogonal units on that page). This is the approach that we will use in the following chapters (see examples in 3.10 , page 140).
Not where it appears to be. When adding an orthogonal reference to a page document, don't try to position it to correspond to the physical position of the orthogonal block on the formatted page. Remember that the structures of the XML page source and its visual rendition cannot be parallelif only because one is hierarchical while the other is two-dimensional.
Usually, an attribute or a direct child (first or last) of a page document's root element is a good enough place for linking up orthogonal content. The stylesheet will decide what is the best position for the block on the final page. Only in cases when this visual positioning is excessively difficult to calculate in the stylesheet, or when your pages will display orthogonal content intermingled with native content, should the orthogonal references in the source be moved to a place somehow corresponding to the positions of the formatted orthogonal blocks on the web page.
External entities are not for linking. XML provides external parsed entities ( 18.104.22.168 ) to embed externally stored components of a document. This is not a link between two documents: The entity is treated as an intrinsic part of the document, just as if it were physically part of it.
Parsing, validation, and processing take place after entity references are resolved. They should make no distinction between a document that is stored in a single piece, and the same document with components stored in external parsed entities.
For example, in book projects, entities may be used to store some chapters separately for convenient editing. When the book is parsed and processed , however, the chapters are seen as part of the book, regardless of whether they are physically present or accessed by resolving entity references.
Don't even think about using entity references as the equivalent of links. Link elements exist at the semantic level and the processing for them can vary from one application to another. In our approach to web site design, orthogonal content and site-wide metadata should be handled with links.
Note that some parsers, although DTD-aware, do not support external parsed entities.
22.214.171.124 Storing auxiliary data in the stylesheet
Some of the site's material can also be stored in the transformation stylesheet. This is only advisable for data that does not really belong in the source of the site. That is, if you can create a different (but fully adequate) rendition of the same source without some bits of its content, chances are these bits are not really part of the source proper but may need to be stored in the stylesheet.
Boil-down analysis. For example, suppose your site features a graphic button reading "portfolio." The same button displays a floating tooltip when the mouse pointer is hovering over it, reading "click here to view portfolio." Apparently, this web site element is comprised of three components: the "portfolio" label, a reference to an image file (for example, img/portfolio.png ), and the text of the tooltip.
Of these three components, only the bare text of the button ("portfolio") clearly deserves to be called its source (and to be stored in the source XML document). The image reference, as well as the image file itself, should ideally be generated by the stylesheet from the label ( 5.5.2 ); however, if this is not possible, the reference to a prefabricated image file can be stored in the source. As for the tooltip, the most effective approach is to also generate it from the button label by automatically prefixing the latter with "click here to view." Store the tooltip text in the source XML only if this automation does not work for all labels, or if you want to reword some labels into the tooltips to make them more descriptive. (Remember, however, that "click here" is meaningless outside of HTMLe.g., if the same material is rendered as a VoiceXML interface.)
126.96.36.199 Other approaches
Everything in one chunk . A different approach to source organization is, "one XML source document maps to many HTML pages." Actually, you can store the source of all pages as well as all site metadata in one big document. The stylesheet can easily extract all necessary data from this source repository and generate all pages of the site from it. This also makes some of the super-document checks possible even with those schema languages that cannot go outside of the current document. However, such a monolithic storage unit is not very convenient to update and maintain, especially if more than one person will be working on the site.
Objects in a database. When you have to manage a lot of source documents, it might make sense to store them not in files but in a native XML database, such as Xindice.  This will make access to your data much faster, especially in search operations, and let you do some impressive tricks (e.g., evaluate an XPath expression against many documents at once). However, each object in such a database still represents a self-contained XML document, so everything we discussed above regarding organization of source files still applies.
Reusing existing infrastructure. The web site requirements are your primary starting point, but you must also take into account your organization's established electronic document workflow. Converting other formats to XML is discussed in 6.2 , but if all of the documents around you are in some XML vocabulary, this will have direct consequences for developing the site's source definition and organizing source documents.