Section 2.3. The art of source definition | ASP.Net 2.0 Cookbook (Cookbooks (OReilly))


Prev	don't be afraid of buying books	Next

2.3 The art of source definition

Designing the markup to be used in your site's XML source may sound simple. In reality, however, it may well be the most challengingand most interestingpart of the entire project. A simple static web site may make do with a dozen element types, but even those may not be easy to select. (In fact, defining one dozen element types may be more difficult than defining two dozen .) You need a solid knowledge of your site's subject area as well as good abstraction capabilities to develop a fully adequate, flexible yet compact source definition.

Below, we'll discuss some of the common errors and misconceptions pertaining to creating a new source definition from scratch. (When you look at the XML source of a finished site, such as our summary examples in 3.10 , it might seem obviousbut this simplicity may cost a lot of effort.) Only generic markup issues are covered here; no particular elements of a web page are examined, as this is the subject of Chapter 3.

2.3.1 Semantic analysis

Since our goal is to create a semantic XML vocabulary ( 1.1.1 ), the key concept is semantic analysis . Keep asking yourself questions like:

What is it that this fragment of content is supposed to mean ? What pertains to its meaning, as opposed to its visual presentation (style) or dynamic features (logic)?
Where are the boundaries of this fragment from the semantic perspectivethat is, what other material (adjacent or not) belongs to it?
What material, even if it is rendered as part of this fragment, should be moved away in the source?
What happens if I pull this fragment out of its context or move it to another contexthow would this affect its meaning?

This way of thinking requires certain training. Learning to abstract out the meaningful core of a web page (which, at this stage, may only exist as a design draft with dummy content, or as a structureless pile of content with no design) is difficult to begin with. Even more difficult is figuring out where the abstracted meaning belongs; the fact that a fragment of content appears in a specific position on the final web page does not necessarily imply that its corresponding XML element must occupy the corresponding position in the source document. For example, if this fragment is copied over to more than one page, its proper place may not be in this source document at all.

2.3.2 Learn to think hierarchically

That XML is a formalism for serializing (i.e., representing sequentially) hierarchical data is usually well understood in theory. In practice, however, authors often have trouble trying to think hierarchically. The resulting XML is then flatter than necessary, with a complex tree of variously scoped objects reduced to a long row of sibling elements. And when there is some depth to the tree, it is often a reflection of the two-dimensional layout of the rendered page, not of the intrinsic structure of the content.

There are cases when the choice between hierarchical and linear arrangements is not that obvious (for an example, see 2.3.7 ). However, it is almost universally true that hierarchical structures are easier to navigate, which means the stylesheet processing a hierarchical XML document will be more straightforward to write and maintain.

In part, this is because with linear structures, XPath expressions will more likely include numeric indices (as in /item[5]/label ). Such numeric pointers are almost meaningless to a stylesheet reader without seeing a document they apply to. What's worse , it is too easy to break them by modifying the document in such a way that the number of sibling elements in a sequence is changed.

2.3.3 Child elements vs. attributes

A puzzle that a great many XML authors run into is whether to represent a particular bit of information as an attribute of some element or as its child element. The three essential points of difference are these:

Attributes are not extensible: You cannot have children or other attributes attached to an attribute (i.e., only character sequences can be stored in an attribute value), but you can always add children or attributes to an element.
Attributes are unique within their elements: You cannot have two attributes with the same name under the same parent element, but elements pose no such restriction.
Attributes are unordered: You cannot rely on the order of an element's attributes to be preserved during processing or transformation of any kind, but the order of child elements is always preserved.

There are several minor points as well:

It is only possible to store one-line text fragments as attribute values. If necessary, a newline can be entered in an attribute value, but it is normalized to a space by the parser.
For DTD users, it may be important that DTDs permit some basic data type checks on attributes, but not on data content within elements.
Attributes cannot be commented separately: You cannot place an XML comment next to an attribute because comments must stay outside of tags. You can place comments inside or outside of an element as you see fit.

Other than that, it is mostly a matter of taste and semantic modeling accuracy. Attributes are commonly used to store metadata (data about data), while elements contain the data itself; although this distinction may in some cases be vague, it is still a useful rule of thumb to remember. For example, a unique identifier of an element is undoubtedly a piece of metadata, so it is always stored in an attribute (conventionally called id ).

Another rule of thumb is that attributes tend to be used more often for numeric data and for various formal constructs such as URIs, whereas human-readable text (even if flat) is better stored in elements. The rationale behind this rule is that when stripped of any markup, XML is supposed to still make at least some sense to a human reader, which means that as much as possible of the document text must be in element content and not in attributes.

This means, by the way, that HTML's markup model by which an image description is stored in the alt attribute of an img element is not optimal. If you define a similar element type for your source markup, it is preferable to store the description text in the content of the element:

 <image src="image.png">  Company logo  </image>

Note that the object element type, added to HTML later than img , uses this markup model.

2.3.4 The art of naming

Source definition means, among other things, thinking up names for your attributes and element types. It is of course a very subjective matter, but some recommendations may still be useful.

Spell out. First of all, avoid abbreviationsuse complete words (unless, of course, an abbreviation is more common or more familiar than its spelled-out form). XML is not supposed to be terse; it is supposed to be readable. Saving a few keystrokes now may cost you many lost seconds later as you'll be trying to remember how exactly you abbreviated that name.

Use case. You can use all-uppercase names if you want your markup to really stand apart from the text. However, all-lowercase names look nicer and are easier to type and read. Initial capitals make no sense for single-word names and may be confusing for multiple-word names (such as SectionHeading ) because they look somewhat like regular text with its mix of cases. ^[16] XML is case-sensitive, and mixed case ought to be considered harmful if only because it makes your markup a fertile ground for hard-to-catch case errors.

^[16] Even more confusing is the so called "camel case" where all initials but the first are capped, e.g., camelCase .

Hyphen-ate. Unlike most programming languages, XML permits hyphens in names, and XSLT and many other XML vocabularies use them to separate parts of multiword names (e.g., apply-templates ). In my opinion, this convention makes complex names easier to remember and more legible than using initial capitals. (It's also handy that, unlike the underscore sometimes used in other languages, a hyphen can be typed without pressing Shift .)

No matter what naming style you choose, be consistent and do not mix different styles within one application.

Readable markup. Another consideration unique to XML is that some element type names will be used only in combination with the names of their required attributes. This allows you to use creative naming schemes that sound almost like English but are nevertheless strict and unambiguous. For example, if you need an element type to represent internal links with an obligatory attribute providing the link's address, then instead of

 <internal-link address="address"/>

you could use

 <internal link="address"/>

This " reversed grammar," with an adjective for the element type name and a noun for the attribute name, makes perfect sense for this construct and is easy to read and remember. You need to make sure, however, that the link attribute always comes before any other attributes of this element.

Similarly, if one element type is always a child of another, you can use the context of the parent to make the child's name shorter. For example, you don't need an author-name inside an author ; just name will do. Do not worry if you have more than one name element type in your vocabulary; unless you use DTDs for validation, you'll have no problems validating and processing different names differently depending on their context.

2.3.5 Structure vs. metadata

A common difficulty for beginner XML authors is separating structure of information from metadata . These two concepts do sometimes overlap. You can more easily differentiate between them, however, if you remember that structure tends to remain constant, while metadata is likely to change as the data itself is changed.

The inventory of your element types and attributes must describe the structure of your content, while metadata is information about that content that is stored in elements and attributes of a document. ^[17] In other words, your XML vocabulary must be able to store both your information and information about that information (metadata), but must not itself depend on any particular information it stores. It is normal to have to update your vocabulary as you see more instances of content, but every such update must make the vocabulary more flexible and general, not just patch it for a specific instance.

^[17] More precisely, metadata tends to go into attributes, while the content itself is more often in elements ( 2.3.3 ).

Example: translations. Suppose you need to store two versions of a heading in different languages. Here, the language of each version is an example of metadata; if you mistake it for structure, this could result in markup like

 <heading>   <en>  Customers  </en>   <de>  Kunden  </de> </heading>

This will even work, so long as you only need these two languages. If you decide to add a third language, however, you will need to patch your schema to allow a new child element type (e.g., fr ) under heading to store it. The obvious sign of a problem is that with this approach, you cannot make "one fix to end all fixes"; inevitably, each new language will require adding an element type of its own.

The correct solution is to use the structural role of the heading translation for the element type name and move the language metadata into an attribute:

 <heading>   <translation language="en">  Customers  </translation>   <translation language="de">  Kunden  </translation> </heading>

Admittedly, this approach is bulkier, but it is much more consistent and extensible.

2.3.6 Generalizing but not overgeneralizing

Creating XML markup for web site content requires abstracting out its meaningful core, leaving all the presentation details aside. This is so fundamental that, sometimes, XML authors tend to "overgeneralize" and treat pieces of content that are structurally different as metadata-differentiated variations of the same structural unit. For instance, for a heading and a paragraph of text, this could result in markup like

 <block type="heading">  Customers  </block> <block type="paragraph">  Our customers are...  </block>

Not only is this approach unnecessarily bulky and difficult to read, but it also makes validation more difficult because DTDs (as well as some other schema languages) cannot define content models that depend on an attribute's value. This means a DTD validator cannot enforce different rules for these two instances of block , nor can it check the proper order and/or nesting of block elements with different type s. Better markup of course would be

 <heading>  Customers  </heading> <paragraph>  Our customers are...  </paragraph>

Troublesome heritage. It seems likely that the overgeneralization tendency is a result of HTML experience. With HTML, you have a very limited set of element types whose intended structural roles rarely match those you need for your content. So, the standard mindset of a hardcore HTML user defaults to searching among what is available and trying to adapt one of the existing element types whenever a new structural unit must be marked up, just as HTML's div and span are often used (with CSS properties) for all kinds of structures in HTML pages. It takes getting used to XML to be able to create vocabularies that are exactly as rich as the content they apply to.

2.3.7 Parallel vs. sequential

Web site content combines both parallel and sequential components . For example, in a site's master document ( 2.1.2.1 ) the order of many elements is not important; page sources, on the other hand, are sequential documents whose order of elements is supposed to be mostly preserved in the formatted web page. This distinction is intuitively clear and poses no problems until you have to combine parallel and sequential components within one element.

Anatomy of a menu item. Consider an item element that describes an item of a site's main menu. Let's say the item must provide several bits of data: the button label, the identifier of an image displayed alongside this menu item, and a list of links to appear on the item's drop-down submenu.

Here, the label and the image identifier are parallel pieces of data in that their order in the source file is not important. But the list of submenu items is sequential, because their order will translate into the visible submenu order and is therefore meaningful. It would be an error to treat all these elements as siblings:

 <item>   <label>  Customers  </label>   <image src="customers_photo"/>   <subitem href="references">  References  </subitem>   <subitem href="clients">  Clients  </subitem>   <subitem href="contact_us">  Contact us  </subitem> </item>

How, then, should we separate the parallel and sequential data in this example? We could, of course, add two child elements to the item to group all sequential data in one child and all parallel data in the other, but that would be overkill. Indeed, the parallel data (the label and the image identifier) does not so much belong together as it belongs to its parent element, item . Adding an intermediate layer between the item and its parallel children is therefore counterintuitive. The sequential data, on the other hand, clearly represents a wholea drop-down menuthat must be marked up as such:

 <item>   <label>  Customers  </label>   <image src="customers_photo"/>   <drop-down>     <subitem href="references">  References  </subitem>     <subitem href="clients">  Clients  </subitem>     <subitem href="contact_us">  Contact us  </subitem>   </drop-down> </item>

Now, the first level of the tree under the item (the label , image , and drop-down elements) holds only parallel data, while the second level (the subitem elements within drop-down ) is sequential. This conforms to the general rule: Never make siblings out of parallel and sequential data bits, but store them on different levels of your hierarchy.

2.3.8 Existing vocabularies

For specific kinds of documents or objects, many existing XML vocabularies have been created by various user groups and standards bodies. Examples include XLink ^[18] for link semantics, XForms ^[19] for interactive forms, and DocBook ^[20] for document markup. Should you reuse these vocabularies, in whole or in part, or should you develop your own markup? The answer to this question depends on three main factors.

^[18] www.w3.org/TR/xlink

^[19] www.w3.org/MarkUp/Forms/

^[20] www.oasis- open .org/ specs /docbook.shtml

The concepts you need to express in your markup. Why was the existing XML vocabulary created in the first place? What are the concepts it formalizes and the abstractions it uses? Are these concepts and abstractions of any utility for your own goal of creating a simple, easy-to-remember semantic vocabulary that strikes the right balance between strictness and flexibility?

The "stairway of abstractions" ( 1.1.1 ) is a useful analysis tool. For example, if you consider reusing (a part of) the XSL-FO ^[21] vocabulary for text formatting, you'll realize that it is focused on visual presentation, while your goal in the source is semantic markup. Therefore, for your purposes, XSL-FO is on a wrong step of the abstractions stairway no matter how rich or well thought out it is.

^[21] www.w3.org/TR/xsl
The need for interoperability. Few sites exist in complete isolation. Most need to draw their content from outside sources, and some also need to provide their content to the world in a format different from that of the web pages (one example is an RSS ^[22] news feed). Quite naturally, the requirements of these inputs and outputs will affect your source definition and may, in some cases, justify reusing some of the existing XML vocabularies in your source markup.

^[22] backend.userland.com/rss

On the other hand, if your inputs or outputs are not XML, they are useless for your source definition, and if they are, it is often easier to use XSLT to translate between vocabularies than to try to prune and graft definitions from one vocabulary to another. Remember that the main goal of your XML source is to be a foundation of your unique web site, and no existing vocabulary can be quite suitable for that.
The need for completeness. In those aspects of your source definition that might be covered by established standards, your choice must be based on the relative importance of those aspects for your content and the complexity of constructs they describe. A well-designed, widely accepted standard that went through several revisions and was tested in many projects is much better prepared for all the unexpected real-life problems. So, if some markup aspect is really important to your site and you reasonably expect that it will keep developing and becoming more complex, your best bet is to go with an appropriate existing standard even if at first glance it may seem like overkill.

In most cases, however, universality is a burden rather than an advantage. A minimalistic ad-hoc vocabulary is often much easier to use and maintain in the long run compared to an all-embracing multifarious standard that will bring its own bulky baggage if you incorporate it into your source definition. Remember that you can start simple ( 2.2.2 ) but keep adding new stuff to your vocabulary, provided you did not make any serious design mistakes in its core.

Chapter 3 mentions the most notable of the existing XML standards that could be used for some aspects of your source markup. It's up to you whether to borrow from these sources and to what extent.

2.3.9 Namespace strategies

As you are creating your own unique XML vocabulary, you need a unique namespace for your element type and attribute names. For it to be unique, it is natural to use the URL of your web site for the namespace URI. If you will be using mostly your own markup constructs while borrowing few, if any, constructs from existing vocabularies, then it is also natural to use the default namespace (without prefix) for your markup.

As for the stuff you borrow from existing vocabularies, two approaches are possible. You can treat your hired staff with due respect, fully preserving their identities (i.e., their namespaces). Or, you can be mean and basically enslave them by converting them into your own namespace.

In fact, those you convert to your namespace have no formal connection to their native land anymore. So it is not really elements or attributes that you borrow, but only their semantics and names; ^[23] what you do is just build your own house after someone else's designs. Still, what might be the advantages of the "namespaces' melting pot" strategy?

^[23] Local names, to be precise, for a namespace is a part of a fully qualified name.

Unifying everything under your own namespace (and making that namespace the default) greatly simplifies things for those who will be authoring and maintaining your site, especially if they are novices in XML.
Getting rid of the borrowed elements' original markup vocabulary means that you can also get rid of any limitations or inconsistencies in that vocabulary. Ever wished to add an advertisement="{yesno}" attribute to HTML's img element? Now you can. Just copy everything from the original img into your own markup vocabulary, removing or adding stuff according to taste.
Finally, severing the link to the original vocabulary from which you borrowed some bits of markup protects you from any future changes in that vocabulary. No need to worry if a future version of HTML deprecates, and the next one removes , your favorite element type that you heavily use in your markup.

These advantages have a flip side:

Your changes to the borrowed markup constructs may run askew of the principles and structure of their original vocabulary.
Partial borrowing or changing borrowed stuff may prevent you from reusing software or schemas written for the original vocabulary.

If any of these considerations are important in your situation, use the native namespace for borrowed stuff and keep everything as per the original specification.


	Amazon