Section 5.4. Bottom-level structures


Prev	don't be afraid of buying books	Next

5.4 Bottom-level structures

The pull-oriented trunk templates in the previous sections had one thing in common: They were rather big, mostly due to a lot of HTML layout code they had to provide. However, as we descend the stylesheet hierarchy, templates become much more lightweight and transparent.

One-to-one mapping. A typical push-oriented branch template match es one source element type and converts it into one corresponding HTML element typeno complex layout, no if s or for-each es, no callable templates. For example, paragraphs are converted to p elements, emphasis to em , and so on.

Be careful with shadows. Inside a branch template, there's usually an xsl:apply-templates without a select attribute working as a catch-all for the children of the current element. Use specific patterns in a select only if you are sure that no other elements may legally occur in that position otherwise you run the risk of losing data. For example, if you only have <xsl:apply-templates select="p"/> inside a section template, any non- p children of a section will be ignored even if you have templates for them. In this situation, anything except p is said to be " shadowed " under a section .

5.4.1 Processing links

The interesting thing about link templates is unabbreviating link addresses ( 3.5.3 , page 112): It is much more convenient to write links using an abbreviated notation in the source, but we need to construct full URIs for the resulting HTML links.

Our shared XSLT library ( 5.1.1 ) contains a few simple unabbreviation functions. In a real stylesheet, you will likely need more such functions, one for each link type ( 3.5.2 , page 109). The link templates calling these functions don't need to be complex, as shown in Example 5.8.

In the stylesheet, we don't need to check for the existence of the link attribute or for the validity of the resulting link; these checks have already been made by the Schematron schema run before the transformation ( 5.1.3.2 , 5.1.3.3 ). However, if you do not use Schematron, you may want to add certain link checks and corresponding diagnostics to the stylesheet templates.

5.4.2 Text processing

Most branch templates transform low-level text markup. Is there anything interesting at still lower levels of the source hierarchy?

Now that we've descended to the level of character data, you may think that our work is over. This is not quite true. Even for plain text, HTML presentation may differ from that of the source XML, and a conversion should therefore be taken care of by the stylesheet.

Example 5.8. Link templates for various link types.

 <xsl:template match=  "link[@type='internal']  int"  >   <a href="  {eg:page-link(@link, $lang)}  "><xsl:apply-templates/></a> </xsl:template> <xsl:template match=  "link[@type='external']  ext"  >   <a href="  {eg:ext-link(@link)}  "><xsl:apply-templates/></a> </xsl:template> <xsl:template match=  "link[@type='rfc']"  >   <a href="  {eg:rfc-link(@link)}  "       target="_new"><xsl:apply-templates/></a> </xsl:template>

A common source of problems is the presentation of special characters , i.e., those outside of the ASCII range. ^[10] Common examples in English texts are the em-dash (), single and double curly quotes ('', ""), and the apostrophe (same as the closing single curly quote). The problem is that there are many ways in which these characters may be encoded, and the way they are represented in your source XML is not necessarily the best for the HTML output.

^[10] In English; other languages may have different notions of what characters are special and what are not. The ASCII range, though, is considered unspecial pretty much all over the world.

5.4.2.1 Charset conversions

You probably don't need to worry about it if all you require is a conversion from one standard character encoding into another. On the Web, it is advisable to represent characters outside of the ASCII range by either mnemonic or numeric character references ( 2.2.4.3 ) to protect them against miscommunication of the page's charset that may happen between the server and the client browser. ^[11] Character references refer to Unicode code points and are therefore immune to any reencodings of the source document. These character references should work in all modern browsersprovided the browser can, in principle, display the corresponding character (i.e., has an appropriate font).

^[11] Unless your language uses mostly non-ASCII characters, in which case following this advice will result in too much overhead.

Fortunately, you don't have to do anything special to obtain proper character references in the output. All you have to do is this:

Make sure all the source documents correctly indicate their encoding in the XML declaration. The most commonly used encoding is ISO 88591, for which the declaration should read
```
 <  ?xml version="1.0" encoding="iso-8859-1"?  > 
```

Make sure you specify ASCII as the output encoding in your stylesheetfor example,

 <xsl:output method=  "html"  encoding=  "US-ASCII"  doctype-public=  "-//W3C//DTD HTML 4.0 Transitional//EN"  />

If both these requirements are met (and your XSLT processor is standards-compliant), any non-ASCII characters in the input will be converted to numeric character references in the output. For example,

She said, "Mr Filk isand always wasmy respected teacher."

becomes ^[12]

^[12] Depending on setup, your XSLT processor may output equivalent hexadecimal numeric character references instead of decimal.

 She said, &#8220;Mr Filk&#235; is &#8212; and always was &#8212; my respected teacher.&#8221;

in HTMLwhich, in turn , will again render the nice curly quotes, accented characters, and em-dashes in the browser window.

5.4.2.2 Search and replace

It's not always that easy, though; not all content authors use correct ISO 88591 characters to start with. This very much depends on what kind of tool they use for XML authoring ( 6.1 ), but chances are that the XML you get from the authors will have no fancy characters at all, but only plain ASCII quotes ( " , ' ) instead of curly quotes and hyphens (-) instead of dashes. Rather than bug the author, you might want to replace these ASCII approximations by proper character references automatically.

The brute force approach. Perhaps the first idea to come to your mind will be writing a simple AWK or Perl script to handle these search-and-replace tasks . However, you'll quickly realize that you don't want to replace all of your quotes and hyphens because a lot of them are part of markup (e.g., quotes around attribute values) and not character data. Moreover, even some parts of character data (such as examples of programming code or XML markup) must be protected from any replacements . To reliably distinguish between those parts of the input stream that are to be processed and those that are not, you basically have to implement a complete XML parserwhich makes the entire idea look hardly feasible .

The XSLT 1.0 approach. It's clear, therefore, that the character replacement job can only be handled by an XPath-enabled language. And while we're writing an XSLT stylesheet, why not assign it this taskalong with all the other tasks it has to perform?

Unfortunately, XSLT 1.0 is badly suited for this kind of job. Matching regexps and replacing parts of a text string cannot be done except via extensions; in pure XSLT, you can only use recursion to parse the string into a sequence of character tokens to be processed in a for-each loop. Which is possible but, believe me, way too awkward and agonizingly slow.

The XSLT 2.0 approach. XSLT 2.0 and XPath 2.0 are much better equipped for text processing, since the new XPath provides functions for regexp matching and replacing. For example, the eg: letters -only() function from Example 5.5 (page 215) uses these new tools to build a filename from a text string by lowercasing it and removing spaces and punctuation. Thus, it will transform the heading

 What, Where, and When?

into

 whatwhereandwhen

The combined approach. For more complex processing, we have to write an extension function and link it to our stylesheet. Let's see how this could be done in Java with Saxon. To process all regular text of a web page, we can write in the stylesheet

 <xsl:template match=  "p//text()  head//text()"  >   <xsl:value-of select=  "text:typography(.)"  /> </xsl:template>

This template matches all text() nodes under p and head , passes each node to the typography() method, and outputs the returned string value. The text namespace prefix, as always, points to the class containing this method. The Java source for that class is shown in Example 5.9.

Alternating quotes. The interesting bit is that ASCII has only one symbol for double quote ( " ) while proper typography requires that open and closing quote characters (" ") be used. To work around this, our class stores an internal flag variable, quoteFlag , set alternatively to 0 or 1 on each quote replacement. If quoteFlag == 0 , we replace the next ASCII quote with an opening quote; otherwise, a closing quote.

It is worth noting that the quoteFlag variable belongs to the class, not to the method itself, and is therefore persistent between method calls. As a result, this simple mechanism gives correct results even when a single text unit is broken into several text nodes. For instance, if your text contains a fragment within quotes enclosed in an inline element, such as

 She said, "<name>Mr Filk&#235;</name> is - and always was - my respected teacher."

this will be correctly translated into what renders as

She said, "Mr Filk isand always wasmy respected teacher."

Example 5.9. The `text` class provides the `typography()` method that replaces some ASCII characters with their improved typographic lookalikes.

  package   com.projectname.xslt;   public class   text {   static  int  quoteFlag  = 0;  public static  String  typography  (String  s  ) {     int  i;   //replace space followed by hyphen   //by no-break space followed by em-dash  i = 0;  while  ((i = s.indexOf ("  -  ", i + 1)) != -1) {       s = s.substring (0, i)           + "\u00a0" + "\u2014"           + s.substring (i + 2);     }  //use right single curly quote instead of  s = s.replace ('\'', '\u2019');  //replace " by alternating left and right double curly quotes   while  ((i = s.indexOf ("\"")) != -1) {  if  (quoteFlag = 0) {         s = s.substring (0, i) + "\u201c" + s.substring (i + 1);         quoteFlag = 1;       }  else if  (quoteFlag = 1) {         s = s.substring (0, i) + "\u201d" + s.substring (i + 1);         quoteFlag = 0;       }     }  return  s;   } }

(with, if necessary, additional formatting for the name ) even though this sentence is broken into three text nodes and therefore triggers three calls to text:typography() .

Note, however, that this approach is risky and can only be used after testing with a specific processor. This is because XSLT does not require that the document order be preserved when the text() nodes (or any other nodes) are matched against the corresponding templates, and an XSLT processor may therefore freely reorder the text:typography() method calls.

Punctuation as style, not content. If some element type, such as quote , always requires quotation marks, you should program your stylesheet to insert them automatically for quote elements and free the author of the burden to supply both markup and punctuation that duplicate each other.

5.4.2.3 Text preparation guidelines

If you decide that you do need search-and-replace text processing similar to what we've just discussed, the text class in Example 5.9 is only a starting point. Your authors may have their own idiosyncrasies regarding ASCII punctuation. For example, some prefer to use double hyphens ( -- ) to represent em-dashes; others may use ASCII backquotes ( ` ) and straight quotes ( ' ) for opening and closing quotation marks.

Typographic conventions in output may also vary considerably. For instance, you may or may not have spaces around em-dashes; besides em-dashes, you may need en-dashes (between digits) and longish quotation dashes; the different approaches to the use of single and double quotes, as well as adjacent punctuation characters, is a topic unto itself. Finally, other languages may impose their own typographic rules, which you must respect even if all you need is a short foreign-language citation.

Standardize. You should make painfully clear to anyone involved in writing or editing web site content what is the accepted standard source representation for any nontrivial characters. Try to keep your guidelines simple and logical, but always be more flexible in your stylesheet code than you are in the guidelines (i.e., try to accommodate as much nonstandard input as possible, so long as it is unambiguous).

If your site is going to contain any significant amount of text and/or be massively updated, take time to develop and publicize your very own Web Site Style Guide (preferably in collaboration with the site's editor and graphic designer). Find a good typographical reference whose recommendations you trust and like. Learn from existing web sites whose typography is above average. Unicode charts ^[13] will not only help you find the codes for the characters you want but will also provide hints on their usage. Don't expect your guide to be ready at the web site launch; the best style guides grow from everyday practice. Last but not least, always test your web typography on all major platforms and browsers.

^[13] www.unicode.org/charts/

Reuse. If you are not new to web design, you have probably accumulated a library of text-processing scripts that you often use for preparing web pages. When migrating to XSLT, you don't have to abandon those scripts just because they may be difficult to reimplement in a functional language. You can still use them as extensions, getting the best of both worlds the power of XPath in the stylesheet and the efficiency of traditional text-processing algorithms in extensions.

Do not abuse this possibility, however. Try to use XML markup for any semantic aspects of your source, and only resort to extensions when your algorithms are too complex for XSLT or when you don't want to place unreasonable markup requirements on your authors.

5.4.2.4 Adding structure

One common text-processing task is adding markup where no markup exists in the sourcethat is, marking up fragments of a source document's text based on some patterns or regular expressions. With XSLT 2.0, this task is achievable even without extensions.

For example, suppose we want to uppercase the first two words of every paragraph. The eg:upcase2() function in Example 5.10 achieves this by breaking its argument into a sequence of words and then reassembling it, adding a span with an appropriate CSS property around the first two words.

Example 5.10. Uppercasing the first two words of each `p` (more precisely, of the first text node within each `p` ).

 <xsl:template match=  "p/text()[1]"  >   <xsl:copy-of select=  "eg:upcase2(.)"  /> </xsl:template> <xsl:function name=  "eg:upcase2"  as=  "item()*"  >   <xsl:param name=  "str"  />   <xsl:variable name=  "seq"  select=  "tokenize($str, '\s+')"  />   <xsl:sequence>     <span style="text-transform: uppercase:">       <xsl:for-each select=  "1 to 2"  >         <xsl:value-of select=  "item-at($seq, .)"  />         <xsl:text> </xsl:text>       </xsl:for-each>     </span>     <xsl:for-each select=  "3 to count($seq)"  >       <xsl:value-of select=  "item-at($seq, .)"  />       <xsl:text> </xsl:text>     </xsl:for-each>   </xsl:sequence> </xsl:function>

Note that we apply this function to p/text()[1] and not just p/text() because a p may have several child text nodes, and we only want to process the first one. Therefore, this trick won't work if there is any source markup around the first and/or second word of a p .


	Amazon