Optimization Techniques

In the three years since XSLT became a W3C Recommendation, a considerable number of implementations have appeared and have had some time to mature. Competitive comparisons of their performance have occasionally been published, which, although they suffer the same problems as all such benchmarks, have certainly spurred vendors to improve their ratings. Implementers will also have had feedback from their own users highlighting performance problems. All the products have improved significantly since their first appearance.

It is reasonable to expect, therefore, that the performance of XSLT processing is by now a fairly widely understood subject, and that the lessons learned by XSLT suppliers will be available for XQuery vendors to study, giving them a head start. Unfortunately, though, very little has been published on XSLT implementation techniques. Several of the products are open -source, so one can study the techiques they use by reading the source code, but that's a far cry from having published articles written by their designers. Apart from anything else, the source code doesn't tell you what techniques were tried but failed, or what ideas are still present in the product even though they actually made little impact.

The notes here come mainly from my experiences with Saxon. [6] The ideas in Saxon were influenced by ideas used in the xt and Xalan products. [7] One of the interesting features of the XSLT world is that there are multiple open-source products all competing with each other, but without the benefit of being able to hide their secrets. Of course there are other products, notably Microsoft's MSXML, [8] that have not revealed the secrets of their design, so it's quite possible that the account given here misses some important ideas.

[6] Saxon: see http://saxon.sf.net/.

[7] On xt, see http://www.blnz.com/xt/index.html. On Xalan, see http://xml.apache.org/.

[8] MSXML: see http://www.microsoft.com/xml.

The reason for writing this section is because I think there are likely to be many similarities between the challenges of XSLT optimization and XQuery optimization. There are also many differences, which I return to at the end of the section; but I think the similarities are important, and many of the experiences with XSLT will prove also to be relevant to XQuery.

Where Does the Time Go?

It's useful to start the discussion with a breakdown of how an XSLT processor spends its time. Of course, this depends on the particular problem under discussion. The charts in Figures 3.1 and 3.2 show the breakdown for two particular transformation tasks . One is the job of rendering the XML version of the XSLT specification itself as HTML; the second one uses the same stylesheet but applies it to a much smaller document. The figures were obtained using Saxon 6.5.2.

Figure 3.1. Time Breakdown for Rendering a Large Document

graphics/03fig01.gif

Figure 3.2. Time Breakdown for Rendering a Small Document

graphics/03fig02.gif

Figure 3.1 shows the breakdown of time spent on the four key activities in applying a stylesheet to a large document: compiling the stylesheet, building the source document tree, performing the actual transformation, and serializing the result tree as HTML. In this case the stylesheet is around 1,350 lines long, divided into three modules; the source document is 600KB, and the result document 860KB. The total transformation time on my machine was about 8.3 seconds. In Saxon, there are three phases executed serially : compiling, tree-building, and transformation/serialization. The division of time between transformation and serialization is calculated by comparing the effect of not doing the HTML serialization (just feeding the result to a user -written content handler that does nothing with it).

This kind of transformation is generally done as a one-off. There are few benefits to be obtained by compiling the stylesheet in advance and amortizing the compilation cost over multiple transformations of different source documents, since the saving this would achieve is 15 percent at best. The total elapsed time of 8.3 seconds also includes a heavy overhead for "warming up" the Java VM ”loading and initializing all the required classes, and getting to the point where the just-in-time Java compilation has reached steady state. The actual stylesheet is performing four or five passes over the source document (one to render the main text, additional passes to build the table of contents, the glossary, and various appendices), and is also doing a considerable amount of keyed access to construct extensive cross-references in the form of hyperlinks . It is striking, therefore, that the cost of parsing the source document and building it into a tree structure is almost as great as the transformation cost proper.

For contrast, Figure 3.2 shows a transformation that is more typical of the "on-demand" XML-to-HTML transformation performed on a web site where the data is stored in XML and rendered each time the content is requested by a user. The stylesheet used is the same; the only difference is that the source document is much smaller ”only 8KB this time. This reduces the total transformation time to 1.6 seconds, of which 79 percent is the cost of compiling the stylesheet. The obvious message from this is the vital importance of compiling stylesheets once and reusing them in this scenario.

Another interesting point is that although the stylesheet is the same as in the example of Figure 3.1, the ratio of transformation time to parsing and tree-building is much higher in this case. A possible explanation for this is that the tree-building phase is responsible for acquiring all the memory used for the transformation, and the cost of allocating the required memory increases more than linearly with the size of the source document, because of paging or memory fragmentation. It might also be the case that the transformation phase includes a one-time startup cost that is independent of document size.

What messages are there here for XQuery? Many XQuery systems will work on databases where the data has been preloaded into a database structure. The database structure performs the same role as the in-memory tree used by XSLT processors, but it is only built once, when a document is loaded or replaced in the database; it is not done on each query. This doesn't mean that the time taken to build the tree can be ignored ”on the contrary, data loading and updating time has always been a critical factor for databases supporting efficient hierarchic access paths, and this is easy to forget after twenty years in which the relational database with its flat storage structures has dominated the scene. XQuery systems need to make exactly the same trade-off as XSLT processors between the time taken to build and index the tree, and the speed of the access paths offered by the tree once it has been built. In many ways they have tougher decisions to make, because whereas an XSLT processor can build the tree with knowledge of the transformation that will take place, the trees built by an XQuery processor will have to support many different kinds of queries.

The observations about compilation time are also relevant to a query processor. Although 1,000-line queries will probably be less common than 1,000-line stylesheets, complex queries will occur, and the complexities of compiling them will probably be proportionately greater than with XSLT stylesheets because of the crucial importance of finding the optimum access path to the data when dealing with gigabytes of source rather than mere kilobytes. But as with XSLT, there will be one-off queries where nothing can be gained by reusing the compiled query, and there will be transactional queries where this is essential.

Having put the actual execution time of a query or transformation into some kind of perspective, we turn now to the various optimizing techniques that can be used.

Internal Coding Efficiency

Books on relational databases don't go out of their way to tell you something that every implementer of a database system knows : The most important way of getting good performance out of the system is simply to write fast code. Most queries in a database system, and most XSLT transformations, are quite simple, and the smartest optimizer in the world can do very little to improve the performance of a simple query. If xt processes the same stylesheet five times faster than Xalan (and it sometimes does), this isn't because it has a better optimizer; it's mainly because James Clark writes very slick code.

Associated with this is the design of efficient data structures. The design of the source tree structure is critical to XSLT performance. The trade-off between time and space is important (people want to transform documents as large as 100MB), as is the trade-off between the time taken to build the tree and the time taken to navigate it. A simplistic tree representation, using one Java object for each node in the tree, would be hopelessly inadequate on both counts. Saxon uses an adaptive approach in which many of the access paths (for example, pointers from nodes to their preceding siblings) are built only if and when these access paths are actually used.

But this book isn't about how to write good code or design efficient data structures in Java or any other programming language. Having made this point, therefore, I'll move on to other factors more specific to XSLT and XQuery processing.

Pipelining and Lazy Evaluation

The world of functional programming languages such as Scheme and Haskell and the world of database management systems often seem to have little in common. But the theory in both areas has one thing in common: the idea of pipelining. The terminology may be different, but the essential idea is the same.

In the relational model, an SQL query can be translated into an expression in the relational algebra using operations such as restriction, projection, union, sort , and join. These are set-at-a-time operations: each operation takes sets of tuples as its input and produces sets (or sometimes, sequences) of tuples as its output. It would be hideously expensive if all these sets of tuples were actually materialized in memory. Allocating memory dynamically is an expensive operation, and memory is a finite resource: use more of it for intermediate results, and less is available for other purposes such as caching. The same factors apply to functional programming languages based on list manipulation, and equally to XPath, whose semantics are defined in terms of sets of nodes. Materializing these sets is expensive.

The technique used to avoid allocating memory to these intermediate results is called pipelining . The way pipelining works is that an operation such as restriction is implemented to deliver one object at a time, on request from another operation, and it in turn requests one object at a time from its subordinate operators. The tree of operators that makes up an SQL or XPath expression is thus represented at run-time by a sequence of so-called iterators , each of which supports a get- next operation to deliver the next object in a stream. In a relational system, the objects are tuples, and the pipeline is described in terms of the flow of a stream of tuples delivered by one node in the expression tree to its parent node in the expression tree.

Not all XPath operations can be pipelined. An obvious example is the last() function: to find out the value of the last() function in an expression such as $x[last() idiv 2] , (which returns the item in the middle of a sequence), you need to know how many objects the current operation is going to return, and the only way of finding out is to read them all. This breaks the pipeline, and is therefore a costly operation. There are actually two strategies for implementing the last() function. One is to evaluate the containing expression once (in this case, $x ) and save the results in memory. The other is to evaluate the containing expression twice, once to count the nodes and once to feed them to the next operator in the pipeline. Sometimes one technique is better, sometimes the other: this is the kind of decision where a good optimizer can make a difference.

The other common operation that breaks a pipeline is sorting. One technique often used by optimizers is to rewrite expressions so that sorting is done last. This can often reduce the number of objects to be sorted, and it can eliminate the need for multiple sorts. With XPath expressions of the form a/b/c , for example, the results are guaranteed by the semantics of the language to be in document order. But there is no need to sort the intermediate result a/b (or b/c , depending on the evaluation strategy). The entire path expression can be evaluated in a single pipeline and sorted at the last possible moment.

A path expression may often be evaluated without performing a sort at all. XSLT processors such as Saxon have at least three techniques available to avoid sorting:

  • Performing static analysis of the path expression to determine that the results are "naturally sorted": that is, that when evaluated using the "obvious" evaluation strategy, the results will automatically be in document order. The rules for this are quite subtle; it's not immediately obvious, for example, that while a//b and a/b/c are both naturally sorted, a//b/c is not.

  • Evaluating reverse axes in forward order to make the expression naturally sorted where it would otherwise not be.

  • Detecting the context in which a path expression is used, in order to recognize that although the language semantics require the results to be sorted, in some contexts it makes no difference to the result. The obvious examples are where the path expression is used as an argument to functions such as count() , sum() , or boolean() . In XQuery this also arises when a path expression is used within a FLWOR expression that has an ORDER BY clause to specify the order of the results. However, there is a further subtlety here: some functions such as count() require duplicate nodes to be eliminated from the result, whereas others such as boolean() do not.

Although eliminating sorts is useful in its own right, because sorting is expensive, the chief benefit of eliminating sorts is that it improves pipelining.

Closely associated with pipelining is another technique well known in the functional programming literature: lazy evaluation. The principle of lazy evaluation is to avoid doing work until the results are actually needed. There are two benefits. First, you don't need any memory to hold the results. Second, you might discover that the results aren't actually needed anyway.

Listing 3.3 shows a simple example of lazy evaluation at work in XSLT.

Listing 3.3 Example of Lazy Evaluation in XSLT
 <xslt:param name="account-nr"/> <xslt:template match="/">   <xslt:variable name="transactions"                 select="/*/transaction[account=$account-nr]"/>   <xslt:choose>   <xslt:when test="starts-with($account-nr, 'Z')">       <xslt:value-of select="closing-balance"/>   </xslt:when>   <xslt:otherwise>       <xslt:value-of select="opening-balance +                             sum($transactions/value)"/>   </xslt:otherwise>   </xslt:choose> </xslt:template> 

The xslt:variable instruction is potentially expensive to evaluate: It involves selecting all the transactions for a given account, which probably means scanning the whole source document. But now look at how the result is used. In one branch of the conditional xslt:choose , the variable isn't used at all. In the other branch, it is only used as the argument to the sum() function. This means that by delaying the evaluation of the variable until it is actually used, we might avoid evaluating it at all; and even if we do need it, we can incorporate it in a pipeline so that the list of transaction elements never needs to be held in memory, and we can certainly avoid sorting the results.

The tricky part of lazy evaluation is that the value of an XPath expression depends on the context in which it appears (the current node, the values of other variables , the namespaces that are in scope), so as well as saving the expression, the relevant parts of the context also need to be saved. This means that an important role of an XPath optimizer is to determine the dependencies of an expression ”the parts of the context on which it depends, which need to be saved when doing lazy evaluation.

Pipelining and lazy evaluation are likely to be just as important to an XQuery processor as to an XSLT/XPath processor.

Expression Rewriting

In relational database theory, the term optimization is used almost synonymously with query rewriting. The job of an optimizer is to take a tree-representation of a query, as produced by the query language parser from the original query text as written, and rearrange it into a different (but equivalent) tree that can be evaluated more efficiently . Many standard techniques are used, such as pushing down restrictions closer to the leaves of the tree. The selection of different operators to implement the relational join functionality has been the staple diet of the research journals for nearly thirty years.

Expression rewriting is also a powerful technique used in an XPath optimizer. There are two variants of the approach. The first rewrites the expression in terms of another valid XPath expression. For example, the expression

 a/b/c  a/b/d 

can be rewritten as

 a/b/(cd) 

(which is a valid XPath 2.0 expression, though not valid in XPath 1.0).

Another example of such a rewrite is count($x)>10 , which can be rewritten as exists($x[11]) . The latter expression is likely to be more efficient, because (with the benefit of pipelining and lazy evaluation) items beyond the eleventh are likely never to be read. Rewrites that move subexpressions out of a loop can also be regarded as falling into this category. For example, the expression

 items[value > $customer/credit-limit] 

can be rewritten as

 let $x := $customer/credit-limit return items[value > $x] 

Again this relies on dependency analysis, that is, knowing that the subexpression $customer/credit-limit has no dependency on the context item or context position, which would make the value vary for different items.

The second variant of the technique rewrites expressions in terms of internal operators that are not available in the surface language. One of the most powerful examples is the expression

 $x[position() != last()] 

which can be rewritten as

 $x[hasNext()] 

where hasNext() is an internal function that tests whether there are any more items in the pipeline. The beauty of this rewrite is that it avoids breaking the pipeline. In the expression as written, when the first item is read, its position (1) needs to be compared with the number of nodes in the pipeline (perhaps thousands), which means that these nodes have to be counted. In the rewritten expression, each node is simply tested to see whether it is the last, which only requires a single-item lookahead .

In the relational database tradition, the most significant rewrites (those with the highest payoff) are those concerned with optimizing joins that occur in the SQL SELECT expression. It is quite likely, therefore, that XQuery vendors will put a lot of emphasis into optimizing the joins that arise in FLWOR expressions, which are the XQuery equivalent of SQL SELECT expressions. There is a risk that this effort could be misplaced, depending on the way users decide to write their queries. In the relational model, all relationships are modeled using primary keys and foreign keys, and equijoins between primary keys and foreign keys are ubiquitous in queries. In the hierarchic world of XML, many relationships are modeled instead by means of the containment relationship: Instead of order-lines holding an order number to indicate the purchase order that they relate to, the order-lines are represented as subelements within an element representing the purchase order. The join that is therefore inevitable in the SQL query is replaced in the XQuery formulation by a path expression. Instead of

 SELECT customer.name, order.date, order-line.value FROM customer, order, order-line WHERE customer.id = order.customer-id AND order-line order-no = order.order-no 

we are likely to see

 for $c in /customers/customer,     $o in $c/order, $ol in $o/order-line     return $c/name, $ol/date, $o/value 

Now, of course, some query systems are likely to use tabular storage underneath the covers, and will implement this query as a value-based join even though there are no explicit join keys in the user query. Processors that use a tree-based representation of the data, however, will evaluate the query in exactly the same way as an XSLT processor would ”by means of a depth-first traversal of the tree.

Of course, join optimization has a role to play in XQuery, especially where multiple indexes are available at different levels of a hierarchy, giving a wide choice of access paths to implement the same query. Joins that relate data held in different documents can also be very important. But I suspect that the join operation will be a lot less important than it is in relational databases, and that optimizing the hierarchic access paths ”effectively, the XPath expressions ”will be at least as significant, and perhaps more so.

As far as I'm aware, XSLT processors have until now done little to optimize joins. I think there are several reasons for this. The language design (with its nested xslt:for-each and xslt:apply-templates instructions) does not make join queries easy to detect. At the same time, the fact that the tree is built anew for each transformation means that there isn't a wide choice of access paths to select from. But the biggest reason, I suspect, is that the stylesheets that XSLT implementers have studied to look for optimization opportunities simply don't do many joins. Their access paths predominantly follow the hierarchic relationships intrinsic to the XML model. XQuery implementers are coming to the game armed with a formidable array of optimization tools that proved useful in the relational world. Only time will tell how effective these tools are in an XML world.

Using Type Information

As we have seen, XSLT 1.0 and XPath 1.0 work without a schema for either the source or the result document. The stylesheet authors need to know what the constraints are on the structure of these documents, but they do not need to tell the system what these constraints are.

This changes in XQuery (and indeed in XSLT 2.0 and XPath 2.0), with the ability to import schemas and make assertions about the types of expressions and function arguments: XQuery is a much more strongly typed language than its predecessors. Or at any rate, it has the potential to be, if users decide to take advantage of these capabilities.

One of the arguments made in favor of stronger typing is that knowledge of types will make much more powerful optimizations possible. There is good theoretical justification for this view. Achieving the results in practice, though, is not easy. Here's a simple case. A very common construct in stylesheets is to use an expression such as //item , whose value is the set of all item elements in the document. Evaluating this expression is expensive in most XSLT processors because it involves a full scan of the source tree. With knowledge of the schema, it becomes possible to limit the search to those branches of the tree where item elements can validly appear.

In XSLT, it's debatable whether this is worth doing. There are some practical problems, because it's not at present possible to know statically that the document to be searched using the //item expression is valid against a particular schema. It's also arguable that other approaches to optimizing this expression may be more powerful. For example, the information about which element types appear on which branches of the tree is not only available from the schema, it can also be collected (with greater precision) during parsing of the source document. In XQuery, this kind of schema-based optimization is much more important, because one of the most important tasks of the query optimizer is to identify access paths that take advantage of prebuilt indexes.

An XSLT or XQuery processor that performs static analysis of the types of expressions can make many decisions at compile-time that would otherwise be made at run-time. Saxon does this even with the loosely typed XPath 1.0 language. For example, it can usually distinguish at compile-time whether the predicate in a filter expression such as $x[FILTER] is a Boolean filter or a numeric subscript. Having schema knowledge available at compile-time will certainly increase the ability to make early decisions about the access path. However, many of these optimizations have only a small payoff. In relation to the overall execution cost of a query, for example, deciding at compile-time to do integer arithmetic rather than floating-point is not a big win.

So in my view the jury is still out on the importance of schema-based optimization. We will have to wait for real implementation experience before we know the answers. Extrapolating the relational experience is not necessarily going to give the right predictions . My own guess is that strong typing will prove less useful in practice than some of its advocates hope, especially because I think many users will take the easy road of not declaring the expected types of variables, function parameters, and constructed nodes when they can get away without it.



XQuery from the Experts(c) A Guide to the W3C XML Query Language
Beginning ASP.NET Databases Using VB.NET
ISBN: N/A
EAN: 2147483647
Year: 2005
Pages: 102

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net