Physical Mappings of XQuery | Beginning ASP.NET Databases Using VB.NET

The previous sections have presented the functional description of how to integrate XQuery and XML into the SQL processing model of relational systems. From a performance point of view, another interesting aspect is how such combined queries are being mapped into a physical operator tree. Since this mapping depends on the physical mapping of the XML datatype and also depends heavily on the vendor-specific architectures, this section provides a more general, higher-level discussion of the different mapping strategies of XQuery into physical execution plans.

In the following, we assume that the XQuery expressions are provided as constants; thus, compilation can occur at the same time as the compilation of the SQL expressions. The handling of parameterized queries is left as an exercise for the reader. The general processing model of an SQL statement is as follows (individual steps may vary in actual implementations ; e.g., a logical operator tree may contain more or less physical information):

Static compilation phase:
1. Parse the statement into an abstract syntax tree (AST).
2. Normalize the AST (example: perform the implicit conversions of single- cell rowsets into scalar values) and check it for type-correctness.
3. Convert the AST into a logical operator tree representing the relational algebra of the implementation engine. A logical operator tree may indicate a join operation, but does not indicate , for example, which physical join implementation is used.
4. Store the operator tree for later dynamic processing after potentially applying some additional static rewrite rules.
Dynamic execution phase:
1. Compile the logical operator tree into a physical operator tree using dynamic statistics such as costing information. This transformation may reorder operations and will choose among the different access methods (e.g., scan vs. index) and join methods .
2. Execute the physical operator tree.

There are two major approaches to compiling and executing XQuery expressions in the context of SQL statements: a decoupled and an integrated approach. The decoupled approach works with a standalone XQuery engine that is added to the relational system. It is usually employed when the XML datatype is being modeled as a user -defined datatype that appears as a binary object to the relational processor. In this case, the query function is processed as is any other external function in the SQL compilation: The XQuery expression is passed to the XQuery processor outside of the general SQL processing model, the query is processed , and the result is returned to the SQL environment. If XQuery has been extended with access to the relational processing environment (for example, the sql:column() function mentioned above), then the external XQuery engine must be able to call back into the relational processing environment to access the data. Figure 7.3 shows the decoupled approach in form of a block diagram. It uses the Greek symbols s and for selection and projection respectively.

Figure 7.3. Decoupled XQuery Processing Approach

graphics/07fig03.jpg

The integrated approach maps the XQuery expressions into logical operator trees that are integrated with the logical operator tree of the SQL statement. This approach requires that the physical design of the XML datatype use one of the relational mappings described above. Note that such a relational physical design can occur as the primary storage or as part of an indexing scheme on an LOB-based storage model or even a just-in-time shredding of such an LOB when processing XQueries. The XQuery logical operator tree must consist of operations that the relational query processor understands such as projections ( ), selections ( s ), and aggregations. Figure 7.4 provides the block diagram of the integrated approach.

Figure 7.4. Integrated XQuery Processing Approach

graphics/07fig04.jpg

Note that the integrated approach is not as simple as it sounds. First, since XQuery has a more complex, nested and sequence-based data model, the relational algebra must be extended with operations that capture the necessary semantics, such as nest/unnest and document-ordering. Furthermore, the XML Schema built-in scalar datatypes and their operations do not normally map one-to-one to the built-in relational scalar datatypes and their operations. For example, relational string comparisons may disregard trailing spaces during string comparisons, whereas XQuery's string compare does not, or the relational expression service that provides the operation semantics may deal only with true scalar types, while the XML Schema types may be not truly scalar, as is true of the XML date and time datatypes that provide a (value, timezone) pair. Thus relational systems must extend their expression services to deal with the new datatypes and different operational semantics.

Finally, integrating XQuery processing into the SQL processing model also means that the cost-based optimizer will have to choose execution plans that preserve the XQuery semantics and understand potentially additional cost factors. One important aspect that affects the execution plan is the preservation of document and input order. An optimizer must either choose only order- preserving join and set algorithms or add specific reorder operators to allow the use of more efficient algorithms that are not order-aware.

Another area that makes the XQuery evaluation more complex than the evaluation of SQL statements is the execution order. The presence of an index often prompts the optimizer to choose a so-called bottom-up evaluation strategy, where first indices are used to filter the processed tuples before any of the other operations are evaluated. Since the nave execution strategy of XQuery is described top-down, an optimizer may produce dynamic errors by reordering the evaluation order that the top-down evaluation strategy would have avoided. A simple example of this situation appears in the following XQuery expression:

 for $i in //A where $i/@a castable as xs:integer return   for $j in $i/B   where xs:integer($i/@a) > 10   return     $j

A naive, top-down evaluation of the nested for expression would only execute the inner for expression if the value of $i/@a is really castable to xs:integer . However, an optimizer may choose to rewrite the above query to the following equivalent:

 for $i in //A, $j in $i/B where $i/@a castable as xs:integer and xs:integer($i/@a) > 10 return   $j

and then choose to execute the comparison before the check for castability. If then there are a attributes that are not castable to xs:integer , the cast in the comparison will fail with a runtime error that in the naive evaluation would have been avoided.

Note that this type of behavior can already occur in processing normal SQL statements without XQueryalthough less frequently. The XQuery language specification also recognizes the benefits of processing optimizations and thus indeed allows these rewrites. However, an XQuery system that integrates with such an optimizer should provide an optimization hint that can force the result of the naive top-down evaluation strategy to provide the query writer with the option to force a correct, error-free evaluation.

Issues of Combining SQL, XML Datatype, and XQuery

Combining XQuery and SQL for querying XML datatype instances is indeed a powerful combination of the two languages. However, some major issues arise with this approach:

XQuery users interested in querying across multiple documents need to understand and use SQL to iterate over the collection of documents and to use SQL joins to be able to join among two different documents.
SQL users interested in querying into XML documents need to learn the new XQuery language.

Since the languages are closely related , somebody familiar with one language should be able to learn the other easily. However, the need to know two languages is still an additional level of complexity that future relational systems will need to address. Today's relational database systems address the second problem by shredding the XML data into relational tables that can then be queried relationally, thus replacing the need for a second query language with a simpler mapping approach. However, this mapping approach often only provides relational fidelity and thus does not address order preservation and markup scenarios. Thus either the SQL model itself must be extended to deal with the nonrelational aspects of order and heterogeneity (this still means that a query writer would have to learn and understand these extensions), or the SQL model will eventually be subsumed under the XML and XQuery model (which would mean that a large number of people would have to learn this new model). Relational database systems will probably provide a combination of the two approaches, thus providing both an SQL- and an XQuery-based approach, each addressing a different skill set.