Recipe9.1.Performing Set Operations on Node Sets

Recipe 9.1. Performing Set Operations on Node Sets

Problem

You need to find the union, intersection, set difference, or symmetrical set difference between two node sets. You may also need to test equality and subset relationships between two node sets.

Solution

XSLT 1.0

The union is trivial because XPath supports it directly:

<xsl:copy-of select="$node-set1 | $node-set2"/>

The intersection of two node sets requires a more convoluted expression:

<xsl:copy-of select="$node-set1[count(. | $node-set2) = count($node-set2)]"/>

This means all elements in node-set1 that are also in node-set2 by virtue of the fact that forming the union with node-set2 and some specified element in node-set1 leaves the same set of elements.

Set difference (those elements that are in the first set but not the second) follows:

<xsl:copy-of select="$node-set1[count(. | $node-set2) != count($node-set2)]"/>

This means all elements in node-set1 that are not also in node-set2 by virtue of the fact that forming the union with node-set2 and some specified element in node-set1 produces a set with more elements.

An example of symmetrical set difference (the elements are in one set but not the other) follows:

<xsl:copy-of select="$node-set1[count(. | $node-set2) != count($node-set2)] | $node-set2[count(. | $node-set1) != count($node-set1)] "/>

The symmetrical set difference is simply the union of the differences taken both ways.

To test if node-set1 is equal to node-set2:

<xsl:if test="count($ns1|$ns2) = count($ns1) and                 count($ns1) = count($ns2)">

Two sets are equal if their union produces a set with the same number of elements as are contained in both sets individually.

To test if node-set2 is a subset of node-set1:

<xsl:if test="count($node-set1|$node-set2) = count($node-set1)">

To test if node-set2 is a proper subset of node-set1:

<xsl:if test="count($ns1|$ns2) = count($ns1) and count($ns1) > count(ns2)">

XSLT 2.0

Set operations on sequences are directly supported in XPath 2.0. See Recipe 1.7 for details.

Discussion

You may wonder what set operations have to do with XML queries. Set operations are ways of finding commonalities and differences between sets of elements extracted from a document. Many basic questions one can ask of data have to do with common and distinguishing traits.

For example, imagine extracting person elements from people.xml as follows:

<xsl:variable name="males" select="//person[@sex='m']"/> <xsl:variable name="females" select="//person[@sex='f']"/> <xsl:variable name="smokers" select="//person[@smoker='yes']"/> <xsl:variable name="non-smokers" select="//person[@smoker='no']"/>

Now, if you were issuing life insurance, you might consider charging each of the following sets of people different rates:

<!-- Male smokers --> <xsl:variable name="super-risk"       select="$males[count(. | $smokers) = count($smokers)]"/> <!-- Female smokers --> <xsl:variable name="high-risk"       select="$females[count(. | $smokers) = count($smokers)]"/> <!-- Male non-smokers --> <xsl:variable name="moderate-risk"       select="$males[count(. | $non-smokers) = count($non-smokers)]"/> <!-- Female non-smokers --> <xsl:variable name="low-risk"       select="$females[count(. | $non-smokers) = count($non-smokers)]"/>

You probably noticed that the same answers could have been acquired more directly by using logic rather than set theory:

<!-- Male smokers --> <xsl:variable name="super-risk"       select="//person[@sex='m' and @smoker='y']"/> <!-- Female smokers --> <xsl:variable name="high-risk"       select="//person[@sex='f' and @smoker='y']"/> <!-- Male non-smokers --> <xsl:variable name="moderate-risk"       select="//person[@sex='m' and @smoker='n']"/> <!-- Female non-smokers --> <xsl:variable name="low-risk"       select="//person[@sex='f' and @smoker='n']"/>

Better still, if you already had the set of males and females extracted, it would be more efficient to say:

<!-- Male smokers --> <xsl:variable name="super-risk"       select="$males[@smoker='y']"/> <!-- Female smokers --> <xsl:variable name="high-risk"       select="$females[@smoker='y']"/> <!-- Male non-smokers --> <xsl:variable name="moderate-risk"       select="$males[@smoker='n']"/> <!-- Female non-smokers --> <xsl:variable name="low-risk"       select="$females[@smoker='n']"/>

These observations do not invalidate the utility of the set approach. Notice that the set operations work without knowledge of what the sets themselves contain. Set operations work at a higher level of abstraction. Imagine that you have a complex XML document and are interested in the following four sets:

<!-- All elements that have elements c1 or c2 as children--> <xsl:variable name="set1" select="//*[c1 or c2]"/> <!-- All elements that have elements c3 and c4 as children--> <xsl:variable name="set2" select="//*[c3 and c4]"/> <!-- All elements whose parent has attribute a1--> <xsl:variable name="set3" select="//*[../@a1]"/> <!-- All elements whose parent has attribute a2--> <xsl:variable name="set4" select="//*[../@a2]"/>

In the original example, it was obvious that the sets of males and females (and smokers and nonsmokers) are disjoint. Here you have no such knowledge. The sets may be completely disjointed, completely overlap, or share only some elements. There are only two ways to find out what is in common between, say, set1 and set3. The first is to take their intersection; the second is to traverse the entire document again using the logical and of their predicates. In this case, the intersection is clearly the way to go.

EXSLT defines a set module that includes functions performing the set operations discussed here. The EXSLT uses an interesting technique to return the result of its set operations. Instead of returning the result directly, it applies templates to the result in a mode particular to the type of set operation. For example, after EXSLT set:intersection computes the intersection, it invokes <xsl:apply-templates mode="set:intersection"/> on the result. A default template exists in EXSLT with this mode, and it will return a copy of the result as a node-tree fragment. This indirect means of returning the result allows users importing the EXSLT set module to override the default to process it further. This technique is useful but limited. It is useful because it potentially eliminates the need to use the node-set extension function to convert the result back into a node set. It is limited because there can be at most one such overriding template per matching pattern in the user stylesheet for each operation. However, you may want to do very different post-processing tasks with the result of intersections invoked from different places in the same stylesheet.

Do not be alarmed if you do not grasp the subtleties of EXSLT's technique discussed here. Chapter 16 will discuss in more detail these and other techniques for making XSLT code reusable.