Recipe3.8.Computing Statistical Functions | XSLT Cookbook: Solutions and Examples for XML and XSLT Developers, 2nd Edition

Recipe 3.8. Computing Statistical Functions

Problem

You need to compute averages, variances, and standard deviations.

Solution

XSLT 1.0

Three types of averages are used by statisticians: the mean (layperson's average), the median, and the mode.

The mean is trivialsimply sum using Recipe 3.6 and divide by the count.

The median is the number that falls in the middle of the set of numbers when they are sorted. If the count is even, then the mean of the two middle numbers is generally taken:

<xsl:template name="ckbk:median">   <xsl:param name="nodes" select="/.."/>   <xsl:variable name="count" select="count($nodes)"/>   <xsl:variable name="middle" select="ceiling($count div 2)"/>   <xsl:variable name="even" select="not($count mod 2)"/>       <xsl:variable name="m1">     <xsl:for-each select="$nodes">       <xsl:sort data-type="number"/>       <xsl:if test="position( ) = $middle">         <xsl:value-of select=". + ($even * ./following-sibling::*[1])"/>       </xsl:if>     </xsl:for-each>   </xsl:variable>       <!-- The median -->   <xsl:value-of select="$m1 div ($even + 1)"/>  </xsl:template>

Handling the even case relies on the Boolean-to-number conversion trick used in several other examples in this book. If the number of nodes is odd, $m1 ends up being equal to the middle node, and you divide by 1 to get the answer. On the other hand, if the number of nodes is odd, $m1 ends up being the sum of the two middle nodes, and you divide by two to get the answer.

The mode is the most frequently occurring element(s) in a set of elements that need not be numbers. If identical nodes compare with equality on their string values, then the following solution does the trick:

<xsl:template name="ckbk:mode">   <xsl:param name="nodes" select="/.."/>   <xsl:param name="max" select="0"/>   <xsl:param name="mode" select="/.."/>       <xsl:choose>     <xsl:when test="not($nodes)">       <xsl:copy-of select="$mode"/>     </xsl:when>     <xsl:otherwise>       <xsl:variable name="first" select="$nodes[1]"/>      <xsl:variable name="try" select="$nodes[. = $first]"/>       <xsl:variable name="count" select="count($try)"/>       <!-- Recurse with nodes not equal to first -->       <xsl:call-template name="ckbk:mode">         <xsl:with-param name="nodes" select="$nodes[not(. = $first)]"/>         <!-- If we have found a node that is more frequent then            pass the count otherwise pass the old max count -->         <xsl:with-param name="max"            select="($count > $max) * $count + not($count > $max) * $max"/>         <!-- Compute the new mode as ... -->         <xsl:with-param name="mode">           <xsl:choose>             <!-- the first element in try if we found a new max -->             <xsl:when test="$count > $max">              <xsl:copy-of select="$try[1]"/>             </xsl:when>             <!-- the old mode union the first element in try if we                 found an equivalent count to current max -->             <xsl:when test="$count = $max">              <!-- Caution: you will need to convert $mode to a -->              <!-- node set if you are using a version of XSLT -->              <!-- that does not convert automatically -->              <xsl:copy-of select="$mode | $try[1]"/>             </xsl:when>             <!-- othewise the old mode stays the same -->             <xsl:otherwise>               <xsl:copy-of select="$mode"/>             </xsl:otherwise>           </xsl:choose>         </xsl:with-param>       </xsl:call-template>     </xsl:otherwise>   </xsl:choose>   </xsl:template>

If not, then replace the comparisons with an appropriate test. For example, if equality is contingent on an attribute called age, the test would be ./@age = $first/@age.

The variance and standard deviation are common statistical measures of dispersion or the spread in the values about the average. The easiest way to compute a variance is to obtain three values: sum = the sum of the numbers, sum-sq = the sum of each number squared, and count = the size of the set of numbers. The variance is then (sum-sq - sum2 / count) / count - 1. You can compute them all in one shot with the following tail-recursive template:

<xsl:template name="ckbk:variance">   <xsl:param name="nodes" select="/.."/>   <xsl:param name="sum" select="0"/>   <xsl:param name="sum-sq" select="0"/>   <xsl:param name="count" select="0"/>   <xsl:choose>     <xsl:when test="not($nodes)">       <xsl:value-of select="($sum-sq - ($sum * $sum) div $count) div ($count - 1)"/>     </xsl:when>     <xsl:otherwise>       <xsl:variable name="value" select="$nodes[1]"/>       <xsl:call-template name="ckbk:variance">         <xsl:with-param name="nodes" select="$nodes[position( ) != 1]"/>         <xsl:with-param name="sum" select="$sum + $value"/>         <xsl:with-param name="sum-sq" select="$sum-sq + ($value * $value)"/>         <xsl:with-param name="count" select="$count + 1"/>       </xsl:call-template>     </xsl:otherwise>   </xsl:choose> </xsl:template>

You may recognize this template as a variation of ckbk:sum that was extended to compute the other two components that comprise the variance calculation. As such, an XSLT implementation without support for tail recursion runs into trouble on large sets. In that case, you must take an alternate piecewise strategy based on the standard definition of variance: (mean - x_i)² / (count - 1). First, compute the mean by using the divide-and-conquer or batch forms of sum and diving by the count. Then use a divide-and-conquer or batch template that computes the sum of the squares of the difference between the mean and each number. Finally, divide the result by count - 1.

Once you can compute the variance, the standard deviation follows as the square root of the variance. See Recipe 3.5 for square root.

XSLT 2.0

<-- Median --> <xsl:function name="ckbk:median">   <xsl:param name="nodes" as="xs:double*" />      <xsl:variable name="count" select="count($nodes)"/>   <xsl:variable name="middle" select="ceiling($count div 2)"/>       <xsl:variable name="sorted" as="xs:double*">     <xsl:perform-sort select="$nodes">       <xsl:sort data-type="number"/>     </xsl:perform-sort>   </xsl:variable> <-- Mode --> <xsl:function name="ckbk:mode" as="item( )*">     <xsl:param name="nodes" as="item( )*"/>     <!-- First locate the distinct values -->     <xsl:variable name="distinct" select="distinct-values($nodes)" as="item( )*"/>     <!--Get a sequence of occurrence counts of the distinct values -->     <xsl:variable name="counts"                    select="for $i in $distinct return count($nodes[. = $i])"                    as="xs:integer*"/>     <!--Find the max of the counts -->     <xsl:variable name="max" select="max($counts)" as="xs:integer?"/>     <!-- Now return those values that have the max count -->      <xsl:sequence select="$distinct[position( ) = index-of($counts,$max)]"/>  </xsl:function> <-- Variance --> <xsl:function name="ckbk:variance" as="xs:double">     <xsl:param name="nodes" as="xs:double*"/>     <xsl:variable name="sum" select="sum($nodes)"/>     <xsl:variable name="count" select="count($nodes)"/>     <xsl:sequence select="if ($count lt 2)                            then 0                            else sum(for $i in $nodes return $i * $i) -                                 $sum * $sum) div                                   ($count * $count - $count)"/> </xsl:function>

Discussion

XSLT 1.0

Statistical functions are common tools for analyzing numerical data, and these templates can be a useful addition to your toolkit. However, XSLT was never intended as a tool for statistical analysis. An alternate approach would use XSLT as a front end for converting XML data to comma- or tab-delimited data and then import this data into a spreadsheet or statistics package.

XSLT 2.0

Again you can see that the 2.0 solutions are much simpler. However, even more enlightening is that the 2.0 implementation of mode was far easier to derive, code, and debug than the 1.0 solution. I recall spending at least a few hours coming up with the 1.0 solution while writing the first edition of this book and probably an hour more debugging it. The corresponding 2.0 solution took about 15 minutes once I realized how I could take advantage of the distinct-values function.