Section 8.5. Calculating a Median

   

8.5 Calculating a Median

8.5.1 Problem

You want to calculate the median bulb life for all bulbs that have been tested . From the discussion earlier in this chapter, you should recognize that the median bulb life represents the case where the number of bulbs with shorter lives is equivalent to the number of bulbs with longer lives.

8.5.2 Solution

To calculate the median of the light-bulb test results, use the following query:

 SELECT x.Hours median FROM BulbLife x, BulbLife y GROUP BY x.Hours HAVING     SUM(CASE WHEN y.Hours <= x.Hours        THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND    SUM(CASE WHEN y.Hours >= x.Hours        THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1 median       -----------  1086 

8.5.3 Discussion

This query follows the definition of the median very closely and uses the solution published several years ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample. To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:

 SELECT COUNT(*) occurrences, x.Hours xhours FROM BulbLife x, BulbLife y GROUP BY x.Hours occurrences xhours       ----------- -----------  40          1043 20          1057 20          1073 20          1074 20          1077 20          1079 40          1085 20          1086 20          1087 20          1093 20          1096 20          1099 20          1103 20          1109 20          1114 20          1129 20          1131 20          1167 

Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085. The detail for the 1086 group is as follows:

 xhours      yhours       ----------- -----------  1086        1043 1086        1043 1086        1057 1086        1073 1086        1074 1086        1077 1086        1079 1086        1085 1086        1085 1086        1086 1086        1087 1086        1093 1086        1096 1086        1099 1086        1103 1086        1109 1086        1114 1086        1129 1086        1131 1086        1167 

The question now is whether the value 1086 represents the median. To determine that, follow these steps:

  1. Count the cases where the y.hours value is less than or equal to the x.hours value.

  2. Count the cases where the x.hours value is less than or equal to the y.hours value.

  3. Compare the two results. If they are equal, then 1086 is the median value.

The HAVING clause in our solution query performs the counts for steps 1 and 2 using the following two invocations of the SUM function combined with a CASE statement:

 SUM(CASE WHEN y.Hours <= x.Hours     THEN 1 ELSE 0 END) SUM(CASE WHEN y.Hours >= x.Hours     THEN 1 ELSE 0 END) 

In our example, the values for these 2 sums work out to 10 and 11, respectively. Plug these two values in for the two SUM expressions in the HAVING clause, and you have the following:

 10 >= (COUNT(*)+1)/2 AND 11 >= (COUNT(*)/2)+1 

At this point, the two COUNT expressions deserve some additional explanation. They have been carefully crafted to allow us to derive a median, even in cases where we have an even number of values in the sample. Let's step back for a moment, and assume that our sample contained 21 values, instead of the 20 that it does contain. If that were the case, the two COUNT expressions would evaluate as follows:

 (COUNT(*)+1)/2         (COUNT(*)/2)+1 (21+1)/2               (21/2)+1 22/2                   10+1 11                     11 

In SQL Server, 21/2 represents an integer division and, hence, yields an integer value of 10 as the result.

Whenever you have an odd number of values in the sample, the two expressions will yield the same result. Given an even number, however, the first expression will yield a result that is one less than the other. Here is how the HAVING expression works out for the data in our example:

 10 >= (20+1)/2 AND 11 >= (20/2)+1 10 >= 10 AND 11 >= 11 

For the case where x.Hours = 1086, both expressions are true, so 1086 is returned as the median value. In actual fact, because we have an even number of values, there are 2 candidates for the median: 1086 and 1087. The value 1086 has 9 values below it and 10 above it. The value for 1087 has 10 values below it and 9 above it. Due to how we've written the COUNT expressions, our solution query arbitrarily returns the lower value as the median.

It's possible to use a slightly modified version of our solution query to return the financial median. Recall, from earlier in this chapter, that in the case of an even number of values, the financial median represents the mean of the two inner neighbors. With respect to our example, that would be the mean of 1086 and 1087, which works out to 1086.5. Use the following query to calculate the financial median:

 SELECT     CASE WHEN COUNT(*)%2=1          THEN x.Hours          ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours                                 THEN y.Hours                            END))/2.0     END median FROM BulbLife x, BulbLife y GROUP BY x.Hours HAVING     SUM(CASE WHEN y.Hours <= x.Hours        THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND    SUM(CASE WHEN y.Hours >= x.Hours        THEN 1 ELSE 0 END)>=(count(*)/2)+1 

The basic query remains the same, the only difference being in the SELECT statement's column list. If there is an odd number of cases, the median is reported directly as the x.Hours value. However, if the number of cases is even, the smallest y.Hoursvalue that is higher than the chosen x.Hours value is identified. This value is then added to the x.Hours value. The result is then divided by 2.0 to return the mean of those two values as the query's result.

The original query reported only the lesser of the two values that were in the middle of the sample. The added logic in the SELECT clause for the financial-median causes the mean of the two values to be calculated and reported. In our example, the result is the following:

 median               -------------------  1086.5 

After you run the query to obtain a financial median, you'll get a warning like this:

 Warning: Null value eliminated from aggregate. 

This is an important warning, because it's a sign that our query is working correctly. We did not want to write an ELSE clause for the CASE statement inside the MIN function. Because if we wrote such an ELSE clause and had it return a 0, the result of the MIN function would always be 0, and, consequently, the financial-median calculation would always be wrong for any sample with an even number of values.

   


Transact-SQL Cookbook
Transact-SQL Cookbook
ISBN: 1565927567
EAN: 2147483647
Year: 2005
Pages: 152

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net