8.5 Calculating a Median8.5.1 ProblemYou want to calculate the median bulb life for all bulbs that have been tested . From the discussion earlier in this chapter, you should recognize that the median bulb life represents the case where the number of bulbs with shorter lives is equivalent to the number of bulbs with longer lives. 8.5.2 SolutionTo calculate the median of the light-bulb test results, use the following query: SELECT x.Hours median FROM BulbLife x, BulbLife y GROUP BY x.Hours HAVING SUM(CASE WHEN y.Hours <= x.Hours THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND SUM(CASE WHEN y.Hours >= x.Hours THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1 median ----------- 1086 8.5.3 DiscussionThis query follows the definition of the median very closely and uses the solution published several years ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample. To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value: SELECT COUNT(*) occurrences, x.Hours xhours FROM BulbLife x, BulbLife y GROUP BY x.Hours occurrences xhours ----------- ----------- 40 1043 20 1057 20 1073 20 1074 20 1077 20 1079 40 1085 20 1086 20 1087 20 1093 20 1096 20 1099 20 1103 20 1109 20 1114 20 1129 20 1131 20 1167 Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085. The detail for the 1086 group is as follows: xhours yhours ----------- ----------- 1086 1043 1086 1043 1086 1057 1086 1073 1086 1074 1086 1077 1086 1079 1086 1085 1086 1085 1086 1086 1086 1087 1086 1093 1086 1096 1086 1099 1086 1103 1086 1109 1086 1114 1086 1129 1086 1131 1086 1167 The question now is whether the value 1086 represents the median. To determine that, follow these steps:
The HAVING clause in our solution query performs the counts for steps 1 and 2 using the following two invocations of the SUM function combined with a CASE statement: SUM(CASE WHEN y.Hours <= x.Hours THEN 1 ELSE 0 END) SUM(CASE WHEN y.Hours >= x.Hours THEN 1 ELSE 0 END) In our example, the values for these 2 sums work out to 10 and 11, respectively. Plug these two values in for the two SUM expressions in the HAVING clause, and you have the following: 10 >= (COUNT(*)+1)/2 AND 11 >= (COUNT(*)/2)+1 At this point, the two COUNT expressions deserve some additional explanation. They have been carefully crafted to allow us to derive a median, even in cases where we have an even number of values in the sample. Let's step back for a moment, and assume that our sample contained 21 values, instead of the 20 that it does contain. If that were the case, the two COUNT expressions would evaluate as follows: (COUNT(*)+1)/2 (COUNT(*)/2)+1 (21+1)/2 (21/2)+1 22/2 10+1 11 11
Whenever you have an odd number of values in the sample, the two expressions will yield the same result. Given an even number, however, the first expression will yield a result that is one less than the other. Here is how the HAVING expression works out for the data in our example: 10 >= (20+1)/2 AND 11 >= (20/2)+1 10 >= 10 AND 11 >= 11 For the case where x.Hours = 1086, both expressions are true, so 1086 is returned as the median value. In actual fact, because we have an even number of values, there are 2 candidates for the median: 1086 and 1087. The value 1086 has 9 values below it and 10 above it. The value for 1087 has 10 values below it and 9 above it. Due to how we've written the COUNT expressions, our solution query arbitrarily returns the lower value as the median. It's possible to use a slightly modified version of our solution query to return the financial median. Recall, from earlier in this chapter, that in the case of an even number of values, the financial median represents the mean of the two inner neighbors. With respect to our example, that would be the mean of 1086 and 1087, which works out to 1086.5. Use the following query to calculate the financial median: SELECT CASE WHEN COUNT(*)%2=1 THEN x.Hours ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours THEN y.Hours END))/2.0 END median FROM BulbLife x, BulbLife y GROUP BY x.Hours HAVING SUM(CASE WHEN y.Hours <= x.Hours THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND SUM(CASE WHEN y.Hours >= x.Hours THEN 1 ELSE 0 END)>=(count(*)/2)+1 The basic query remains the same, the only difference being in the SELECT statement's column list. If there is an odd number of cases, the median is reported directly as the x.Hours value. However, if the number of cases is even, the smallest y.Hoursvalue that is higher than the chosen x.Hours value is identified. This value is then added to the x.Hours value. The result is then divided by 2.0 to return the mean of those two values as the query's result. The original query reported only the lesser of the two values that were in the middle of the sample. The added logic in the SELECT clause for the financial-median causes the mean of the two values to be calculated and reported. In our example, the result is the following: median ------------------- 1086.5 After you run the query to obtain a financial median, you'll get a warning like this: Warning: Null value eliminated from aggregate. This is an important warning, because it's a sign that our query is working correctly. We did not want to write an ELSE clause for the CASE statement inside the MIN function. Because if we wrote such an ELSE clause and had it return a 0, the result of the MIN function would always be 0, and, consequently, the financial-median calculation would always be wrong for any sample with an even number of values. |