5.5 Subquery Case Study: The Top N Performers

   

Certain queries that are easily described in English have traditionally been difficult to formulate in SQL. One common example is the "Find the top five salespeople" query. The complexity stems from the fact that data from a table must first be aggregated, and then the aggregated values must be sorted and compared to one another to identify the top or bottom performers. In this section, you will see how subqueries may be used to answer such questions. At the end of the section, we introduce ranking functions, a feature of Oracle SQL that was specifically designed for these types of queries.

5.5.1 A Look at the Data

Consider the problem of finding the top five salespeople. Let's assume that we are basing our evaluation on the amount of revenue each salesperson brought in during the previous year. The first task, then, would be to sum the dollar amount of all orders booked by each salesperson during the year in question. To do so, we will dip into our data warehouse, in which orders have been aggregated by salesperson, year, month, customer, and region. The following query generates total sales per salesperson for the year 2001:

SELECT s.name employee, SUM(o.tot_sales) total_sales FROM orders o INNER JOIN salesperson s ON o.salesperson_id = s.salesperson_id WHERE o.year = 2001 GROUP BY s.name ORDER BY 2 DESC; EMPLOYEE                       TOTAL_SALES ------------------------------ ----------- Jeff Blake                         1927580 Sam Houseman                       1814327 Mark Russell                       1784596 John Boorman                       1768813 Carl Isaacs                        1761814 Tim McGowan                        1761814 Chris Anderson                     1757883 Bill Evans                         1737093 Jim Fletcher                       1735575 Mary Dunn                          1723305 Dave Jacobs                        1710831 Chuck Thomas                       1695124 Greg Powers                        1688252 Don Walters                        1672522 Alex Fox                           1645204 Barbara King                       1625456 Lynn Nichols                       1542152 Karen Young                        1516776 Bob Grossman                       1501039 Eric Iverson                       1468316 Tom Freeman                        1461898 Andy Levitz                        1458053 Laura Peters                       1443837 Susan Jones                        1392648

It appears that Isaacs and McGowan have tied for fifth place, which, as you will see, adds an interesting wrinkle to the problem.

5.5.2 Your Assignment

It seems that the boss was so tickled with this year's sales that she has asked you, the IT manager, to see that each of the top five salespeople receive a bonus equal to 1% of their yearly sales. No problem, you say. You quickly throw together the following report using your favorite feature, the inline view, and send it off to the boss:

SELECT s.name employee, top5_emp_orders.tot_sales total_sales,   ROUND(top5_emp_orders.tot_sales * 0.01) bonus FROM   (SELECT all_emp_orders.salesperson_id emp_id,      all_emp_orders.tot_sales tot_sales   FROM    (SELECT salesperson_id, SUM(tot_sales) tot_sales     FROM orders     WHERE year = 2001     GROUP BY salesperson_id     ORDER BY 2 DESC    ) all_emp_orders   WHERE ROWNUM <= 5  ) top5_emp_orders INNER JOIN salesperson s ON top5_emp_orders.emp_id = s.salesperson_id ORDER BY 2 DESC; EMPLOYEE                       TOTAL_SALES      BONUS ------------------------------ ----------- ---------- Jeff Blake                         1927580      19276 Sam Houseman                       1814327      18143 Mark Russell                       1784596      17846 John Boorman                       1768813      17688 Tim McGowan                        1761814      17618

The howl emitted by Isaacs can be heard for five square blocks. The boss, looking a bit harried, asks you to take another stab at it. Upon reviewing your query, the problem becomes immediately evident; the inline view aggregates the sales data and sorts the results, and the containing query grabs the first five sorted rows and discards the rest. Although it could easily have been McGowan, since there is no second sort column, Isaacs was arbitrarily omitted from the result set.

5.5.3 Second Attempt

You console yourself with the fact that you gave the boss exactly what she asked for: the top five salespeople. However, you realize that part of your job as IT manager is to give people what they need, not necessarily what they ask for, so you rephrase the boss's request as follows: give a bonus to all salespeople whose total sales ranked in the top five last year. This will require two steps: find the fifth highest sales total last year, and then find all salespeople whose total sales meet or exceed that figure. You write a new query as follows:

SELECT s.name employee, top5_emp_orders.tot_sales total_sales,   ROUND(top5_emp_orders.tot_sales * 0.01) bonus FROM salesperson s INNER JOIN  (SELECT salesperson_id, SUM(tot_sales) tot_sales   FROM orders   WHERE year = 2001   GROUP BY salesperson_id   HAVING SUM(tot_sales) IN    (SELECT all_emp_orders.tot_sales     FROM       (SELECT SUM(tot_sales) tot_sales       FROM orders       WHERE year = 2001       GROUP BY salesperson_id       ORDER BY 1 DESC      ) all_emp_orders     WHERE ROWNUM <= 5)  ) top5_emp_orders ON top5_emp_orders.salesperson_id = s.salesperson_id ORDER BY 2 DESC; EMPLOYEE                       TOTAL_SALES      BONUS ------------------------------ ----------- ---------- Jeff Blake                         1927580      19276 Sam Houseman                       1814327      18143 Mark Russell                       1784596      17846 John Boorman                       1768813      17688 Tim McGowan                        1761814      17618 Carl Isaacs                        1761814      17618

Thus, there are actually six top five salespeople. The main difference between your first attempt and the second is the addition of the HAVING clause in the inline view. The subquery in the HAVING clause returns the five highest sales totals, and the inline view then returns all salespeople (potentially more than five) whose total sales exist in the set returned by the subquery.

Although you are confident in your latest results, there are several aspects of the query that bother you:

  • The aggregation of sales data is performed twice.

  • The query will never contend for Most Elegant Query of the Year.

  • You could've sworn you read about some sort of feature just for handling these types of queries . . .

In fact, there is a feature, an analytic SQL feature, for performing ranking queries that became available with Oracle8i. That feature is the RANK function.

5.5.4 Final Answer

The RANK function is specifically designed to help you write queries to answer questions like the one posed in this case study. Part of a set of analytic functions (all of which will be explored in Chapter 14), the RANK function may be used to assign a ranking to each element of a set. The RANK function understands that there may be ties in the set of values being ranked and leaves gaps in the ranking to compensate. The following query illustrates how rankings would be assigned to the entire set of salespeople; notice how the RANK function leaves a gap between the fifth and seventh rankings to compensate for the fact that two rows share the fifth spot in the ranking:

SELECT salesperson_id, SUM(tot_sales) tot_sales,   RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank FROM orders WHERE year = 2001 GROUP BY salesperson_id; SALESPERSON_ID  TOT_SALES SALES_RANK -------------- ---------- ----------              1    1927580          1             14    1814327          2             24    1784596          3              8    1768813          4             15    1761814          5             16    1761814          5             20    1757883          7             11    1737093          8              9    1735575          9             10    1723305         10             17    1710831         11              4    1695124         12              5    1688252         13             12    1672522         14             19    1645204         15             18    1625456         16             21    1542152         17             13    1516776         18              3    1501039         19             22    1468316         20              2    1461898         21              7    1458053         22             23    1443837         23              6    1392648         24

Leaving gaps in the rankings whenever ties are encountered is critical for properly handling these types of queries. (If you do not wish to have gaps in the ranking, you can use the DENSE_RANK function intead.) Table 5-1 shows the number of rows that would be returned for this data set for various top-N queries.

Table 5-1. Rows returned for N = {1,2,3,...,9}

Top-N salespeople

Rows returned

1

1

2

2

3

3

4

4

5

6

6

6

7

7

8

8

9

9


As you can see, the result sets would be identical for both the "top five" and "top six" versions of this query for this particular data set.

By wrapping the previous RANK query in an inline view, you can retrieve the salespeople with a ranking of five or less and join the results to the salesperson table to generate the final result set:

SELECT s.name employee, top5_emp_orders.tot_sales total_sales,   ROUND(top5_emp_orders.tot_sales * 0.01) bonus FROM  (SELECT all_emp_orders.salesperson_id emp_id,      all_emp_orders.tot_sales tot_sales   FROM    (SELECT salesperson_id, SUM(tot_sales) tot_sales,       RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank     FROM orders     WHERE year = 2001     GROUP BY salesperson_id    ) all_emp_orders   WHERE all_emp_orders.sales_rank <= 5  ) top5_emp_orders INNER JOIN salesperson s ON top5_emp_orders.emp_id = s.salesperson_id ORDER BY 2 DESC; EMPLOYEE                       TOTAL_SALES      BONUS ------------------------------ ----------- ---------- Jeff Blake                         1927580      19276 Sam Houseman                       1814327      18143 Mark Russell                       1784596      17846 John Boorman                       1768813      17688 Tim McGowan                        1761814      17618 Carl Isaacs                        1761814      17618

If this query looks familiar, that's because it's almost identical to the first attempt, except that the RANK function is used instead of the pseudocolumn ROWNUM to determine where to draw the line between the top five salespeople and the rest of the pack.

Now that you are happy with your query and confident in your results, you show your findings to your boss. "Nice work," she says. "Why don't you give yourself a bonus as well? In fact, you can have Isaacs's bonus, since he quit this morning." Salespeople can be so touchy.



Mastering Oracle SQL
Mastering Oracle SQL, 2nd Edition
ISBN: 0596006322
EAN: 2147483647
Year: 2003
Pages: 154

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net