14.1 Analytic SQL Overview

   

The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:

  • Find the top 10 salespeople in each sales district last year.

  • Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.

  • Identify the region that suffered the worst quarter-to-quarter sales decline last year.

  • Find the best and worst selling menu items by state for each quarter last year.

Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc., to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.

Although all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:

  • They may require different levels of aggregation of the same data.

  • They may involve intratable comparisons (comparing one or more rows in a table with other rows in the same table).

  • They may require an extra filtering step after the result set has been sorted (i.e., finding the top 10 and bottom 10 salespeople last month).

Although it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: "Find all customers whose total orders in 2001 exceeded 20% of the aggregate sales for their geographic region."

For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called orders) containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:

  • Sum all sales per region last year.

  • Sum all sales per customer last year.

After these two intermediate result sets have been constructed, each customer's total can be compared to the total for their region to see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region's sales.

The query to aggregate sales by region looks as follows:

SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id; REGION_ID  TOT_SALES ---------- ----------          5    6585641          6    6307766          7    6868495          8    6854731          9    6739374         10    6238901

The query to aggregate sales by customer would be:

SELECT o.cust_nbr cust_nbr, o.region_id region_id,   SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id;   CUST_NBR  REGION_ID  TOT_SALES ---------- ---------- ----------          1          5    1151162          2          5    1224992          3          5    1161286          4          5    1878275          5          5    1169926          6          6    1788836          7          6     971585          8          6    1141638          9          6    1208959         10          6    1196748         11          7    1190421         12          7    1182275         13          7    1310434         14          7    1929774         15          7    1255591         16          8    1068467         17          8    1944281         18          8    1253840         19          8    1174421         20          8    1413722         21          9    1020541         22          9    1036146         23          9    1224992         24          9    1224992         25          9    2232703         26         10    1808949         27         10    1322747         28         10     986964         29         10     903383         30         10    1216858

By placing each of the two queries in an inline view and joining them on region_id, you can identify those customers whose total sales exceeds 20% of their region, as in:

SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,   cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales FROM  (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales   FROM orders o   WHERE o.year = 2001   GROUP BY o.region_id) region_sales INNER JOIN  (SELECT o.cust_nbr cust_nbr, o.region_id region_id,     SUM(o.tot_sales) tot_sales   FROM orders o   WHERE o.year = 2001   GROUP BY o.cust_nbr, o.region_id) cust_sales   ON cust_sales.region_id = region_sales.region_id WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2);   CUST_NBR  REGION_ID CUST_SALES REGION_SALES ---------- ---------- ---------- ------------          4          5    1878275      6585641          6          6    1788836      6307766         14          7    1929774      6868495         17          8    1944281      6854731         20          8    1413722      6854731         25          9    2232703      6739374         26         10    1808949      6238901         27         10    1322747      6238901

The final step is to join the region and customer dimensions to include the customer and region names in the result set:

SELECT c.name cust_name,   big_custs.cust_sales cust_sales, r.name region_name,   100 * ROUND(big_custs.cust_sales /      big_custs.region_sales, 2)  percent_of_region FROM   (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,     cust_sales.tot_sales cust_sales,      region_sales.tot_sales region_sales   FROM    (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales     FROM orders o     WHERE o.year = 2001     GROUP BY o.region_id) region_sales INNER JOIN    (SELECT o.cust_nbr cust_nbr, o.region_id region_id,       SUM(o.tot_sales) tot_sales     FROM orders o     WHERE o.year = 2001     GROUP BY o.cust_nbr, o.region_id) cust_sales     ON cust_sales.region_id = region_sales.region_id   WHERE cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs INNER JOIN  customer c   ON big_custs.cust_nbr = c.cust_nbr   INNER JOIN region r   ON big_custs.region_id = r.region_id; CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc.             1878275 New England                         29 Spartan Industries        1788836 Mid-Atlantic                        28 Madden Industries         1929774 Southeast US                        28 Evans Supply Corp.        1944281 Southwest US                        28 Malden Labs               1413722 Southwest US                        21 Worcester Technologies    2232703 Northwest US                        33 Alpha Technologies        1808949 Central US                          29 Phillips Labs             1322747 Central US                          21

Using nothing more exotic than inline views, therefore, it is possible to construct a single query that generates the desired results. Such a solution, however, has the following shortcomings:

  • The query is fairly complex.

  • Two passes through the same rows of the orders table are required to generate the different aggregation levels needed by the query.

Let's see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer, and then call an analytic function that performs a second level of aggregation to generate total sales per region:

SELECT o.region_id region_id, o.cust_nbr cust_nbr,  SUM(o.tot_sales) tot_sales,  SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id, o.cust_nbr; REGION_ID   CUST_NBR  TOT_SALES REGION_SALES ---------- ---------- ---------- ------------          5          1    1151162      6585641          5          2    1224992      6585641          5          3    1161286      6585641          5          4    1878275      6585641          5          5    1169926      6585641          6          6    1788836      6307766          6          7     971585      6307766          6          8    1141638      6307766          6          9    1208959      6307766          6         10    1196748      6307766          7         11    1190421      6868495          7         12    1182275      6868495          7         13    1310434      6868495          7         14    1929774      6868495          7         15    1255591      6868495          8         16    1068467      6854731          8         17    1944281      6854731          8         18    1253840      6854731          8         19    1174421      6854731          8         20    1413722      6854731          9         21    1020541      6739374          9         22    1036146      6739374          9         23    1224992      6739374          9         24    1224992      6739374          9         25    2232703      6739374         10         26    1808949      6238901         10         27    1322747      6238901         10         28     986964      6238901         10         29     903383      6238901         10         30    1216858      6238901

The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view, filter out those customers with less than 20% of their region's total sales, and join the region and customer tables to generate the desired result set:

SELECT c.name cust_name,   cust_sales.tot_sales cust_sales, r.name region_name,   100 * ROUND(cust_sales.tot_sales /      cust_sales.region_sales, 2)  percent_of_region FROM   (SELECT o.region_id region_id, o.cust_nbr cust_nbr,     SUM(o.tot_sales) tot_sales,     SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales   FROM orders o   WHERE o.year = 2001   GROUP BY o.region_id, o.cust_nbr) cust_sales INNER JOIN region r   ON cust_sales.region_id = r.region_id   INNER JOIN customer c   ON cust_sales.cust_nbr = c.cust_nbr WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2); CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc.             1878275 New England                         29 Spartan Industries        1788836 Mid-Atlantic                        28 Madden Industries         1929774 Southeast US                        28 Evans Supply Corp.        1944281 Southwest US                        28 Malden Labs               1413722 Southwest US                        21 Worcester Technologies    2232703 Northwest US                        33 Alpha Technologies        1808949 Central US                          29 Phillips Labs             1322747 Central US                          21

Using an inline view saves us from having to join the region and customer tables to the orders table; otherwise, we would have to include columns from the region and customer tables in the GROUP BY clause.

Later in this chapter, under "Reporting Functions," we'll get into the details of how the SUM...OVER function works. For now, you can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.

Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT and ORDER BY clauses of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions as well as specify analytic functions not found in the SELECT clause.

The remainder of this chapter introduces the Oracle8i Database and Oracle9i Database analytic functions, grouped by functionality.



Mastering Oracle SQL
Mastering Oracle SQL, 2nd Edition
ISBN: 0596006322
EAN: 2147483647
Year: 2003
Pages: 154

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net