Section 6.7. Result Set Obtained by Aggregation

6.7. Result Set Obtained by Aggregation

An extremely common situation is the case in which the result set is a dynamically computed summary of the detailed data from one or more main tables. In other words, we are facing an aggregation of data. When data is aggregated, the size of the result set isn't dependent on the precision of the criteria that are provided, but merely on the cardinality of the columns that we group by. As in the first situation of the small result set obtained through precise criteria (and as you'll see again in Chapter 11), aggregate functions (or aggregates) are also often quite useful for obtaining in a single pass on the table results that are not truly aggregated but that would otherwise require self-joins and multiple passes. In fact, the most interesting SQL uses of aggregates are not the cases in which sums or averages are an obvious part of the requirements, but situations in which a clever use of aggregates provides a pure SQL alternative to a procedural processing.

I stress in Chapter 2 that one of the keys to efficient SQL coding is a swashbuckling approach to code execution, testing for success after the deed rather than executing preliminary queries to check if, by chance, the really useful query we want to execute may fail: you cannot win a swimming race by tiptoeing carefully into the water. The other key point is to try to pack as much "action" as possible into an SQL query, and it is in respect to this second key point that aggregate functions can be particularly useful.

Much of the difficulty of good SQL programming lies in seeing how a problem can translate, not into a succession of queries to a database, but into very few queries. When, in a program, you need a lot of intermediate variables to hold values you get from the database before reinjecting them into the database as input to other queries, and if you perform against those variables nothing but very simple tests, you can bet that you have the algorithm wrong. And it is a striking feature of poorly written SQL programs to see the high number of lines of code outside of SQL queries that are simply devoted to summing up, multiplying, dividing, and subtracting inside loops what is painfully returned from the database. This is a totally useless and utterly inefficient job: we have SQL aggregate functions for that sort of work.

Aggregate functions are very useful tools for solving SQL problems (and we will revisit them in Chapter 11, when I talk about stratagems); however, it often appears to me that developers use only the least interesting aggregate function of all, namely count( ), the real usefulness of which is often, at best, dubious in most programs.

Chapter 2 shows that using count(*) to decide whether to update an existing row or insert a new one is wasteful. You can misuse count(*) in reports as well. A test for existence is sometimes implemented as a mock-Boolean value such as:

     case count(*)     when 0 then 'N'     else 'Y'     end

Such an implementation gets, when rows are found, all the rows that match the condition in order to obtain a precise count, whereas finding only one is enough to decide whether Y or N must be displayed. You can usually write a much more effective statement by using a construct that either limits the number of rows returned or tests for existence, effectively stopping processing as soon as a row that matches the condition is found.

But when the question at hand is about the most, the least, the greatest, or even the first or the last, it is likely that aggregate functions (possibly used as OLAP functions) will provide the best answer. If you believe that aggregate functions should be used only when counts, sums, maxima, minima, or averages are explicitly required, then you risk seriously underusing them.

Interestingly, aggregate functions are extremely narrow in scope. If you exclude the computation of maximum and minimum values, the only thing they can really do is simple arithmetic; a count( ) is nothing more than adding 1s for each row encountered. Similarly, the computation of avg( ) is just, on one hand, adding up the values in the column it is applied to and, on the other hand, adding 1s, and then dividing.

But it is sometimes wonderful what you can do with simple sums. If you're mathematically inclined, you'll remember how easily you can switch between sums and products by the magic of logarithms and power functions. And if you're logically inclined, you know well how much OR owes to sums and AND to products.

I'll show the power of aggregation with a simple example. Assume that we have a number of shipments to make and that each shipment is made of a number of different orders, each of which has to be separately prepared; it is only when each order in a shipment is complete that the shipment itself is ready. The problem is how to detect when all the orders comprising a shipment are complete.

As is so often the case, there are several ways to determine the shipments that are complete. The worst approach would probably be to loop on all shipments, inside a second loop on each shipment count how many orders have N as value for the order_complete column, and return shipment IDs for which the count is 0. A much better solution would be to recognize the test on the nonexistence of an N value for what it is, and use a subquery, correlated or uncorrelated; for instance:

     select shipment_id     from shipments     where not exists (select null from orders                       where order_complete = 'N'                         and orders.shipment_id = shipments.shipment_id)

This approach is pretty bad if we have no other condition on the shipments table. Following is a query that may be much more efficient if we have a large shipments table and few uncompleted orders:

     select shipment_id     from shipments     where shipment_id not in (select shipment_id                               from orders                               where order_complete = 'N')

This query can also be expressed as follows, as a variant that an optimizer may like better but that wants an index on the column shipment_id of the table orders:

     select shipments.shipment_id     from shipments         left outer join orders               on orders.shipment_id = shipments.shipment_id               and orders.order_complete = 'N'     where orders.shipment_id is null

Another alternative is a massive set operation that will operate on the primary key index of shipments on one hand, and that will perform a full table scan of orders on the other hand:

     select shipment_id     from shipments     except     select shipment_id     from orders     where order_complete = 'N'

Be aware that not all DBMS implement the except operator, sometimes known as minus.

But there is still another way to express our query. What we are doing, basically, is to return the identifiers of all shipments for which a logical AND operation on all orders which have been completed returns trUE. This kind of operation happens to be quite common in the real world. As hinted previously, there is a very strong link between AND and multiplication, and between OR and addition. The key is to convert flags such as Y and N to 0s and 1s. This conversion is a trivial operation with the case construct. To get just order_complete as a 0 or 1 value, we can write:

     select shipment_id,            case when order_complete = 'Y' then 1                                           else 0            end flag     from orders

So far, so good. If we always had a fixed number of orders per shipment, it would be easy to sum the calculated column and check if the result is the number of orders we expect. However, what we want here is to multiply the flag values per shipment and check whether the result is 0 or 1. That approach works, because even one incomplete order, represented by a 0, will cause the final result of all the multiplication to also be 0. The multiplication can be done with the help of logarithms (although 0s are not the easiest values to handle with logarithms). But in this particular case, our task is even easier.

What we want are the shipments for which the first order is completed and the second order is completed and...the nth order is completed. Logic and the laws of de Morgan^[*] tell us that this is exactly the same as stating that we do not have (first order not completed or second order not completed...or nth order not completed). Since their kinship to sums makes ORs much easier to process with aggregates than ANDs, checking that a list of conditions linked by OR is false is much easier than checking that a list of conditions linked by AND is true. What we must consider as our true predicate is "the order is not completed" rather than the reverse, and convert the order_complete flag to 1 if it is N, and 0 if it is Y. In that way, we can easily check that we have 0s (or yeses) everywhere by summing up valuesif the sum is 0, then all orders are completed; otherwise, we are at various stages of incompletion.

^[*] The India-born Augustus de Morgan (18061871) was a British mathematician who contributed to many areas of mathematics, but most significantly to the field of logic. The de Morgan laws state that the complement of the intersection of any number of sets equals the union of their complements and that the complement of the union of any number of sets equals the intersection of their complements. If you remember that SQL is about sets, and that negating a condition returns the complement of the result set returned by the initial condition (if you have no null values), you'll understand why these laws are particularly useful to the SQL practitioner.

Therefore we can also express our query as:

     select shipment_id     from (select shipment_id,                  case when order_complete = 'N' then 1                                                 else 0                  end flag           from orders) s     group by shipment_id     having sum(flag) =0

And it can be expressed in an even more concise way as:

     select shipment_id     from orders     group by shipment_id     having sum(case when order_complete = 'N' then 1                                               else 0                end) =0

There is another way to write this query that is even simpler, using another aggregate function, and without any need to convert flag values. Noticing that Y is, from an alphabetical point of view, greater than N, it is not too difficult to infer that if all values are Y then the minimum will necessarily be Y too. Hence:

     select shipment_id     from orders     group by shipment_id     having min(order_complete) = 'Y'

This approach of depending on Y to be greater than N may not be as well grounded mathematically as the flag-to-number conversion, but it is just as efficient.

Of course we must see how the query that uses a group by and a condition on the minimum value for order_complete compares to the other versions that use subqueries or except instead of an aggregate function. What we can say is that it has to fully sort the orders table to aggregate the values and check whether the sum is or is not 0. As I've specified the problem, this solution involving a non-trivial use of an aggregate function is likely to be faster than the other queries, which hit two tables (shipments and orders), and usually less efficiently.

I have made an extensive use of the having clause in the previous examples. As already mentioned in Chapter 4, a common example of careless SQL statements involves the use of the having clause in aggregate statements. Such an example is illustrated in the following (Oracle) query, which attempts to obtain the sales per product per week during the past month:

     select product_id,            trunc(sale_date, 'WEEK'),            sum(sold_qty)     from sales_history     group by product_id, trunc(sale_date, 'WEEK')     having trunc(sale_date, 'WEEK') >= add_month(sysdate, -1)

The mistake here is that the condition expressed in the having clause doesn't depend on the aggregate. As a result, the DBMS has to process all of the data in sales_history, sorting it and aggregating against each row, before filtering out ancient figures as the last step before returning the required rows. This is the kind of mistake that can go unnoticed until sales_history grows really big. The proper approach is, of course, to put the condition in a where clause, ensuring that the filtering occurs at an early stage and that we are working afterwards on a much reduced set of data.

I should note that when we apply criteria to views, which are aggregated results, we may encounter exactly the same problem if the optimizer is not smart enough to reinject our filter before aggregation.

You can have slightly more subtle variants of a filter applied later than it should be. For instance:

     select customer_id     from orders     where order_date < add_months(sysdate, -1)     group by customer_id     having sum(amount) > 0

In this query, the following condition looks at first glance like a reasonable use of having:

     having sum(amount) > 0

However, this use of having does not really make sense if amount is always a positive quantity or zero. In that event, we might be better using the following condition:

     where amount > 0

We have two possibilities here. Either we keep the group by:

     select customer_id     from orders     where order_date < add_months(sysdate, -1)       and amount > 0     group by customer_id

or we notice that group by is no longer required to compute any aggregate and replace it with a distinct that in this case performs the same task of sorting and eliminating duplicates:

     select distinct customer_id     from orders     where order_date < add_months(sysdate, -1)       and amount > 0

Placing the condition in the where clause allows unwanted rows to be filtered at an earlier stage, and therefore more effectively.

Aggregate as little data as you can.