Recipe 8.7. Dividing a Summary into Subgroups

Problem

You want to calculate a summary for each subgroup of a set of rows, not an overall summary value.

Solution

Use a GROUP BY clause to arrange rows into groups.

Discussion

The summary statements shown so far calculate summary values over all rows in the result set. For example, the following statement determines the number of records in the mail table, and thus the total number of mail messages that have been sent:

mysql> SELECT COUNT(*) FROM mail; +----------+ | COUNT(*) | +----------+ |       16 | +----------+

Sometimes it's desirable to break a set of rows into subgroups and summarize each group. Do this by using aggregate functions in conjunction with a GROUP BY clause. To determine the number of messages per sender, group the rows by sender name, count how many times each name occurs, and display the names with the counts:

mysql> SELECT srcuser, COUNT(*) FROM mail     -> GROUP BY srcuser; +---------+----------+ | srcuser | COUNT(*) | +---------+----------+ | barb    |        3 | | gene    |        6 | | phil    |        5 | | tricia  |        2 | +---------+----------+

That query summarizes the same column that is used for grouping (srcuser), but that's not always necessary. Suppose that you want a quick characterization of the mail table, showing for each sender listed in it the total amount of traffic sent (in bytes) and the average number of bytes per message. In this case, you still use the srcuser column to place the rows in groups, but the summary functions operate on the size values:

mysql> SELECT srcuser,     -> SUM(size) AS 'total bytes',     -> AVG(size) AS 'bytes per message'     -> FROM mail GROUP BY srcuser; +---------+-------------+-------------------+ | srcuser | total bytes | bytes per message | +---------+-------------+-------------------+ | barb    |      156696 |        52232.0000 | | gene    |     1033108 |       172184.6667 | | phil    |       18974 |         3794.8000 | | tricia  |     2589407 |      1294703.5000 | +---------+-------------+-------------------+

Use as many grouping columns as necessary to achieve as fine-grained a summary as you require. The earlier query that shows the number of messages per sender is a coarse summary. To be more specific and find out how many messages each sender sent from each host, use two grouping columns. This produces a result with nested groups (groups within groups):

mysql> SELECT srcuser, srchost, COUNT(srcuser) FROM mail     -> GROUP BY srcuser, srchost; +---------+---------+----------------+ | srcuser | srchost | COUNT(srcuser) | +---------+---------+----------------+ | barb    | saturn  |              2 | | barb    | venus   |              1 | | gene    | mars    |              2 | | gene    | saturn  |              2 | | gene    | venus   |              2 | | phil    | mars    |              3 | | phil    | venus   |              2 | | tricia  | mars    |              1 | | tricia  | saturn  |              1 | +---------+---------+----------------+

The preceding examples in this section have used COUNT⁠(⁠ ⁠ ⁠), SUM⁠(⁠ ⁠ ⁠), and AVG⁠(⁠ ⁠ ⁠) for per-group summaries. You can use MIN⁠(⁠ ⁠ ⁠) or MAX⁠(⁠ ⁠ ⁠), too. With a GROUP BY clause, they will tell you the smallest or largest value per group. The following query groups mail table rows by message sender, displaying for each the size of the largest message sent and the date of the most recent message:

mysql> SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser; +---------+-----------+---------------------+ | srcuser | MAX(size) | MAX(t)              | +---------+-----------+---------------------+ | barb    |     98151 | 2006-05-14 14:42:21 | | gene    |    998532 | 2006-05-19 22:21:51 | | phil    |     10294 | 2006-05-17 12:49:23 | | tricia  |   2394482 | 2006-05-14 17:03:01 | +---------+-----------+---------------------+

You can group by multiple columns and display a maximum for each combination of values in those columns. This query finds the size of the largest message sent between each pair of sender and recipient values listed in the mail table:

mysql> SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser; +---------+---------+-----------+ | srcuser | dstuser | MAX(size) | +---------+---------+-----------+ | barb    | barb    |     98151 | | barb    | tricia  |     58274 | | gene    | barb    |      2291 | | gene    | gene    |     23992 | | gene    | tricia  |    998532 | | phil    | barb    |     10294 | | phil    | phil    |      1048 | | phil    | tricia  |      5781 | | tricia  | gene    |    194925 | | tricia  | phil    |   2394482 | +---------+---------+-----------+

When using aggregate functions to produce per-group summary values, watch out for the following trap, which involves selecting nonsummary table columns not related to the grouping columns. Suppose that you want to know the longest trip per driver in the driver_log table. That's produced by this query:

mysql> SELECT name, MAX(miles) AS 'longest trip'     -> FROM driver_log GROUP BY name; +-------+--------------+ | name  | longest trip | +-------+--------------+ | Ben   |          152 | | Henry |          300 | | Suzi  |          502 | +-------+--------------+

But what if you also want to show the date on which each driver's longest trip occurred? Can you just add TRav_date to the output column list? Sorry, that won't work:

mysql> SELECT name, trav_date, MAX(miles) AS 'longest trip'     -> FROM driver_log GROUP BY name; +-------+------------+--------------+ | name  | trav_date  | longest trip | +-------+------------+--------------+ | Ben   | 2006-08-30 |          152 | | Henry | 2006-08-29 |          300 | | Suzi  | 2006-08-29 |          502 | +-------+------------+--------------+

The query does produce a result, but if you compare it to the full table (shown following), you'll see that although the dates for Ben and Henry are correct, the date for Suzi is not:

+--------+-------+------------+-------+ | rec_id | name  | trav_date  | miles | +--------+-------+------------+-------+ |      1 | Ben   | 2006-08-30 |   152 |     Bens longest trip |      2 | Suzi  | 2006-08-29 |   391 | |      3 | Henry | 2006-08-29 |   300 |     Henrys longest trip |      4 | Henry | 2006-08-27 |    96 | |      5 | Ben   | 2006-08-29 |   131 | |      6 | Henry | 2006-08-26 |   115 | |      7 | Suzi  | 2006-09-02 |   502 |     Suzis longest trip |      8 | Henry | 2006-09-01 |   197 | |      9 | Ben   | 2006-09-02 |    79 | |     10 | Henry | 2006-08-30 |   203 | +--------+-------+------------+-------+

So what's going on? Why does the summary statement produce incorrect results? This happens because when you include a GROUP BY clause in a query, the only values that you can select are the grouped columns or summary values calculated from the groups. If you display additional table columns, they're not tied to the grouped columns and the values displayed for them are indeterminate. (For the statement just shown, it appears that MySQL may simply be picking the first date for each driver, regardless of whether it matches the driver's maximum mileage value.)

The general solution to the problem of displaying contents of rows associated with minimum or maximum group values involves a join. The technique is described in Chapter 12. For the problem at hand, the required results are produced as follows:

mysql> CREATE TABLE t     -> SELECT name, MAX(miles) AS miles FROM driver_log GROUP BY name; mysql> SELECT d.name, d.trav_date, d.miles AS 'longest trip'     -> FROM driver_log AS d INNER JOIN t USING (name, miles) ORDER BY name; +-------+------------+--------------+ | name  | trav_date  | longest trip | +-------+------------+--------------+ | Ben   | 2006-08-30 |          152 | | Henry | 2006-08-29 |          300 | | Suzi  | 2006-09-02 |          502 | +-------+------------+--------------+