7.8.1 Problem
You want to calculate a summary for each subgroup of a set of rows, not an overall summary value.
7.8.2 Solution
Use a GROUP BY clause to arrange rows into groups.
7.8.3 Discussion
The summary queries shown so far calculate summary values over all rows in the result set. For example, the following query determines the number of daily driving records in the driver_log table, and thus the total number of days that drivers were on the road:
mysql> SELECT COUNT(*) FROM driver_log; +----------+ | COUNT(*) | +----------+ | 10 | +----------+
But sometimes it's desirable to break a set of rows into subgroups and summarize each group. This is done by using aggregate functions in conjunction with a GROUP BY clause. To determine the number of days driven by each driver, group the rows by driver name, count how many rows there are for each name, and display the names with the counts:
mysql> SELECT name, COUNT(name) FROM driver_log GROUP BY name; +-------+-------------+ | name | COUNT(name) | +-------+-------------+ | Ben | 3 | | Henry | 5 | | Suzi | 2 | +-------+-------------+
That query summarizes the same column used for grouping (name), but that's not always necessary. Suppose you want a quick characterization of the driver_log table, showing for each person listed in it the total number of miles driven and the average number of miles per day. In this case, you still use the name column to place the rows in groups, but the summary functions operate on the miles values:
mysql> SELECT name, -> SUM(miles) AS 'total miles', -> AVG(miles) AS 'miles per day' -> FROM driver_log GROUP BY name; +-------+-------------+---------------+ | name | total miles | miles per day | +-------+-------------+---------------+ | Ben | 362 | 120.6667 | | Henry | 911 | 182.2000 | | Suzi | 893 | 446.5000 | +-------+-------------+---------------+
Use as many grouping columns as necessary to achieve as fine-grained a summary as you require. The following query produces a coarse summary showing how many messages were sent by each message sender listed in the mail table:
mysql> SELECT srcuser, COUNT(*) FROM mail -> GROUP BY srcuser; +---------+----------+ | srcuser | COUNT(*) | +---------+----------+ | barb | 3 | | gene | 6 | | phil | 5 | | tricia | 2 | +---------+----------+
To be more specific and find out how many messages each sender sent from each host, use two grouping columns. This produces a result with nested groups (groups within groups):
mysql> SELECT srcuser, srchost, COUNT(*) FROM mail -> GROUP BY srcuser, srchost; +---------+---------+----------+ | srcuser | srchost | COUNT(*) | +---------+---------+----------+ | barb | saturn | 2 | | barb | venus | 1 | | gene | mars | 2 | | gene | saturn | 2 | | gene | venus | 2 | | phil | mars | 3 | | phil | venus | 2 | | tricia | mars | 1 | | tricia | saturn | 1 | +---------+---------+----------+
The preceding examples in this section have used COUNT( ), SUM( ) and AVG( ) for per-group summaries. You can use MIN( ) or MAX( ), too. With a GROUP BY clause, they will tell you the smallest or largest value per group. The following query groups mail table rows by message sender, displaying for each one the size of the largest message sent and the date of the most recent message:
mysql> SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser; +---------+-----------+---------------------+ | srcuser | MAX(size) | MAX(t) | +---------+-----------+---------------------+ | barb | 98151 | 2001-05-14 14:42:21 | | gene | 998532 | 2001-05-19 22:21:51 | | phil | 10294 | 2001-05-17 12:49:23 | | tricia | 2394482 | 2001-05-14 17:03:01 | +---------+-----------+---------------------+
You can group by multiple columns and display a maximum for each combination of values in those columns. This query finds the size of the largest message sent between each pair of sender and recipient values listed in the mail table:
mysql> SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser; +---------+---------+-----------+ | srcuser | dstuser | MAX(size) | +---------+---------+-----------+ | barb | barb | 98151 | | barb | tricia | 58274 | | gene | barb | 2291 | | gene | gene | 23992 | | gene | tricia | 998532 | | phil | barb | 10294 | | phil | phil | 1048 | | phil | tricia | 5781 | | tricia | gene | 194925 | | tricia | phil | 2394482 | +---------+---------+-----------+
When using aggregate functions to produce per-group summary values, watch out for the following trap. Suppose you want to know the longest trip per driver in the driver_log table. That's produced by this query:
mysql> SELECT name, MAX(miles) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+--------------+ | name | longest trip | +-------+--------------+ | Ben | 152 | | Henry | 300 | | Suzi | 502 | +-------+--------------+
But what if you also want to show the date on which each driver's longest trip occurred? Can you just add trav_date to the output column list? Sorry, that won't work:
mysql> SELECT name, trav_date, MAX(miles) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+------------+--------------+ | name | trav_date | longest trip | +-------+------------+--------------+ | Ben | 2001-11-30 | 152 | | Henry | 2001-11-29 | 300 | | Suzi | 2001-11-29 | 502 | +-------+------------+--------------+
The query does produce a result, but if you compare it to the full table (shown below), you'll see that although the dates for Ben and Henry are correct, the date for Suzi is not:
+--------+-------+------------+-------+ | rec_id | name | trav_date | miles | +--------+-------+------------+-------+ | 1 | Ben | 2001-11-30 | 152 | <-- Ben's longest trip | 2 | Suzi | 2001-11-29 | 391 | | 3 | Henry | 2001-11-29 | 300 | <-- Henry's longest trip | 4 | Henry | 2001-11-27 | 96 | | 5 | Ben | 2001-11-29 | 131 | | 6 | Henry | 2001-11-26 | 115 | | 7 | Suzi | 2001-12-02 | 502 | <-- Suzi's longest trip | 8 | Henry | 2001-12-01 | 197 | | 9 | Ben | 2001-12-02 | 79 | | 10 | Henry | 2001-11-30 | 203 | +--------+-------+------------+-------+
So what's going on? Why does the summary query produce incorrect results? This happens because when you include a GROUP BY clause in a query, the only values you can select are the grouped columns or the summary values calculated from them. If you display additional columns, they're not tied to the grouped columns and the values displayed for them are indeterminate. (For the query just shown, it appears that MySQL may simply be picking the first date for each driver, whether or not it matches the driver's maximum mileage value.)
The general solution to the problem of displaying contents of rows associated with minimum or maximum group values involves a join. The technique is described in Chapter 12. If you don't want to read ahead, or you don't want to use another table, consider using the MAX-CONCAT trick described earlier. It produces the correct result, although the query is fairly ugly:
mysql> SELECT name, -> SUBSTRING(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),4) AS date, -> LEFT(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),3) AS 'longest trip' -> FROM driver_log GROUP BY name; +-------+------------+--------------+ | name | date | longest trip | +-------+------------+--------------+ | Ben | 2001-11-30 | 152 | | Henry | 2001-11-29 | 300 | | Suzi | 2001-12-02 | 502 | +-------+------------+--------------+
Using the mysql Client Program
Writing MySQL-Based Programs
Record Selection Techniques
Working with Strings
Working with Dates and Times
Sorting Query Results
Generating Summaries
Modifying Tables with ALTER TABLE
Obtaining and Using Metadata
Importing and Exporting Data
Generating and Using Sequences
Using Multiple Tables
Statistical Techniques
Handling Duplicates
Performing Transactions
Introduction to MySQL on the Web
Incorporating Query Resultsinto Web Pages
Processing Web Input with MySQL
Using MySQL-Based Web Session Management
Appendix A. Obtaining MySQL Software
Appendix B. JSP and Tomcat Primer
Appendix C. References