Grouping Data | Beginning Transact-SQL With SQL Server 2000 and 2005

So far, your work with aggregate functions has been for a group of records that return a single value. An operation that returns a single value is known as a scalar result. Although this may be appropriate for very simple data analysis, aggregate functions can be used in far more sophisticated and useful ways. Groups are used to distill rows with common column values into one row. This gives you the opportunity to perform aggregated calculations on each of the groupings. There are some restrictions and it's important to understand the rules regarding groups. Columns returned by a grouped query must either be referenced in the GROUP BY list or use an aggregate function. Other columns can be used for filtering or sorting but these column values cannot be returned in the result set.

GROUP BY

Grouping occurs after records are retrieved and then aggregated. The GROUP BY clause is added to the query after the WHERE and ORDER BY clauses. Consider that the query runs first without the aggregate functions and grouping to determine which rows will be considered for grouping. After these results are read into memory, SQL Server makes a pass through these records, applying groupings and aggregate calculations.

Consider the following example, using the SUM() function:

 SELECT SalesOrderID, SUM(OrderQty) FROM SalesOrderDetail  GROUP BY SalesOrderID

The SalesOrderID value can be returned because it appears in the GROUP BY list. The query will return one distinct row for each SalesOrderID value. For each group of related records, all of the OrderQty values are added together as the result of the SUM() function. The result should include two columns, the SalesOrderID and the sum of the OrderQty for the related detail rows, as shown in Figure 7-3.

image from book
Figure 7-3:

Because detail rows contain multiple quantities, you really can't tell if these rows are aggregated. To get a better view, add another column using the COUNT() function. Also add column aliases to label these values:

 SELECT  SalesOrderID , SUM(OrderQty) As QtySum , COUNT(SalesOrderID) As DetailCount  FROM SalesOrderDetail GROUP BY SalesOrderID

The result, shown in Figure 7-4, shows that all but two of the visible rows were grouped and the SUM() function was applied to the OrderQty column value.

image from book
Figure 7-4:

If you were to view the ungrouped records in this table, you could clearly see what's going on. The result shown in Figure 7-5 is just a simple SELECT query on the SalesOrderDetail table showing the first nine rows.

image from book
Figure 7-5:

Sales order 5002 has two detail rows, whose OrderQty values add up to 3, order 5003 also has two detail rows with a total quantity of 3, and order 5005's quantity adds up to 8.

When grouping on more than one column, every unique combination of grouped values produces a row in the result set. Because the SalesOrderDetail table isn't preordered by the two columns you want to group on, you're explicitly ordering the results:

 SELECT ProductID , SpecialOfferID FROM SalesOrderDetail GROUP BY ProductID, SpecialOfferID ORDER BY ProductID, SpecialOfferID

The query returns a distinct list of ProductID and SpecialOfferID values (including null values), as shown in Figure 7-6.

image from book
Figure 7-6:

Although this may be interesting, it's not particularly useful information. Let's find out how many rows are actually being used to produce this list of distinct values. Extending the same query, add two columns that return the count of the ProductID and the SpecialOfferID values:

 SELECT ProductID, COUNT(ProductID) As ProductIDCount , SpecialOfferID, COUNT(SpecialOfferID) As SpecialOfferIDCount FROM SalesOrderDetail GROUP BY  ProductID, SpecialOfferID ORDER BY ProductID, SpecialOfferID

In the result set shown in Figure 7-7, you can see that you get the same rows.

image from book
Figure 7-7:

What you didn't see in the first result set is that the first row is an aggregation of 150 rows where the ProductID was 3 and the SpecialOfferID value was 6. The second row represents 218 rows. In the third row, 48 individual rows had a ProductID value of 8 and the SpecialOfferID was null. The COUNT() function returns 0 because null values are not considered in aggregate function calculations. 102 rows had the same ProductID value but the SpecialOfferID value was 6.

For a more real-world example, due to the complexity of the AdventureWorks database, it's necessary to create a fairly complex query with several table joins. Again, don't be concerned with the statements in the FROM clause, but do pay attention to the column list after the SELECT statement. You'll come back to this query in the next chapter.

The purpose of this query is to find out what products your customers have purchased. The Individual table contains personal information about human being–type customers (rather than stores that buy wholesale products). You've already seen that sales orders have order details, and a sales order detail line is related to a product:

 SELECT Store.Name AS StoreName , Product.Name AS ProductName , COUNT(SalesOrderDetail.ProductID) AS PurchaseCount FROM Customer INNER JOIN SalesOrderHeader  ON Customer.CustomerID = SalesOrderHeader.CustomerID  INNER JOIN SalesOrderDetail  ON SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID  INNER JOIN Product ON SalesOrderDetail.ProductID = Product.ProductID  INNER JOIN Store ON Customer.CustomerID = Store.CustomerID GROUP BY Product.Name, Store.Name ORDER BY Store.Name, Product.Name

Three columns are returned from the query: the StoreName, the ProductName, and the number of product records in each group (PurchaseCount), using the COUNT() function. This returns the number of times a store purchased the same product. You could use the product Name in the COUNT() function but it's usually more efficient to use primary key columns. Note that even though the StoreName and ProductName columns are aliased in the SELECT list, when used in the GROUP BY and ORDER BY statements, the alias name is not used; only the qualified column names are used.

Figure 7-8 shows the first 34 rows in the result set.

image from book
Figure 7-8:

Suppose that the purpose of this query was to locate stores that have purchased more than four of any product. Rather than scrolling through 20,531 rows, you can modify the query for rows with a count greater than four.

HAVING

How do you identify these rows? You can't use the WHERE clause because it is processed prior to grouping and aggregation; therefore, you need some way to filter the rows after the grouping has been completed. This is the job of the HAVING clause. The HAVING clause is limited to those columns and aggregate expressions that have already been specified on the SELECT statement. Typically, you will refer to aggregate values by simply repeating the aggregate function expression in the HAVING clause, just like you did in the SELECT statement. Here is the previous query with this expression added:

 SELECT Store.Name AS StoreName , Product.Name AS ProductName , COUNT(SalesOrderDetail.ProductID) AS PurchaseCount FROM Customer INNER JOIN SalesOrderHeader  ON Customer.CustomerID = SalesOrderHeader.CustomerID  INNER JOIN SalesOrderDetail  ON SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID  INNER JOIN Product ON SalesOrderDetail.ProductID = Product.ProductID  INNER JOIN Store ON Customer.CustomerID = Store.CustomerID GROUP BY Product.Name, Store.Name HAVING COUNT(SalesOrderDetail.ProductID) > 4 ORDER BY Store.Name, Product.Name

The result set is now reduced to 2,139 rows, including only the store/product purchases. Examining the results shows something very interesting. In the previous results, you saw that there appeared to be a lot of transactions for the store called A Bike Store. However, when you look at the results in this way, you see that this customer purchased only small quantities of each product. As such, they are not included in these results, shown in Figure 7-9.

image from book
Figure 7-9:

This data makes sense because most of the products listed with higher product counts are lower-priced items that tend to sell quickly.

The HAVING clause works much like the WHERE clause in that you can use any combination of comparison expressions and logical operators. Just be mindful that it is processed after the initial record selection (which is filtered by the WHERE clause) and that you are limited to the columns and aggregate expressions in the SELECT statement.

I'll use an example from the AdventureWorks database so you can follow along and work with the data yourself. Chapter 3 pointed out that the SQL Server Product Team had some fun putting the Employee sample data together. You'll see some more evidence of this. I thought it would be fun to see what gender variations there were for first names such as Terry and Pat. It turns out that our friends at Microsoft took this a little further.

Grouping the Employee table on the FirstName and Gender columns will return all the combinations of these values. I'll aggregate the EmployeeID column using the COUNT() function so you can see how many records there are for each FirstName/Gender combination. Here's the SQL expression for this query:

 SELECT FirstName, Gender, COUNT(EmployeeID) FROM Employee GROUP BY FirstName, Gender ORDER BY FirstName, Gender

I've scrolled down a bit to view some of the rows with a higher count. In Figure 7-10, you can see that there are some interesting anomalies in the results. Of the four employees named Brian, three of them are female.

image from book
Figure 7-10:

I can't be certain but I think that some of the folks on the SQL Server product team intended for this to be an inside joke and probably wondered if anyone would notice (considering that some of the names in this table are actual product team members). There are also five female employees named David. Apparently someone has a sense humor.

Totals and Subtotals

Before getting into a discussion about the techniques available for totaling grouped aggregates and creating subtotal breaks, a discussion of how you will use this data is needed. SQL Server is typically used as the back-end data store for some type of application or data consumer product. Many data presentation and reporting tools exist that will take care of formatting and totaling values from your queries. The technique you choose will largely depend on the tool or application that will consume this data. A number of products in the Microsoft suite can be used to easily present query results in a readable form. These include Excel, Access, and SQL Server Reporting Services.

One important consideration is whether you want the data to be grouped, aggregated, and subtotaled by the database server or by the client application after results have been returned from the database. There is little doubt that it is more efficient to let the database server do the work and send less data across the network. However, consider the case where an application allows users to interact with data, choosing different sorting and grouping options. It might make more sense to send raw data to the application one time so it can be manipulated by the client rather than refreshing the result set and resending a different result each time the user chooses a grouping or sorting option. These are decisions that you and solutions designers might need to make on a larger scale. The purpose here is to discuss the options in SQL Server to do the grouping and subtotaling at the database server.

When I use the terms totaling and subtotaling, I use these in a general sense to mean applying whatever aggregate functions you choose at various group levels. So, for example, if I were using the AVG() function to return the average purchase price per product, and per quarter at the quarter level, I would want to see the average calculation for all of the product price averages. I'm loosely using the term subtotal, even though I expect to see an average calculation rather than a sum or total.

Subgrouping

With more than one column referenced in the GROUP BY clause, some interesting things happen. For the sake of simplicity, a hypothetical table follows with simplified values.

ID1	ID2	MyValue
A	X	2
A	X	1
A	Y	2
A	Y	1
B	X	3
B	Y	2
B	Y	2

In a query for my hypothetical table, I include the first two columns, ID1 and ID2, in the GROUP BY clause and use the SUM() function to total the values in the third column:

 SELECT ID1, ID2, Sum(MyValue) FROM MyFakeTable GROUP BY ID1, ID2

Multiple rows are returned, one for each unique combination of values, as shown in the following table.

ID1	ID2	MyValue
A	X	3
A	Y	3
B	X	3
B	Y	4

What I don't have in this result set is the sum for all occurrences where ID1 is equal to A or where ID2 is equal to Y. To get the aggregate result of a grouped query, you can use the ROLLUP and CUBE statements. These will essentially take the results from the grouped query and apply the same aggregation to either the first column's values or all combinations of values for each column that appears in the GROUP BY column list.

WITH ROLLUP

This is the simplest option for calculating subtotals and totals on the first column in the GROUP BY column list. In the case of my hypothetical example, in addition to calculating the sum of each unique column value, totals would be tallied for the value A and B in the ID1 column only. Using the same query, I've added WITH ROLLUP after the GROUP BY statement:

 SELECT ID1, ID2, SUM(MyValue) FROM MyFakeTable GROUP BY ID1, ID2 WITH ROLLUP

The results would look something like those shown in the following table.

ID1	ID2	MyValue
A	X	3
A	Y	3
B	X	3
B	Y	4
A	(null)	6
B	(null)	7

Null values are used to indicate that the corresponding column was ignored when calculating the aggregate value.

WITH CUBE

The CUBE operator is an expanded version of the ROLLUP operator. Rather than just rolling up the aggregate values for the first column in the GROUP BY list, CUBE performs this rollup for every combination of grouped column values. In the case of the hypothetical tables used in the previous example, three additional rows are added to the result set. Here is the same query using WITH CUBE rather than WITH ROLLUP:

 SELECT ID1, ID2, SUM(MyValue) FROM MyFakeTable GROUP BY ID1, ID2 WITH CUBE

The corresponding result set is shown in the following table.

ID1	ID2	MyValue
A	X	3
A	Y	3
B	X	3
B	Y	4
A	(null)	6
B	(null)	7
(null)	X	6
(null)	Y	7
(null)	(null)	13

Null values in the first column indicate that this is a rollup on the values in the second column. In this case, these rows contain subtotal values for all rows where ID2 is equal to X or Y. The last row has null values in both grouped columns. This indicates that it is a grand total, the sum of all rows. Be advised that CUBE operations are expensive when it comes to server resources. Carefully consider whether it would be more efficient to just send the simplified aggregate data to the application and let it do the equivalent operations to derive the cubed data.

The GROUPING() Function

Let's go back to the AdventureWorks Product table example used earlier. I made a point not to use this table for the ROLLUP and CUBE examples because it would throw a wrench into the works. Go back and take another look at Figure 7-7; note the null values in the SpecialOfferID column. You'll recall that when using the ROLLUP and CUBE operators, a null is used to indicate a rollup or subtotal row where that column's value isn't being considered. What if a column in the GROUP BY list actually contains null values? Here's the earlier example again, with the added ROLLUP operator:

 SELECT ProductID, COUNT(ProductID) As ProductIDCount , SpecialOfferID, COUNT(SpecialOfferID) As SpecialOfferIDCount  FROM SalesOrderDetail GROUP BY ProductID, SpecialOfferID WITH ROLLUP ORDER BY ProductID, SpecialOfferID

In the result set shown in Figure 7-11, additional rows are added with subtotals.

image from book
Figure 7-11:

Take a close look at rows 6, 7, and 8. Which one contains the subtotal for all products where the ProductID is 8? Two rows have nulls in the SpecialOfferID column. One of them contains NULL because some of this column's values are, in fact, NULL. The answer is row 7, because the count of the ProductID is higher than any other. Row 6 has a SpecialOfferID count of 0 because this row represents those rows where the SpecialOfferID is actually NULL. Do you find this confusing? It certainly can be. Imagine the added confusion if you were grouping on more than two columns, or using the CUBE operator rather than ROLLUP. Imagine taking the results of this query to your software developer and asking him to create a custom report with subtotals and totals and then try to explain this grouping criterion. This is where the GROUPING() function comes in.

The GROUPING() function returns a bit value (1 or 0) to indicate that a row is a rollup. This makes it easy to separate the aggregation of null values. Any application or tool that consumes grouped data can easily distinguish the rolled-up subtotal rows from simple grouped rows. Here's the query with two columns added using the GROUPING() function:

 SELECT ProductID , GROUPING(ProductID) As ProdGroup , COUNT(ProductID) As ProductIDCount , SpecialOfferID , GROUPING(SpecialOfferID) As SO_Group , COUNT(SpecialOfferID) As SpecialOfferIDCount FROM SalesOrderDetail GROUP BY ProductID, SpecialOfferID WITH ROLLUP ORDER BY ProductID, SpecialOfferID

In the results shown in Figure 7-12, you can see the two new columns (aliased as ProdGroup and SO_Group). The first row in the result set is the grand total, with rolled-up values for all rows (any combination of ProductID and SpecialOfferID values). This row and each individual rollup row is also flagged with a 1.

image from book
Figure 7-12:

Because this is a ROLLUP query, only the SpecialOfferID values get rolled up into their respective ProductID combination counterparts. If you substitute CUBE for the ROLLUP function, you will see additional Grouping flags for the ProductID.

Note that if the results of these grouped and aggregated queries will be fed to a custom reporting solution or a similar application, application developers will appreciate the output from the GROUPING() function. This can make life much easier for a custom software developer or report designer.

COMPUTE and COMPUTE BY

Regardless of the data you might work with, SQL Server was designed, and is optimized, to return rows and columns — two dimensions. Likewise, software designed to consume SQL data expects to receive two-dimensional data. All of the components, application programming interfaces (APIs), and connection utilities are engineered to work with two-dimensional result sets.

Why am I making such a big deal out of this two-dimensional result set business? The COMPUTE clause is a very simple means for viewing data with totals and subtotals, but it breaks all the rules when it comes to standard results. It's also important to note that this is a proprietary SQL Server feature and isn't recognized by the ANSI SQL specification. My purpose is not to try to talk you out of using these features entirely but to realize its limitations. This is an effective technique for viewing summary data, but its usefulness may be somewhat limited in many real software solutions because it does not return data in a format that is consumable by any application. Its usefulness is limited to Query Analyzer or SQL Server Management Studio.

Suppose that the sales manager calls you on the phone and asks you to tell her what the total sales were for last month. She doesn't need a formal report, and you're not going to develop a custom application for users to do this themselves. She just wants a list of sales orders with the total. Using this technique may be the best choice.

Here's a simple example of the COMPUTE clause:

 SELECT ProductID, SalesOrderID, OrderQty FROM SalesOrderDetail ORDER BY ProductID, SalesOrderID COMPUTE SUM(OrderQty)

The query editor splits the result into two grids because the result doesn't fit into a standard two-dimensional grid, as shown in Figure 7-13.

image from book
Figure 7-13:

I had asked for SQL Server to compute the sum of the OrderQty for the entire result set. This created a grand total for the entire range of data. Because of the formatting restrictions of viewing results in grid view, I'd like to show you the same result in text view.

Try It Out

To switch the view from grid to text, choose Results in Text from the Query menu in Query Analyzer or Results to Text in the Query Editor. Execute the query from the previous example and scroll all the way down to the bottom of the results. That probably took a while considering that there were more than 120,000 rows. To work with a more manageable set of data, modify the query as follows so it only returns the 23 orders:

 SELECT ProductID, SalesOrderID, OrderQty FROM SalesOrderDetail WHERE SalesOrderID > 43650 ORDER BY ProductID, SalesOrderID COMPUTE SUM(OrderQty) By ProductID

This returns a short result set in the form of monospaced text:

ProductID   SalesOrderID OrderQty  ----------- ------------ --------  209         43652        1 209         43654        1 209         43655        1 210         43653        1 210         43656        1 211         43657        1 211         43658        1 213         43651        1 1228        43654        1 1273        43658        1 1277        43652        1 1277        43654        1 1277        43656        1 1277        43657        1 1277        43658        1 1279        43652        1 1279        43654        1 1279        43656        1 1279        43657        1 1279        43658        1 1286        43653        1 1286        43657        1                              sum                          ===========                          22         (23 row(s) affected)

This may be useful if you are interested in the grand total following the entire range of values. If you want to see grouped sections of rows with subtotals, it's a simple matter to add the column name or list of columns to the end of the COMPUTE clause. Modify the previous query to group by the ProductID:

 SELECT ProductID, SalesOrderID, OrderQty FROM SalesOrderDetail WHERE SalesOrderID > 43650 ORDER BY ProductID, SalesOrderID COMPUTE SUM(OrderQty) By ProductID

The result set shows the same list of SalesOrderDetail records with a subtotal break after each ProductID. (I've shortened the result text to save space.)

ProductID   SalesOrderID OrderQty  ----------- ------------ --------  209         43652        1 209         43654        1 209         43655        1                               sum                          ===========                          3 ...         ProductID   SalesOrderID OrderQty  ----------- ------------ --------  1279        43652        1 1279        43654        1 1279        43656        1 1279        43657        1 1279        43658        1                          sum                          ===========                          5         ProductID   SalesOrderID OrderQty  ----------- ------------ --------  1286        43653        1 1286        43657        1                              sum                          ===========                          2         (31 row(s) affected)

The COMPUTE clause is an easy and convenient technique for producing a detailed list of column values with total and grouped subtotal breaks. It doesn't do grouping and aggregation on every row like the GROUP BY clause. Just keep in mind that the output isn't compatible with most standard data consumer software and programming components. If you just need to view or print a quick, ad-hoc report, this may be the easiest way to get there. Otherwise, use the ANSI standard GROUP BY clause with the ROLLUP or CUBE statements.

From a database design standpoint, it is imperative that large tables are indexed on columns that will be used for grouping and aggregation. Few other functions and SQL statements will stress the database engine to the same degree. Consider using a clustered index on a column that is usually used in queries to join to another table or specify the usual sort order. You can find more information about indexing strategies in Professional SQL Server Programming from Wrox Press.

As previously mentioned, cube operations can be especially intensive. As you have run some of these queries, you've probably noticed that it takes a little while to perform calculations and return the aggregated results. It's best to use the ROLLUP and CUBE statements with filtered data. If you do need to perform intensive grouping operations on a large volume of data, try to do this at a time when you won't be competing with other large operations.

Although it usually makes sense to let SQL Server do a majority of the work, sometimes returning a larger volume of data that can be reused in the client application, rather than running numerous query operations, is best. This is especially true with dynamic reporting solutions. Make a point to understand your users' needs and try to strike a balance between the flexibility of an application or reporting solution and the efficiency of the whole system. As a rule, don't return any more data than is necessary. Lastly, make a point to order records explicitly using the ORDER BY clause. This will guarantee the sort order of your queries. Even if records in the table already exist in the correct order, using the ORDER BY clause will not cause any processing overhead.