Grouping Data with Transact-SQL | MCSE Training Guide (70-229). Designing and Implementing Databases with MicrosoftR SQL ServerT 2000 Enterprise Edition

Group data using Transact- SQL .

SQL Server is all about reporting. If you think that the database you're designing is for storing data, you are looking at design from the wrong angle.

Reporting is typically how things come out of the database. If your database is for a payroll system, reporting is how it prints checks. If your database is for a cargo shipping application, then reporting is how it prints invoices and how many widgets the company will need to make next month if sales continue at the same rate. Reporting is what makes money for companies. Without effective ways of reading and presenting data, there are no payroll checks, shipping invoices, or sales projections.

Reporting is being able to count things, average things, find maximum values, and find minimum values, and do all that over different categories of data. If you've ever been involved in a report design, then you know that most reports are all about grouping different entities together and describing how those entities behave. How many widgets did we sell last month, what was the average cost to produce, and what was the average selling price? Those are the things that drive how business is done.

This section is going to cover big topics. It describes how to aggregate data, how to perform operations on those aggregates, and how to do computations across aggregates. Remember, it's all about the slice and dice.

Aggregate Functions

Aggregate functions are functions that provide summary data about sets. Questions like "How many rows are in that table?" "How many widgets did we sell last week?" and "What is the average price we charged for a widget?" are all answered with aggregate functions. Table 5.1 provides a list and a brief description of each aggregate function. This section also covers what they do and their syntax.

Table 5.1. Aggregate Functions

Function	Description
`AVG()`	Average value of the group.
`BINARY_CHECKSUM()` , `CHECKSUM()` , and `CHECKSUM_AGG()`	Return a number representing the value of the data in the group. Useful for detecting data changes.
`COUNT()` , `COUNT_BIG()`	Number of objects in the group.
`MAX()`	Maximum value of the group.
`MIN()`	Minimum value of the group.
`SUM()`	Sum of the values in the group.
`STDEV()` , `STDEVP()`	Standard deviation of values in the group.
`VAR()` , `VARP()`	Variance of the values in the group.

Aggregate functions take one of three types of arguments. Some take a simple wildcard (*). This means that the operation either doesn't apply to rows, or it should apply to all rows. Look at the following statement, for example:

 SELECT COUNT(*) FROM Person

This example returns the number of rows in the Person table. The number of rows that are in the table is independent of any individual column. The COUNT() functions and the CHECKSUM() functions work this way. It applies the function across the entire group, regardless of the content of each row.

All the functions take a column name as an argument, and then the aggregate applies only to that column. You could have used this in the preceding query:

 SELECT COUNT(PersonID) FROM Person

This example returns the number of not-null PersonIDs in the table.

Some functions enable you to apply the function to distinct values only. For example, if you want a count of the distinct values in a table, you can use COUNT (DISTINCT LastName) , which provides a count of the number of distinct last names . The COUNT , AVG , and SUM functions support this and actually do something with it. MIN and MAX support it, but it doesn't mean anything; it is included only for SQL-92 compliance.

The following sections look at the functions by category: First the statistical aggregate functions, then the data management aggregate functions.

Statistical Aggregate Functions

Statistical aggregate functions perform various operations across their sets. The COUNT() function returns an integer representing the number of rows in the table. The COUNT_BIG() function does the same thing, but it returns a number of type bigint instead.

The AVG() function returns the average value for a given column. It requires a column name, and optionally you can use DISTINCT to get an average for just the distinct values in the table. Here's an example that determines the average size of an order from the sample Sales table:

 SELECT AVG(QtyPurchased) FROM Sales

This returns a value of 3, which is the sum of the QtyPurchased (4 + 3 + 1 + 1 + 1 + 10 + 5 + 6, which is 31), divided by the number of records in the table (8). If you divide those two together and truncate the decimal places (because the example deals with integers), you get 3. If you execute this query:

 SELECT AVG(DISTINCT QtyPurchased) FROM Sales

You get the value 4, which is the distinct QtyPurchased values (4 + 3 + 1 + 10 + 5 + 6, which is 29) divided by the number of distinct values (6). Truncate the decimal places and you get 4.

MAX() and MIN() are very simple. They return the maximum or the minimum value in the set. Here's an example:

 SELECT MIN(QtyPurchased), MAX(QtyPurchased) FROM Sales

This returns a rowset with one record and two columns : 1 and 10. The SUM function is nearly as simple, it just returns the sum of all the values in the group:

 SELECT SUM(QtyPurchased), SUM(DISTINCT QtyPurchased) FROM Sales

As you can see, this is another one of the functions that can use the DISTINCT keyword. This returns 31 and 29. Finally, dust off your old statistics book, because the next two sets of functions are a bit advanced. The VAR() and VARP() functions calculate the variance of the group. The STDDEV() and STDDEVP() functions calculate the standard deviation of the group. Variance and standard deviation are used in statistics to determine the spread of a sample set. A sample set has a high variance and a high standard deviation if there is a wide range of values in the sample set. If there is a small range of values, then the variance and standard deviation is smaller. The STDEVP() and STDEV() functions differ in that the STDEV() determines the sample standard deviation, while STDEVP() returns the population standard deviation. VAR() and VARP() are similar. See the "Suggested Reading" section at the end of the chapter if you really want more information.

Data Management Aggregate Functions

Four functions specifically pertain to helping you manage data. Three of them revolve around checksums. A checksum is a number generated from a data value that essentially summarizes the data value. For example, if you have the string "This is a test" , the checksum is 575031869. That doesn't mean much, but if you store that value, and then change the string to "The test is over" , which has a checksum of 270787094, you can quickly tell by comparing the checksums that the value of the string changed. If you're working with large data sets, though, the ability to compare checksums saves an immense amount of time over comparing strings together.

The computer science folks call this a hash function . Basically, it's a way to compress your data for comparison purposes to make it easier to compare. There is a small chance that two pieces of data will have the same checksum. How small? Well, 2 in 4 billion. The checksums return some number that's an integer, and the chances that two pseudorandom integers will be the same are very small. So this is actually a very useful function for detecting changes in rows.

The CHECKSUM() and CHECKSUM_BINARY() functions aren't true aggregate functions. They return a checksum for whatever you ask them to ”typically for a row. The difference between CHECKSUM() and CHECKSUM_BINARY() is that a CHECKSUM() checks data that's been localized for the server, whereas CHECKSUM_BINARY() checks raw data. In this case, "localized" means that the string is translated to how it would compare prior to computing the checksum. So, if you're on a server that's set up as not case sensitive, which is the default, the CHECKSUM() for "Widget" and "widget" would be the same, whereas the BINARY_CHECKSUM() would be different.

The CHECKSUM_AGG() function is the aggregate function used to compute checksums over sets. This function enables you to get the aggregated checksum for an entire table. Look at some examples. First, here's an example of using just the CHECKSUM() function on a table:

 SELECT CHECKSUM(*) FROM Sales

This returns 8 rows, each containing a checksum for the entire row. Now, if you want to use a checksum for the entire table, you can do this:

 SELECT CHECKSUM_AGG(CHECKSUM(*)) FROM Sales

This gives you a checksum for the entire table. Now, look at a way to use this information. The key thing to remember about using a checksum is that it's a snapshot. To find differences, you've got to have a checksum before and a checksum after the changes occur. One way to do this is to create a table that contains the name and checksum of each other table in your database, like this:

 Create Table ChecksumTracker (        TableName sysname,        Checksum int )

Then populate the table with the checksum from each table. Then you can re-run the checksum query periodically and compare the results to see whether the table changed.

So far, every aggregate function you've seen has operated on a set that has been the entire table. So, how can you create several subsets out of a table and get aggregate information for those subsets ?

Using GROUP BY to Aggregate Data

The GROUP BY operator creates subgroups within a query that can be used by aggregate functions. For example, look at the Sales table. Right now, the Sales table has one record for each sale, but you really want to know how many total items each person bought. Here's how GROUP BY solves that problem:

 SELECT PersonID, sum(QtyPurchased) as TotalQtyPurchased FROM Sales GROUP BY PersonID

Here's the resultset:

 PersonID    TotalQtyPurchased ----------- ----------------- 1           7 3           1 4           2 9           15 10          6 (5 row(s) affected)

Each person is listed once, and the sum(QtyPurchased) is computed for each person across the aggregate.

There are, of course, some rules. (There are always rules, or what would the test cover?)

The items in the SELECT statement that are not part of an aggregate function have to appear in the GROUP BY clause, and they probably should appear in the same order as they do in the SELECT statement.
You can have multiple aggregates in one SELECT statement. For example, you can find MIN and MAX values for a particular column in one SELECT statement.
The items in aggregate functions in the SELECT statement cannot appear outside aggregate functions in the SELECT statement. So although you can do SELECT MIN(value) , MAX(value) , you can't do SELECT value, MIN(value) , MAX(value) , even if you use the appropriate GROUP BY .

Here's another example. Imagine that you want to find out how many records each person has in the Sales table. You could do something like this:

 SELECT PersonID, count(*) FROM Sales GROUP BY PersonID

This returns each person's PersonID and the number of records they have in the Sales table. You can include multiple columns in the SELECT list and get a different aggregate:

 SELECT PersonID, ProductID, count(*) FROM Sales GROUP BY PersonID, ProductID

This returns each person and product, and the number of different purchases a person made for each product. You can also include an ORDER BY for sorting:

 SELECT PersonID, ProductID, count(*) FROM Sales GROUP BY PersonID, ProductID ORDER BY PersonID, ProductID

You could also order by count(*) and you'd get the expected results. It works just like any other returned column. You can do the normal filtering with a WHERE clause, also, as follows :

 SELECT PersonID, count(*) FROM Sales WHERE DatePurchased > '11/1/2000' GROUP BY PersonID

This filters out any sale with a DatePurchased after 11/1/2000, so the count reflects only sales since that time. So how can you find the people who purchased more than one item? To do that, you need to filter on the results from the aggregate function, which is something you cannot do in the WHERE clause. You need to use something a little different.

Using GROUP BY and HAVING to Filter Data

The HAVING keyword appears after GROUP BY in a SELECT statement, and it is used to filter data after the GROUP BY . You can actually use HAVING without using GROUP BY , in which case HAVING acts like a part of the WHERE clause. The HAVING operator is typically used with aggregate functions to filter out rows that don't meet certain criteria. For example:

 SELECT PersonID, count(*) FROM Sales GROUP BY PersonID HAVING count(*) > 1

This filters out all the records from the result of the aggregate that don't meet the criteria of COUNT(*) > 1 , so you end up with:

 PersonID ----------- ----------- 1           2 4           2 9           2 (3 row(s) affected)

This leaves out all the rows with COUNT(*) equal to 1. A HAVING clause can be more complex and contain expressions separated by AND and OR as appropriate:

 SELECT PersonID, count(*) FROM Sales GROUP BY PersonID HAVING count(*) > 1 AND PersonID > 4

Notice that the HAVING in this case doesn't care about the fact that PersonID isn't an aggregate function; it just filters out everyone who is less than or equal to 4 and has had fewer than 2 sales.

Creating Breaks with COMPUTE and COMPUTE BY

So far, the operators you've looked at do not ever create data rows. That changes now. The COMPUTE and COMPUTE BY operators create summary data across your aggregates and add it into your rowset. Here's an example:

 SELECT PersonID, QtyPurchased FROM Sales COMPUTE sum(QtyPurchased)

What does this code do? Well, it creates the sum of the QtyPurchased field, like this:

 PersonID    QtyPurchased ----------- ------------ 1           4 1           3 3           1 4           1 4           1 9           10 9           5 10          6             sum             ===========             31 (9 row(s) affected)

In addition, you can also find the summary information for each person by using this query:

 SELECT PersonID, QtyPurchased FROM Sales ORDER BY personid COMPUTE sum(QtyPurchased) BY PersonID

Which yields these results:

 PersonID    QtyPurchased DatePurchased ----------- ------------ ----------------------- 1           4            2001-07-22 16:50:38.257 1           3            2001-07-22 16:50:38.257             sum             ===========             7 PersonID    QtyPurchased DatePurchased ----------- ------------ ----------------------- 3           1            2001-07-22 16:50:38.257             sum             ===========             1

For the sake of brevity, the entire resultset is not shown, but you get the idea. It shows how many items each person purchased and when, and a summary of the total items purchased. Inevitably, there are rules about these sorts of things:

If you use COMPUTE without using BY , you can use the COUNT aggregate. Otherwise, you can't.
If you use COMPUTE with BY , you need to have an ORDER BY , with the same columns in the same order. You can have additional things in the ORDER BY , but you've at least got to have what is specified in the COMPUTE BY .
You can't use COMPUTE or COMPUTE BY with SELECT INTO .
You can't use COMPUTE or COMPUTE BY as the SELECT statement that feeds a cursor. ( Cursors are covered in Chapter 6.)

WARNING

Rowsets and Compute By When you're writing an application using ADO or OLE-DB, you may have some serious issues with COMPUTE and COMPUTE BY , along with anything else that returns irregular rowsets. This happens for the same reason that cursors and SELECT INTO aren't valid for these operators: the rowsets are irregular.

If you look at the structure for the rowsets, you'll notice that the rowsets aren't consistent ”they have the extra "stuff" in them for the aggregate. This "stuff" doesn't fit into a table very well; that's why it doesn't work with SELECT INTO . It also doesn't work with cursors for the same reason.

The only real difference between COMPUTE and COMPUTE BY is that a COMPUTE BY does an aggregation with each computation; it splits the rowset up with whatever you've specified to break it up by and creates those extra columns once for each aggregate.

Now that you've seen how GROUP BY and COMPUTE BY work, you can apply this knowledge to some other summarization tools: CUBE and ROLLUP specifically. As you'll discover, the shortcomings of COMPUTE and COMPUTE BY can sometimes be overcome with these two operators.