In addition to the SELECT statement and joins, SQL Server provides other useful search expressions that enable you to tailor your queries. These expressions include LIKE and BETWEEN, along with aggregate functions such as AVG, COUNT, and MAX.
In an earlier example, we used LIKE to find authors whose names start with Ri . LIKE allows you to search using the simple pattern matching that is standard with ANSI SQL, but SQL Server adds capability by introducing wildcard characters that allow the use of regular expressions. Table 7-4 shows how wildcard characters are used in searches:
Table 7-4. Wildcard characters recognized by SQL Server for use in regular expressions.
Wildcard | Searches for |
---|---|
% | Any string of zero or more characters. |
- | Any single character. |
[ ] | Any single character within the specified range (for example, [a-f]) or the specified set (for example, [abcdef]). |
[^] | Any single character not within the specified range (for example, [^a-f]) or the specified set (for example, [^abcdef]). |
The following examples from the SQL Server documentation show how LIKE can be used (Table 7-5).
Table 7-5. Use of LIKE in searches.
Form | Searches for |
---|---|
LIKE 'Mc%' | All names that begin with the letters Mc (McBadden). |
LIKE '%inger' | All names that end with inger (Ringer, Stringer). |
LIKE '%en%' | All names that include the letters en (Bennet, Green, McBadden). |
LIKE '-heryl' | All six-letter names ending with heryl (Cheryl, Sheryl). |
LIKE '[CK]ars[eo]n' | All names that begin with C or K, then ars, then e or o, and then end with n (Carsen, Karsen, Carson, and Karson). |
LIKE '[M-Z]inger' | All names ending with inger that begin with any single letter from M through Z (Ringer). |
LIKE 'M[^c]%' | All names beginning with the letter M that don't have the letter c as the second letter (MacFeather). |
Trailing blanks (blank spaces) can cause considerable confusion and errors. In SQL Server, using trailing blanks requires that you be aware of differences in how varchar and char data is stored and of how the SET ANSI_PADDING option is used.
Suppose that we create a table with five columns and insert data as follows :
SET ANSI_PADDING OFF -- OFF is the default if nothing is set DROP TABLE checkpad GO CREATE TABLE checkpad ( rowid smallint NOT NULL PRIMARY KEY, c10not char(10) NOT NULL, c10nul char(10) NULL, v10not varchar(10) NOT NULL, v10nul varchar(10) NULL ) -- Row 1 has names with no trailing blanks INSERT checkpad VALUES (1, 'John', 'John', 'John', 'John') -- Row 2 has each name inserted with three trailing blanks INSERT checkpad VALUES (2, 'John ', 'John ', 'John ', 'Jo hn ') -- Row 3 has each name inserted with a full six trailing blanks INSERT checkpad VALUES (3, 'John ', 'John ', 'John ', 'John ') -- Row 4 has each name inserted with seven trailing blanks too many INSERT checkpad VALUES (4, 'John ', 'John ', 'John ', 'John ') |
We can then use the following query to analyze the contents of the table. We use the DATALENGTH( expression ) function to determine the actual length of the data, and then convert to VARBINARY() to display the hexadecimal values of the characters stored. (For those who don't have the ASCII chart memorized, a blank space is 0x20.) We then use a LIKE to try to match each column, once with one trailing blank and once with six trailing blanks.
SELECT ROWID, "0xC10NOT"=CONVERT(VARBINARY(10), c10not), L1=DATALENGTH(c10not), "LIKE 'John %'"=CASE WHEN (c10not LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John %'"=CASE WHEN (c10not LIKE 'John %') THEN 1 ELSE 0 END, "0xC10NUL"=CONVERT(VARBINARY(10), c10nul), L2=DATALENGTH(c10nul), "LIKE 'John %'"=CASE WHEN (c10nul LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John %'"=CASE WHEN (c10nul LIKE 'John %') THEN 1 ELSE 0 END, "0xV10NOT"=CONVERT(VARBINARY(10), v10not), L3=DATALENGTH(v10not), "LIKE 'John %'"=CASE WHEN (v10not LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John %'"=CASE WHEN (v10not LIKE 'John %') THEN 1 ELSE 0 END, "0xV10NUL"=CONVERT(VARBINARY(10), v10nul), L4=DATALENGTH(v10nul), "LIKE 'John %'"=CASE WHEN (v10nul LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John %'"=CASE WHEN (v10nul LIKE 'John %') THEN 1 ELSE 0 END FROM checkpad |
Figure 7-6 shows the results of this query from the checkpad table.
Figure 7-6. Results with SET ANSI_PADDING OFF.
Without enabling SET ANSI_PADDING ON, SQL Server trims trailing blanks for variable-length columns and for fixed-length columns that allow NULLs. Notice that only column c10not has a length of 10. The other three columns are treated identically and end up with a length of 4 and no padding. For these columns, even trailing blanks explicitly entered within the quotation marks are trimmed . All the variable-length columns as well as the fixed-length column that allows NULLs have only the 4 bytes J-O-H-N (4a-6f-68-6e) stored. The true fixed-length column, c10not , is automatically padded with six blank spaces (0x20) to fill the entire 10 bytes.
All four columns would match LIKE 'John%' (no trailing blanks), but only the fixed-length column c10not matches LIKE 'John %' (one trailing blank) and LIKE 'JOHN %' (six trailing blanks). The reason for this should be apparent when you realize that those trailing blanks are truncated and aren't part of the stored values for the other columns. A column query will get a hit with LIKE and trailing blanks only when the trailing blanks are stored. In this example, only the c10not column stores trailing blanks, so that's where the hit occurs.
SQL Server has dealt with trailing blanks in this way for a long time. However, when the developers began work to pass the National Institute of Standards and Technology (NIST) test suite for ANSI SQL conformance, they realized that this behavior was inconsistent with the ANSI specification. The specification requires that you always pad char datatypes and that you don't truncate trailing blanks entered by the user , even for varchar . We can argue for either side both the long-standing SQL Server behavior and the ANSI specification behavior can be desirable, depending on the circumstances. However, if SQL Server changed its behavior to conform to ANSI, it would break many deployed SQL Server applications. So, to satisfy both sides, the ANSI_PADDING option was added in version 6.5. If we enable SET ANSI_PADDING ON, re-create the table, reinsert the rows, and then issue the same query, we get different results from those shown in Figure 7-6. Figure 7-7 shows these results.
By setting ANSI_PADDING ON, the char(10) columns c10not and c10nul , whether declared NOT NULL or NULL, are always padded out to 10 characters regardless of how many trailing blanks were entered. In addition, the variable-length columns v10not and v10nul contain the number of trailing blanks that were inserted. Trailing blanks are not truncated. Because the variable-length columns now store any trailing blanks that were part of the INSERT (or UPDATE) statement, the columns get hits in queries with LIKE and trailing blanks if the query finds that number of trailing blanks (or more) in the columns' contents. Notice that no error occurs by inserting a column with seven (instead of six) trailing blanks. The final trailing blank simply gets truncated. If you want a warning message to appear in such cases, you can use SET ANSI_WARNING ON.
Figure 7-7. Results with SET ANSI_PADDING ON.
Note that the SET ANSI_PADDING ON option applies to a table only if the option was enabled at the time the table was created (or to the column if it was added via ALTER TABLE). Enabling this option for an existing table does nothing. To determine whether a column has ANSI_PADDING set to ON, examine the status field of the syscolumns table and do a bitwise AND 20 (decimal). If the bit is on, you know the column was created with the option enabled. For example:
SELECT name, 'Padding On'=CASE WHEN status & 20 <> 0 THEN 'YES' ELSE 'NO' END FROM syscolumns WHERE id=OBJECT_ID('checkpad') AND name='v10not' GO |
SET ANSI_PADDING ON also affects the behavior of binary and varbinary fields, but the padding is done with zeros instead of spaces. Table 7-6 below shows the effects of the SET ANSI_PADDING setting when values are inserted into columns with char , varchar , binary , and varbinary data types.
Note than when a fixed-length column allows NULLs, it mimics the behavior of variable-length columns if ANSI_PADDING is off, and it mimics the behavior of fixed-length columns if ANSI_PADDING is on.
The SQL Server ODBC driver and OLE DB Provider for SQL Server automatically set ANSI_PADDING to ON when connecting. This can be configured in ODBC data sources, in ODBC connection attributes, or OLE DB connection properties set in the application before connecting.
Table 7-6. The effects of the ANSI_PADDING setting.
Setting of ANSI_PADDING | char( n ) NOT NULL or binary( n ) NOT NULL | char( n ) NULL or binary( n ) NULL | varchar( n ) or varbinary( n ) |
---|---|---|---|
ON | Pad original value to the length of the column. | Follows same rules as for char(n) or binary(n) NOT NULL. | Trailing blanks (for varchar ) or zeros (for varbinary ) in values inserted aren't trimmed. Values are not padded to the length of the column. |
OFF | Pad original value to the length of the column. | Follows same rules as for varchar or varbinary . | Trailing blanks (for varchar ) or trailing zeros (for varbinary ) are trimmed. |
Regardless of how the column was created, SQL Server provides functions that can help you deal effectively with trailing blanks. The RTRIM( char_expression ) function removes trailing blanks if they exist. You can add trailing blanks to your output by concatenating with the SPACE( integer_expression ) function. An easier way is to use the CONVERT function to convert to a fixed-length char() of whatever length you want. Then variable-length columns will be returned as fully padded, regardless of the ANSI_PADDING setting and regardless of the columns' actual storage, by converting them to char(10) or whatever size you want. The following SELECT statement against the checkpad table illustrates this:
SELECT CONVERT(VARBINARY(10), CONVERT(char(10), v10nul)) FROM checkpad |
Here's the result:
0x4a6f686e202020202020 0x4a6f686e202020202020 0x4a6f686e202020202020 0x4a6f686e202020202020 |
BETWEEN is shorthand for greater-than-or-equal-to AND less-than -or-equal-to. The clause WHERE C BETWEEN B AND D is equivalent to writing WHERE C >= B AND C <= D. This one's not too tricky, but the order of the values to be checked is important when you use BETWEEN. If B , C , and D are increasing values, you can see that WHERE C BETWEEN D AND B wouldn't be true because it would evaluate to WHERE C >= D AND C <= B. Some early versions of SQL Server let you reverse the order, such that the meaning was equivalent to WHERE (C >= B AND C <= D) OR (C >= D AND C <= B). However, that behavior was considered to be a bug and it was corrected in 1993.
Aggregate functions (sometimes referred to as set functions ) allow you to sum-marize a column of output. SQL Server provides six general aggregate functions that are standard ANSI SQL-92 fare. (Therefore, we won't go into much depth about their general use; instead, we'll focus on some aspects specific to SQL Server.) Table 7-7 below summarizes SQL Server's aggregate functions.
Consider the following table, automobile_sales_detail . We'll use this table to demonstrate how aggregate functions simplify finding the answers to complex questions:
rowid model year color units_sold ----- ----- ---- ----- ---------- 1 Chevy 1990 Red 5 2 Chevy 1990 White 87 3 Chevy 1990 Blue 62 4 Chevy 1991 Red 54 5 Chevy 1991 White 95 6 Chevy 1991 Blue 49 7 Chevy 1992 Red 31 8 Chevy 1992 White 54 9 Chevy 1992 Blue 71 10 Ford 1990 Red 64 11 Ford 1990 White 62 12 Ford 1990 Blue 63 13 Ford 1991 Red 52 14 Ford 1991 White 9 15 Ford 1991 Blue 55 16 Ford 1992 Red 27 17 Ford 1992 White 62 18 Ford 1992 Blue 39 |
Table 7-7. SQL Server's aggregate functions.
Aggregate Function | Description |
---|---|
AVG( expression ) | Returns the average (mean) of all the values, or only the DISTINCT values, in the expression. You can use AVG with numeric * columns only. Null values are ignored. |
COUNT( expression ) | Returns the number of non-null values in the expression. When DISTINCT is specified, COUNT finds the number of unique non-null values. You can use COUNT with both numeric and character columns. Null values are ignored. |
COUNT(*) | Returns the number of rows. COUNT(*) takes no parameters and can't be used with DISTINCT. All rows are counted, even those with null values. |
MAX( expression ) | Returns the maximum value in the expression. You can use MAX with numeric, character, and datetime columns but not with bit columns. With character columns, MAX finds the highest value in the collating sequence. MAX ignores any null values. DISTINCT is available for ANSI compatibility, but it's not meaningful with MAX. |
MIN( expression ) | Returns the minimum value in the expression. You can use MIN with numeric, character, and datetime columns, but not with bit columns. With character columns, MIN finds the value that is lowest in the sort sequence. MIN ignores any null values. DISTINCT is a1vailable for ANSI compatibility, but it's not meaningful with MIN. |
SUM( expression ) | Returns the sum of all the values, or only the DISTINCT values, in the expression. You can use SUM with numeric columns only. Null values are ignored. |
You can increase the power of aggregate functions by allowing them to be grouped and by allowing the groups to have criteria established for inclusion via the HAVING clause. The following queries provide some examples.
Show the oldest and newest model years for sale, and show the average sales of all the entries in the table:
SELECT 'Oldest'=MIN(year), 'Newest'=MAX(year), 'Avg Sales'=AVG(units_sold) FROM automobile_sales_detail |
Here's the output:
Oldest Newest Avg Sales ------ ------ --------- 1990 1992 52 |
Do the Example 1 query, but show the values for autos that are Chevys only:
SELECT 'Oldest'=MIN(year), 'Newest'=MAX(year), 'Avg Sales'=AVG(units_sold) FROM automobile_sales_detail WHERE model='Chevy' |
And the result:
Oldest Newest Avg Sales ------ ------ --------- 1990 1992 56 |
Show the same values as in Example 1, but group them based on the model and color of the cars :
SELECT model, color, 'Oldest'=MIN(year), 'Newest'=MAX(year), 'Avg Sales'=AVG(units_sold) FROM automobile_sales_detail GROUP BY model, color ORDER BY model, color |
The output appears below.
model color Oldest Newest Avg Sales ----- ----- ------ ------ --------- Chevy Blue 1990 1992 60 Chevy Red 1990 1992 30 Chevy White 1990 1992 78 Ford Blue 1990 1992 52 Ford Red 1990 1992 47 Ford White 1990 1992 44 |
Show the same values for only those model-year rows with average sales of 65 or less. Also, order and group the output first by color and then by model, but keep the columns in the same order:
SELECT model, color, 'Oldest'=MIN(year), 'Newest'=MAX(year), 'Avg Sales'=AVG(units_sold) FROM automobile_sales_detail GROUP BY color, model HAVING AVG(units_sold) <= 65 ORDER BY color, model |
Here's the output:
model color Oldest Newest Avg Sales ----- ----- ------ ------ --------- Chevy Blue 1990 1992 60 Ford Blue 1990 1992 52 Chevy Red 1990 1992 30 Ford Red 1990 1992 47 Ford White 1990 1992 44 |
NOTE
We include an ORDER BY clause in these queries, even though we specify the criteria of the ORDER BY to be the same criteria that the GROUP BY clause uses. In earlier releases of SQL Server, this wasn't necessary, because GROUP BY always returned the rows in the order of the grouping columns. SQL Server 7 has several alternative strategies for handling GROUP BY that don't require sorting of the results, so it's unpredictable which order the results will be returned in. If you want the results in a particular order, you must specify ORDER BY.
You can use COUNT(*) to find the count of all the rows in the table, but a considerably faster way exists. The rows column in the sysindexes table keeps the current rowcount dynamically for the clustered index. If one exists, indid = 1. If no clustered index exits, sysindexes keeps the count for the table ( indid = 0). A quicker query than using COUNT(*) to find the count of all rows in the table looks something like this:
SELECT rows FROM sysindexes WHERE id=OBJECT_ID ("authors") AND indid < 2 |
This query works only for base tables, not views, and it works only if you apply no selection criteria via a WHERE clause. If you want to respond to a query such as "Show me the count of all rows for which the author lives in California," you'd still need to use COUNT(*). In addition, there's no guarantee that this number will be accurate 100 percent of the time. In previous versions of SQL Server, certain operations (such as SELECT INTO), didn't always correctly update the rows column in sysindexes. SQL Server 7 is much more likely to keep sysindexes up to date, but there's still no guarantee. If you absolutely must know exactly how many rows are in a table in your application, use the COUNT(*) aggregate.
As usual, you must understand the effect of using NULLs to use them appropriately with aggregate functions. Unfortunately, these effects aren't necessarily intuitive or even consistent. Because NULL is considered Unknown, one could certainly argue that using SUM(), MAX(), MIN(), or AVG() on a table containing one or more NULL values should also produce NULL. Obviously, if we have three employees and we don't know the salaries of two of them, it's totally inaccurate to state that the SUM() of the salaries for the three employees is equal to the one known salary. Yet this is exactly how SUM(), MIN(), MAX(), AVG(), and COUNT() work (but not COUNT(*)). Implicitly, these functions seem to tack on the criterion of "for only those values that are not NULL."
You will likely want these operations to work this way most of the time, and pragmatically, it's good that they work like this because they're simpler to use for common cases. But if you don't understand exactly how these functions will affect your code, you might introduce some bugs . For example, if you were to divide SUM() by the result of COUNT(*) as a way to compute the mean value, the result probably wouldn't be what you intended SUM() disregards the NULL values, but COUNT(*) counts the rows that include those NULL values.
We've suggested a few times that using default or dummy values can be a good alternative to using NULL. But we've also tried to make clear that while these alternatives solve some issues, they introduce other problems. You must account for aggregates, and you might want to get back to exactly the semantics that aggregates such as SUM() use with defaults, ignoring the default or dummy values. Aggregating on only one column is pretty straightforward, and your WHERE clause could simply exclude those values you want to disregard. But if you're aggregating on multiple columns, eliminating rows in the WHERE clause might be a poor solution.
For example, suppose that in our employee table, we have both an emp_age column and an emp_salary column. To avoid using NULL, we use 0.00 when we don't know the actual value. We want one query to do aggregations of each column. Yet we don't want to disregard a row's value for emp_age if we don't have an actual value for emp_salary , or vice versa.
Here's our employee table:
emp_id emp_name emp_age emp_salary ------ -------- ------- ---------- 1 Smith 34 26000.00 2 Doe 30 35000.00 3 Jones 45 0.00 4 Clark 0 65000.00 5 Kent 63 0.00 |
Now suppose we want to write a simple query to get the count of all employees, the average and lowest salaries, and the average and lowest ages. Here's our first simplistic query:
SELECT 'Num Employees'=COUNT(*), 'Avg Salary'=AVG(emp_salary), 'Low Salary'=MIN(emp_salary), 'Avg Age'=AVG(emp_age), 'Youngest'=MIN(emp_age) FROM employee |
Here's the result:
Num Employees Avg Salary Low Salary Avg Age Youngest ------------- ---------- ---------- ------- -------- 5 25200.00 0.00 34 0 |
Obviously, this query doesn't correctly answer our question, because the dummy values of 0 have distorted the results for the AVG() and MIN() columns. So we decide to exclude those rows by stating in the WHERE clause that the salary should not be the default value:
SELECT 'Num Employees'=COUNT(*), 'Avg Salary'=AVG(emp_salary), 'Low Salary'=MIN(emp_salary), 'Avg Age'=AVG(emp_age), 'Youngest'=MIN(emp_age) FROM employee WHERE emp_salary > 0 AND emp_age > 0 |
Here's what we get:
Num Employees Avg Salary Low Salary Avg Age Youngest ------------- ---------- ---------- ------- -------- 2 30500.00 26000.00 32 30 |
This query is marginally better because at least the average salary is no longer lower than any actually known salary. The values are based on only the first two employees, Smith and Doe we have five employees, not just two, as this result claims. Two other employees, Jones and Kent, are significantly older than the two who were used to compute the average age. And had Clark's salary been considered, the average salary would be much higher. We are seemingly stuck and can't write a single query to answer the question; we might have to write separate queries to get the count, the salary information, as well as the age information.
Fortunately, the NULLIF() function comes to the rescue. This function is one of many handy Transact-SQL functions. NULLIF() takes two expressions, NULLIF( expression1, expression2 ), and returns NULL if the two expressions are equivalent, and expression1 if they aren't equivalent. The NULLIF() function is roughly the mirror image of another function, ISNULL(), which produces a value if a NULL is encountered (discussed earlier in this chapter). The NULLIF() function is actually a special-purpose shorthand of the ANSI SQL-92 CASE expression (discussed in detail in Chapter 9). Using NULLIF() is equivalent to the following:
CASE WHEN expression1=expression2 THEN NULL ELSE expression1 END |
Hence, NULLIF (emp_salary, 0) produces NULL for those rows in which the salary equals 0. By converting the 0 values to NULL, we can use the aggregate functions just as we'd use them with NULL so that SUM() and MIN() will disregard the NULL entries.
SELECT 'Num Employees'=COUNT(*), 'Avg Salary'=AVG(NULLIF(emp_salary, 0)), 'Low Salary'=MIN(NULLIF(emp_salary, 0)), 'Avg Age'=AVG(NULLIF(emp_age, 0)), 'Youngest''=MIN(NULLIF(emp_age, 0)) FROM employee |
Here's the more accurate result:
Num Employees Avg Salary Low Salary Avg Age Youngest ------------- ---------- ---------- ------- -------- 5 42000.00 26000.00 43 30 |
Finally! This result is exactly what we wanted, and all five employees are represented. Of the three whose salaries we know, the average salary is $42,000.00 and the low salary is $26,000.00. Of the four whose ages we know, the average age is 43 and the youngest is 30.
The previous aggregate examples in this chapter, all using standard SQL-92 syntax, show that by formulating a specific query on the sample table with one of the aggregate functions, we can answer almost any question regarding aggregate sales for some combination of model, year, and color of car. However, answering any of these questions requires a separate, specific query. Of course, it's common to look at data and keep formulating and posing new and different questions. Such questions are the hallmark of data mining, decision-support systems (DSS), online analytic processing (OLAP), or whatever name is being used these days for this age-old need to examine data.
To address this inherent weakness in standard SQL, developers have created several front-end tools that maintain a set of data that allows changes in aggregation to be easily queried, avoiding the need to sweep through data for every new aggregation request. One such tool is Microsoft OLAP Services, which you can install from the same CD that contains your SQL Server 7 software. OLAP tools can add a lot of value, and they make it easy to slice and dice the values pertaining to any "cut" of the data groupings. But even these tools can perform better with help from the database engine.
Some HistoryThe motivation for inventing CUBE came from a paper written by Jim Gray, and others, soon after Jim joined Microsoft as a researcher. After reading his paper, a few of the developers got excited about adding the feature to the 6.5 release, and Don Reichart did a masterful job of implementing this important feature in record time. Subsequently, this feature was submitted to the ANSI SQL committee as a proposed addition to the standard. For more details about CUBE, you can read the paper that Jim Gray submitted to the ACM on the companion CD.
SQL Server has two extensions to GROUP BY CUBE and ROLLUP that allow SQL Server to optionally generate all the aggregate groupings in a single query.
CUBE explodes data to produce a result set containing a superset of groups, with a cross-tabulation of every column to the value of every other column, as well as a special super-aggregate value that can be thought of as meaning ALL VALUES. Here's the CUBE for the 18 rows of the automobile_sales_detail table. Notice that these 18 rows generate 48 rows in the datacube.
units model year color ----- ----- ---- ----- 62 Chevy 1990 Blue 5 Chevy 1990 Red 87 Chevy 1990 White 154 Chevy 1990 ALL 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 ALL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 ALL 508 Chevy ALL ALL 63 Ford 1990 Blue 64 Ford 1990 Red 62 Ford 1990 White 189 Ford 1990 ALL 55 Ford 1991 Blue 52 Ford 1991 Red 9 Ford 1991 White 116 Ford 1991 ALL 39 Ford 1992 Blue 27 Ford 1992 Red 62 Ford 1992 White 128 Ford 1992 ALL 433 Ford ALL ALL 941 ALL ALL ALL 125 ALL 1990 Blue 69 ALL 1990 Red 149 ALL 1990 White 343 ALL 1990 ALL 104 ALL 1991 Blue 106 ALL 1991 Red 104 ALL 1991 White 314 ALL 1991 ALL 110 ALL 1992 Blue 58 ALL 1992 Red 116 ALL 1992 White 284 ALL 1992 ALL 182 Chevy ALL Blue 157 Ford ALL Blue 339 ALL ALL Blue 90 Chevy ALL Red 143 Ford ALL Red 233 ALL ALL Red 236 Chevy ALL White 133 Ford ALL White 369 ALL ALL White |
The results of the CUBE make it simple to answer just about any sales aggregation question that we can think of. If we want to get sales information for all 1992 automobiles, regardless of model or color, we can find the corresponding row and see that 284 cars were sold. To find all Chevy sales for all years and colors, it's equally easy to find 508. Such a result set might be used as is for a reference chart or in coordination with an OLAP tool.
The number of rows exploded by a CUBE operation can be surprisingly large. In the example above, 18 rows generated 48 rows of output. However, the result of a CUBE operation won't necessarily produce more rows of output than what appears in the underlying data. The number of rows generated will depend on the number of attributes being grouped and on the actual combinations in your data. Without detailed knowledge of your data, the number of rows produced by a CUBE operation can't be predicted. However, the upper bound can be easily predicted . The upper bound will be equal to the cross-product of the (number of distinct values + 1) value for each attribute. The addition of 1 is for the case of "ALL." The automobile_sales_detail example has three attributes: model, year, and color.
Upper bound of number of rows = (Number of models + 1) * (Number of years + 1) * (Number of colors + 1) = (2 + 1) * (3 + 1) * (3 + 1) = 48 |
This example should make it clear that the upper bound of the number of rows depends not on the number of data rows, but rather on the number of attributes being grouped and the number of distinct values for each attribute. For example, if the table had 18 million instead of 18 data rows but still had no more than two models, three years, and three colors (there's no stated UNIQUE or PRIMARY KEY constraint on the model, year, or color fields), the CUBE operation will again return only 48 rows.
The actual number of rows might be considerably less than the upper bound, however, because CUBE doesn't try to force a 0 or a NULL aggregate for combinations for which no values exist. In the carefully constructed example you just saw, there's complete coverage across every attribute. Each of the two models has an entry for each of the three years and for each of the three colors. But suppose that no sales data appears for the 1990 Chevy and that no red Fords exist for any year. The raw data would look like this:
Rowid Model Year Color Units_Sold ----- ----- ---- ----- ---------- 1 Chevy 1991 Red 54 2 Chevy 1991 White 95 3 Chevy 1991 Blue 49 4 Chevy 1992 Red 31 5 Chevy 1992 White 54 6 Chevy 1992 Blue 71 7 Ford 1990 White 62 8 Ford 1990 Blue 63 9 Ford 1991 White 9 10 Ford 1991 Blue 55 11 Ford 1992 White 62 12 Ford 1992 Blue 39 |
Here's the new cube:
Units Model Year Color ----- ----- ---- ----- 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 ALL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 ALL 354 Chevy ALL ALL 63 Ford 1990 Blue 62 Ford 1990 White 125 Ford 1990 ALL 55 Ford 1991 Blue 9 Ford 1991 White 64 Ford 1991 ALL 39 Ford 1992 Blue 62 Ford 1992 White 101 Ford 1992 ALL 290 Ford ALL ALL 644 ALL ALL ALL 63 ALL 1990 Blue 62 ALL 1990 White 125 ALL 1990 ALL 104 ALL 1991 Blue 54 ALL 1991 Red 104 ALL 1991 White 262 ALL 1991 ALL 110 ALL 1992 Blue 31 ALL 1992 Red 116 ALL 1992 White 257 ALL 1992 ALL 120 Chevy ALL Blue 157 Ford ALL Blue 277 ALL ALL Blue 85 Chevy ALL Red 85 ALL ALL Red 149 Chevy ALL White 133 Ford ALL White 282 ALL ALL White |
We now have 12 data rows but still three attributes and the same number of distinct values for each attribute. But instead of 48 rows for the result of the CUBE, the result contains only 39 rows because some combinations of attributes have no data. For example, no row exists for all models of red cars for 1990 because no data met that specific criterion.
If you would benefit from always having the full cube populated when you use CUBE, you might want to add a placeholder row with dummy values for the columns being aggregated so that every combination is represented. Or you could run a query that produces the cross-product of all combinations not existing, and then UNION the result with the cube. (We'll see an example of this later when discussing UNION.) Be careful: as the number of attributes and the cardinality between the values increases , the cube can quickly get large. The full cube based on five attributes, each having 50 distinct values, with at least one matching entry for each combination, generates (50 + 1) 5 , or 345,025,251, rows in the cube. If coverage of all these combinations is nowhere near full, the cube might otherwise have generated "only" a few thousand rows instead of more than 345 million!
So far, we haven't seen the actual query used to produce the cube because you need to understand the basic concepts before yet again dealing with issues of NULL. A simple GROUP BY column WITH CUBE will not, by default, produce super-aggregate rows having a value that outputs as "ALL." The ALL value is really a nonvalue, like NULL (gasp!). In fact, the ALL column is represented by (gulp!) a special kind of NULL, referred to as a grouping NULL .
While implementing this new feature, the developers debated whether to use NULL or invent a new marker value similar in some ways to NULL but clearly still separate from data. The decision wasn't easy. A new data value would be more theoretically pure. The meaning of "ALL" is really different from "unknown" or "missing" that is, it's different from NULL. However, had they introduced another special-meaning marker, it would require the same level of complexity necessary to work with as NULL with special operators like IS ALL and the truth tables would become all the more difficult to work with in expressions such as =, <, IN, and so on. And they did have some precedents that they wanted to avoid.
In many ways, the longtime SQL Server feature COMPUTE BY is quite similar to CUBE. (Actually, it's more similar to the derivative of CUBE ROLLUP which we'll see shortly.) COMPUTE BY, although helpful, is a rarely used feature, largely because COMPUTE BY doesn't generate a standard row format for the result set but rather returns a special alternate result set . Applications must go to significant lengths to handle this special-purpose result, which is nonrelational in that the output itself isn't a table. Most applications didn't or couldn't go the distance; hence, most applications don't support COMPUTE BY. Because of this, the SQL Server development team chose to overload NULL, which applications must already deal with to some extent. (This presents a few wrinkles of its own, which we'll come back to.)
Following is the basic query and a discussion of how "ALL" is produced in the results:
SELECT SUM(units_sold), model, year, color FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE |
Here's the output:
model year color ----------- ----------- ----------- ----------- 62 Chevy 1990 Blue 5 Chevy 1990 Red 87 Chevy 1990 White 154 Chevy 1990 NULL 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 NULL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 NULL 508 Chevy NULL NULL 63 Ford 1990 Blue 64 Ford 1990 Red 62 Ford 1990 White 189 Ford 1990 NULL 55 Ford 1991 Blue 52 Ford 1991 Red 9 Ford 1991 White 116 Ford 1991 NULL 39 Ford 1992 Blue 27 Ford 1992 Red 62 Ford 1992 White 128 Ford 1992 NULL 433 Ford NULL NULL 941 NULL NULL NULL 125 NULL 1990 Blue 69 NULL 1990 Red 149 NULL 1990 White 343 NULL 1990 NULL 104 NULL 1991 Blue 106 NULL 1991 Red 104 NULL 1991 White 314 NULL 1991 NULL 110 NULL 1992 Blue 58 NULL 1992 Red 116 NULL 1992 White 284 NULL 1992 NULL 182 Chevy NULL Blue 157 Ford NULL Blue 339 NULL NULL Blue 90 Chevy NULL Red 143 Ford NULL Red 233 NULL NULL Red 236 Chevy NULL White 133 Ford NULL White 369 NULL NULL White |
Notice that rather than ALL, which we had constructed earlier to be a placeholder, the super-aggregate values are represented by NULL here formatted as NULL as in the SQL Server Query Analyzer, although this is application specific. The data in this case had no NULL values, so we could easily tell the difference between NULL meaning "unknown" and NULL meaning "ALL."
If your data uses some NULL values, however, you won't be so lucky. Suppose that we add one row of data that has NULL for both model and year ( assuming these columns allow NULL):
INSERT automobile_sales_detail values (NULL, NULL, 'White', 10) |
The last row in the CUBE above (having 369 as total sales for all white cars) would be seemingly indistinguishable from this actual data row. Fortunately, the GROUPING() function comes to the rescue to differentiate the two. It returns 1 (TRUE) if the element is an ALL value and 0 (FALSE) otherwise. Here is a modified query to show the CUBE with the inclusion of the row having NULL values and with the addition of the GROUPING() function to designate ALL values:
SELECT 'Units Sold'=SUM(units_sold), model, 'ALL Models'=GROUPING(model), year, 'ALL Years'=GROUPING(year), color, 'ALL Colors'=GROUPING(color) FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE |
Here's the result set:
ALL ALL ALL Units Sold model Models year Years color Colors ---------- ----- ------ ------ ----- ----- ------ 10 NULL 0 NULL 0 White 0 10 NULL 0 NULL 0 NULL 1 10 NULL 0 NULL 1 NULL 1 62 Chevy 0 1990 0 Blue 0 5 Chevy 0 1990 0 Red 0 87 Chevy 0 1990 0 White 0 154 Chevy 0 1990 0 NULL 1 49 Chevy 0 1991 0 Blue 0 54 Chevy 0 1991 0 Red 0 95 Chevy 0 1991 0 White 0 198 Chevy 0 1991 0 NULL 1 71 Chevy 0 1992 0 Blue 0 31 Chevy 0 1992 0 Red 0 54 Chevy 0 1992 0 White 0 156 Chevy 0 1992 0 NULL 1 508 Chevy 0 NULL 1 NULL 1 63 Ford 0 1990 0 Blue 0 64 Ford 0 1990 0 Red 0 62 Ford 0 1990 0 White 0 189 Ford 0 1990 0 NULL 1 55 Ford 0 1991 0 Blue 0 52 Ford 0 1991 0 Red 0 9 Ford 0 1991 0 White 0 116 Ford 0 1991 0 NULL 1 39 Ford 0 1992 0 Blue 0 27 Ford 0 1992 0 Red 0 62 Ford 0 1992 0 White 0 128 Ford 0 1992 0 NULL 1 433 Ford 0 NULL 1 NULL 1 951 NULL 1 NULL 1 NULL 1 10 NULL 1 NULL 0 White 0 10 NULL 1 NULL 0 NULL 1 125 NULL 1 1990 0 Blue 0 69 NULL 1 1990 0 Red 0 149 NULL 1 1990 0 White 0 343 NULL 1 1990 0 NULL 1 104 NULL 1 1991 0 Blue 0 106 NULL 1 1991 0 Red 0 104 NULL 1 1991 0 White 0 314 NULL 1 1991 0 NULL 1 110 NULL 1 1992 0 Blue 0 58 NULL 1 1992 0 Red 0 116 NULL 1 1992 0 White 0 284 NULL 1 1992 0 NULL 1 182 Chevy 0 NULL 1 Blue 0 157 Ford 0 NULL 1 Blue 0 339 NULL 1 NULL 1 Blue 0 90 Chevy 0 NULL 1 Red 0 143 Ford 0 NULL 1 Red 0 233 NULL 1 NULL 1 Red 0 10 NULL 0 NULL 1 White 0 236 Chevy 0 NULL 1 White 0 133 Ford 0 NULL 1 White 0 379 NULL 1 NULL 1 White 0 |
Note that the GROUPING() function takes a column name as an argument. A 0 returned for GROUPING( column_name ) means the value returned for that column is a real value. A 1 returned for GROUPING( column_name ) means that the value returned for that column is not real it was returned only because of the use of the CUBE or ROLLUP option.
The GROUPING() function enables you to differentiate between a GROUPING NULL value and a NULL value, but hardly makes reading and analyzing it as intuitive as using ALL. An alternative would be to write a view that outputs a grouping value as ALL. You might be more comfortable turning your programmers and power users loose on such a view rather than relying on them to understand GROUPING NULL vs. NULL value. This example uses the CASE expression, which is discussed in detail in Chapter 9:
CREATE VIEW auto_cube (units, model, year, color) AS SELECT SUM(units_sold), CASE WHEN (GROUPING(model)=1) THEN 'ALL' ELSE ISNULL(model, '????') END, CASE WHEN (GROUPING(year)=1) THEN 'ALL' ELSE ISNULL(CONVERT(char(6), year), '????') END, CASE WHEN (GROUPING(color)=1) THEN 'ALL' ELSE ISNULL(color, '????') END FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE |
Having constructed this view, it's simple for someone to work with, to understand the difference between grouping values (represented by ' ALL' ) and NULL data values (represented by ' ????' ), and to easily further refine the query if necessary to look for certain data. This example shows a simple query:
SELECT * FROM auto_cube |
And the results:
units model year color ----- ----- ---- ----- 10 ???? ???? White 10 ???? ???? ALL 10 ???? ALL ALL 62 Chevy 1990 Blue 5 Chevy 1990 Red 87 Chevy 1990 White 154 Chevy 1990 ALL 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 ALL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 ALL 508 Chevy ALL ALL 63 Ford 1990 Blue 64 Ford 1990 Red 62 Ford 1990 White 189 Ford 1990 ALL 55 Ford 1991 Blue 52 Ford 1991 Red 9 Ford 1991 White 116 Ford 1991 ALL 39 Ford 1992 Blue 27 Ford 1992 Red 62 Ford 1992 White 128 Ford 1992 ALL 433 Ford ALL ALL 951 ALL ALL ALL 10 ALL ???? White 10 ALL ???? ALL 125 ALL 1990 Blue 69 ALL 1990 Red 149 ALL 1990 White 343 ALL 1990 ALL 104 ALL 1991 Blue 106 ALL 1991 Red 104 ALL 1991 White 314 ALL 1991 ALL 110 ALL 1992 Blue 58 ALL 1992 Red 116 ALL 1992 White 284 ALL 1992 ALL 182 Chevy ALL Blue 157 Ford ALL Blue 339 ALL ALL Blue 90 Chevy ALL Red 143 Ford ALL Red 233 ALL ALL Red 10 ???? ALL White 236 Chevy ALL White 133 Ford ALL White 379 ALL ALL White |
Working with this view is easy. Grouping values appear as "ALL." Actual NULL values are represented by question marks, appropriate because they're unknown. We can easily further select from the view to drill into any dimension of the cube that we like. For example, to find all Chevy sales, regardless of the year and color:
SELECT * FROM auto_cube WHERE model='Chevy' AND year='ALL' AND color='ALL' |
Here are the results:
units model year color ----- ----- ---- ----- 508 Chevy ALL ALL |
In this view, we elected to output everything as character data; thus, "ALL" was a reasonable choice. If keeping the data numeric had been important, we would've chosen some special value, such as -999999.
If you're looking for a hierarchy or a drill-down report (what you might know as a control-break report if you've ever programmed in COBOL), CUBE can be overkill. It generates many more result rows than you want, and it obscures the requested information with more data. (Don't be confused by thinking that data and information are equivalent.) Using CUBE, you'll get all permutations , including super-aggregates, for all attributes for which corresponding data exists. SQL Server provides the ROLLUP operator to extract statistics and summary information from result sets. ROLLUP returns only the values for a hierarchy of the attributes that you specify.
This behavior is best explained with an example. If we change the CUBE query to use ROLLUP instead, the results are more compact and easier to interpret when we drill into progressive levels of details for sales by model. However, with ROLLUP, we don't get one-stop shopping to answer a question like "How many white cars of any model were sold?"
SELECT 'units sold'=SUM(units_sold), 'model'=CASE WHEN (GROUPING(model)=1) THEN 'ALL' ELSE ISNULL(model, '????') END, 'year'=CASE WHEN (GROUPING(year)=1) THEN 'ALL' ELSE ISNULL(CONVERT(char(6), year), '????') END, 'color'=CASE WHEN (GROUPING(color)=1) THEN 'ALL' ELSE ISNULL(color, '????') END FROM automobile_sales_detail GROUP BY model, year, color WITH ROLLUP |
The results:
units sold model year color ---------- ----- ---- ----- 10 ???? ???? White 10 ???? ???? ALL 10 ???? ALL ALL 62 Chevy 1990 Blue 5 Chevy 1990 Red 87 Chevy 1990 White 154 Chevy 1990 ALL 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 ALL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 ALL 508 Chevy ALL ALL 63 Ford 1990 Blue 64 Ford 1990 Red 62 Ford 1990 White 189 Ford 1990 ALL 55 Ford 1991 Blue 52 Ford 1991 Red 9 Ford 1991 White 116 Ford 1991 ALL 39 Ford 1992 Blue 27 Ford 1992 Red 62 Ford 1992 White 128 Ford 1992 ALL 433 Ford ALL ALL 951 ALL ALL ALL |
ROLLUP is similar to COMPUTE BY, which has nice functionality. However, COMPUTE BY is not as relational as ROLLUP it doesn't simply produce results as an ordinary table of values. COMPUTE BY always requires special application programming to deal with its alternate result sets, which are essentially equivalent to super-aggregates. Recall that CUBE and ROLLUP simply use ordinary result sets and that the super-aggregates are represented as a grouping NULL. Generally speaking, CUBE and ROLLUP are more preferable than COMPUTE BY because they're easier to fit into applications; you can use them with views, CASE, subselects, and column aliases; and they're internally better optimized. But for completeness, here's an example using COMPUTE BY as functionally equivalent to ROLLUP:
SELECT units_sold, model, year, color FROM automobile_sales_detail ORDER BY model, year, color COMPUTE SUM(units_sold) BY model, year COMPUTE SUM(units_sold) BY model COMPUTE SUM(units_sold) |
And the results:
units_sold model year color ---------- ----- ---- ----- 10 NULL NULL White Sum =========== 10 Sum =========== 10 units_sold model year color ---------- ----- ---- ----- 62 Chevy 1990 Blue 5 Chevy 1990 Red 87 Chevy 1990 White Sum =========== 154 units_sold model year color ---------- ----- ---- ----- 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White Sum =========== 198 units_sold model year color ---------- ----- ---- ----- 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White Sum =========== 156 Sum =========== 508 units_sold model year color ---------- ----- ---- ----- 63 Ford 1990 Blue 64 Ford 1990 Red 62 Ford 1990 White Sum =========== 189 units_sold model year color ---------- ----- ---- ----- 55 Ford 1991 Blue 52 Ford 1991 Red 9 Ford 1991 White Sum =========== 116 units_sold model year color ---------- ----- ---- ----- 39 Ford 1992 Blue 27 Ford 1992 Red 62 Ford 1992 White Sum =========== 128 Sum =========== 433 Sum =========== 951 |
SQL Server 7 introduces four new aggregate functions, in addition to the standard six described earlier. These functions are used for statistical applications and are considered aggregate functions because they operate on a set of values instead of on one single value. Table 7-8 below describes the four statistical aggregate functions.
You could find the standard deviation and variance of the prices in the titles table with this query:
SELECT stdev(price), var(price) FROM titles |
Alternatively, you could calculate the standard deviation and variance of the prices for each different type of book:
SELECT type, stdev(price), var(price) FROM titles GROUP BY type |
Table 7-8. Statistical aggregate functions.
Aggregate Function | Description |
---|---|
STDEV( expression ) | Returns the statistical standard deviation of all values in the given expression. |
STDEVP( expression ) | Returns the statistical standard deviation for the population for all values in the given expression. |
VAR( expression ) | Returns the statistical variance of all values in the given expression. |
VARP( expression ) | Returns the statistical variance for the population for all values in the given expression. |
We've already seen that the WHERE clause allows us to limit the number of rows that a SELECT statement will return, but using a WHERE clause assumes we have knowledge of the actual data values present. What if we wanted to see only a screenful of rows from a table but had no idea which range of values were present in the table? Prior to SQL Server version 7, we could use the option SET ROWCOUNT n to limit SQL Server to sending only n rows to the client. Every SELECT statement would just stop sending rows back to the client after n rows had been sent.
For example, the following will return only 10 rows from the authors table:
SET ROWCOUNT 10 GO SELECT * FROM authors GO |
This is useful in some situations, but it's quite limited in functionality. The TOP keyword used in the SELECT list allows much greater control over the quantity of data we'd like to see. Here's the syntax for using TOP:
SELECT [ TOP n [PERCENT] [ WITH TIES] ] select_list FROM table_list WHERE conditions ... rest of query |
When the query includes an ORDER BY clause, TOP causes the first n rows ordered by the ORDER BY clause to be output. When the query has no ORDER BY clause, we can't predict which rows will be returned. Compare the output of the following two queries:
SELECT TOP 5 title_id, price, type FROM titles title_id price type -------- --------------------- ------------ BU1032 19.9900 business BU1111 11.9500 business BU2075 2.9900 business BU7832 19.9900 business MC2222 19.9900 mod_cook |
SELECT TOP 5 title_id, price, type FROM titles ORDER BY price DESC title_id price type -------- --------------------- ------------ PC1035 22.9500 popular_comp PS1372 21.5900 psychology TC3218 20.9500 trad_cook PC8888 20.0000 popular_comp BU1032 19.9900 business |
Notice that simply including TOP 5 in the SELECT list is equivalent to setting ROWCOUNT to 5. SQL Server just stops returning rows after the first five. In the example above, however, you should note that other books exist with a price of $19.99, but because we wanted only five, we didn't see them. Does this output really answer the question, "What are the five highest priced books?" If you answered no and really want to see all the books with the same price as the ones listed, you can use the WITH TIES option. This option is only allowable if your SELECT statement includes an ORDER BY:
SELECT TOP 5 WITH TIES title_id, price, type FROM titles ORDER BY price DESC |
This returns the following eight rows:
title_id price type -------- --------------------- ------------ PC1035 22.9500 popular_comp PS1372 21.5900 psychology TC3218 20.9500 trad_cook PC8888 20.0000 popular_comp BU1032 19.9900 business BU7832 19.9900 business MC2222 19.9900 mod_cook PS3333 19.9900 psychology |
If we want to see a certain fraction of the rows, we can use TOP with PERCENT, which will round up to the nearest integer number of rows:
SELECT TOP 30 Percent title_id, price, type FROM titles ORDER BY price DESC |
You can think of SQL Server carrying out this operation by first counting out the number of rows specified by TOP in this case, five. Only after the first five are returned will SQL Server check to see if any other rows have a value equivalent to the last one returned.
This won't necessarily give you books with the five highest prices. If we changed TOP 5 to TOP 8, you'd get the same eight rows back as above, which represent only five different prices.
If we really wanted to see the eight highest prices, we could use DISTINCT with TOP. We couldn't select title_id at the same time, because DISTINCT applies to the whole result row, and every title_id and price combination is distinct in itself. However, we could use a subquery to determine which rows have one of the eight highest prices:
SELECT title_id, price, type FROM titles WHERE price in (SELECT DISTINCT TOP 8 price FROM titles ORDER BY price DESC) |
In this case, the rows wouldn't come back in descending order of price, because the order is controlled by the outer query. You'd have to add another ORDER BY at the end.
The titles table contains 18 rows, and 30 percent of 18 is 5.6. This is rounded up to return these six rows:
title_id price type -------- --------------------- ------------ PC1035 22.9500 popular_comp PS1372 21.5900 psychology TC3218 20.9500 trad_cook PC8888 20.0000 popular_comp BU1032 19.9900 business BU7832 19.9900 business |
We can also use the WITH TIES option with PERCENT. If used in the above query, we'd then get the additional two rows that also have a price of $19.99.
UNION is a conceptually easy topic. Given the same number of columns and compatible datatypes, UNION combines two or more result sets into a single result set. By default, duplicate rows are eliminated, although you can include them by specifying UNION ALL. The datatypes of the result sets to be combined with UNION don't need to be identical, but they must be able to be implicitly converted. If this isn't the case, you can explicitly convert them to identical or compatible types using the handy CONVERT() function.
The following example shows UNION at its most basic level. It also demonstrates that a SELECT statement in SQL Server doesn't need to be from a table or a view but can be a constant or a variable:
SELECT col1=2, col2=1 UNION SELECT xxx=1, yyy=2 UNION SELECT '3','0' UNION SELECT TAN(0), 3*1000/10/100 -- Tangent of 0 is 0. -- So this row produces 0.0, 3 UNION SELECT 1, 2 |
The output follows:
col1 col2 ---- ---- 0.0 3 3.0 0 1.0 2 2.0 1 |
Earlier versions of SQL Server always returned the result of a UNION in sorted order, unless you specified UNION ALL. This occurred because UNION eliminated duplicate rows using a sorting strategy. The ordering was simply a by-product of the sorting to eliminate duplicates. However, SQL Server 7 has several alternative strategies available for removing duplicates, so there's no guarantee of any particular order when using UNION. If order is important to you, you should explicitly use ORDER BY at the end of the last SELECT statement.
In the example above, the datatypes are compatible but not identical. Because TAN(0) returns a numeric datatype, col1 is automatically cast to numeric (since using an integer would lose precision), and the other values are implicitly converted. In addition, the one duplicate row (1.0, 2) is eliminated, and the columns take their monikers from the first result set.
If you used UNION ALL instead of UNION, the duplicate row wouldn't be eliminated and no extra work would be needed to determine which rows were distinct. If there's no need to eliminate duplicates, or if you know that there won't be any duplicates, you can achieve a noticeable performance improvement by using UNION ALL. The following query illustrates the use of UNION ALL and uses ORDER BY to present the results in a particular order. It also converts the result of the tangent operation to a tinyint so that col1 is then implicitly converted to an int .
SELECT col1=2, col2=1 UNION ALL SELECT xxx=1, yyy=2 UNION ALL SELECT '3', '0' UNION ALL SELECT CONVERT(smallint, TAN(0)), 3*1000/10/100 -- Tangent of 0 is 0. So this row produces 0.0, 3 UNION ALL SELECT 1, 2 ORDER BY col2 |
Here's the output:
col1 col2 ---- ---- 3 0 2 1 1 2 1 2 0 3 |
Earlier, we saw that the CUBE operation wouldn't generate missing rows with the value 0. However, you might occasionally want the full cube, with 0 as the SUM() value for missing combinations. We promised you an example of how this could be accomplished using UNION, so here's the example. And here's the key the table can be cross-joined to itself to produce a 0 row for every combination possible given the non-NULL data. This query will use the existing data to produce a list of every possible combination of model, color, and year:
SELECT DISTINCT units_sold=0, A.model, B.color, C.year FROM automobile_sales_detail A CROSS JOIN automobile_sales_detail B CROSS JOIN automobile_sales_detail C |
Here's the output:
units_sold model color year ---------- ----- ----- ---- 0 Chevy Blue 1990 0 Chevy Blue 1991 0 Chevy Blue 1992 0 Chevy Red 1990 0 Chevy Red 1991 0 Chevy Red 1992 0 Chevy White 1990 0 Chevy White 1991 0 Chevy White 1992 0 Ford Blue 1990 0 Ford Blue 1991 0 Ford Blue 1992 0 Ford Red 1990 0 Ford Red 1991 0 Ford Red 1992 0 Ford White 1990 0 Ford White 1991 0 Ford White 1992 |
The real underlying data is the set of 12 rows shown earlier and shown again here. Recall that six combinations have no value that is, there are no Chevys for 1990 and no red Fords:
model year color units_sold ----- ---- ----- ---------- Chevy 1991 Red 54 Chevy 1991 White 95 Chevy 1991 Blue 49 Chevy 1992 Red 31 Chevy 1992 White 54 Chevy 1992 Blue 71 Ford 1990 White 62 Ford 1990 Blue 63 Ford 1991 White 9 Ford 1991 Blue 55 Ford 1992 White 62 Ford 1992 Blue 39 |
Having generated all the dummy rows as such, we can then easily UNION the cross-joined dummy results with the output of the CUBE. But that still wouldn't do because we'd get both a real row and a dummy row whenever a real combination exists. SQL Server gives you several good ways to solve this dilemma. You can use NOT EXISTS to produce the dummy rows only for combinations that don't exist. Another approach, and perhaps the most intuitive one, is to make a view of the cross-joined dummy rows and the actual data, and then perform the CUBE on the view. The 0 values, of course, don't affect the SUM, so this works nicely :
CREATE VIEW fullcube ( units_sold, model, year, color ) AS ( SELECT D.units_sold, D.model, D.year, D.color FROM automobile_sales_detail D UNION ALL SELECT DISTINCT 0, A.model, C.year, B.color FROM automobile_sales_detail A CROSS JOIN automobile_sales_detail B CROSS JOIN automobile_sales_detail C ) |
Having constructed the view, we can then issue our CUBE query against it, just as we did previously against the base table:
SELECT units_sold=SUM(units_sold), model, year, color FROM fullcube GROUP BY model, year, color WITH CUBE |
Here's the output:
units_sold model year color ---------- ----- ---- ----- 0 Chevy 1990 Blue 0 Chevy 1990 Red 0 Chevy 1990 White 0 Chevy 1990 NULL 49 Chevy 1991 Blue 54 Chevy 1991 Red 95 Chevy 1991 White 198 Chevy 1991 NULL 71 Chevy 1992 Blue 31 Chevy 1992 Red 54 Chevy 1992 White 156 Chevy 1992 NULL 354 Chevy NULL NULL 63 Ford 1990 Blue 0 Ford 1990 Red 62 Ford 1990 White 125 Ford 1990 NULL 55 Ford 1991 Blue 0 Ford 1991 Red 9 Ford 1991 White 64 Ford 1991 NULL 39 Ford 1992 Blue 0 Ford 1992 Red 62 Ford 1992 White 101 Ford 1992 NULL 290 Ford NULL NULL 644 NULL NULL NULL 63 NULL 1990 Blue 0 NULL 1990 Red 62 NULL 1990 White 125 NULL 1990 NULL 104 NULL 1991 Blue 54 NULL 1991 Red 104 NULL 1991 White 262 NULL 1991 NULL 110 NULL 1992 Blue 31 NULL 1992 Red 116 NULL 1992 White 257 NULL 1992 NULL 120 Chevy NULL Blue 157 Ford NULL Blue 277 NULL NULL Blue 85 Chevy NULL Red 0 Ford NULL Red 85 NULL NULL Red 149 Chevy NULL White 133 Ford NULL White 282 NULL NULL White |
Then we can define yet another view on top to generate the "ALL" and "????" placeholders. No NULL data exists in this example, so we'll look at only the "ALL" case:
CREATE VIEW auto_cube (units, model, year, color) AS SELECT SUM(units_sold), ISNULL(model, 'ALL'), ISNULL(CONVERT(char(4), year), 'ALL'), ISNULL(color, 'ALL') FROM fullcube GROUP BY model, year, color WITH CUBE |
One advantage of filling out the cube with placeholder rows is that no matter what combination we formulate , we'll get an answer. There are no unknowns and no need to include the GROUPING() function. Had we not constructed the cube, this query would return no rows found instead of a row with 0 units:
SELECT * FROM auto_cube WHERE model='Chevy' AND color='ALL' AND year='1990' units model year color ----- ----- ---- ----- 0 Chevy 1990 ALL |
Just for fun, here's the equivalent of the fullcube view as a derived table:
SELECT units_sold=SUM(T1.units_sold), T1.model, T1.year, T1.color FROM ( SELECT D.units_sold, D.model, D.year, D.color FROM automobile_sales_detail D UNION ALL SELECT DISTINCT 0, A.model, C.year, B.color FROM automobile_sales_detail A CROSS JOIN automobile_sales_detail B CROSS JOIN automobile_sales_detail C ) AS T1 GROUP BY T1.model, T1.year, T1.color WITH CUBE GO |
Now that you've seen these somewhat exotic uses of CROSS JOIN and views to fill out the cube, be aware that if you had wanted the whole cube to be generated, you might have inserted some placeholder rows with a value of 0 in the base table itself (assuming you had access to do so). That way, every combination is accounted for in the base table itself.