Other Search Expressions | Inside Microsoft SQL Server 7.0 (Mps)

In addition to the SELECT statement and joins, SQL Server provides other useful search expressions that enable you to tailor your queries. These expressions include LIKE and BETWEEN, along with aggregate functions such as AVG, COUNT, and MAX.

LIKE

In an earlier example, we used LIKE to find authors whose names start with Ri . LIKE allows you to search using the simple pattern matching that is standard with ANSI SQL, but SQL Server adds capability by introducing wildcard characters that allow the use of regular expressions. Table 7-4 shows how wildcard characters are used in searches:

Table 7-4. Wildcard characters recognized by SQL Server for use in regular expressions.

Wildcard	Searches for
%	Any string of zero or more characters.
-	Any single character.
[ ]	Any single character within the specified range (for example, [a-f]) or the specified set (for example, [abcdef]).
[^]	Any single character not within the specified range (for example, [^a-f]) or the specified set (for example, [^abcdef]).

The following examples from the SQL Server documentation show how LIKE can be used (Table 7-5).

Table 7-5. Use of LIKE in searches.

Form	Searches for
LIKE 'Mc%'	All names that begin with the letters Mc (McBadden).
LIKE '%inger'	All names that end with inger (Ringer, Stringer).
LIKE '%en%'	All names that include the letters en (Bennet, Green, McBadden).
LIKE '-heryl'	All six-letter names ending with heryl (Cheryl, Sheryl).
LIKE '[CK]ars[eo]n'	All names that begin with C or K, then ars, then e or o, and then end with n (Carsen, Karsen, Carson, and Karson).
LIKE '[M-Z]inger'	All names ending with inger that begin with any single letter from M through Z (Ringer).
LIKE 'M[^c]%'	All names beginning with the letter M that don't have the letter c as the second letter (MacFeather).

Trailing Blanks

Trailing blanks (blank spaces) can cause considerable confusion and errors. In SQL Server, using trailing blanks requires that you be aware of differences in how varchar and char data is stored and of how the SET ANSI_PADDING option is used.

Suppose that we create a table with five columns and insert data as follows :

 SET ANSI_PADDING OFF   -- OFF is the default if nothing is set DROP TABLE checkpad GO CREATE TABLE checkpad ( rowid        smallint       NOT NULL PRIMARY KEY, c10not       char(10)       NOT NULL, c10nul       char(10)       NULL, v10not       varchar(10)    NOT NULL, v10nul       varchar(10)    NULL ) -- Row 1 has names with no trailing blanks INSERT checkpad VALUES (1, 'John', 'John', 'John', 'John') -- Row 2 has each name inserted with three trailing blanks INSERT checkpad VALUES     (2, 'John   ', 'John   ', 'John   ', 'Jo hn   ') -- Row 3 has each name inserted with a full six trailing blanks INSERT checkpad VALUES     (3, 'John      ', 'John      ', 'John      ', 'John      ') -- Row 4 has each name inserted with seven trailing blanks too many INSERT checkpad VALUES     (4, 'John       ', 'John       ', 'John       ', 'John       ')

We can then use the following query to analyze the contents of the table. We use the DATALENGTH( expression ) function to determine the actual length of the data, and then convert to VARBINARY() to display the hexadecimal values of the characters stored. (For those who don't have the ASCII chart memorized, a blank space is 0x20.) We then use a LIKE to try to match each column, once with one trailing blank and once with six trailing blanks.

 SELECT ROWID, "0xC10NOT"=CONVERT(VARBINARY(10), c10not), L1=DATALENGTH(c10not), "LIKE 'John %'"=CASE WHEN (c10not LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John      %'"=CASE WHEN (c10not LIKE 'John      %')     THEN 1 ELSE 0 END, "0xC10NUL"=CONVERT(VARBINARY(10), c10nul), L2=DATALENGTH(c10nul), "LIKE 'John %'"=CASE WHEN (c10nul LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John      %'"=CASE WHEN (c10nul LIKE 'John      %')     THEN 1 ELSE 0 END, "0xV10NOT"=CONVERT(VARBINARY(10), v10not), L3=DATALENGTH(v10not), "LIKE 'John %'"=CASE WHEN (v10not LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John      %'"=CASE WHEN (v10not LIKE 'John      %')     THEN 1 ELSE 0 END, "0xV10NUL"=CONVERT(VARBINARY(10), v10nul), L4=DATALENGTH(v10nul), "LIKE 'John %'"=CASE WHEN (v10nul LIKE 'John %') THEN 1 ELSE 0 END, "LIKE 'John      %'"=CASE WHEN (v10nul LIKE 'John      %')     THEN 1 ELSE 0 END FROM checkpad

Figure 7-6 shows the results of this query from the checkpad table.

click to view at full size.

Figure 7-6. Results with SET ANSI_PADDING OFF.

Without enabling SET ANSI_PADDING ON, SQL Server trims trailing blanks for variable-length columns and for fixed-length columns that allow NULLs. Notice that only column c10not has a length of 10. The other three columns are treated identically and end up with a length of 4 and no padding. For these columns, even trailing blanks explicitly entered within the quotation marks are trimmed . All the variable-length columns as well as the fixed-length column that allows NULLs have only the 4 bytes J-O-H-N (4a-6f-68-6e) stored. The true fixed-length column, c10not , is automatically padded with six blank spaces (0x20) to fill the entire 10 bytes.

All four columns would match LIKE 'John%' (no trailing blanks), but only the fixed-length column c10not matches LIKE 'John %' (one trailing blank) and LIKE 'JOHN %' (six trailing blanks). The reason for this should be apparent when you realize that those trailing blanks are truncated and aren't part of the stored values for the other columns. A column query will get a hit with LIKE and trailing blanks only when the trailing blanks are stored. In this example, only the c10not column stores trailing blanks, so that's where the hit occurs.

SQL Server has dealt with trailing blanks in this way for a long time. However, when the developers began work to pass the National Institute of Standards and Technology (NIST) test suite for ANSI SQL conformance, they realized that this behavior was inconsistent with the ANSI specification. The specification requires that you always pad char datatypes and that you don't truncate trailing blanks entered by the user , even for varchar . We can argue for either side both the long-standing SQL Server behavior and the ANSI specification behavior can be desirable, depending on the circumstances. However, if SQL Server changed its behavior to conform to ANSI, it would break many deployed SQL Server applications. So, to satisfy both sides, the ANSI_PADDING option was added in version 6.5. If we enable SET ANSI_PADDING ON, re-create the table, reinsert the rows, and then issue the same query, we get different results from those shown in Figure 7-6. Figure 7-7 shows these results.

By setting ANSI_PADDING ON, the char(10) columns c10not and c10nul , whether declared NOT NULL or NULL, are always padded out to 10 characters regardless of how many trailing blanks were entered. In addition, the variable-length columns v10not and v10nul contain the number of trailing blanks that were inserted. Trailing blanks are not truncated. Because the variable-length columns now store any trailing blanks that were part of the INSERT (or UPDATE) statement, the columns get hits in queries with LIKE and trailing blanks if the query finds that number of trailing blanks (or more) in the columns' contents. Notice that no error occurs by inserting a column with seven (instead of six) trailing blanks. The final trailing blank simply gets truncated. If you want a warning message to appear in such cases, you can use SET ANSI_WARNING ON.

click to view at full size.

Figure 7-7. Results with SET ANSI_PADDING ON.

Note that the SET ANSI_PADDING ON option applies to a table only if the option was enabled at the time the table was created (or to the column if it was added via ALTER TABLE). Enabling this option for an existing table does nothing. To determine whether a column has ANSI_PADDING set to ON, examine the status field of the syscolumns table and do a bitwise AND 20 (decimal). If the bit is on, you know the column was created with the option enabled. For example:

 SELECT name, 'Padding On'=CASE WHEN status & 20 <> 0     THEN 'YES' ELSE 'NO' END FROM syscolumns WHERE id=OBJECT_ID('checkpad') AND name='v10not' GO

SET ANSI_PADDING ON also affects the behavior of binary and varbinary fields, but the padding is done with zeros instead of spaces. Table 7-6 below shows the effects of the SET ANSI_PADDING setting when values are inserted into columns with char , varchar , binary , and varbinary data types.

Note than when a fixed-length column allows NULLs, it mimics the behavior of variable-length columns if ANSI_PADDING is off, and it mimics the behavior of fixed-length columns if ANSI_PADDING is on.

The SQL Server ODBC driver and OLE DB Provider for SQL Server automatically set ANSI_PADDING to ON when connecting. This can be configured in ODBC data sources, in ODBC connection attributes, or OLE DB connection properties set in the application before connecting.

Table 7-6. The effects of the ANSI_PADDING setting.

Setting of ANSI_PADDING	char( n ) NOT NULL or binary( n ) NOT NULL	char( n ) NULL or binary( n ) NULL	varchar( n ) or varbinary( n )
ON	Pad original value to the length of the column.	Follows same rules as for char(n) or binary(n) NOT NULL.	Trailing blanks (for varchar ) or zeros (for varbinary ) in values inserted aren't trimmed. Values are not padded to the length of the column.
OFF	Pad original value to the length of the column.	Follows same rules as for varchar or varbinary .	Trailing blanks (for varchar ) or trailing zeros (for varbinary ) are trimmed.

Regardless of how the column was created, SQL Server provides functions that can help you deal effectively with trailing blanks. The RTRIM( char_expression ) function removes trailing blanks if they exist. You can add trailing blanks to your output by concatenating with the SPACE( integer_expression ) function. An easier way is to use the CONVERT function to convert to a fixed-length char() of whatever length you want. Then variable-length columns will be returned as fully padded, regardless of the ANSI_PADDING setting and regardless of the columns' actual storage, by converting them to char(10) or whatever size you want. The following SELECT statement against the checkpad table illustrates this:

 SELECT CONVERT(VARBINARY(10), CONVERT(char(10), v10nul))     FROM checkpad

Here's the result:

 0x4a6f686e202020202020 0x4a6f686e202020202020 0x4a6f686e202020202020 0x4a6f686e202020202020

BETWEEN

BETWEEN is shorthand for greater-than-or-equal-to AND less-than -or-equal-to. The clause WHERE C BETWEEN B AND D is equivalent to writing WHERE C >= B AND C <= D. This one's not too tricky, but the order of the values to be checked is important when you use BETWEEN. If B , C , and D are increasing values, you can see that WHERE C BETWEEN D AND B wouldn't be true because it would evaluate to WHERE C >= D AND C <= B. Some early versions of SQL Server let you reverse the order, such that the meaning was equivalent to WHERE (C >= B AND C <= D) OR (C >= D AND C <= B). However, that behavior was considered to be a bug and it was corrected in 1993.

Aggregate Functions

Aggregate functions (sometimes referred to as set functions ) allow you to sum-marize a column of output. SQL Server provides six general aggregate functions that are standard ANSI SQL-92 fare. (Therefore, we won't go into much depth about their general use; instead, we'll focus on some aspects specific to SQL Server.) Table 7-7 below summarizes SQL Server's aggregate functions.

Consider the following table, automobile_sales_detail . We'll use this table to demonstrate how aggregate functions simplify finding the answers to complex questions:

 rowid    model    year    color    units_sold -----    -----    ----    -----    ---------- 1        Chevy    1990    Red      5 2        Chevy    1990    White    87 3        Chevy    1990    Blue     62 4        Chevy    1991    Red      54 5        Chevy    1991    White    95 6        Chevy    1991    Blue     49 7        Chevy    1992    Red      31 8        Chevy    1992    White    54 9        Chevy    1992    Blue     71 10       Ford     1990    Red      64 11       Ford     1990    White    62 12       Ford     1990    Blue     63 13       Ford     1991    Red      52 14       Ford     1991    White    9 15       Ford     1991    Blue     55 16       Ford     1992    Red      27 17       Ford     1992    White    62 18       Ford     1992    Blue     39

Table 7-7. SQL Server's aggregate functions.

Aggregate Function	Description
AVG( expression )	Returns the average (mean) of all the values, or only the DISTINCT values, in the expression. You can use AVG with numeric ^* columns only. Null values are ignored.
COUNT( expression )	Returns the number of non-null values in the expression. When DISTINCT is specified, COUNT finds the number of unique non-null values. You can use COUNT with both numeric and character columns. Null values are ignored.
COUNT(*)	Returns the number of rows. COUNT(*) takes no parameters and can't be used with DISTINCT. All rows are counted, even those with null values.
MAX( expression )	Returns the maximum value in the expression. You can use MAX with numeric, character, and datetime columns but not with bit columns. With character columns, MAX finds the highest value in the collating sequence. MAX ignores any null values. DISTINCT is available for ANSI compatibility, but it's not meaningful with MAX.
MIN( expression )	Returns the minimum value in the expression. You can use MIN with numeric, character, and datetime columns, but not with bit columns. With character columns, MIN finds the value that is lowest in the sort sequence. MIN ignores any null values. DISTINCT is a1vailable for ANSI compatibility, but it's not meaningful with MIN.
SUM( expression )	Returns the sum of all the values, or only the DISTINCT values, in the expression. You can use SUM with numeric columns only. Null values are ignored.

You can increase the power of aggregate functions by allowing them to be grouped and by allowing the groups to have criteria established for inclusion via the HAVING clause. The following queries provide some examples.

EXAMPLE 1

Show the oldest and newest model years for sale, and show the average sales of all the entries in the table:

 SELECT 'Oldest'=MIN(year), 'Newest'=MAX(year),     'Avg Sales'=AVG(units_sold) FROM automobile_sales_detail

Here's the output:

 Oldest    Newest    Avg Sales ------    ------    --------- 1990      1992      52

EXAMPLE 2

Do the Example 1 query, but show the values for autos that are Chevys only:

 SELECT 'Oldest'=MIN(year), 'Newest'=MAX(year),     'Avg Sales'=AVG(units_sold)  FROM automobile_sales_detail WHERE model='Chevy'

And the result:

 Oldest    Newest    Avg Sales ------    ------    --------- 1990      1992      56

EXAMPLE 3

Show the same values as in Example 1, but group them based on the model and color of the cars :

 SELECT model, color, 'Oldest'=MIN(year), 'Newest'=MAX(year),     'Avg Sales'=AVG(units_sold)  FROM automobile_sales_detail GROUP BY model, color ORDER BY model, color

The output appears below.

 model    color    Oldest    Newest    Avg Sales -----    -----    ------    ------    --------- Chevy    Blue     1990      1992      60 Chevy    Red      1990      1992      30 Chevy    White    1990      1992      78 Ford     Blue     1990      1992      52 Ford     Red      1990      1992      47 Ford     White    1990      1992      44

EXAMPLE 4

Show the same values for only those model-year rows with average sales of 65 or less. Also, order and group the output first by color and then by model, but keep the columns in the same order:

 SELECT model, color, 'Oldest'=MIN(year), 'Newest'=MAX(year),     'Avg Sales'=AVG(units_sold)  FROM automobile_sales_detail GROUP BY color, model HAVING AVG(units_sold) <= 65 ORDER BY color, model

Here's the output:

 model    color    Oldest    Newest    Avg Sales -----    -----    ------    ------    --------- Chevy    Blue     1990      1992      60 Ford     Blue     1990      1992      52 Chevy    Red      1990      1992      30 Ford     Red      1990      1992      47 Ford     White    1990      1992      44

NOTE
We include an ORDER BY clause in these queries, even though we specify the criteria of the ORDER BY to be the same criteria that the GROUP BY clause uses. In earlier releases of SQL Server, this wasn't necessary, because GROUP BY always returned the rows in the order of the grouping columns. SQL Server 7 has several alternative strategies for handling GROUP BY that don't require sorting of the results, so it's unpredictable which order the results will be returned in. If you want the results in a particular order, you must specify ORDER BY.

An Alternative to Using COUNT(*)

You can use COUNT(*) to find the count of all the rows in the table, but a considerably faster way exists. The rows column in the sysindexes table keeps the current rowcount dynamically for the clustered index. If one exists, indid = 1. If no clustered index exits, sysindexes keeps the count for the table ( indid = 0). A quicker query than using COUNT(*) to find the count of all rows in the table looks something like this:

 SELECT rows     FROM sysindexes     WHERE id=OBJECT_ID ("authors")     AND indid < 2

This query works only for base tables, not views, and it works only if you apply no selection criteria via a WHERE clause. If you want to respond to a query such as "Show me the count of all rows for which the author lives in California," you'd still need to use COUNT(*). In addition, there's no guarantee that this number will be accurate 100 percent of the time. In previous versions of SQL Server, certain operations (such as SELECT INTO), didn't always correctly update the rows column in sysindexes. SQL Server 7 is much more likely to keep sysindexes up to date, but there's still no guarantee. If you absolutely must know exactly how many rows are in a table in your application, use the COUNT(*) aggregate.

Aggregate Functions and NULLs

As usual, you must understand the effect of using NULLs to use them appropriately with aggregate functions. Unfortunately, these effects aren't necessarily intuitive or even consistent. Because NULL is considered Unknown, one could certainly argue that using SUM(), MAX(), MIN(), or AVG() on a table containing one or more NULL values should also produce NULL. Obviously, if we have three employees and we don't know the salaries of two of them, it's totally inaccurate to state that the SUM() of the salaries for the three employees is equal to the one known salary. Yet this is exactly how SUM(), MIN(), MAX(), AVG(), and COUNT() work (but not COUNT(*)). Implicitly, these functions seem to tack on the criterion of "for only those values that are not NULL."

You will likely want these operations to work this way most of the time, and pragmatically, it's good that they work like this because they're simpler to use for common cases. But if you don't understand exactly how these functions will affect your code, you might introduce some bugs . For example, if you were to divide SUM() by the result of COUNT(*) as a way to compute the mean value, the result probably wouldn't be what you intended SUM() disregards the NULL values, but COUNT(*) counts the rows that include those NULL values.

We've suggested a few times that using default or dummy values can be a good alternative to using NULL. But we've also tried to make clear that while these alternatives solve some issues, they introduce other problems. You must account for aggregates, and you might want to get back to exactly the semantics that aggregates such as SUM() use with defaults, ignoring the default or dummy values. Aggregating on only one column is pretty straightforward, and your WHERE clause could simply exclude those values you want to disregard. But if you're aggregating on multiple columns, eliminating rows in the WHERE clause might be a poor solution.

For example, suppose that in our employee table, we have both an emp_age column and an emp_salary column. To avoid using NULL, we use 0.00 when we don't know the actual value. We want one query to do aggregations of each column. Yet we don't want to disregard a row's value for emp_age if we don't have an actual value for emp_salary , or vice versa.

Here's our employee table:

 emp_id    emp_name    emp_age    emp_salary ------    --------    -------    ---------- 1         Smith       34         26000.00 2         Doe         30         35000.00 3         Jones       45         0.00 4         Clark       0          65000.00 5         Kent        63         0.00

Now suppose we want to write a simple query to get the count of all employees, the average and lowest salaries, and the average and lowest ages. Here's our first simplistic query:

 SELECT  'Num Employees'=COUNT(*),  'Avg Salary'=AVG(emp_salary),  'Low Salary'=MIN(emp_salary), 'Avg Age'=AVG(emp_age), 'Youngest'=MIN(emp_age) FROM employee

Here's the result:

 Num Employees    Avg Salary    Low Salary    Avg Age    Youngest -------------    ----------    ----------    -------    -------- 5                25200.00      0.00          34         0

Obviously, this query doesn't correctly answer our question, because the dummy values of 0 have distorted the results for the AVG() and MIN() columns. So we decide to exclude those rows by stating in the WHERE clause that the salary should not be the default value:

 SELECT  'Num Employees'=COUNT(*),  'Avg Salary'=AVG(emp_salary),  'Low Salary'=MIN(emp_salary), 'Avg Age'=AVG(emp_age), 'Youngest'=MIN(emp_age) FROM employee WHERE emp_salary > 0 AND emp_age > 0

Here's what we get:

 Num Employees    Avg Salary    Low Salary    Avg Age    Youngest -------------    ----------    ----------    -------    -------- 2                30500.00      26000.00      32         30

This query is marginally better because at least the average salary is no longer lower than any actually known salary. The values are based on only the first two employees, Smith and Doe we have five employees, not just two, as this result claims. Two other employees, Jones and Kent, are significantly older than the two who were used to compute the average age. And had Clark's salary been considered, the average salary would be much higher. We are seemingly stuck and can't write a single query to answer the question; we might have to write separate queries to get the count, the salary information, as well as the age information.

Fortunately, the NULLIF() function comes to the rescue. This function is one of many handy Transact-SQL functions. NULLIF() takes two expressions, NULLIF( expression1, expression2 ), and returns NULL if the two expressions are equivalent, and expression1 if they aren't equivalent. The NULLIF() function is roughly the mirror image of another function, ISNULL(), which produces a value if a NULL is encountered (discussed earlier in this chapter). The NULLIF() function is actually a special-purpose shorthand of the ANSI SQL-92 CASE expression (discussed in detail in Chapter 9). Using NULLIF() is equivalent to the following:

 CASE     WHEN expression1=expression2 THEN NULL     ELSE expression1 END

Hence, NULLIF (emp_salary, 0) produces NULL for those rows in which the salary equals 0. By converting the 0 values to NULL, we can use the aggregate functions just as we'd use them with NULL so that SUM() and MIN() will disregard the NULL entries.

 SELECT  'Num Employees'=COUNT(*),  'Avg Salary'=AVG(NULLIF(emp_salary, 0)),  'Low Salary'=MIN(NULLIF(emp_salary, 0)), 'Avg Age'=AVG(NULLIF(emp_age, 0)), 'Youngest''=MIN(NULLIF(emp_age, 0)) FROM employee

Here's the more accurate result:

 Num Employees    Avg Salary    Low Salary    Avg Age    Youngest -------------    ----------    ----------    -------    -------- 5                42000.00      26000.00      43         30

Finally! This result is exactly what we wanted, and all five employees are represented. Of the three whose salaries we know, the average salary is $42,000.00 and the low salary is $26,000.00. Of the four whose ages we know, the average age is 43 and the youngest is 30.

Datacube Aggregate Variations

The previous aggregate examples in this chapter, all using standard SQL-92 syntax, show that by formulating a specific query on the sample table with one of the aggregate functions, we can answer almost any question regarding aggregate sales for some combination of model, year, and color of car. However, answering any of these questions requires a separate, specific query. Of course, it's common to look at data and keep formulating and posing new and different questions. Such questions are the hallmark of data mining, decision-support systems (DSS), online analytic processing (OLAP), or whatever name is being used these days for this age-old need to examine data.

To address this inherent weakness in standard SQL, developers have created several front-end tools that maintain a set of data that allows changes in aggregation to be easily queried, avoiding the need to sweep through data for every new aggregation request. One such tool is Microsoft OLAP Services, which you can install from the same CD that contains your SQL Server 7 software. OLAP tools can add a lot of value, and they make it easy to slice and dice the values pertaining to any "cut" of the data groupings. But even these tools can perform better with help from the database engine.

Some History
The motivation for inventing CUBE came from a paper written by Jim Gray, and others, soon after Jim joined Microsoft as a researcher. After reading his paper, a few of the developers got excited about adding the feature to the 6.5 release, and Don Reichart did a masterful job of implementing this important feature in record time. Subsequently, this feature was submitted to the ANSI SQL committee as a proposed addition to the standard. For more details about CUBE, you can read the paper that Jim Gray submitted to the ACM on the companion CD.

SQL Server has two extensions to GROUP BY CUBE and ROLLUP that allow SQL Server to optionally generate all the aggregate groupings in a single query.

CUBE

CUBE explodes data to produce a result set containing a superset of groups, with a cross-tabulation of every column to the value of every other column, as well as a special super-aggregate value that can be thought of as meaning ALL VALUES. Here's the CUBE for the 18 rows of the automobile_sales_detail table. Notice that these 18 rows generate 48 rows in the datacube.

 units    model    year    color -----    -----    ----    ----- 62       Chevy    1990    Blue 5        Chevy    1990    Red 87       Chevy    1990    White 154      Chevy    1990    ALL  49       Chevy    1991    Blue 54       Chevy    1991    Red 95       Chevy    1991    White 198      Chevy    1991    ALL  71       Chevy    1992    Blue 31       Chevy    1992    Red 54       Chevy    1992    White 156      Chevy    1992    ALL  508      Chevy    ALL     ALL  63       Ford     1990    Blue 64       Ford     1990    Red 62       Ford     1990    White 189      Ford     1990    ALL  55       Ford     1991    Blue 52       Ford     1991    Red 9        Ford     1991    White 116      Ford     1991    ALL  39       Ford     1992    Blue 27       Ford     1992    Red 62       Ford     1992    White 128      Ford     1992    ALL  433      Ford     ALL     ALL  941      ALL      ALL     ALL  125      ALL      1990    Blue 69       ALL      1990    Red 149      ALL      1990    White 343      ALL      1990    ALL  104      ALL      1991    Blue 106      ALL      1991    Red 104      ALL      1991    White 314      ALL      1991    ALL  110      ALL      1992    Blue 58       ALL      1992    Red 116      ALL      1992    White 284      ALL      1992    ALL  182      Chevy    ALL     Blue 157      Ford     ALL     Blue 339      ALL      ALL     Blue 90       Chevy    ALL     Red 143      Ford     ALL     Red 233      ALL      ALL     Red 236      Chevy    ALL     White 133      Ford     ALL     White 369      ALL      ALL     White

The results of the CUBE make it simple to answer just about any sales aggregation question that we can think of. If we want to get sales information for all 1992 automobiles, regardless of model or color, we can find the corresponding row and see that 284 cars were sold. To find all Chevy sales for all years and colors, it's equally easy to find 508. Such a result set might be used as is for a reference chart or in coordination with an OLAP tool.

The number of rows exploded by a CUBE operation can be surprisingly large. In the example above, 18 rows generated 48 rows of output. However, the result of a CUBE operation won't necessarily produce more rows of output than what appears in the underlying data. The number of rows generated will depend on the number of attributes being grouped and on the actual combinations in your data. Without detailed knowledge of your data, the number of rows produced by a CUBE operation can't be predicted. However, the upper bound can be easily predicted . The upper bound will be equal to the cross-product of the (number of distinct values + 1) value for each attribute. The addition of 1 is for the case of "ALL." The automobile_sales_detail example has three attributes: model, year, and color.

 Upper bound of number of rows = (Number of models + 1) * (Number of years + 1) *    (Number of colors + 1) = (2 + 1) * (3 + 1) * (3 + 1) = 48

This example should make it clear that the upper bound of the number of rows depends not on the number of data rows, but rather on the number of attributes being grouped and the number of distinct values for each attribute. For example, if the table had 18 million instead of 18 data rows but still had no more than two models, three years, and three colors (there's no stated UNIQUE or PRIMARY KEY constraint on the model, year, or color fields), the CUBE operation will again return only 48 rows.

The actual number of rows might be considerably less than the upper bound, however, because CUBE doesn't try to force a 0 or a NULL aggregate for combinations for which no values exist. In the carefully constructed example you just saw, there's complete coverage across every attribute. Each of the two models has an entry for each of the three years and for each of the three colors. But suppose that no sales data appears for the 1990 Chevy and that no red Fords exist for any year. The raw data would look like this:

 Rowid    Model    Year    Color    Units_Sold -----    -----    ----    -----    ---------- 1        Chevy    1991    Red      54 2        Chevy    1991    White    95 3        Chevy    1991    Blue     49 4        Chevy    1992    Red      31 5        Chevy    1992    White    54 6        Chevy    1992    Blue     71 7        Ford     1990    White    62 8        Ford     1990    Blue     63 9        Ford     1991    White    9 10       Ford     1991    Blue     55 11       Ford     1992    White    62 12       Ford     1992    Blue     39

Here's the new cube:

 Units    Model    Year    Color -----    -----    ----    ----- 49       Chevy    1991    Blue 54       Chevy    1991    Red 95       Chevy    1991    White 198      Chevy    1991    ALL  71       Chevy    1992    Blue 31       Chevy    1992    Red 54       Chevy    1992    White 156      Chevy    1992    ALL  354      Chevy    ALL     ALL  63       Ford     1990    Blue 62       Ford     1990    White 125      Ford     1990    ALL  55       Ford     1991    Blue 9        Ford     1991    White 64       Ford     1991    ALL  39       Ford     1992    Blue 62       Ford     1992    White 101      Ford     1992    ALL  290      Ford     ALL     ALL  644      ALL      ALL     ALL  63       ALL      1990    Blue 62       ALL      1990    White 125      ALL      1990    ALL  104      ALL      1991    Blue 54       ALL      1991    Red 104      ALL      1991    White 262      ALL      1991    ALL  110      ALL      1992    Blue 31       ALL      1992    Red 116      ALL      1992    White 257      ALL      1992    ALL  120      Chevy    ALL     Blue 157      Ford     ALL     Blue 277      ALL      ALL     Blue 85       Chevy    ALL     Red 85       ALL      ALL     Red 149      Chevy    ALL     White 133      Ford     ALL     White 282      ALL      ALL     White

We now have 12 data rows but still three attributes and the same number of distinct values for each attribute. But instead of 48 rows for the result of the CUBE, the result contains only 39 rows because some combinations of attributes have no data. For example, no row exists for all models of red cars for 1990 because no data met that specific criterion.

If you would benefit from always having the full cube populated when you use CUBE, you might want to add a placeholder row with dummy values for the columns being aggregated so that every combination is represented. Or you could run a query that produces the cross-product of all combinations not existing, and then UNION the result with the cube. (We'll see an example of this later when discussing UNION.) Be careful: as the number of attributes and the cardinality between the values increases , the cube can quickly get large. The full cube based on five attributes, each having 50 distinct values, with at least one matching entry for each combination, generates (50 + 1) ⁵ , or 345,025,251, rows in the cube. If coverage of all these combinations is nowhere near full, the cube might otherwise have generated "only" a few thousand rows instead of more than 345 million!

So far, we haven't seen the actual query used to produce the cube because you need to understand the basic concepts before yet again dealing with issues of NULL. A simple GROUP BY column WITH CUBE will not, by default, produce super-aggregate rows having a value that outputs as "ALL." The ALL value is really a nonvalue, like NULL (gasp!). In fact, the ALL column is represented by (gulp!) a special kind of NULL, referred to as a grouping NULL .

While implementing this new feature, the developers debated whether to use NULL or invent a new marker value similar in some ways to NULL but clearly still separate from data. The decision wasn't easy. A new data value would be more theoretically pure. The meaning of "ALL" is really different from "unknown" or "missing" that is, it's different from NULL. However, had they introduced another special-meaning marker, it would require the same level of complexity necessary to work with as NULL with special operators like IS ALL and the truth tables would become all the more difficult to work with in expressions such as =, <, IN, and so on. And they did have some precedents that they wanted to avoid.

In many ways, the longtime SQL Server feature COMPUTE BY is quite similar to CUBE. (Actually, it's more similar to the derivative of CUBE ROLLUP which we'll see shortly.) COMPUTE BY, although helpful, is a rarely used feature, largely because COMPUTE BY doesn't generate a standard row format for the result set but rather returns a special alternate result set . Applications must go to significant lengths to handle this special-purpose result, which is nonrelational in that the output itself isn't a table. Most applications didn't or couldn't go the distance; hence, most applications don't support COMPUTE BY. Because of this, the SQL Server development team chose to overload NULL, which applications must already deal with to some extent. (This presents a few wrinkles of its own, which we'll come back to.)

Following is the basic query and a discussion of how "ALL" is produced in the results:

 SELECT SUM(units_sold), model, year, color FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE

Here's the output:

 model       year        color ----------- ----------- ----------- ----------- 62          Chevy       1990        Blue  5           Chevy       1990        Red  87          Chevy       1990        White  154         Chevy       1990        NULL 49          Chevy       1991        Blue  54          Chevy       1991        Red  95          Chevy       1991        White  198         Chevy       1991        NULL  71          Chevy       1992        Blue  31          Chevy       1992        Red  54          Chevy       1992        White  156         Chevy       1992        NULL  508         Chevy       NULL        NULL 63          Ford        1990        Blue  64          Ford        1990        Red  62          Ford        1990        White  189         Ford        1990        NULL  55          Ford        1991        Blue  52          Ford        1991        Red  9           Ford        1991        White  116         Ford        1991        NULL  39          Ford        1992        Blue  27          Ford        1992        Red  62          Ford        1992        White  128         Ford        1992        NULL 433         Ford        NULL        NULL 941         NULL        NULL        NULL 125         NULL        1990        Blue  69          NULL        1990        Red  149         NULL        1990        White  343         NULL        1990        NULL 104         NULL        1991        Blue  106         NULL        1991        Red  104         NULL        1991        White  314         NULL        1991        NULL 110         NULL        1992        Blue  58          NULL        1992        Red  116         NULL        1992        White  284         NULL        1992        NULL 182         Chevy       NULL        Blue  157         Ford        NULL        Blue  339         NULL        NULL        Blue 90          Chevy       NULL        Red  143         Ford        NULL        Red  233         NULL        NULL        Red  236         Chevy       NULL        White  133         Ford        NULL        White  369         NULL        NULL        White

Notice that rather than ALL, which we had constructed earlier to be a placeholder, the super-aggregate values are represented by NULL here formatted as NULL as in the SQL Server Query Analyzer, although this is application specific. The data in this case had no NULL values, so we could easily tell the difference between NULL meaning "unknown" and NULL meaning "ALL."

If your data uses some NULL values, however, you won't be so lucky. Suppose that we add one row of data that has NULL for both model and year ( assuming these columns allow NULL):

 INSERT automobile_sales_detail values (NULL, NULL, 'White', 10)

The last row in the CUBE above (having 369 as total sales for all white cars) would be seemingly indistinguishable from this actual data row. Fortunately, the GROUPING() function comes to the rescue to differentiate the two. It returns 1 (TRUE) if the element is an ALL value and 0 (FALSE) otherwise. Here is a modified query to show the CUBE with the inclusion of the row having NULL values and with the addition of the GROUPING() function to designate ALL values:

 SELECT 'Units Sold'=SUM(units_sold), model, 'ALL Models'=GROUPING(model), year, 'ALL Years'=GROUPING(year), color, 'ALL Colors'=GROUPING(color) FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE

Here's the result set:

 ALL               ALL             ALL Units Sold   model   Models   year     Years   color   Colors ----------   -----   ------   ------   -----   -----   ------ 10           NULL    0        NULL     0       White   0 10           NULL    0        NULL     0       NULL    1 10           NULL    0        NULL     1       NULL    1 62           Chevy   0        1990     0       Blue    0 5            Chevy   0        1990     0       Red     0 87           Chevy   0        1990     0       White   0 154          Chevy   0        1990     0       NULL    1 49           Chevy   0        1991     0       Blue    0 54           Chevy   0        1991     0       Red     0 95           Chevy   0        1991     0       White   0 198          Chevy   0        1991     0       NULL    1 71           Chevy   0        1992     0       Blue    0 31           Chevy   0        1992     0       Red     0 54           Chevy   0        1992     0       White   0 156          Chevy   0        1992     0       NULL    1 508          Chevy   0        NULL     1       NULL    1 63           Ford    0        1990     0       Blue    0 64           Ford    0        1990     0       Red     0 62           Ford    0        1990     0       White   0 189          Ford    0        1990     0       NULL    1 55           Ford    0        1991     0       Blue    0 52           Ford    0        1991     0       Red     0 9            Ford    0        1991     0       White   0 116          Ford    0        1991     0       NULL    1 39           Ford    0        1992     0       Blue    0 27           Ford    0        1992     0       Red     0 62           Ford    0        1992     0       White   0 128          Ford    0        1992     0       NULL    1 433          Ford    0        NULL     1       NULL    1 951          NULL    1        NULL     1       NULL    1 10           NULL    1        NULL     0       White   0 10           NULL    1        NULL     0       NULL    1 125          NULL    1        1990     0       Blue    0 69           NULL    1        1990     0       Red     0 149          NULL    1        1990     0       White   0 343          NULL    1        1990     0       NULL    1 104          NULL    1        1991     0       Blue    0 106          NULL    1        1991     0       Red     0 104          NULL    1        1991     0       White   0 314          NULL    1        1991     0       NULL    1 110          NULL    1        1992     0       Blue    0 58           NULL    1        1992     0       Red     0 116          NULL    1        1992     0       White   0 284          NULL    1        1992     0       NULL    1 182          Chevy   0        NULL     1       Blue    0 157          Ford    0        NULL     1       Blue    0 339          NULL    1        NULL     1       Blue    0 90           Chevy   0        NULL     1       Red     0 143          Ford    0        NULL     1       Red     0 233          NULL    1        NULL     1       Red     0 10           NULL    0        NULL     1       White   0 236          Chevy   0        NULL     1       White   0 133          Ford    0        NULL     1       White   0 379          NULL    1        NULL     1       White   0

Note that the GROUPING() function takes a column name as an argument. A 0 returned for GROUPING( column_name ) means the value returned for that column is a real value. A 1 returned for GROUPING( column_name ) means that the value returned for that column is not real it was returned only because of the use of the CUBE or ROLLUP option.

The GROUPING() function enables you to differentiate between a GROUPING NULL value and a NULL value, but hardly makes reading and analyzing it as intuitive as using ALL. An alternative would be to write a view that outputs a grouping value as ALL. You might be more comfortable turning your programmers and power users loose on such a view rather than relying on them to understand GROUPING NULL vs. NULL value. This example uses the CASE expression, which is discussed in detail in Chapter 9:

 CREATE VIEW auto_cube (units, model, year, color) AS SELECT SUM(units_sold), CASE    WHEN (GROUPING(model)=1) THEN 'ALL'     ELSE ISNULL(model, '????')     END, CASE    WHEN (GROUPING(year)=1) THEN 'ALL'     ELSE ISNULL(CONVERT(char(6), year), '????')     END, CASE    WHEN (GROUPING(color)=1) THEN 'ALL'     ELSE ISNULL(color, '????')     END FROM automobile_sales_detail GROUP BY model, year, color WITH CUBE

Having constructed this view, it's simple for someone to work with, to understand the difference between grouping values (represented by ' ALL' ) and NULL data values (represented by ' ????' ), and to easily further refine the query if necessary to look for certain data. This example shows a simple query:

 SELECT * FROM auto_cube

And the results:

 units    model    year    color -----    -----    ----    ----- 10       ????     ????    White 10       ????     ????    ALL 10       ????     ALL     ALL 62       Chevy    1990    Blue 5        Chevy    1990    Red 87       Chevy    1990    White 154      Chevy    1990    ALL 49       Chevy    1991    Blue 54       Chevy    1991    Red 95       Chevy    1991    White 198      Chevy    1991    ALL 71       Chevy    1992    Blue 31       Chevy    1992    Red 54       Chevy    1992    White 156      Chevy    1992    ALL 508      Chevy    ALL     ALL 63       Ford     1990    Blue 64       Ford     1990    Red 62       Ford     1990    White 189      Ford     1990    ALL 55       Ford     1991    Blue 52       Ford     1991    Red 9        Ford     1991    White 116      Ford     1991    ALL 39       Ford     1992    Blue 27       Ford     1992    Red 62       Ford     1992    White 128      Ford     1992    ALL 433      Ford     ALL     ALL 951      ALL      ALL     ALL 10       ALL      ????    White 10       ALL      ????    ALL 125      ALL      1990    Blue 69       ALL      1990    Red 149      ALL      1990    White 343      ALL      1990    ALL 104      ALL      1991    Blue 106      ALL      1991    Red 104      ALL      1991    White 314      ALL      1991    ALL 110      ALL      1992    Blue 58       ALL      1992    Red 116      ALL      1992    White 284      ALL      1992    ALL 182      Chevy    ALL     Blue 157      Ford     ALL     Blue 339      ALL      ALL     Blue 90       Chevy    ALL     Red 143      Ford     ALL     Red 233      ALL      ALL     Red 10       ????     ALL     White 236      Chevy    ALL     White 133      Ford     ALL     White 379      ALL      ALL     White

Working with this view is easy. Grouping values appear as "ALL." Actual NULL values are represented by question marks, appropriate because they're unknown. We can easily further select from the view to drill into any dimension of the cube that we like. For example, to find all Chevy sales, regardless of the year and color:

 SELECT * FROM auto_cube  WHERE model='Chevy' AND year='ALL' AND color='ALL'

Here are the results:

 units    model    year    color -----    -----    ----    ----- 508      Chevy    ALL     ALL

In this view, we elected to output everything as character data; thus, "ALL" was a reasonable choice. If keeping the data numeric had been important, we would've chosen some special value, such as -999999.

ROLLUP

If you're looking for a hierarchy or a drill-down report (what you might know as a control-break report if you've ever programmed in COBOL), CUBE can be overkill. It generates many more result rows than you want, and it obscures the requested information with more data. (Don't be confused by thinking that data and information are equivalent.) Using CUBE, you'll get all permutations , including super-aggregates, for all attributes for which corresponding data exists. SQL Server provides the ROLLUP operator to extract statistics and summary information from result sets. ROLLUP returns only the values for a hierarchy of the attributes that you specify.

This behavior is best explained with an example. If we change the CUBE query to use ROLLUP instead, the results are more compact and easier to interpret when we drill into progressive levels of details for sales by model. However, with ROLLUP, we don't get one-stop shopping to answer a question like "How many white cars of any model were sold?"

 SELECT 'units sold'=SUM(units_sold), 'model'=CASE WHEN (GROUPING(model)=1) THEN 'ALL'     ELSE ISNULL(model, '????')     END, 'year'=CASE WHEN (GROUPING(year)=1) THEN 'ALL'     ELSE ISNULL(CONVERT(char(6), year), '????')     END, 'color'=CASE WHEN (GROUPING(color)=1) THEN 'ALL'     ELSE ISNULL(color, '????')     END FROM automobile_sales_detail GROUP BY model, year, color WITH ROLLUP

The results:

 units sold    model    year    color ----------    -----    ----    ----- 10            ????     ????    White 10            ????     ????    ALL 10            ????     ALL     ALL 62            Chevy    1990    Blue 5             Chevy    1990    Red 87            Chevy    1990    White 154           Chevy    1990    ALL 49            Chevy    1991    Blue 54            Chevy    1991    Red 95            Chevy    1991    White 198           Chevy    1991    ALL 71            Chevy    1992    Blue 31            Chevy    1992    Red 54            Chevy    1992    White 156           Chevy    1992    ALL 508           Chevy    ALL     ALL 63            Ford     1990    Blue 64            Ford     1990    Red 62            Ford     1990    White 189           Ford     1990    ALL 55            Ford     1991    Blue 52            Ford     1991    Red 9             Ford     1991    White 116           Ford     1991    ALL 39            Ford     1992    Blue 27            Ford     1992    Red 62            Ford     1992    White 128           Ford     1992    ALL 433           Ford     ALL     ALL 951           ALL      ALL     ALL

ROLLUP is similar to COMPUTE BY, which has nice functionality. However, COMPUTE BY is not as relational as ROLLUP it doesn't simply produce results as an ordinary table of values. COMPUTE BY always requires special application programming to deal with its alternate result sets, which are essentially equivalent to super-aggregates. Recall that CUBE and ROLLUP simply use ordinary result sets and that the super-aggregates are represented as a grouping NULL. Generally speaking, CUBE and ROLLUP are more preferable than COMPUTE BY because they're easier to fit into applications; you can use them with views, CASE, subselects, and column aliases; and they're internally better optimized. But for completeness, here's an example using COMPUTE BY as functionally equivalent to ROLLUP:

 SELECT units_sold, model, year, color FROM automobile_sales_detail ORDER BY model, year, color COMPUTE SUM(units_sold) BY model, year COMPUTE SUM(units_sold) BY model COMPUTE SUM(units_sold)

And the results:

 units_sold    model    year    color ----------    -----    ----    ----- 10            NULL     NULL    White              Sum             ===========             10                  Sum             ===========             10                               units_sold    model    year    color ----------    -----    ----    ----- 62            Chevy    1990    Blue  5             Chevy    1990    Red  87            Chevy    1990    White              Sum             ===========             154                              units_sold    model    year    color ----------    -----    ----    ----- 49            Chevy    1991    Blue  54            Chevy    1991    Red  95            Chevy    1991    White              Sum             ===========             198                              units_sold    model    year    color ----------    -----    ----    ----- 71            Chevy    1992    Blue  31            Chevy    1992    Red  54            Chevy    1992    White              Sum             ===========             156                 Sum             ===========             508                              units_sold    model    year    color ----------    -----    ----    ----- 63            Ford     1990    Blue  64            Ford     1990    Red  62            Ford     1990    White              Sum             ===========             189                              units_sold    model    year    color ----------    -----    ----    ----- 55            Ford     1991    Blue  52            Ford     1991    Red  9             Ford     1991    White              Sum             ===========             116                              units_sold    model    year    color ----------    -----    ----    ----- 39            Ford     1992    Blue  27            Ford     1992    Red  62            Ford     1992    White              Sum             ===========             128                 Sum             ===========             433                 Sum             ===========             951

Statistical Aggregates

SQL Server 7 introduces four new aggregate functions, in addition to the standard six described earlier. These functions are used for statistical applications and are considered aggregate functions because they operate on a set of values instead of on one single value. Table 7-8 below describes the four statistical aggregate functions.

You could find the standard deviation and variance of the prices in the titles table with this query:

 SELECT stdev(price), var(price) FROM titles

Alternatively, you could calculate the standard deviation and variance of the prices for each different type of book:

 SELECT type, stdev(price), var(price) FROM titles GROUP BY type

Table 7-8. Statistical aggregate functions.

Aggregate Function	Description
STDEV( expression )	Returns the statistical standard deviation of all values in the given expression.
STDEVP( expression )	Returns the statistical standard deviation for the population for all values in the given expression.
VAR( expression )	Returns the statistical variance of all values in the given expression.
VARP( expression )	Returns the statistical variance for the population for all values in the given expression.

TOP

We've already seen that the WHERE clause allows us to limit the number of rows that a SELECT statement will return, but using a WHERE clause assumes we have knowledge of the actual data values present. What if we wanted to see only a screenful of rows from a table but had no idea which range of values were present in the table? Prior to SQL Server version 7, we could use the option SET ROWCOUNT n to limit SQL Server to sending only n rows to the client. Every SELECT statement would just stop sending rows back to the client after n rows had been sent.

For example, the following will return only 10 rows from the authors table:

 SET ROWCOUNT 10 GO SELECT * FROM authors GO

This is useful in some situations, but it's quite limited in functionality. The TOP keyword used in the SELECT list allows much greater control over the quantity of data we'd like to see. Here's the syntax for using TOP:

 SELECT [ TOP  n  [PERCENT] [ WITH TIES] ]  select_list  FROM  table_list  WHERE  conditions  ...  rest of query

When the query includes an ORDER BY clause, TOP causes the first n rows ordered by the ORDER BY clause to be output. When the query has no ORDER BY clause, we can't predict which rows will be returned. Compare the output of the following two queries:

QUERY 1

 SELECT TOP 5 title_id, price, type  FROM titles title_id price                 type          -------- --------------------- ------------  BU1032   19.9900               business     BU1111   11.9500               business     BU2075   2.9900                business     BU7832   19.9900               business     MC2222   19.9900               mod_cook

QUERY 2

 SELECT TOP 5 title_id, price, type  FROM titles ORDER BY price DESC title_id price                 type          -------- --------------------- ------------  PC1035   22.9500               popular_comp PS1372   21.5900               psychology   TC3218   20.9500               trad_cook    PC8888   20.0000               popular_comp BU1032   19.9900               business

Notice that simply including TOP 5 in the SELECT list is equivalent to setting ROWCOUNT to 5. SQL Server just stops returning rows after the first five. In the example above, however, you should note that other books exist with a price of $19.99, but because we wanted only five, we didn't see them. Does this output really answer the question, "What are the five highest priced books?" If you answered no and really want to see all the books with the same price as the ones listed, you can use the WITH TIES option. This option is only allowable if your SELECT statement includes an ORDER BY:

 SELECT TOP 5 WITH TIES title_id, price, type  FROM titles ORDER BY price DESC

This returns the following eight rows:

 title_id price                 type          -------- --------------------- ------------  PC1035   22.9500               popular_comp PS1372   21.5900               psychology   TC3218   20.9500               trad_cook    PC8888   20.0000               popular_comp BU1032   19.9900               business     BU7832   19.9900               business     MC2222   19.9900               mod_cook     PS3333   19.9900               psychology

If we want to see a certain fraction of the rows, we can use TOP with PERCENT, which will round up to the nearest integer number of rows:

 SELECT TOP 30 Percent title_id, price, type  FROM titles ORDER BY price DESC

You can think of SQL Server carrying out this operation by first counting out the number of rows specified by TOP in this case, five. Only after the first five are returned will SQL Server check to see if any other rows have a value equivalent to the last one returned.

This won't necessarily give you books with the five highest prices. If we changed TOP 5 to TOP 8, you'd get the same eight rows back as above, which represent only five different prices.

If we really wanted to see the eight highest prices, we could use DISTINCT with TOP. We couldn't select title_id at the same time, because DISTINCT applies to the whole result row, and every title_id and price combination is distinct in itself. However, we could use a subquery to determine which rows have one of the eight highest prices:

 SELECT title_id, price, type  FROM titles WHERE price in (SELECT DISTINCT TOP 8 price     FROM titles     ORDER BY price DESC)

In this case, the rows wouldn't come back in descending order of price, because the order is controlled by the outer query. You'd have to add another ORDER BY at the end.

The titles table contains 18 rows, and 30 percent of 18 is 5.6. This is rounded up to return these six rows:

 title_id price                 type          -------- --------------------- ------------  PC1035   22.9500               popular_comp PS1372   21.5900               psychology   TC3218   20.9500               trad_cook    PC8888   20.0000               popular_comp BU1032   19.9900               business     BU7832   19.9900               business

We can also use the WITH TIES option with PERCENT. If used in the above query, we'd then get the additional two rows that also have a price of $19.99.

UNION

UNION is a conceptually easy topic. Given the same number of columns and compatible datatypes, UNION combines two or more result sets into a single result set. By default, duplicate rows are eliminated, although you can include them by specifying UNION ALL. The datatypes of the result sets to be combined with UNION don't need to be identical, but they must be able to be implicitly converted. If this isn't the case, you can explicitly convert them to identical or compatible types using the handy CONVERT() function.

The following example shows UNION at its most basic level. It also demonstrates that a SELECT statement in SQL Server doesn't need to be from a table or a view but can be a constant or a variable:

 SELECT col1=2, col2=1 UNION  SELECT xxx=1, yyy=2 UNION  SELECT '3','0'  UNION  SELECT TAN(0), 3*1000/10/100 -- Tangent of 0 is 0.                              -- So this row produces 0.0, 3 UNION SELECT 1, 2

The output follows:

 col1    col2 ----    ---- 0.0     3 3.0     0 1.0     2 2.0     1

Earlier versions of SQL Server always returned the result of a UNION in sorted order, unless you specified UNION ALL. This occurred because UNION eliminated duplicate rows using a sorting strategy. The ordering was simply a by-product of the sorting to eliminate duplicates. However, SQL Server 7 has several alternative strategies available for removing duplicates, so there's no guarantee of any particular order when using UNION. If order is important to you, you should explicitly use ORDER BY at the end of the last SELECT statement.

In the example above, the datatypes are compatible but not identical. Because TAN(0) returns a numeric datatype, col1 is automatically cast to numeric (since using an integer would lose precision), and the other values are implicitly converted. In addition, the one duplicate row (1.0, 2) is eliminated, and the columns take their monikers from the first result set.

If you used UNION ALL instead of UNION, the duplicate row wouldn't be eliminated and no extra work would be needed to determine which rows were distinct. If there's no need to eliminate duplicates, or if you know that there won't be any duplicates, you can achieve a noticeable performance improvement by using UNION ALL. The following query illustrates the use of UNION ALL and uses ORDER BY to present the results in a particular order. It also converts the result of the tangent operation to a tinyint so that col1 is then implicitly converted to an int .

 SELECT col1=2, col2=1 UNION ALL SELECT xxx=1, yyy=2 UNION ALL SELECT '3', '0' UNION ALL SELECT CONVERT(smallint, TAN(0)), 3*1000/10/100  -- Tangent of 0 is 0. So this row produces 0.0, 3 UNION ALL SELECT 1, 2 ORDER BY col2

Here's the output:

 col1    col2 ----    ---- 3       0 2       1 1       2 1       2 0       3

Earlier, we saw that the CUBE operation wouldn't generate missing rows with the value 0. However, you might occasionally want the full cube, with 0 as the SUM() value for missing combinations. We promised you an example of how this could be accomplished using UNION, so here's the example. And here's the key the table can be cross-joined to itself to produce a 0 row for every combination possible given the non-NULL data. This query will use the existing data to produce a list of every possible combination of model, color, and year:

 SELECT DISTINCT units_sold=0, A.model, B.color, C.year FROM automobile_sales_detail A CROSS JOIN automobile_sales_detail B CROSS JOIN automobile_sales_detail C

Here's the output:

 units_sold    model    color    year ----------    -----    -----    ---- 0             Chevy    Blue     1990 0             Chevy    Blue     1991 0             Chevy    Blue     1992 0             Chevy    Red      1990 0             Chevy    Red      1991 0             Chevy    Red      1992 0             Chevy    White    1990 0             Chevy    White    1991 0             Chevy    White    1992 0             Ford     Blue     1990 0             Ford     Blue     1991 0             Ford     Blue     1992 0             Ford     Red      1990 0             Ford     Red      1991 0             Ford     Red      1992 0             Ford     White    1990 0             Ford     White    1991 0             Ford     White    1992

The real underlying data is the set of 12 rows shown earlier and shown again here. Recall that six combinations have no value that is, there are no Chevys for 1990 and no red Fords:

 model    year    color    units_sold -----    ----    -----    ---------- Chevy    1991    Red      54 Chevy    1991    White    95 Chevy    1991    Blue     49 Chevy    1992    Red      31 Chevy    1992    White    54 Chevy    1992    Blue     71 Ford     1990    White    62 Ford     1990    Blue     63 Ford     1991    White    9 Ford     1991    Blue     55 Ford     1992    White    62 Ford     1992    Blue     39

Having generated all the dummy rows as such, we can then easily UNION the cross-joined dummy results with the output of the CUBE. But that still wouldn't do because we'd get both a real row and a dummy row whenever a real combination exists. SQL Server gives you several good ways to solve this dilemma. You can use NOT EXISTS to produce the dummy rows only for combinations that don't exist. Another approach, and perhaps the most intuitive one, is to make a view of the cross-joined dummy rows and the actual data, and then perform the CUBE on the view. The 0 values, of course, don't affect the SUM, so this works nicely :

 CREATE VIEW fullcube      (     units_sold,     model,     year,     color     )     AS     (     SELECT D.units_sold, D.model, D.year, D.color     FROM automobile_sales_detail D     UNION ALL     SELECT DISTINCT 0,     A.model, C.year, B.color     FROM      automobile_sales_detail A     CROSS JOIN     automobile_sales_detail B     CROSS JOIN     automobile_sales_detail C      )

Having constructed the view, we can then issue our CUBE query against it, just as we did previously against the base table:

 SELECT units_sold=SUM(units_sold), model, year, color FROM fullcube GROUP BY model, year, color WITH CUBE

Here's the output:

 units_sold    model    year    color ----------    -----    ----    ----- 0             Chevy    1990    Blue 0             Chevy    1990    Red 0             Chevy    1990    White 0             Chevy    1990    NULL 49            Chevy    1991    Blue 54            Chevy    1991    Red 95            Chevy    1991    White 198           Chevy    1991    NULL 71            Chevy    1992    Blue 31            Chevy    1992    Red 54            Chevy    1992    White 156           Chevy    1992    NULL 354           Chevy    NULL    NULL 63            Ford     1990    Blue 0             Ford     1990    Red 62            Ford     1990    White 125           Ford     1990    NULL 55            Ford     1991    Blue 0             Ford     1991    Red 9             Ford     1991    White 64            Ford     1991    NULL 39            Ford     1992    Blue 0             Ford     1992    Red 62            Ford     1992    White 101           Ford     1992    NULL   290           Ford     NULL    NULL   644           NULL     NULL    NULL   63            NULL     1990    Blue 0             NULL     1990    Red 62            NULL     1990    White 125           NULL     1990    NULL   104           NULL     1991    Blue 54            NULL     1991    Red 104           NULL     1991    White 262           NULL     1991    NULL   110           NULL     1992    Blue 31            NULL     1992    Red 116           NULL     1992    White 257           NULL     1992    NULL   120           Chevy    NULL    Blue 157           Ford     NULL    Blue 277           NULL     NULL    Blue 85            Chevy    NULL    Red 0             Ford     NULL    Red 85            NULL     NULL    Red 149           Chevy    NULL    White 133           Ford     NULL    White 282           NULL     NULL    White

Then we can define yet another view on top to generate the "ALL" and "????" placeholders. No NULL data exists in this example, so we'll look at only the "ALL" case:

 CREATE VIEW auto_cube (units, model, year, color) AS SELECT SUM(units_sold), ISNULL(model, 'ALL'), ISNULL(CONVERT(char(4), year), 'ALL'), ISNULL(color, 'ALL') FROM fullcube GROUP BY model, year, color WITH CUBE

One advantage of filling out the cube with placeholder rows is that no matter what combination we formulate , we'll get an answer. There are no unknowns and no need to include the GROUPING() function. Had we not constructed the cube, this query would return no rows found instead of a row with 0 units:

 SELECT * FROM auto_cube WHERE model='Chevy' AND color='ALL' AND year='1990' units    model    year        color -----    -----    ----        ----- 0        Chevy    1990        ALL

Just for fun, here's the equivalent of the fullcube view as a derived table:

 SELECT units_sold=SUM(T1.units_sold), T1.model, T1.year, T1.color FROM (      SELECT D.units_sold, D.model, D.year, D.color      FROM automobile_sales_detail D      UNION ALL      SELECT DISTINCT 0,      A.model, C.year, B.color      FROM       automobile_sales_detail A      CROSS JOIN      automobile_sales_detail B      CROSS JOIN      automobile_sales_detail C       )         AS T1 GROUP BY T1.model, T1.year, T1.color WITH CUBE GO

Now that you've seen these somewhat exotic uses of CROSS JOIN and views to fill out the cube, be aware that if you had wanted the whole cube to be generated, you might have inserted some placeholder rows with a value of 0 in the base table itself (assuming you had access to do so). That way, every combination is accounted for in the base table itself.