Selecting Information from the Data

After data is in the database, it is inevitable that the data will need to be accessed, changed, and reported on. To perform these basic operations, you need to be able to apply the programming constructs of Structured Query Language (SQL)specifically, Microsoft's implementation referred to as Transact-SQL (T-SQL). The most common of these statements performs the basis for getting data out of the system: the SELECT statement.

SELECT statements can be made very complex with the use of options that can join many tables together and with functions that can calculate and summarize data at the same time. SELECT statements also can often be as simple as one line of code that retrieves the requested data. The complete SELECT syntax is very involved with many optional portions. The complete syntax reference can be found in SQL Server Books Online (BOL) under "SELECT, SELECT (described)." Many of the options are used only under special circumstances.

To start out simple, you will often be retrieving all the data from a particular table. Even if the final query is not intended to get all the data, you can often begin the data analysis by examining all the rows and columns of data in a particular table. The following example retrieves the employee data from the Northwind database:

 SELECT * FROM MYSERVER.Northwind.dbo.Employees

Note the asterisk used to obtain all columns from the Employees table. It is also worth noting the use of the four-part name (Server.Owner. Database.Object). This name includes the server name, MYSERVER, database name, Northwind, the owner name, dbo, and the name of the table itself, Employees.

Four-part names are used to perform queries when a one-part name of the table itself does not sufficiently qualify the table being queried. If the query is being executed within the scope of the server itself with the Northwind database in use and you are the owner or the owner is dbo, then the four-part name is not necessary. There are therefore several valid variations on queries for the Employees table. Each of the following will produce the same results:

 SELECT * FROM Employees SELECT * FROM dbo.Employees SELECT * FROM Northwind.dbo.Employees SELECT * FROM Northwind..Employees

Although often a query will go after all the data in a table, there are a considerable number of options available for the query statement. You can choose some of the columns of the table, provide record-level conditions to limit the number of rows returned, put the output into groups, provide group-level conditions to selectively choose the groups, put the output into a sorted order, and produce calculated results. You can also get into some very complex queries through the use of JOIN, UNION, and subquery operations.

The clauses of a SELECT query must be provided in the correct order to have valid syntax. As a mechanism for remembering the order, you can use the following acronym and phrase: SIFWGHOCSome Infinitely Funny Winos Get High On Champagne. SELECT, INTO, FROM, WHERE, GROUP BY, HAVING, ORDER BY, COMPUTE (BY).

The exam objectives require you to know how to access data in different ways, providing for different looks into the data with user-friendly views and other reporting methods from the data source. Most cases don't want the extraction of all the data from the tables. Usually queries want to limit the number of columns and rows displayed from the database. This will provide the functionality to meet most scenarios while providing for improved performance and response times to the user. You need to limit and specifically apply conditions over the data that is transmitted to the user interface.

Narrowing the Scope of Data Retrieval

The first portion of the SELECT statement identifies which columns will come from a table. When specifying column names, be sure to use a comma-delimited list and don't place a comma after the final column name. When specifying column names, you can use an asterisk (*), designating that all columns are to be returned from the specified table or tables. The asterisk can be provided by itself as seen previously or supplied in addition to columns selected. The following query selects a few of the columns from the Employees table:

 SELECT EmployeeID, FirstName, LastName, Title FROM Employees

You can optionally supply column headers to give a user-friendly listing of the data. By default, the column headers that are displayed in the resultset are the same as the columns specified in the column select list, such as FirstName and LastName. But why not change this column header to something more readable? You can change the name of a resultset column by specifying the keyword AS, this being the traditional SQL-92 ANSI standard. Changing the column name with an equals sign (=) or implied assignment is also an alternative syntax choice. The following example illustrates the use of column aliasing (of course, you would normally use only one of the three techniques and the industry standard is SQL-92 ANSI):

 SELECT EmployeeID AS 'Employee ID',      'First Name' = FirstName,      LastName 'Last Name',     Title     FROM Employees SELECT EmployeeID AS 'Employee ID',      'First Name' AS FirstName,      LastName AS 'Last Name',      Title     FROM Employees

Notice that the previous column aliases have been enclosed within single quotation marks. This enclosure needs to be made when the column alias includes spaces. The alias name needs to be enclosed within brackets when the alias is a reserved SQL Server keyword.

Sometimes you need to show two columns as one by combining two columns together. When you do this, you are using a method called string concatenation. Concatenation can be thought of as joining strings together just as you can combine words into phrases. The operator used to perform the concatenation is the plus (+) sign. You can create a singular name column by combining the last name and first name values:

 SELECT TitleOfCourtesy + ' ' + FirstName + ' ' + LastName      AS 'Employee Name',                 Title      FROM Employees

As you can see, multiple strings can be concatenated together, creating more meaningful content. When SELECT features are used in this manner, information is beginning to be pulled out of the raw data. In fact, the SELECT statement with the addition of the INTO clause can create a table drawn from the original table. In this manner the newly created table can contain a subset of columns and/or rows.

The SELECT INTO statement can perform a data insertion and create the table for the data in a single operation. The new table is populated with the data provided by a FROM clause. The SELECT INTO statement creates a new table with a structure identical to that of the columns provided in the query. It then copies all data that meets the WHERE condition into this newly created table. It is possible to combine data from several tables or views into one table, and again a variety of sources can be used. The following example creates a new table within the database that would contain only two columns:

 SELECT TitleOfCourtesy + ' ' + FirstName + ' ' + LastName      AS 'Employee Name',                 Title      INTO HRTable      FROM Employees

The INTO clause creates a table, so it is important that the table does not exist when you're using the command. If the desire is to add data to an existing table, an INSERT INTO operation must be performed. We will look at the INSERT statement later in the chapter. An INTO operation is often used to denormalize data. By definition, data will be duplicated into another location. The second table is often used in circumstances in which it is easier and/or more efficient to work with a subset of the data. This duplication is purposely performed but some queries can result in unwanted duplicates.

Use of the DISTINCT clause eliminates duplicate rows from any resultset. A SELECT query may return records with equal values, but using DISTINCT eliminates duplicates and leaves only singular values in the resultset. You might use the DISTINCT keyword when you need to know only whether a value exists, rather than how many records of a value exist. This DISTINCT option, when specified, selects only the values that are unique throughout a row. See Figure 5.2 for a contrast between listing all data values and listing only DISTINCT data values.

Figure 5.2. Listing DISTINCT cities.

The DISTINCT keyword is optional when calculations are performed using SUM, AVG, and COUNT. When DISTINCT is used, duplicate values are eliminated before the calculation is performed. Limiting the number of rows returned from queries is common and serves a number of purposes. This narrowing of data is referred to as horizontal filtering.

The WHERE clause is used to limit the number of rows in the resultset based on defined restrictions. These restrictions are specified as conditional arguments, such as Salary>10000, LastName LIKE 'G%', or State = 'FL'. SQL Server will always execute the WHERE action first to discover which rows should be looked at before other operations needed by a given SELECT clause. This acts as a layer of filtration in a basic SELECT query.

Similar to the WHERE clause is the HAVING clause. In a SELECT statement these clauses control the rows from the source tables that are used to build the resultset. WHERE and HAVING are filters. They specify a series of search conditions, and only those rows that meet the terms of the search conditions are used to build the resultset. To address how these clauses are used, we must understand the conditions that can be applied within these clauses.

Conditional Filtering of Data

Filtering data is used for determining the data to be selected based on conditional requirements. Essentially, all conditions come down to one of three possible outcomes. If two values are compared, the result is positive, negative, or equal (greater than, less than, or equal to).

Operators play an important part in determining the content of any conditional operations. An operator is a symbol specifying an action performed on one or more expressions. In SQL Server these operators are divided into a few elementary categories, as presented in the next few paragraphs. Each operator category represents a piece of functionality that provides many operators. In some instances there are several dozen operator choices.

Comparison Operators

Comparison operators can be used with character, numeric, or date data expressions. Their purpose is to determine a Boolean result based on the comparison of the expressions. Table 5.1 lists all comparison operators that are valid in SQL Server.

Table 5.1. Comparison Operators
Operator	Meaning
<	Less than
>	Greater than
=	Equal to
<=	Less than or equal to
>=	Greater than or equal to
!=	Not equal to
<>	Not equal to
!<	Not less than
!>	Not greater than

Whereas comparison operators evaluate the differences between two or more values, arithmetic operators aid in processing mathematical functions against values.

Arithmetic Operators

Arithmetic operators perform mathematical operations on two expressions. A complete listing of the available arithmetic operators and their uses is shown in Table 5.2.

Table 5.2. Arithmetic Operators
Operator	Meaning
+ (Add)	Addition
- (Subtract)	Subtraction
* (Multiply)	Multiplication
/ (Divide)	Division
% (Modulo)	Returns the integer remainder of a division

Although these operators usually perform actions against numeric data, the plus (+) and minus (-) operators can also be used to perform arithmetic operations on date values.

Logical Operators

Logical operators are also known as Boolean operators. The three logical operators are AND, OR, and NOT. Their meanings are pretty straightforward: AND adds an additional filter condition to the one specified and returns trUE only when both or all conditions specified are met. The OR logical operator adds another filter condition to the existing condition as well, but it returns trUE when either condition is met. NOT tells SQL to get everything in the query except for what it has specified. An example of the three operators in use is provided in Figure 5.3.

Figure 5.3. Compound conditions using the logical operators AND, OR, NOT.

Brackets can significantly alter the results of operations that use logical operators. Test all conditions and alter the precedence by using brackets. According to the order of operations, bracketed comparisons are performed first. The logical operators are evaluated in the order: () first, then NOT, then AND, and finally OR.

Bitwise and Unary Operators

Bitwise operators are used on int, smallint, or tinyint data. The ~ (bitwise NOT) operator can also use bit data. All bitwise operators perform an operation on one or more specified integer values as translated to binary expressions within T-SQL statements. The bitwise NOT operator changes binary 1s to 0s and 0s to 1s. Bitwise operators perform bit manipulations between two expressions of any of the data types of the integer data type category. The set of operators is shown in Table 5.3.

Table 5.3. Bitwise Operators
Operator	Meaning
&	Bitwise AND
\|	Bitwise OR
^	Bitwise exclusive OR
~	Bitwise NOT

The operands for bitwise operators can be any of the data types of the integer or binary string data type categories, with the exception that both operands cannot be any of the data types of the binary string data type category.

Unary operators (see Table 5.4) perform an operation on only one expression of any of the data types of the numeric data type category. The + (positive) and - (negative) operators can be used on any expression of any of the data types of the numeric data type category. The ~ (bitwise NOT) operator can be used only on expressions of any of the data types of the integer data type category.

Table 5.4. Unary Operators
Operator	Meaning
+	Value is positive
-	Value is negative
~	Returns the ones complement of the number

BETWEEN, IN, and EXISTS

Ranges can also be specified when using the WHERE clause with the help of the BETWEEN keyword. Simply put, BETWEEN provides a range of values within which the data should lie; otherwise, the data does not meet the condition. BETWEEN is inclusive, meaning that the range includes the lower value specified and the upper value specified. The following query therefore would also have the value 20 as a possibility in the results:

 SELECT * FROM Products WHERE UnitPrice BETWEEN 10 AND 20

If the intent was to exclude the value 20, the query would be written like this:

 SELECT * FROM Products WHERE UnitPrice BETWEEN 10.00 AND 19.99

You can also incorporate something known as a list when using the WHERE clause. Essentially, a list specifies the exact values a column may or may not take. If the record does not contain the value for the column specified in the IN list, it is not selected. IN determines whether a given value matches a set of values listed. For example:

 SELECT * FROM Customers WHERE Country IN ('UK', 'USA')

This example limits the values of Country to only UK and USA. Customers who live in the countries mentioned in the IN list will be the only ones listed.

The use of IN is similar to that of EXISTS. When a subquery is introduced with the keyword EXISTS, it functions as an existence test. The WHERE clause of the outer query tests for the existence of rows returned by the subquery. The subquery does not actually produce any data; it returns a value of trUE or FALSE. The following is an example of the use of EXISTS to find only those customers in the United States:

 SELECT DISTINCT Pub_Name FROM Publishers    WHERE EXISTS (SELECT * FROM titles                                    WHERE Pub_id = Publishers.Pub_id                                    AND Country = 'USA')

Existence tests can be useful in solving some problems, but because of the use of subqueries, the query itself is very inefficient and can usually be accomplished more efficiently by using a JOIN operation.

Similar Conditioning Using LIKE

You can retrieve rows that are based on portions of character strings by using the LIKE predicate. The LIKE predicate determines whether a given character string matches a specified pattern. The types of data a LIKE statement can work with are char, varchar, nvarchar, nchar, datetime, smalldatetime, and text. A pattern specified in the LIKE predicate can include regular characters and wildcard characters. During pattern matching, regular characters must exactly match the characters specified in the column value. Wildcard characters can be matched with any character or set of characters according to the wildcard character used, as shown in Table 5.5.

Table 5.5. The Wildcard Characters Allowed in T-SQL
Character	Meaning
[ ]	Any single character within the specified range ([f-j]) or set ([fghij])
_ (underscore)	Any single character
%	Any number of zero or more characters
[ ^ ]	Any single character not in the specified range or set

If your application repeatedly calls the LIKE predicate and performs numerous wildcard searches, you should consider using the MS Search Service if it is installed and in use on the server. Consider the value of the response time over the storage resources that the MS Search Service and full-text search capabilities require. MS Search Service is required to use full-text search. Full-text search enables a variety of powerful wildcard searches. You should avoid LIKE searches that have a % wildcard at both the beginning and the end. The following example of how to use the LIKE clause uses the % wildcard to select all customers whose CustomerID begins with the letter A:

 SELECT CustomerID, ContactName FROM Customers  WHERE CustomerID LIKE 'A%'

You can also use the NOT keyword with the LIKE predicate, which simply retrieves a query that does not contain records matching the specified elements in the LIKE clause. With character matching it is sometimes more efficient to exclude characters with the use of NOT. It is common to use a negative test in particular when looking for values that represent true data, as in NOT NULL.

Selecting Rows Based on NULL Values

A NULL value is a value given to a field that that has no value. Many people confuse NULL values with zero-length strings or the value zero, but such is not the case. NULL is just a fancy word for a value that is unknown. In SQL Server, you can select the desired NULL values or reject them using IS NULL or IS NOT NULL.

Many queries involving NULL tests depend on the ANSI connection settings for the session. When you SET ANSI NULLS ON, a comparison in which one or more of the expressions is NULL returns UNKNOWN as the result of the comparison. A value that is not known cannot be compared against another value. Use the IS NULL or IS NOT NULL instead to test for NULL values. T-SQL does support an extension of the traditional behavior that allows for comparison operators to return TRUE or FALSE as a comparison result. If you SET ANSI NULLS OFF, comparisons to NULL return TRUE when the value being compared against is also a NULL. The comparison returns FALSE for any other value. Regardless of the ANSI NULLS setting, NULL values are always considered equal for the purposes of the ORDER BY, GROUP BY, and DISTINCT keywords.

Putting It All in ORDER

Putting data in order provides for a more meaningful display of data and enables the data to be presented in a manner that meets additional reporting requirements set by most front-end applications. Data can be ordered in either ascending or descending sequence (ASC/DESC). ASC is the default and can be optionally provided with the command. DESC must be provided if a descending sequence is desired. When you are ordering rows that contain NULL values, the NULL-valued records are displayed first, provided that the default sort order is used (ASC).

ORDER BY determines the sequence of data based on column(s) selected and sequencing requested: ascending (ASC) or descending (DESC). Descending orders rows from highest to lowest; ascending orders rows from lowest to highest. Ordering can even be performed based on columns not included in the SELECT list. There is no limit to the number of items in the ORDER BY clause; however, there is a limit of 8,060 bytes.

After data selection has been determined, the actual data sent is ready for an ordering process if one has been defined. The ordering of data is optional, and if it is not present in a request, the data is sent in an order determined at the data source. Ordering data is a necessity when using the TOP condition.

Returning TOP rows

The TOP clause limits the number of rows returned in the resultset to a specified number or percentage at the top of a sorted range.

Select top 50: Returns the top 50 rows

Select top 50 percent: Returns the top 50% of the rows

There is an alternative to TOP. You can also limit the number of rows to return using SET ROWCOUNT N. The difference between these two is that the TOP keyword applies to the single SELECT statement in which it is specified. SET ROWCOUNT stays in effect until another SET ROWCOUNT statement is executed, such as SET ROWCOUNT 0 to turn the option off.

You can optionally specify that the TOP keyword is to use the WITH TIES option, in which case any number of records can possibly be displayed. WITH TIES displays all records that are equivalent to the last matching element. If you are looking for the top 10 employees and there is a tie for 10th between two employees, 11 or more records are displayed. If the tie is for 9th or a higher position, only 10 records are listed.

Of course, after you begin placing data into the desired order, the next thing would be to group the output and perform calculations based on the group. Grouping allows the production of subtotals and also provides more usable output in applications that require grouped output.

Displaying Groups in Output

The GROUP BY clause of the SELECT statement creates groups within the data. These groups can be used to display data in a more orderly fashion or produce more meaningful results through the use of aggregate functions.

The GROUP BY clause specifies the groups into which output is to be shown and, if aggregate functions are included, calculations of summary values are performed for each group. When GROUP BY is specified, either each column in any non-aggregate expression in the select list should be included in the GROUP BY list, or the GROUP BY expression must match exactly the select list expression.

It is important to note that if the ORDER BY clause is not specified, groups returned using the GROUP BY clause are not in any particular order. It is recommended that you always use the ORDER BY clause to specify a particular ordering of the data. Data will still be collected into groups. Examine the following example:

 SELECT Country, Count(DISTINCT City) AS 'Number of Cities'  FROM Customers GROUP BY Country

Countries are collected together and are placed in the order chosen by SQL Server (usually ascending). The number of unique cities will be counted and displayed beside the related country. By supplying the ORDER BY clause as in the following example, you sort data into descending sequence, placing the country with the greatest number of unique cities at the top:

 SELECT Country, Count(DISTINCT City) AS 'Number of Cities'  FROM Customers GROUP BY Country  ORDER BY Count(DISTINCT City) DESC

You may not want all groups to be included in the output. To exclude groups from the recordset, you can utilize the HAVING clause, which operates against the groups of data in the same way that the WHERE clause acts against the individual rows. In the next example, as shown in Figure 5.4, the listing has been narrowed down by the elimination of countries with fewer than three unique cities.

Figure 5.4. GROUP BY, HAVING, and ORDER BY used together.

COMPUTE and COMPUTE BY options of the SELECT statement also fall into the topic of grouping data. These two clauses can serve a purpose similar to that of the GROUP BY clause. GROUP BY will produce a single output in which there is a single row for each group. The output will contain only the grouping columns and aggregate functions, and the select list can contain only the grouping columns and aggregate functions. COMPUTE produces multiple results, the detail rows for each group containing the expressions from the select list and aggregate for the group, or the total aggregate for the statement. A COMPUTE list can contain expressions other than the grouping columns or aggregate functions, and aggregate functions are specified in the COMPUTE clause, not in the select list, as shown in the following example:

 SELECT DISTINCT Country, City  FROM Customers  ORDER BY Country, City  COMPUTE Count(City) BY Country

More Advanced Query Groups

CUBE and ROLLUP, along with GROUPING, can be used to effectively summarize data. The CUBE operator is primarily used in data warehousing applications. It is used to generate a resultset that is a multidimensional cube. Without getting too deep into a data warehousing discussion, a data cube is based on columns that are needed for data analysis. These columns within data cubes are more appropriately referred to as dimensions. The cube is a resultset containing a cross-tabulation of all the possible combinations of the dimensions.

The CUBE operator is specified in the GROUP BY clause of a SELECT statement. The select list contains the dimensions and aggregate expressions. The resultset contains all possible combinations of the values in the dimensions, along with the aggregate values from the underlying rows. Figure 5.5 represents the output from using CUBE.

Figure 5.5. A sample of CUBE output.

The query in Figure 5.5 returns a resultset that contains the units-on-hand subtotal for all possible combinations of category and supplier. Note the rows that contain the NULL values. These rows report subtotals for each category (where the supplier is NULL) and supplier (where the category is NULL). The row containing the two NULL values represents the grand total.

NULL values can present a problem in distinguishing a NULL generated by the CUBE from a NULL in the actual data. You can solve this quandary using the GROUPING function. GROUPING returns 0 if the value came from the data, and 1 if the value is a NULL generated by the CUBE. GROUPING is represented in Figure 5.6.

Figure 5.6. GROUPING with CUBE.

Roll Up the Cube

The ROLLUP operator is similar to CUBE but somewhat more useful in standard applications that do not necessarily pertain to data warehousing. The ROLLUP operator is used in generating reports that contain subtotals and totals. Whereas the CUBE operator generates its results for all combinations of values, the ROLLUP operator generates only a set of values for a hierarchy. The ROLLUP operation does not report on values from the lower elements of the hierarchy. A similar query as previously used in Figure 5.5 produces a much smaller resultset, as shown in Figure 5.7.

Figure 5.7. A sample of ROLLUP output.

You must know the basics of CUBE, ROLLUP, and GROUPING for this exam, but don't get too hung up on the granular details. Data warehousing is a complete topic in its own right. Is has its own certification exam covering the material in depth.

We have peered into a little of what data warehousing is about, but let's bring the focus back to the heart of the design exam. SQL Server is a relational database management system. This section focuses on the "relational" part. It covers how tables relate to one another. In Chapter 2, "Creating a Logical Data Model," you covered how primary keys and foreign keys are used in a database model, and how they impact storing of data. This chapter shows you how to retrieve data from a relational database. This involves using the various join types: INNER, RIGHT, LEFT, and CROSS. You're also going to learn about using a technique called derived tables to simplify query writing.