SQL is based on mathematical principlesspecifically, on set theory and relational algebra. The data is stored in the database as unordered sets of data records. SQL is a set-oriented language, and many of its language elements are directly related to relational algebraic terms, such as permutation, projection, restriction, and join.
A set of data is represented in a DB2 database as a table or a view and is stored in a DB2 table without regard to order. To retrieve data in a particular order, an ORDER BY phrase must be added to a SELECT statement. Similarly, if the data is to be grouped, a GROUP BY phrase must be added to the statement.
Let's review the DB2USER1 database design defined in the previous chapter and manipulate data, using various SQL statements. Each of the three main tablesCANDIDATE, TEST, and TEST_CENTERrepresents a set of records that correspond to a test candidate (person), a test, and a test center (location).
An associative table, known as the TEST_TAKEN table, is used to reflect the relationships among the three main tables. The TEST_TAKEN table is used to schedule the test candidates and to maintain their test scores.
The longest SQL statement is 2MB.
Remember that to execute any operation, the user must have the necessary privileges.
Retrieving the Entire Table
The most basic of all retrieval commands involves the SELECT statement, with no operators other than the name of the table. The following SQL statement retrieves all the candidates who have taken a DB2 Certification exam. The information requested is contained in the table TEST_TAKEN:
SELECT * FROM DB2USER1.TEST_TAKEN
In this example, using an asterisk (*) for column selection, the columns will be returned to the user in the order in which they are defined in the table.
SQL is a data access language that consists of language statements and clauses. Its many optional clauses can be used to modify the output. The output of a SELECT statement is known as a result set, or result table. The results from the SELECT statement are shown next:
CID TCID NUMBER DATE_TAKEN START_TIME FINISH_TIME SCORE PASS_FAIL SEAT_NO --- ---- ------ ---------- ---------- ----------- ----- --------- ------- 111 TX01 500 01/01/2000 11:30:00 12:30:00 65 Y 1 111 TX01 501 02/02/2000 10:30:00 11:45:00 73 Y 1 111 TX01 502 03/03/2000 12:30:00 13:30:00 67 Y 1 222 TR01 500 01/01/2000 14:00:00 15:30:00 55 N 2 222 TR01 502 01/02/2000 09:00:00 10:15:00 53 N 2 222 TR01 502 02/18/2000 10:00:00 11:30:00 75 Y 2 333 TX01 500 03/01/2000 11:30:00 13:00:00 82 Y 2 333 TX01 501 12/29/2000 14:00:00 - - - 1 333 TX01 502 03/02/2000 14:00:00 14:30:00 92 Y 1 9 record(s) selected.
In SQL, the * is used to indicate that all columns of a table are being referenced. In this example, the SQL statement refers to all the columns defined for the DB2USER1.TEST_TAKEN table. If the table is altered and a new column is added to the table definition, the result set contains the new column.
Adding a new column to an existing table will result in default values being populated for the existing rows.
Because the output of the SQL statement using the * character varies according to the table definition, it is recommended that you specify in the SELECT statement all the column names you want to see. We could have obtained the same result as the statement using SELECT * with the following SQL statement:
SELECT CID, TCID, NUMBER,DATE_TAKEN,START_TIME, FINISH_TIME,SCORE,SEAT_NO FROM DB2USER1.TEST_TAKEN;
The * character is used to refer to all the columns defined for a table. The order of the columns in the result table is the same order as specified in the CREATE TABLE or CREATE VIEW statement.
What is returned to the user is known as the result set. If the result set is large, it is advisable to filter the data, using a WHERE predicate.
Because it describes the locationtable or viewof the data, the FROM clause is required for the SELECT SQL statement. Our example references a single table calledN DB2USER1.TEST_TAKEN. The SELECT and FROM clauses are required in all data-retrieval statements. The list of columns following the SELECT keyword is referred to as the select list.
Projecting Columns from a Table
Projection is a relational operation that allows you to retrieve a subset of the defined columns from a table. The next example restricts the output from the SELECT command so that only the candidate ID, test center, and test number attributes from the TEST_TAKEN table are shown:
SELECT CID,TCID,NUMBER FROM DB2USER1.TEST_TAKEN
The output of this SELECT statement follows:
CID TCID NUMBER --- ---- ------ 111 TX01 500 111 TX01 501 111 TX01 502 222 TR01 500 222 TR01 502 222 TR01 502 333 TX01 500 333 TX01 501 333 TX01 502 9 record(s) selected.
The order of the columns in the result table will always match the order in the select list. The order of the columns as they were defined in the CREATE TABLE or CREATE VIEW statement is ignored when a select list is provided in the SQL statement. In this example, the order of the columns is similar to that in the CREATE TABLE statement, as the CID column was defined prior to the TCID and NUMBER columns.
Changing the Order of the Columns
Permutation is the relational operation that allows you to change the order of the columns in your result table. Permutation is used every time you select columns in an order different from the order defined in the CREATE TABLE statement. For example, to display the test center ID prior to the candidate IDs and the test number, you could execute the following:
SELECT TCID,CID,NUMBER FROM DB2USER1.TEST_TAKEN
The result of this SELECT statement specifies a select list in a different order from that definedin the table definition:
TCID CID NUMBER ---- --- ------ TX01 111 500 TX01 111 501 TX01 111 502 TR01 222 500 TR01 222 502 TR01 222 502 TX01 333 500 TX01 333 501 TX01 333 502 9 record(s) selected.
We refer to the output of a SELECT statement as the result table because the output of all SELECT statements can be considered a relational table.
Restricting Rows from a Table
Restriction is a relational operation that filters the resulting rows of a table. Restriction can be accomplished through the use of predicates defined in an SQL WHERE clause. To restrict the result set, we need to add a WHERE clause to the SQL statement.
A predicate is a condition placed on the data. The result of the condition is trUE, FALSE, or UNKNOWN.
The WHERE clause specifies conditions, or predicates, that must be evaluated by DB2 before the result table is returned to the end user. Many valid types of predicates can be used. In the following example, the equality (=) predicate is used to restrict the records to only those candidates who have taken a DB2 Certification test at the test center TR01:
SELECT TCID,CID FROM DB2USER1.TEST_TAKEN WHERE TCID ='TR01'
The WHERE clause also accepts other comparison operators, such as greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), and not equal to (<>). This statement is an example of a basic predicate. A basic predicate compares two values. In addition, more complex predicates, such as LIKE, BETWEEN, and IN, are also valid and will be discussed later.
Predicate Evaluation for Distinct Types
If the column length was defined as a distinct type, in order to make a comparison in the WHERE clause valid using this column, a casting function needs to be used. This is not mandatory if the comparison is to a literal or host variable, but is demonstrated as follows:
SELECT NUMBER, LENGTH FROM DB2USER1.TEST WHERE LENGTH=CAST(60 AS DB2CERT.MINUTES)
Predicate evaluation requires that the data types be compatible, that is, the same data type or a compatible data type. We can accomplish the data type conversion, or cast, by using either the CAST expression or a casting function. (Refer to Chapter 15 for more information on casting for distinct types.)
Although not always mandatory, casting is recommended to ensure that compatible data types are compared, including length and scale. This allows DB2 to resolve these issues as indexable or stage 1 rather than as stage 2 predicates. For more information on stage 1 and stage 2 predicates refer to Chapter 6.
Using Multiple Conditions to Restrict Rows
It is possible to combine multiple conditions, or predicates, in a single SQL statement. The predicates can be combined by using Boolean operators, such as the AND or OR operators. These operators allow you to combine multiple conditions in a single SQL statement. The order of the predicate evaluation will not affect the result set, known as set closure.
The next example retrieves the records for the test candidates who took a test at test center TR01 and achieved a score greater than 65. The SQL statement uses multiple predicates. The rows that satisfy the predicates are known as the qualifying rows.
SELECT TCID,CID,SCORE FROM DB2USER1.TEST_TAKEN WHERE TCID='TR01' AND SCORE > 65
Selecting Columns from Multiple Tables
Two operationsCartesian product and joincombine columns from multiple tables in a single SQL statement.
A Cartesian product is a relational operation that merges all the values from one table with all the values from another table. This operation is not used frequently, because the result table can be very large. The number of rows in the result table is always equal to the product of the number of rows in the qualifying rows for each of the tables being accessed.
The DB2 optimizer may choose to use a Cartesian product of unrelated tables if this is deemed to be an efficient method of accessing multiple tables. An example is two single-row tables that are joined with a large table. The cross-product of 1 x 1 = 1; thus, the large-table access is deferred as late as possible, with a potential increase in the restrictive predicates that can be applied without incurring the overhead of a large Cartesian result. This method of table access is typical in processing queries against a star schema data model.
The following example is a Cartesian product of all test numbers and test names from the TEST table, with all candidates from the TEST_TAKEN table. First, select from the TEST table:
SELECT NUMBER, NAME FROM DB2USER1.TEST; NUMBER NAME ------ -------------------------------------------------- 500 DB2 Fundamentals 501 DB2 Administration 502 DB2 Application Development 3 record(s) selected.
Next, select all the candidates from the TEST_TAKEN table:
SELECT CID,TCID FROM DB2USER1.TEST_TAKEN; CID TCID --- ---- 222 TR01 222 TR01 222 TR01 111 TX01 111 TX01 111 TX01 333 TX01 333 TX01 333 TX01 9 record(s) selected.
Then combine the two tables to form a Cartesian product result table:
SELECT DB2USER1.TEST_TAKEN.NUMBER,CID,TCID FROM DB2USER1.TEST_TAKEN,DB2USER1.TEST; NUMBER CID TCID ------ --- ---- 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 500 111 TX01 501 111 TX01 502 111 TX01 500 222 TR01 502 222 TR01 502 222 TR01 500 333 TX01 501 333 TX01 502 333 TX01 27 record(s) selected.
Two tables are referenced in the FROM clause of this query. The tables are separated by commas. The WHERE clause contains no relationship expression. This type of query results in a Cartesian product.
The result table is a representation of all possible combinations of the input tables. The TEST table has 3 rows, and the TEST_TAKEN table has 9 rows. Therefore, the SELECT statement shown earlier returns 27 rows. Note the first column name in this query. It is necessary to fully qualify the column name by providing the schema name and table name with the column name because this column exists in both the TEST table and TEST_TAKEN table. In this case, we needed to specify that the number column is to be retrieved from the DB2USER1.TEST_TAKEN table and not from the DB2USER1.TEST table.
By adding a predicate to a Cartesian product SQL query, the result table can represent a more useful representation of the data. In the next example, the query returns all of the tests that were taken by the candidate whose ID is 111:
SELECT DB2USER1.TEST_TAKEN.NUMBER,CID,TCID FROM DB2USER1.TEST_TAKEN,DB2USER1.TEST WHERE CID='111' NUMBER CID TCID ------ --- ---- 500 111 TX01 500 111 TX01 500 111 TX01 501 111 TX01 501 111 TX01 501 111 TX01 502 111 TX01 502 111 TX01 502 111 TX01 9 record(s) selected.
Adding a WHERE clause to your query does not always provide the desired result. In the preceding example, you want to know all the tests that were taken by the candidate whose ID is 111, and the query returns nine rows. But we know from previous queries that the candidate took only three tests. The query in this example has a WHERE clause to filter out the candidate whose ID is 111 from the TEST_TAKEN table, but there was no filter on the TEST_CENTER table. Therefore, the result of the query would always be a multiple of the number of testing centers. Usually, when multiple tables are referenced, you should include a cross-table relationship, using a table merge, or join, method, as shown in the following example:
SELECT DB2USER1.TEST_TAKEN.NUMBER,CID,TCID FROM DB2USER1.TEST_TAKEN,DB2USER1.TEST WHERE CID= '111' AND DB2USER1.TEST_TAKEN.NUMBER=DB2USER1.TEST.NUMBER; NUMBER CID TCID ------ --- ---- 500 111 TX01 501 111 TX01 502 111 TX01 3 record(s) selected.
We will examine table join methods further in the next section.
To avoid data redundancy, the database tables should be normalized. Following a normalization process, a number of related tables will exist. To satisfy some of the required queries, the tables must be reconstructed. The tables are reconstructed temporarily, using a table join strategy to produce a single-result table.
The result tables in the previous examples usually provided candidate ID numbers and not the complete name of the test candidates. The candidate IDs are stored in the TEST_TAKEN table, and the full names are stored in the CANDIDATE table. To obtain the name of a candidate, the data must be retrieved from the CANDIDATE table, using a relationship, or join, strategy.
Consider an example that will list the names and phone numbers of candidates who were registered to take a DB2 Certification test in 2000. To accomplish this, we need to select data from two tables: DB2USER1.CANDIDATE and DB2USER1.TEST_TAKEN.
Let's retrieve a list of candidate names, phone numbers, and IDs from the CANDIDATE table. The candidate names were stored in multiple columns to allow for easy retrieval by last name:
SELECT FNAME,INITIAL,LNAME,HPHONE,CID FROM DB2USER1.CANDIDATE;
The output of this example follows. Pay special attention to the values in the CID column. It will be used as the join column in the next example:
FNAME INITIAL LNAME HPHONE CID -------- ------- ---------- ---------- --- Susan M Lawson 1115551234 111 Dan Luksetich 2226543455 222 John H Maenpaa 4442314244 333 Jeremy Peck 5552143244 444 4 record(s) selected.
Now, let's retrieve the ID numbers of those candidates who were registered to take the test in 2000:
SELECT DISTINCT CID FROM DB2USER1.TEST_TAKEN WHERE YEAR(DATE_TAKEN) = 2000; CID --- 222 111 333 3 record(s) selected.
The candidate IDs in the TEST_TAKEN table must correspond to a candidate ID in the CANDIDATE table because of the declarative referential integrity constraints. The parent table in the relationship is the CANDIDATE table, and the child, or dependent, table is the TEST_TAKEN table.
The result table from the preceding query does not include test candidate 444, who did not have a test scheduled for 2000. We need to join the two result tables, based on the candidate ID values. This column is known as the join column.
Query performance can significantly improve if the join columns are appropriately indexed.
The following single query will satisfy the end user requirement:
SELECT DISTINCT FNAME,INITIAL,LNAME,HPHONE FROM DB2USER1.TEST_TAKEN,DB2USER1.CANDIDATE WHERE YEAR(DATE_TAKEN) = 2000 AND DB2USER1.TEST_TAKEN.CID=DB2USER1.CANDIDATE.CID;
A table join requires a predicate that includes an expression based on columns from the tables referenced in the FROM clause. This is known as a join predicate. The FROM clause has not changed from the Cartesian product examples. The only difference is in the join predicate (TEST_TAKEN.CID = CANDIDATE.CID).
An alternative form of the FROM clause for joins involves explicitly coding the JOIN keyword between the tables, rather than using a comma, and coding the join predicates in the ON clause rather than in the WHERE clause. This method of coding is covered in more detail in Chapter 6.
The table names needed to be explicitly stated because a column named CID is in both of the referenced tables. When multiple tables are being accessed in a single query, any selected columns that occur in more than one table must be qualified with the table name.
An error will occur if the columns being referenced are ambiguous or not properly qualified.
A maximum of 225 base tables can be in a single SQL statement; however, you are limited to 15 tables in a join (the FROM clause).
The 15-table limit can be increased by updating the macro SPRMMXT to a value greater than 15.
The kind of join operation shown in the preceding example is also known as an inner join. An inner join displays only the rows that are present in both of the joined tables.
Using Correlation Names
If each of the columns needed to be fully qualified with the table name, such as tableschema.tablename.columnname, the queries would become very large and cumbersome to work with. Fortunately, an easier way exists to qualify the ambiguous columns resulting from a multitable SELECT statement.
The columns can be qualified using a correlation name. A correlation name is a temporary alias for the tables referenced in an SQL statement. We rewrite the previous query using correlation names as follows:
SELECT DISTINCT FNAME,INITIAL,LNAME,HPHONE FROM DB2USER1.TEST_TAKEN TT, DB2USER1.CANDIDATE C WHERE YEAR(DATE_TAKEN) = 2000 AND TT.CID = C.CID;
The correlation name immediately follows the name of the table as stated in the FROM clause. In this example, the correlated name for the TEST_TAKEN table is TT, and the correlated name for the CANDIDATE table is C.
For readability, correlation names can be optionally prefixed with the AS keyword, such as DB2USER1.CANDIDATE AS C.
The correlated names are accessible within the SQL statement only. Following the execution of the SQL statement, the correlation name is no longer defined. Once a correlation name has been defined, it can be referenced in the rest of the query instead of the table name. However, the table name can still be referenced.
Use simple, easy-to-remember correlation names. Table initials are good candidates for correlation names.
Sorting Your Output
We have been retrieving data from one or more tables. The order of the result table has not been specified in any of the SQL statements. Therefore, the data is retrieved in an undetermined order if the SQL statement contains no ORDER BY clause.
The only guaranteed way to return data in the required sequence is with the ORDER BY clause. Any data retrieval that is currently returned in sequence without this clause is based purely on the data retrieval method at the time. Any future access path execution may not return the data in the same sequence.
The following example produces a list of the test candidates in alphabetical order by last name for the candidates who have taken a DB2 Certification test at the TR01 test center:
SELECT LNAME,INITIAL,FNAME FROM DB2USER1.CANDIDATE C,DB2USER1.TEST_TAKEN TT WHERE C.CID=TT.CID AND TCID='TR01' ORDER BY LNAME;
This example contains a new clause: ORDER BY. After the ORDER BY clause, you can list the columns that will specify the sort order and the type of sort.
Appropriate indexing may allow DB2 to avoid sorting the data to match the ORDER BY clause. If the data is already sequenced via the index, DB2 may choose to use the index and avoid sorting the data. DB2 cannot avoid a sort for an ORDER BY involving columns from more than one table. Where possible, include columns from only one table in the ORDER BY to give DB2 greater opportunity for sort avoidance.
The SQL can be modified so that the output is changed to descending order by last name and a secondary order column on the first name in ascending order:
SELECT LNAME,FNAME,HPHONE FROM DB2USER1.CANDIDATE C,DB2USER1.TEST_TAKEN TT WHERE C.CID=TT.CID AND TCID= 'TR01' ORDER BY LNAME DESC, FNAME;
In this example, the DESC keyword that follows the LNAME column indicates that the result table should be in descending order based on the last name. More than one record can have the same last name. This situation is quite common. A second column specified in the ORDER BY clause, fname, has no keyword specifying the sort sequence. Therefore, the default ordering sequence (ascending) is used.
The next example contains three columns: LNAME, FNAME, and HPHONE. You can reference the column that should be used to sort the data by using the column name or by specifying its position in the select list. Using the column position is useful when the column in the select list is made up of derived, or calculated, columns that have no explicit name:
SELECT LNAME,FNAME,HPHONE FROM DB2USER1.CANDIDATE C,DB2USER1.TEST_TAKEN TT WHERE C.CID=TT.CID AND TCID='TR01' ORDER BY 1 DESC, 2;
In this example, the sort order is specified using the column position. Therefore, the query result is exactly the same as in the previous example.
You can also assign an alternative column name using column renaming. The assigned name can then be referenced in the ORDER BY clause, and it appears as the column heading where appropriate.
Although useful, coding a column position rather than a column or renamed column in the ORDER BY clause may lead to a different sequence if another column that alters the column positioning is inadvertently added to the query.
The order by clause must succeed all subselects in the fullselect statement. The only clauses that can succeed the order by clause are the for read-only, optimize-for, isolation, and update clauses.
In some cases, you will need to perform calculations on the data. SQL has some basic built-in mathematical and string functions. Mathematical operations include standard addition, subtraction, multiplication, and division.
The calculation can be defined in the WHERE clause of the SQL statement or the select list. Suppose that you need to calculate a passing rate for a DB2 test. The passing rate is defined as the percentage of candidates passing the test (totalpassed*100/totaltaken). The following SQL statement will accomplish this for us for test number 500:
SELECT NUMBER, TOTALPASSED*100/TOTALTAKEN FROM test WHERE NUMBER='500';
In this example, the second column of the output list is a calculated column. Remember that you must use the column position if you want to use this calculated column for the ORDER BY clause unless you name it, as we now discuss.
Occasionally, the results of a derived column may not display as expected. The example using totalpassed*100/totaltaken will result in a value of 66 being retrieved. As both the totalpassed and totaltaken columns are integers, the final result is also an integer, and the fractional part is discarded. If this is not your desired result, you should use other functions, such as decimal, to change the way the calculation is performed or displayed.
You can specify a column name for any expression. When you provide the derivedcalculatedcolumn with a name, the ORDER BY clause can reference the derived name to allow for a more readable SQL statement.
The following SQL calculates the percentage of people passing the DB2 Certification exams and orders the output in descending order of the passing rate:
SELECT NUMBER,TOTALPASSED*100/TOTALTAKEN AS PASSEDRATE FROM TEST ORDER BY PASSEDRATE DESC;
The AS clause is used to rename the default name of an element in the select list. In this example, we are giving the name of PassedRate to the result of the division of column totalpassed by column totaltaken. The named column is used in the query to specify the column that should be used for sorting the output.
The AS keyword is optional, although it should be added for readability. Without the AS clause, it may be unclear whether the intention was to rename a column or whether a comma was missed between two retrieved columns.
DB2 provides various types of functions. For example, two types of functions are scalar and column functions. (A third type of function, called a table function, is discussed in Chapter 15.)
This section describes some of the SQL functions provided with DB2. SQL functions are categorized by their implementation type. The functions either are built in or are extensions of DB2 and are known as user-defined functions (UDFs).
Built-in functions are defined within the SQL standards and are provided by DB2. These functions can be either scalar or column functions.
Scalar functions are applied to each row of data, and a per row result is provided. A scalar function could be used to retrieve only the first three digits of telephone numbers for each candidate. The function used is called SUBSTR. The arguments for this function include a string data type column, a beginning offset, and length. The output data type and attribute of the function depend on the input data type and attribute. The following example retrieves the telephone area code for the column wphone:
SELECT LNAME, SUBSTR(CHAR(WPHONE),1,3) FROM DB2USER1.CANDIDATE;
The SUBSTR function is a scalar function. SUBSTR returns a character string of three characters. The result string corresponds to the first three characters of the wphone column. This function is known as a string function because it works with any string data type. If we wanted to give the output column a meaningful name, we could provide an alias, as was done for calculated columns.
In the preceding example, the substring starts from the beginning of the string because we indicate 1 as the second parameter of the function. The length of the resulting string is indicated in the third argument. In our example, the length is 3. Note that the data type of the wphone column is phone, so a casting function is used to convert the phone data type to the CHAR data type.
The following query will provide the month when the exam was taken. The input for this function is a DATE string, and the output is an integer:
SELECT FNAME, MONTH (DATE_TAKEN) FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID;
Column functions provide a single result for a group of qualifying rows for a specified table or view. Many common queries can be satisfied using column functions if they include common tasks, such as finding the smallest value, the largest value, or the average value for a group of data records. The following example obtains the maximum total passed of any of the DB2 Certification exams:
SELECT MAX(TOTALPASSEd) FROM TEST;
If a WHERE clause were added to this example, the maximum would represent the maximum value for the qualifying rows, as the predicate is used to filter the data prior to the application of the MAX function.
This next example calculates the average of the number of seats for all the test centers. Note that the column function AVG is used in this example:
SELECT AVG(noseats) FROM TEST_CENTER;
DB2 provides many more built-in functions. If you are interested in calculating statistical information, you can use statistical functions, such as VARIANCE, STDDEV, or a sampling of these functions.
MAX and MIN functions can be either column or scalar functions, depending on the input arguments.
UDFs are not defined within the SQL standards, because they are extensions of the current SQL. These functions can be developed by a DB2 administrator or application developer. UDFs can be either scalar or table functions but not usually column functions. Once the UDFs have been created, they can be invoked by any end user with the proper privileges. For more information about UDFs, refer to Chapter 15.
Many queries require some level of aggregated data, which in SQL is accomplished through the use of the GROUP BY clause. GROUP BY conceptually rearranges the table represented by the FROM clause into partitions such that within any one group, all rows have the same value for the GROUP BY field.
GROUP BY does not imply ORDER BY.
The following SQL statement obtains the average number of seats for each country:
SELECT COUNTRY, AVG(NOSEATS) FROM TEST_CENTER GROUP BY COUNTRY;
This SQL statement obtains the average number of seats per country, and the GROUP BY clause tells DB2 to group together the rows that have the same values in the columns indicated in the GROUP BY list. In our example, we are grouping countries into subsets. As the subsets are created, DB2 calculates the average of each of those groups, or subsets: in this case, by each country.
When you combine column functions and other elements, such as column names, scalar functions, or calculated columns, you must use the GROUP BY clause. In this case, you must include every element that is not a column function in the GROUP BY list. The only elements that can be omitted in the GROUP BY list are constant values.
The next SQL statement obtains a list that includes the average cut score and minimum total passed for the DB2 Certification exams. The result is then grouped by the type of exam, as follows:
SELECT TYPE, AVG(CUT_SCORE), MIN(TOTALPASSED) FROM TEST GROUP BY TYPE
Appropriate indexing can allow DB2 to avoid a sort to group the data rows to match the GROUP BY clause.
It is possible to sort the output of the previous example by using an ORDER BY clause.
GROUP BY may, but this is not guaranteed to, return data in the same order as an ORDER BY and is based on the access path. The only way to guarantee data sequence is with an ORDER BY.
Restricting the Use of Sets of Data
So far, we have discussed how to restrict output, based on row conditions. With SQL, it is also possible to restrict that output by using column functions and the GROUP BY clause. Suppose that you want a list of all the test centers that have administered more than five DB2 Certification exams. To make it easier to understand, let's first get the number of tests that have been taken in each test center:
SELECT TCID, COUNT(*) FROM TEST_TAKEN GROUP BY TCID
We use the COUNT column function to get the total number of tests that have been taken in each test center. Using an asterisk (*) with the COUNT function indicates that you want the number of rows in a table that meet the criteria established in the SQL statement. In this example, we are grouping by TCID because we have a number of occurrences for all the test centers in the TEST_TAKEN table. The TEST_TAKEN table has an entry for every DB2 Certification exam that has been taken. Finally, the output is restricted to only those test centers that have administered more than four exams:
SELECT TCID FROM TEST_TAKEN GROUP BY TCID HAVING COUNT(*) > 4;
This example introduces the HAVING clause, which is equivalent to the WHERE clause for groups and column functions. The HAVING clause will restrict the result set to the groups that meet the condition specified in it. In our example, only the test centers that have administered more than four DB2 Certification exams will be displayed.
When you execute a query, you might get duplicate rows in the answer set. SQL provides a special clause to remove the duplicate rows from your output. The following SQL statement generates a list of names and phone numbers for all the candidates who have taken a test. In the following example, we eliminate the duplicate rows from our output list, using the DISTINCT clause:
SELECT DISTINCT FNAME,WPHONE,HPHONE FROM CANDIDATE C,TEST_TAKEN TT WHERE C.CID=TT.CID;
The DISTINCT clause can also be used with the COUNT function. When you use DISTINCT inside a COUNT function, the duplicate entries for a particular column will not be counted. The following example allows you to count how many test centers have candidates registered:
SELECT COUNT(DISTINCT TCID) FROM TEST_TAKEN;
This example provides the number of test centers that are registered in the TEST_TAKEN table. Remember that all the candidates who have registered for DB2 Certification exams are stored in this table. Make sure that you understand how COUNT(*), COUNT(colname), and COUNT(DISTINCT colname) differ. They are very similar in syntax but differ in function.
COUNT(*) returns a count of all rows that qualify against the WHERE clause. COUNT(colname) returns a count of all rows that qualify against the WHERE clause, with null occurrences of colname removed. COUNT(DISTINCT colname) counts distinct occurrences of colname, with nulls removed.
Searching for String Patterns
SQL has a powerful predicateLIKEthat allows you to search for patterns in character string columns. Suppose that you want to generate a list of the candidates whose first name starts with the letter G:
SELECT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE WHERE FNAME LIKE 'G%' ORDER BY LNAME,FNAME;
In this query, we use a wildcard character with the LIKE predicate. In SQL, the percent character (%) is a substitute for zero or more characters. The search string G% can be substituted with such names as George, Gary, Ginger, and so on. (Because the percent character can substitute zero or more characters, the search string can also be a single letter G.)
The percent character can be used any place in the search string and as many times as you need it. The percent sign is not case sensitive, so it can take the place of uppercase or lowercase letters. However, the constant characters included in your search string are case sensitive.
Another wildcard character used with the LIKE predicate is the underline character (_). This character substitutes one and only one character. The underline character can take the place of any character but cannot be substituted for an empty character.
If the pattern needs to search for occurrences of the wildcard characters % and _ as their values, the ESCAPE clause is used to specify a character that precedes the percent or underscore in the pattern.
The previous SQL can be modified to include all candidates' names and the telephone numbers for those candidates whose name has the letter A as its second letter:
SELECT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE WHERE FNAME LIKE '_A%' ORDER BY LNAME,FNAME;
This example uses two wildcard characters that work with the LIKE predicate. The search string in this example can include such names as Paul, Gabriel, or Natalie. (The first character may be any character, the lowercase letter a is the second character in the string, and the string ends with any number of characters.)
When the pattern in a LIKE predicate is a fixed-length host variable, the correct length must be specified for the string to be returned. 'G%' assigned to an 8-byte variable (LIKE :variable) will search for all occurrences of 'G% ' (G, followed by any character, followed by six blank characters). To find rows that begin with a G, 'G%%%%%%%' should be assigned to the fixed-length variable.
Searching for Data in Ranges
SQL also offers a range operator, which is used to restrict rows that are in a particular range of values. Consider the requirement to list those candidates whose scores in the DB2 Certification exam are between 60 and 75:
SELECT DISTINCT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID AND SCORE BETWEEN 60 AND 75
The BETWEEN predicate includes the values that you specify for searching your data. An important fact about the BETWEEN predicate is that it can work with character ranges as well.
In addition to the score requirement, this example modifies the SQL to include only those candidates whose last name begins with a letter between B and G:
SELECT DISTINCT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID AND SCORE BETWEEN 60 AND 75 AND LNAME BETWEEN 'B' AND 'GZ'
In this example, the second BETWEEN predicate contains character values. We need to specify the GZ value to include all the possible names that start with the letter G. This was done assuming that the letter Z is the last possible value in the alphabet.
The arguments of the BETWEEN clause are not inter-changeable; the first must specify the low value; the second, the high value. BETWEEN 1 AND 2 will return all values within the range (inclusive of 1 and 2). BETWEEN 2 AND 1 will return zero rows.
Searching for Null Values
Null values represent an unknown value for a particular occurrence of an entity. We can use a null value if we don't know a particular value of a column. Let's say that we want a list of all those candidates whose score is not yet input. This condition is represented with a null value:
SELECT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID AND SCORE IS NULL;
The IS predicate is used to search for the null value in this example. Remember that the null value means unknown. Because it has no particular value, it can't be compared with other values. You can't use conditional operands, such as equal (=) or greater than (>), with null values.
Searching for Negative Conditions
The BETWEEN, IS, and LIKE predicates always look for the values that meet a particular condition. These predicates can also be used to look for values that don't meet a particular condition.
The NOT predicate can be used to look for the opposite condition, combined with the LIKE, BETWEEN, and IS predicate, to accomplish negative searches, as shown in the following example. This example has a LIKE predicate combined with the NOT predicate. We want a list of those candidates whose last names do not start with the letter S:
SELECT DISTINCT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE WHERE LNAME NOT LIKE 'S%' ORDER BY LNAME,FNAME;
The next example has a BETWEEN predicate combined with the NOT predicate. We want a list of those candidates whose score, in any test, is not in the range 60 to 75:
SELECT DISTINCT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID AND INTEGER(SCORE) NOT BETWEEN 60 AND 75;
In this example, the NOT predicate will exclude all the values that are in the range 60 to 75.
Negation can also be applied to the null value. This SQL statement produces a report that searches for those candidates who have a seat number assigned. This is expressed with a NOT NULL value:
SELECT DISTINCT FNAME,LNAME,WPHONE,HPHONE FROM CANDIDATE C, TEST_TAKEN TT WHERE C.CID=TT.CID AND SEAT_NO IS NOT NULL;
The NOT operator can also be used to negate the standard comparison operators: =, <, <=, >, and >=.
When dealing with nulls, NOT or negation may not return the opposite of the positive logic. For example, WHERE SEAT_NO = 1 will return only the rows for seat number 1. Any value other than 1 is discarded, as these rows are FALSE, including nulls, as these are UNKNOWN. WHERE SEAT_NO <> 1 excludes rows where the seat number is 1 and also discards nulls because these are UNKNOWN.
Searching for a Set of Values
In SQL, it is possible to establish a restriction condition based on a set of values. Suppose that you need a list of the test centers that have candidates registered for the DB2 Fundamentals test and for the DB2 Application Development test. This can be queried with the following statement:
SELECT DISTINCT NAME,PHONE FROM TEST_CENTER TC, TEST_TAKEN TT WHERE TC.TCID=TT.TCID AND (NUMBER = '500' OR NUMBER ='502');
To simplify building multiple OR conditions when multiple values for the same column are being compared, you can rewrite the statement, using the IN clause:
SELECT DISTINCT NAME,PHONE FROM TEST_CENTER TC, TEST_TAKEN TT WHERE TC.TCID=TT.TCID AND NUMBER IN ('500','502');
The IN clause is used to denote a set of values. In this example, we use a constant set of values.
You can also use the NOT predicate with the IN clause. In this case, the condition will be true when a value is not present in the set of values provided to the IN clause. You can use as many values as you wish in the IN clause, within the defined limits of the size of a SQL statement.
Advanced Selection Functionality
Chapter 6 looks at more of the functionality and power of the SELECT statement. The following topics are covered: