The heart of SQL in a relational system is the ability to perform relational joins. Joins are specified by using more than one table name in the FROM clause of a query. A joined table specifies an intermediate-result table that is the result of either an inner join or an outer join.
Until version 8, the maximum number of tables that could be coded in a single FROM clause was 15, but up to 225 table references can be in an SQL statement. To increase the limit of 15, update the macro SPRMMXT to a value greater than 15. Also, the 15-tables limit does not apply to star join queries. Version 8 of DB2 fully supports 225 tables in a query, including subqueries.
The join operation used most often is the one we have been using in the book exercises, which is known as an inner join. An inner join combines each row of the left table with each row of the right table, keeping only the rows where the join condition is trUE. The result table may be missing rows from either or both of the joined tables. The following example shows an inner join without using the INNER JOIN syntax, which means that the inner join is implicitly defined. The tables in the FROM clause are separated by commas to indicate the join. This type of syntax is valid only for inner joins.
In the example following, we wish to find the scores of all DB2 tests taken by U.S. citizens in the small test centers year by year, in order of ascending year and descending score:
SELECT T.NAME, YEAR(TT.DATE_TAKEN) AS YEAR, TT.SCORE FROM TEST T, TEST_TAKEN TT, TEST_CENTER TC, CANDEDATE C WHERE C.CID = TT.CID AND TC.TCID = TT.TCID AND T.NUMBER = TT.NUMBER AND T.NAME LIKE 'DB2%' AND C.COUNTRY='USA' AND TC.NOSEATS < 10 ORDER BY YEAR(TT.DATE_TAKEN), TT.SCORE DESC
If the inner join used the explicit syntax, it would appear as follows:
SELECT T.NAME, YEAR(TT.DATE_TAKEN) AS YEAR, TT.SCORE FROM TEST T INNER JOIN TEST_TAKEN TT ON T.NUMBER = TT.NUMBER INNER JOIN TEST_CENTER TC ON TC.TCID = TT.TCID INNER JOIN CANDIDATE C ON C.CID = TT.CID WHERE T.NAME LIKE 'DB2%' AND C.COUNTRY='USA' AND TC.NOSEATS < 10 ORDER BY YEAR(TT.DATE_TAKEN), TT.SCORE DESC
For the explicit join syntax, the keyword INNER is optional and, if omitted, implies an inner join. The ON clause determines the join predicates between the tables, and the WHERE clause applies the local or table access predicates. For an inner join, the result is not impacted if the local predicates appear in the WHERE or ON clause. It is recommended that, for consistency with outer joins, join predicates appear only in the ON clause; local predicates, in the WHERE clause. For outer joins, the placement of predicates in either the ON or WHERE clause can dramatically change the result. This is demonstrated in the section on outer joins.
Each join must be followed by its ON clause, which dictates the two tables that are being joined. The joined tables must appear in the FROM clause before they are referenced by join predicates in the ON clause. In the preceding example, the first join is between TEST and TEST_TAKEN. The second join is between TEST_CENTER and TEST_TAKEN as dictated by the ON clause. The third join is between CANDIDATE and TEST_TAKEN.
One exception to the rule that the ON clause must follow each join is when nesting joins within other joins. For simplicity, this syntax is not recommended and can easily be recoded to suit the rule just cited.
Deciding whether to code the explicit join syntax or to separate the joined tables with commas is purely a matter of preference and is not related to performance. However, the explicit join syntax has the benefits of compatibility with the outer join syntax; improved readability of large SQL statements, as each join is followed by its join predicate; and increased likelihood that all join predicates will be coded correctly.
A different kind of join operation is the outer join. The result set of an inner join consists only of those matched rows that are present in both joined tables. What happens when we need to include those values that are present in one or another joined table but not in both of them? In this case, an OUTER JOIN operation is needed. Outer joins are designed to generate an answer set that includes those values that are present in joined tables and those that are not. This type of optional relationship between tables is very common. Figure 6-2 shows the best way to view outer joins, representing the data from each table that is included/excluded in each type of join.
Figure 6-2. Outer joins
Outer joins are of three types: left outer join, right outer join, and full outer join.
Left Outer Join
A LEFT OUTER JOIN operation, or left join, produces an answer set that includes the matching values of both joined tablesinner join rowsand those values present only in the left joined tablerows exclusive to the left table. The left joined table is the one used in the left part of the LEFT OUTER JOIN operator when coding the join operation in the FROM clause.
In a left outer join, the left table is referred to as the preserved row table. If a row is unmatched between the two joined tables, it is still returned, or preserved, from the left table. The opposite, or right, table is the null-supplying table. When a row is unmatched from the left to the right table in the join, the row is supplied with nulls from the right table.
With left and right outer joins, reversing the order of tables in the FROM clause may alter the result.
Suppose that we have been requested to generate a report that includes the first name, the phone number, and the highest score for all the candidates present in the CANDIDATE table. If an inner join is used, the report will include data for only those candidates present in the TEST_TAKEN table.
Without using the outer-join syntax, this request could be solved using some SQL statements already discussed; however, the construction would be complex. Instead, we will use the left outer join to satisfy the request, as the following example shows:
SELECT C.FNAME, C.WPHONE, MAX(TT.SCORE) FROM CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON C.CID=TT.CID GROUP BY FNAME, WPHONE;
The keyword OUTER is optional. Provided that the JOIN is preceded by LEFT, RIGHT, or FULL, OUTER is implied.
Note the syntax used to indicate a left outer join. The LEFT OUTER JOIN operator is used to indicate the left outer join operation. In this example, the answer set includes those candidates not present in the TEST_TAKEN table. The MAX(TT.SCORE) column will show nulls for those candidates.
When coding outer joins, it is important to note the difference between WHERE and ON clause predicates. From the previous example, we may want to retrieve only those candidates with a CID of 111, 333, or 444:
SELECT C.CID, C.FNAME, C.WPHONE, MAX(TT.SCORE) AS MAX_SCORE FROM CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON C.CID=TT.CID WHERE C.CID IN (111, 333, 444) GROUP BY C.CID, C.FNAME, C.WPHONE CID FNAME WPHONE MAX_SCORE --- ----- ---------- --------- 111 SUSAN 1115551234 92 333 DAN 4442314244 73 444 JOHN 5552143244 - 3 record(s) selected.
The WHERE clause predicate in this example is applied to the left, or preserved, row table. DB2 is able, therefore, to apply this predicate before the join to the CANDIDATE table and join only candidates 111, 333, and 444 to the TEST_TAKEN table. CID 444 in this example returns a MAX_SCORE of null because the candidate has not taken any tests and is therefore not found in the TEST_TAKEN table.
Consider the result if the query is written to also exclude those rows with a score less than 75, if found:
SELECT C.CID, C.FNAME, C.WPHONE, MAX(TT.SCORE) AS MAX_SCORE FROM CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON C.CID=TT.CID WHERE C.CID IN (111, 333, 444) AND TT.SCORE >= 75 GROUP BY C.CID, C.FNAME, C.WPHONE CID FNAME WPHONE MAX_SCORE --- ----- ---------- --------- 111 SUSAN 1115551234 92 1 record(s) selected.
Only one row qualifies from this query. The WHERE clause predicate on the preserved row table (WHERE C.CID IN (111, 333, 444)) can be applied before the join to limit the number of rows that must be joined. The additional WHERE clause predicate (AND TT.SCORE >= 75) is applied after the join because it is not known until after the join what the value of tt.score will be. This predicate is applied to each row, with the following result:
Remember that for a row to be returned, the WHERE clause must evaluate to TRUE. Because the nulls introduced by the outer join are discarded by the WHERE clause applied to the null-supplying table, DB2 will rewrite the left join to an INNER JOIN, thus allowing the WHERE clause predicates to be applied to both tables before the joining of the rows occurs:
SELECT C.CID, C.FNAME, C.WPHONE, MAX(TT.SCORE) AS MAX_SCORE FROM CANDIDATE C INNER JOIN TEST_TAKEN TT ON C.CID=TT.CID WHERE C.CID IN (111, 333, 444) AND TT.SCORE >= 75 GROUP BY C.CID, C.FNAME, C.WPHONE
The outer join simplification will be recorded in the JOIN_TYPE column of the plan table. If a left, right, or full join has been simplified to an inner join, the JOIN_TYPE value will be blank.
If you wish to maintain the null-supplied rows, you must include an additional IS NULL predicate, such as:
SELECT C.CID, C.FNAME, C.WPHONE, MAX(TT.SCORE) AS MAX_SCORE FROM CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON C.CID=TT.CID WHERE C.CID IN (111, 333, 444) AND (TT.SCORE >= 75 OR TT.SCORE IS NULL) GROUP BY C.CID, C.FNAME, C.WPHONE CID FNAME WPHONE MAX_SCORE --- ----- ---------- --------- 333 SUSAN 4442314244 92 444 JOHN 5552143244 - 2 record(s) selected.
In this case, DB2 will not simplify the left outer join to an inner join; because the null-supplied row is not discarded by the WHERE clause predicate. Thus, the row for Glen is included in the result set because it is null.
WHERE clause predicates applied to the preserved row table can be applied before the join. Predicates applied to the null-supplying table will cause DB2 to simplify the join unless a predicate of the form OR COL IS NULLor equivalent expression that retains nullsis coded.
We've looked at WHERE clause predicates applied to either the preserved row or null-supplied tables. The next objective is to demonstrate the impact of coding local predicates in the ON clause. The following example displays the output if we specify the candidate predicate (111, 333, or 444) in the ON clause rather than in the WHERE clause:
SELECT C.CID, C.FNAME, C.WPHONE, MAX(TT.SCORE) AS MAX_SCORE FROM CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON C.CID=TT.CID AND C.CID IN (111, 333, 444) GROUP BY C.CID, C.FNAME, C.WPHONE CID FNAME WPHONE MAX_SCORE --- ----- ---------- --------- 111 BILL 1115551234 73 222 GEORGE 2226543455 - 333 SUSAN 4442314244 92 444 GLEN 5552143244 - 555 JIM 6662341234 - 666 KEVIN 7773142134 - 777 BERT 8886534534 - 888 PAUL 9992112212 - 8 record(s) selected.
Because no WHERE clause predicates restrict the result, this query returns all rows of the CANDIDATE table. Columns from the TEST_TAKEN table are supplied only when the ON clause is TRUE. Remember, WHERE clause predicates restrict the result, whereas ON clause predicates determine the criteria for the join. For the rows not matching the full ON clause (CID NOT IN 111, 333note that 444 is not included, because it will not be found in the TEST_TAKEN table), DB2 will preserve the row from the left table and supply nulls from the right table. Rows matching the ON clause (CID IN 111, 333) will match and thus return columns from both tables.
If it can determine before joining the rows that the ON clause will fail, DB2 will automatically supply nulls for the null-supplied columns without attempting the join. This improves performance, as rows are not joined unnecessarily if the final result will not match.
Right Outer Join
A RIGHT OUTER JOIN operation, or right join, produces an answer set that includes the matching values of both joined tables and those values present only in the right joined table. The right joined table is the one used in the right part of the RIGHT OUTER JOIN operator when coding the join operation.
For a right join, the right table is the preserved row table, and the left table is the null-supplied table for unmatched rows. This is the opposite of a left join. An example using a right outer join follows:
SELECT NAME, COUNT(DISTINCT TT.CID) FROM TEST_TAKEN TT RIGHT OUTER JOIN TEST T ON TT.NUMBER = T.NUMBER
This example requests all test names present in the TEST table and the number of candidates who scheduled or took each test. Note that there may be some tests for which no candidate was scheduled. You cannot report such tests using an inner join statement; however, you can do so using a right outer join.
The COUNT(DISTINCT TT.CID) will count unique occurrences of TT.CID after the exclusion of nulls. Thus, nulls introduced by the outer join will not be counted. This is intentional, as the addition of a null will falsely add 1 to the count. The right outer join is required, however, to also report those tests that have not been taken by any candidates.
As with a left join, the order of the tables in the FROM clause for a right join can impact the result. However, a right join can be converted to a left join by reversing the order of the tables in the FROM clause and by changing the RIGHT keyword to LEFT.
DB2 converts all right joins to left joins at bind time. The JOIN_TYPE column of the plan table will show an L for a right or a left join unless the join has been simplified to an inner. For this reason, you should avoid coding right joins, to minimize confusion when evaluating the access path.
Full Outer Join
The FULL OUTER JOIN operation produces an answer set that includes the matching values of both joined tables and those values not present in one or the other of the tables. Thus, it returns the inclusive inner join rows plus the exclusive left and right outer rows.
To show a FULL OUTER JOIN operation, we will create two sample tables: CITY and COUNTRY. They show the relationship between a city and a country (CITY table) and a country and a continent (COUNTRY table). The CITY table is designed to have countries that are not in the COUNTRY table. The COUNTRY table is also designed to have countries that are not in the city table. The contents of both tables are as follows:
TABLE:CITY CITY_NAME COUNTRY_NAME --------------- --------------- Sydney Australia London England Dublin Ireland Firenze Italy Milano Italy Mexico City Mexico Lima Peru Toronto Canada Vienna Austria Hannover Germany 10 record(s) selected. TABLE:COUNTRY COUNTRY_NAME CONTINENT --------------- -------------------------- Australia Australian Continent England European Continent Ireland European Continent Italy European Continent Mexico American Continent Austria European Continent South Africa African Continent Spain European Continent 8 record(s) selected.
We want to show all the countries, cities, and continents that are in the tables. Therefore, we are using a full outer join, as the following example shows:
SELECT CTRY.CONTINENT, CTRY.COUNTRY_NAME, CTY.COUNTRY_NAME, CTY.CITY_NAME FROM COUNTRY CTRY FULL OUTER JOIN CITY CTY ON CTY.COUNTRY_NAME=CTRY.COUNTRY_NAME ORDER BY CTRY.CONTINENT, CTY.COUNTRY_NAME, CTY.CITY_NAME CONTINENT COUNTRY_NAME COUNTRY_NAME CITY_NAME -------------- ------------- -------------- ------------ African Continent South Africa - - American Continent Mexico Mexico Mexico City Australian Continent Australia Australia Sydney European Continent Austria Austria Vienna European Continent England England London European Continent Ireland Ireland Dublin European Continent Italy Italy Firenze European Continent Italy Italy Milano European Continent Spain - - - - Canada Toronto - - Germany Hannover - - Peru Lima 12 record(s) selected.
The rows that have a null value were added by the outer join operation. The COUNTRY_NAME column is shown twice to see those countries present in the COUNTRY table but not in the CITY table, and vice versa.
As shown by the full join example, both tables can preserve rows and also supply nulls for unmatched rows. Because WHERE clause predicates applied to null-supplied tables may cause DB2 to perform simplification of the outer join, predicates that are to be applied before the join to limit the result set must be coded within nested table expressions.
DB2 can simplify a full join to either a left or an inner join if the WHERE clause predicates discard nulls from one or both tables, respectively. This is shown in the JOIN_TYPE column of the plan table as either an L for left, or blank, for an inner join rather than an F for a full join.
Consider the following example, which lists the information for only South Africa and Peru:
SELECT CTRY.CONTINENT, CTRY.COUNTRY_NAME, CTY.COUNTRY_NAME, CTY.CITY_NAME FROM (SELECT CONTINENT, COUNTRY_NAME FROM COUNTRY WHERE COUNTRY_NAME IN ('SOUTH AFRICA', 'PERU')) AS CTRY FULL OUTER JOIN (SELECT COUNTRY_NAME, CITY_NAME FROM CITY WHERE COUNTRY_NAME IN ('SOUTH AFRICA', 'PERU')) AS CTY ON CTY.COUNTRY_NAME=CTRY.COUNTRY_NAME ORDER BY CTRY.CONTINENT, CTY.COUNTRY_NAME, CTY.CITY_NAME CONTINENT COUNTRY_NAME COUNTRY_NAME CITY_NAME ----------------- ------------- ------------- --------- African Continent South Africa - - - - Peru Lima 2 record(s) selected.
Because the country name can return null from either table, it is common to code the COALESCE or VALUE function to ensure that the value is always supplied and is not null. The select list for this example would be as follows:
SELECT CTRY.CONTINENT, COALESCE(CTRY.COUNTRY_NAME,CTY.COUNTRY_NAME) , CTY.CITY_NAME
COALESCE returns the first list element that is not null.
Combining Outer Joins
To this point, we have discussed each outer join operation separately. Now, we will show a more complex example: combining two outer joins in a single query. Let's display all the candidates and all the tests with their respective scores. To create this query, we need two outer joins. The first outer join will obtain all candidates and their scores, including candidates who did not schedule or take any tests. The second outer join will retrieve all the tests present in the TEST table, even if no candidate scheduled or took those tests:
SELECT C.CID, T.NAME, SCORE FROM (CANDIDATE C LEFT OUTER JOIN TEST_TAKEN TT ON TT.CID=C.CID) FULL OUTER JOIN TEST T ON TT.NUMBER = T.NUMBER ORDER BY C.CID
The first outer join is enclosed in parentheses, which are used for readability and to indicate that the left outer join will be resolved first. In reality, the join sequence is determined by the precedence of the join predicates. For example, because the full join in this example joins to a column from the null-supplied table of the previous left join, the left join must precede the full join.
When the join sequence is not dictated by the join predicates, DB2 will evaluate the most cost-effective join sequence.
The left outer join gathers all the candidate IDs. We need only a left outer join here. Because of referential integrity constraints, table TEST_TAKEN can have candidates present only in the CANDIDATE table.
The second part is a full outer join. With this outer join, we take all the tests taken by the candidates and the result of the left outer join, and we join them with all the tests in the TEST table. We need a full outer join this time, even though the TEST_TAKEN table can have only test numbers that are present in the TEST table; the left table of this outer join, which is the result of the first join, may include null values as a test number. The TEST table does not have null values, so we need to use a full outer join. Even if the TEST table did contain nulls, a null will never equal another null unless the predicates of the form IS NULL are included in the join criteria. The results are as follows:
CID NAME SCORE --- ------------------------------ ----- 111 DB2 Fundamentals 65 111 DB2 Administration 73 111 DB2 Application Development 67 222 DB2 Fundamentals 55 222 DB2 Application Development 53 222 DB2 Application Development 75 333 DB2 Fundamentals 82 333 DB2 Administration - 333 DB2 Application Development 92 444 - - 555 - - 666 - - 777 - - 888 - - 14 record(s) selected.
Care must be taken when joining the result of an outer join to subsequent tables. If the join predicate refers to columns from a null-supplied table, the join predicate must also cater for the occurrence of nulls. Where possible, code the join predicate from the preserved row table; if the preceding join was a full join, use the COALESCE function to ensure that the join predicate cannot be null.
Joins versus Denormalization
One issue that needs to be examined when discussing joins is that of denormalization, the process of pulling normalized tables back together into a single table to avoid joins. This is often done in an effort to minimize the joins and to improve performance. Because this is generally applied to tables in a one-to-many relationship, it results in multiple occurrences of the same data, introducing update anomalies, or a single piece of data occurring multiple times. When an attempt to update the data is made, all occurrences must be found, and there is no fixed way to determine the number or location of the multiple occurrences. Some denormalization in some environments, such as read-only data warehouses, may be legitimate, but each situation must be evaluated carefully. For additional information about normalization/denormalization and update anomalies, refer to Chapter 4.