Complex SQL Guidelines

 <  Day Day Up  >  

The preceding section provided guidelines for simple SQL SELECT statements. These statements retrieve rows from a single table only. Complex SQL can use a single SQL SELECT statement to retrieve rows from different tables. There are five basic categories of complex SQL statements, namely

  • Joins

  • Subqueries

  • Table Expressions (also known as in-line views)

  • Unions

  • Grouping

Other forms of complex SQL, such as common table expressions and recursion, are covered later in this chapter.

Before delving into each of the five basic categories let's first discuss a general approach to writing complex SQL. It is a good practice to build complex SQL in steps. Start with simple SQL, test it for accuracy, and slowly add any additional elements required to return the proper results. For example, if you need to access three tables do not start by trying to build a single SQL statement that references all three tables. Start with one table and code the local predicates for that table. Test it for accuracy. When that works, add any grouping, sorting, and functions that might be needed. Test again. Only when you are retrieving the proper data from the first table should you add an additional table and its join predicates to the statement. Repeating this process for each table allows you to slowly build complex SQL instead of just jumping in and starting off complex ”which can be confusing and error-prone .

Retrieving Data From Multiple Tables

There are three basic ways to retrieve data from multiple tables in a single DB2 query: joins, subqueries, and table expressions.

A join references multiple tables in the FROM clause of a SELECT (or uses the JOIN keyword to combine tables). When joining tables, columns from any table in the join can be returned in the result set (simply by specifying the column names in the SELECT -list of the query).

A subquery is when one query references the results of another query. With subqueries, only those columns in the outermost table in the statement can be returned in the result set.

Table expressions are SQL statements that are used in place of a table in another SQL statement. As with joins, columns from any table in the statement can be returned in the result set.

Truthfully, there are four ways to access data from multiple tables in a single query if you include unions . A UNION , however, is fundamentally different from a join, subquery, or table expression because each SQL statement that is unioned returns a portion of the results set.

The following guidelines offer advice for complex SQL using joins, subqueries, table expressions, and unions.

UNION Versus UNION ALL

The UNION operator always results in a sort . When the UNION operator connects two SELECT statements, both SELECT statements are issued, the rows are sorted, and all duplicates are eliminated. If you want to avoid duplicates, use the UNION operator.

The UNION ALL operator, by contrast, does not invoke a sort. The SELECT statements connected by UNION ALL are executed, and all rows from the first SELECT statement are appended to all rows from the second SELECT statement. Duplicate rows might exist. Use UNION ALL when duplicate rows are required or, at least, are not a problem. Also use UNION ALL when you know that the SELECT statements will not return duplicates.

Use NOT EXISTS Instead of NOT IN

When you code a subquery using negation logic, try to use NOT EXISTS instead of NOT IN to increase the efficiency of your SQL statement. When you use NOT EXISTS , DB2 must verify only nonexistence. Doing so can reduce processing time significantly. With the NOT IN predicate, DB2 must materialize and sort the complete subquery results set.

Order the Elements in Your IN Lists

The order of elements in a SQL IN list can impact performance when an index is not used to process the IN list. DB2 will search the list of elements from left to right until a match is found or the end of the list is reached. For this reason, code the most commonly occurring values first in your IN lists. Doing so will cause DB2 to match as early as possible most of the time ”thereby improving performance. For example, consider the following SQL statement:

 

 SELECT  EMPNO, WORKDEPT, EDLEVEL, SALARY FROM    DSN8810.EMP WHERE   WORKDEPT IN ('A00', 'A01', 'E21'); 

This SQL statement demonstrates the natural tendency to order lists alphabetically or numerically . The statement is correctly coded only if A00 occurs more often than A01 , which occurs more often than E21 . But if A01 was most common, followed by E21 and then A00 , the statement would be better coded (performance-wise) as

 

 SELECT  EMPNO, WORKDEPT, EDLEVEL, SALARY FROM    DSN8810.EMP WHERE   WORKDEPT IN ('A01', 'E21', 'A00'); 

Remove Duplicates from Your IN Lists

When using the IN predicate with a list of values, DB2 sorts the IN -list into ascending sequence and removes duplicates if the column specified is indexed. The IN -list values are then used to probe the index to find the matching rows. However, if there is no index for the column, DB2 will not sort the IN -list and any duplicates will remain . Upon retrieval of a row, the column value is used to search the IN -list (duplicates and all).

Therefore, it makes sense to order the elements of the IN -list, code your most restrictive predicates first, and never code duplicates in the IN -list.

Be Aware of Predicate Transitive Closure Rules

Predicate transitive closure refers to the capability of the DB2 optimizer to use the rule of transitivity (if A=B and B=C, then A=C) to determine the most efficient access path for queries. The optimizer did not always have the capability to use the rule of transitivity.

In older releases of DB2, you produced a more efficient query by providing redundant information in the WHERE clause of a join statement, as in this example

 

 SELECT  A.COL1, A.COL2, B.COL1 FROM    TABLEA A, TABLEB B WHERE   A.COL1 = B.COL1 AND     A.COL1 = :HOSTVAR; 

This query could process more efficiently in pre-V2.1 releases of DB2 by coding a redundant predicate, as follows :

 

 SELECT  A.COL1, A.COL2, B.COL1 FROM    TABLEA A,         TABLEB B WHERE   A.COL1 = B.COL1 AND     A.COL1 = :HOSTVAR AND     B.COL1 = :HOSTVAR; 

The need to code redundant predicates for performance no longer exists for equality and range predicates. However, predicate transitive closure is not applied with LIKE or IN predicates. Consider this example:

 

 SELECT  A.COL1, A.COL2, B.COL1 FROM    TABLEA  A,         TABLEB  B WHERE   A.COL1 = B.COL1 AND     A.COL1 LIKE 'ABC%'; 

The preceding can be more efficiently coded as follows:

 

 SELECT  A.COL1, A.COL2, B.COL1 FROM    TABLEA  A,         TABLEB  B WHERE   A.COL1 = B.COL1 AND     A.COL1 LIKE 'ABC%' AND     B.COL1 LIKE 'ABC%'; 

Unless you're using an IN or LIKE clause, or you are running on an ancient version of DB2 (pre V2.3) do not code redundant predicates; doing so is unnecessary and might cause the query to be less efficient.

Use SQL to Determine "Top Ten"

Application developers frequently wish to retrieve a limited number of qualifying rows from a table. For example, maybe you need to list the ten highest selling items from inventory or the top ten most expensive products (that is, the products with the highest price tags). There are several ways to accomplish this prior to DB2 V7 using SQL, but they are not necessarily efficient.

The basic question that arises frequently is how best to return only a portion of the actual result set for a query. This situation most frequently manifests itself in the "Top Ten" problem (for example, returning the top-ten highest salaries in the company).

The first reaction is to simply use the WHERE clause to eliminate non-qualifying rows. But this is simplistic, and often is not sufficient to produce the results desired in an optimal manner. What if the program only requires that the top ten results be returned? This can be a somewhat difficult request to formulate using SQL alone. Consider, for example, an application that needs to retrieve only the ten highest-paid employees from the EMP sample table. You could simply issue a SQL request that retrieves all of the employees in order by salary, but only use the first ten retrieved. That is easy, for example

 

 SELECT   SALARY, EMPNO, LASTNAME FROM     DSN8810.EMP ORDER BY SALARY DESC; 

It is imperative that you specify the ORDER BY clause with the DESC keyword. This sorts the results into descending order, instead of the default, which is ascending. Without the DESC key word, the "top ten" would be at the very end of the results set, not at the beginning.

But this "solution" does not really satisfy the requirement ”retrieving only the top ten. It merely sorts the results into descending sequence. So the results would still be all employees in the table, but in the correct order so you can view the "top ten" salaries very easily.

The ideal solution should return only the ten employees with the highest salaries and not merely a sorted list of all employees. Consider the following SQL:

 

 SELECT SALARY, EMPNO, LASTNAME FROM   DSN8810.EMP  E1 WHERE  10 > (SELECT COUNT(*)              FROM   DSN8810.EMP E2              WHERE  E1.SALARY < E2.SALARY); 

The ten highest salaries are returned. You can alter the actual number by changing the literal value 10 to whatever number you want. This particular SQL does not perform very well, but it will work with all versions of DB2.

graphics/v7_icon.gif

As of DB2 Version 7, the best way to perform this function is to use a query with the FETCH FIRST n ROWS ONLY clause. So, the query becomes:


 

 SELECT SALARY, EMPNO, LASTNAME FROM   DSN8810.EMP ORDER BY SALARY DESC FETCH FIRST 10 ROWS ONLY; 

The ORDER BY clause will sort the data into descending order by SALARY values. The FETCH FIRST 10 ROWS ONLY clause will limit your output to 10 rows only. If duplicates are possible you can eliminate them by adding a DISTINCT clause after the SELECT and before the columns being selected.

Use FETCH FIRST n ROWS ONLY to Limit the Size of SQL Result Sets

The FETCH FIRST n ROWS ONLY clause can limit the number of qualifying rows for any SELECT statement. You can code FETCH FIRST n ROWS , which will limit the number of rows fetched and returned by a SELECT statement. So, for example, to limit your results to 371 rows, you could code

 

 SELECT SALARY, EMPNO, LASTNAME FROM   DSN8810.EMP FETCH FIRST 371 ROWS ONLY; 

The SQLCODE will be set to +100 when you try to FETCH row number 372, or at the end of the result set if there are fewer than 371 rows.

CAUTION

The FETCH FIRST n ROWS ONLY clause is not the same as the OPTIMIZE FOR n ROWS clause. The FETCH FIRST clause will cause your cursor to stop returning rows after n rows have been returned; the OPTIMIZE FOR clause will not (it just gives guidance to the DB2 optimizer for access path formulation).

However, specifying FETCH FIRST causes DB2 to use an implicit OPTIMIZE FOR clause with the same number for n . So, specifying FETCH FIRST 12 ROWS ONLY causes DB2 to use an implicit OPTIMIZE FOR 12 ROWS ; you do not need to code the OPTIMIZE FOR clause.


Keep in mind that this discussion differs from the previous guideline because we do not care about order (or the "top" so many rows), just limiting the number of rows to some preset number. That is why an ORDER BY is not required.

CAUTION

If you do not provide an ORDER BY clause on a query that uses the FETCH FIRST n ROWS ONLY clause, your results will be unpredictable. Remember, there is no inherent order to a DB2 table, so the order in which rows are returned is based on the access path chosen by the DB2 optimizer. So, although the FETCH FIRST n ROWS ONLY clause will limit your result set to n , you are not guaranteed to retrieve the same n rows every time.


Code Appropriate Existence Checking SQL

There are times when a program just needs to know that some given data exists and does not need to actually retrieve and use that data. For these situations you will need to develop the most efficient SQL possible that just checks if the specific data exists. But what is the best way to accomplish an existence check?

Prior to DB2 V7, the best way to check for existence is to use a correlated query against the SYSDUMMY1 table. For example, to check for the existence of an employee with the last name of Jones in the EMP table, the SQL would look like this:

 

 SELECT 1 FROM   SYSIBM.SYSDUMMY1 A WHERE  EXISTS (SELECT 1                FROM   DSN8810.EMP B                WHERE  LASTNAME = 'JONES'                AND A.IBMREQD = A.IBMREQD); 

Sometimes, though, we need more information. Perhaps we need to list all employees who are responsible for at least one project. That can be coded as follows:

 

 SELECT  EMPNO FROM    DSN8810.EMP   E WHERE   EXISTS         (SELECT  1          FROM    DSN8810.PROJ   P          WHERE   P.RESPEMP = E.EMPNO); 

First, notice that we just SELECT the constant 1 . Because the data does not need to be returned to the program, the SQL statement need not specify any columns in the SELECT -list. We simply check the SQLCODE . If the SQLCODE is zero, data exists; if not, the data does not exist.

If you do not use a correlated query with EXISTS , but instead simply issue the SELECT statement, performance can suffer as DB2 scans to find subsequent occurrences of the data ”which might be many, especially for a common name such as Jones. Of course, if a unique index exists on the column(s) in question, then you can get by with the simple query.

graphics/v7_icon.gif

Which brings us to DB2 V7. In this version IBM added a new clause called FETCH FIRST n ROWS ONLY . This solves our problem for existence checking and should be the new standard after you move to DB2 V7. Going back to our example, the SQL for existence checking now becomes:


 

 SELECT 1 FROM   DSN8810.EMP WHERE  LASTNAME = 'JONES' FETCH FIRST 1 ROW ONLY; 

We still do not specify columns in the SELECT -list, but we no longer need the correlated query. DB2 will stop after 1 row has been checked ”which is the desired result.

Minimize the Number of Tables in a Join

Joining many tables in one query can adversely affect performance. Although the maximum number of tables that can be joined in a single SQL statement is 225, the practical limit is usually fewer.

CAUTION

graphics/v8_icon.gif

Prior to DB2 V6, the limit for tables in a SQL statement was 15. DB2 V6 actually increased the limit for tables in a SQL statement from 15 to 225, but in a restricted manner: Each SQL statement can consist of up to 15 Query Blocks, each directly or indirectly identifying 15 base table references. The grand total number of tables for a single SQL statement is 225; however, no query block can exceed 15 base table references, whether direct or indirect.

DB2 V8 removes this restriction; each SQL statement can reference up to 225 tables in total. The limit has been raised to such a high number to accommodate ERP vendors such as Peoplesoft and SAP, whose applications were designed originally for other RDBMS packages, such as Oracle, that have higher limits than DB2. Just because the limit has been increased does not mean you should write queries that access such a large number of tables. The performance of such queries will likely be poor and difficult to manage.


However, setting an artificial limit on the standard number of tables per join is not a wise course of action. In some situations, avoiding large, complex joins in an online environment may be necessary. But the same statement might be completely acceptable in a batch job or as an ad hoc request.

The number of tables to be joined in any application should be based on the following:

  • The total number of rows participating in the join

  • The results you want to obtain from the query

  • The level of performance you want

  • The anticipated throughput of the application

  • The type of application (OLTP versus OLAP or DSS)

  • The environment in which the application will operate (online versus batch)

  • The availability you want (for example, 24x7)

In general, however, always eliminate unnecessary tables from your join statement.

Consider CASE Expressions to Optimize Counting

When you need to produce counts of rows in DB2 tables, consider using a CASE expression instead of using multiple SQL SELECT statements. For example, if you need to count the number of employees who earn different salary ranges, consider the following statement

 

 SELECT SUM(CASE WHEN SALARY BETWEEN 0 AND 20000                 THEN 1  ELSE 0  END) AS UPTO20       ,SUM(CASE WHEN SALARY BETWEEN 20001 AND 50000                 THEN 1  ELSE 0  END) AS FROM20TO50       ,SUM(CASE WHEN SALARY BETWEEN 50001 AND 80000                 THEN 1  ELSE 0  END) AS FROM50TO80       ,SUM(CASE WHEN SALARY > 80000                 THEN 1  ELSE 0  END) AS OVER80       ,SUM(CASE WHEN SALARY < 0                 THEN 1  ELSE 0  END) AS NEGATIVESAL FROM DSN8810.EMP; 

This SELECT statement efficiently scans through the data and produces a sum for each range that is defined. The multiple CASE expressions return either a or a 1 . The SUM function just adds up the values and you get totals for employees within each range.

This outperforms multiple SELECT statements using COUNT(*) because DB2 can keep running totals as it passes once through the data using the CASE expressions.

CAUTION

The previous example shows the CASE expression returning either a or a 1 . If you did not include the ELSE portion of each CASE expression, DB2 would return either a 1 or a NULL . This can be more efficient than returning 1 or for very large sets of data. This is so because DB2 will actually add the zeroes, but it can ignore the NULL s.

However, be careful, because if there are any categories where no rows apply, then DB2 will SUM up a bunch of NULL s ”which returns NULL , not as the result.


Consider Denormalizing to Reduce Joins

To minimize the need for joins, consider denormalization. Remember, however, that denormalization usually implies redundant data, dual updating, and extra DASD usage. Normalization optimizes data modification at the expense of data access; denormalization optimizes data access at the expense of data modification.

Whenever you denormalize be sure to document the specific reasons for the denormalization, as well as listing each specific change made to the database structure. Denormalization should be undertaken only as a last resort. You can find additional denormalization assistance in Chapter 5.

Reduce the Number of Rows to Be Joined

The number of rows participating in a join is the single most important determinant in predicting the response time of a join. To reduce join response time, reduce the number of rows to be joined in the join's predicates.

For example, when you try to determine which males in all departments reporting to department D01 make a salary of $40,000 or more, you can code the predicates for both SEX and SALARY as follows:

 

 SELECT  E.LASTNAME, E.FIRSTNME FROM    DSN8810.DEPT  D,         DSN8810.EMP   E WHERE   D.ADMRDEPT = 'D01' AND     D.DEPTNO = E.WORKDEPT AND     E.SEX = 'M' AND     E.SALARY >= 40000.00; 

The predicates on the SEX and SALARY columns can be used to reduce the amount of data that needs to be joined. If you fail to code either of the last two predicates, deciding instead to scan the results and pull out the information you need, more rows qualify for the join and the join is less efficient.

Join Using SQL Instead of Program Logic

Coding a join using SQL instead of COBOL or another high-level language is almost always more efficient. The DB2 optimizer has a vast array of tools in its arsenal to optimize the performance of SQL queries. Usually, a programmer will fail to consider the same number of possibilities as DB2.

If a specific SQL join is causing high overhead, consider the tuning options outlined in this chapter before deciding to implement the join using a program. To further emphasize the point, consider the results of a recent test. A three table join using GROUP BY and the COUNT(*) function similar to the one below was run:

 

 SELECT   A.EMPNO, LASTNAME, COUNT(*) FROM     DSN8810.EMP         E,          DSN8810.EMPPROJACT  A,          DSN8810.PROJ        P WHERE    E.EMPNO = A.EMPNO AND      P.PROJNAME IN ('PROJECT1', 'PROJECT7', 'PROJECT9') AND      A.PROJNO = P.PROJNO AND      A.EMPTIME > 40.0 GROUP BY A.EMPNO, LASTNAME; 

Additionally, an equivalent program was coded using three cursors (one for each join), internal sorting (using Syncsort, DFSORT , or a similar utility), and programmatic counting. Performance reports were run on both, and the SQL statement outperformed the equivalent application program by several orders of magnitude in terms of both elapsed time and CPU time.

Programming your own application joins should always be a very last resort and should not be considered unless you have exhausted all other tuning techniques. In practice, application joins are almost never needed for performance reasons.

Use Joins Instead of Subqueries

A join can be more efficient than a correlated subquery or a subquery using IN . For example, this query joins two tables:

 

 SELECT  EMPNO, LASTNAME FROM    DSN8810.EMP,         DSN8810.PROJ WHERE   WORKDEPT = DEPTNO AND     EMPNO = RESPEMP; 

The preceding example is usually more efficient than the following query, which is formulated as a correlated subquery accessing the same two tables:

 

 SELECT  EMPNO, LASTNAME FROM    DSN8810.EMP X WHERE   WORKDEPT IN         (SELECT  DEPTNO          FROM    DSN8810.PROJ          WHERE   RESPEMP = X.EMPNO); 

The preceding two queries demonstrate how to turn a correlated subquery into a join. You can translate non-correlated subqueries into joins in the same manner. For example, the join

 

 SELECT  EMPNO, LASTNAME FROM    DSN8810.EMP,         DSN8810.DEPT WHERE   WORKDEPT = DEPTNO AND     DEPTNAME = 'PLANNING'; 

is usually more efficient than the subquery

 

 SELECT  EMPNO, LASTNAME FROM    DSN8810.EMP WHERE   WORKDEPT IN         (SELECT  DEPTNO          FROM    DSN8810.DEPT          WHERE   DEPTNAME = 'PLANNING'); 

Note that these two queries do not necessarily return the same results. If DEPTNO is not unique, the first SELECT statement could return more rows than the second SELECT statement, and some of the values for EMPNO could appear more than once in the results table.

Be aware, however, that with each new release of DB2, subqueries (both correlated and non-correlated) are becoming more and more efficient. Yet, performance concerns aside, standardizing on joins instead of subqueries (when possible) can make development and maintenance easier because fewer query formulations need to be considered during implementation, or modified during maintenance cycles. Additionally, in certain cases DB2 may transform subqueries into joins during optimization.

Join on Clustered Columns

When you join large tables, use clustered columns in the join criteria when possible. This way, you can reduce the need for intermediate sorts. Note that doing so might require clustering of the parent table by primary key and the child table by foreign key.

Join on Indexed Columns

The efficiency of your program improves when tables are joined based on indexed columns rather than on non-indexed ones. To increase the performance of joins, consider creating indexes specifically for the predicates being joined.

Use Caution When Specifying ORDER BY with a Join

When the results of a join must be sorted, limiting the ORDER BY to columns of a single table can cause DB2 to avoid a sort. Whenever you specify columns from multiple tables in the ORDER BY clause of a join statement, DB2 invokes a sort.

graphics/v8_icon.gif

In DB2 V8, you can alleviate the pain of the multi-table ORDER BY by using a Materialized Query Table, which is covered in detail in Chapter 45, "Data Warehousing with DB2."


Avoid Cartesian Products

Never use a join statement without a predicate. A join without a predicate generates a results table in which every row from the first table is joined with every row from the other table: a Cartesian product. For example, joining ”without a predicate ”a 1,000 row table with another 1,000 row table results in a table with 1,000,000 rows. No additional information is provided by this join, so a lot of machine resources are wasted . [1]

NOTE

Although you should never specify a Cartesian product in your SQL queries, the DB2 optimizer may decide to use a Cartesian product for a portion, or portions of a join. For example, when a star join is being used, the DB2 optimizer will choose to implement Cartesian products for portions of the join. This may happen in data warehousing queries and other ad hoc queries where multiple dimension tables are joined to a very large fact table. Because the fact table is usually many times larger than the dimension tables, processing the fact table only once against the Cartesian product of the fact tables can enhance query performance. As many as six dimension tables (five prior to DB2 V6) can be joined as a Cartesian product for a star join in DB2.

For more information on star joins consult Chapter 45.


Provide Adequate Search Criteria

When possible, provide additional search criteria in the WHERE clause for every table in a join. These criteria are in addition to the join criteria, which are mandatory to avoid Cartesian products. This information provides DB2 with the best opportunity for ranking the tables to be joined in the most efficient manner (that is, for reducing the size of intermediate results tables). In general, the more information you provide to DB2 for a query, the better the chances that the query will perform adequately.

Consider Using Explicit INNER JOINs

Instead of specifying joins by using a comma-delimited list of tables in the FROM clause, use INNER JOIN with the ON clause. Explicit INNER JOIN syntax might help when you're training new programmers in SQL because it provides a join keyword. Likewise, the join predicates must be isolated in the ON clause when you're using an explicit INNER JOIN . This way, reading, tuning, and maintaining the SQL code are easier.

Use Explicit OUTER JOINs

Avoid coding an outer join in the old style, which required a simple SELECT , a UNION , and a correlated subselect .

NOTE

Prior to DB2 V4, this was the only way to code an outer join.


Instead use the SQL outer join syntax which is easier to code, easier to maintain, and more efficient to execute. An explicit OUTER JOIN uses one pass against the tables and as such usually outperforms an outer join using UNION or UNION ALL . Using explicit OUTER JOIN statements reduces the number of bugs and speeds application development time due solely to the significant reduction in lines of code required. Furthermore, as IBM improves the optimizer over time, techniques designed to make outer joins more efficient will most likely focus only on the new, explicit outer join syntax and not on the old, complex SQL formulation.

Exception Reporting

You can use the bottom half of the old style of outer join to report just the exceptions when you don't need a full-blown outer join. Consider this example:

 

 SELECT  D.DEPTNO, D.DEPTNAME, D.MGRNO, '* No Mgr Name *' FROM    DSN8810.DEPT  D WHERE   NOT EXISTS         (SELECT  1          FROM    DSN8810.EMP  E          WHERE   D.MGRNO = E.EMPNO) ORDER BY 1; 

This SQL returns only the departments without a manager name.

Favor LEFT Over RIGHT OUTER JOIN s

Favor coding LEFT OUTER JOIN over RIGHT OUTER JOIN . The choice is truly arbitrary, but the manner in which DB2 shows EXPLAIN information makes left outer joins easier to tune. EXPLAIN populates the JOIN_TYPE column to describe the outer join method ( FULL , RIGHT , or LEFT ). The column contains the value F for a FULL OUTER JOIN , L for a LEFT OUTER JOIN or RIGHT OUTER JOIN , or a blank for an INNER JOIN or no join. DB2 always converts right outer joins to left outer joins, so there is no R value for JOIN_TYPE .

The right outer join syntax is only available to make the query easier to read and understand. Because deciphering the PLAN_TABLE data is more difficult for a RIGHT OUTER JOIN than for a LEFT OUTER JOIN , favor coding left outer joins whenever possible.

Use COALESCE with FULL OUTER JOIN s

At times, you might need the COALESCE function to avoid nulls in the result columns of OUTER JOIN statements. To understand how COALESCE can be useful in an outer join, consider the following query:

 

 SELECT  EMP.EMPNO, EMP.WORKDEPT, DEPT.DEPTNAME FROM    DSN8810.EMP EMP FULL OUTER JOIN DSN8810.DEPT DEPT ON      EMP.WORKDEPT = DEPT.DEPTNO; 

A portion of the results for this query looks like the following:

EMPNO

WORKDEPT

DEPTNAME

200330

E21

SOFTWARE SUPPORT

200340

E21

SOFTWARE SUPPORT

------

---

DEVELOPMENT CENTER


Note that the department code for DEVELOPMENT CENTER is not displayed, even though you know by simple browsing of the DEPT table that the code is D01 . The value is not returned because the query selects the WORKDEPT column from EMP , not the DEPTNO column from DEPT . You can rectify this situation by using the COALESCE function. The COALESCE function notifies DB2 to look for a value in both of the listed columns, one from each table in the outer join (in this case, EMP and DEPT ). If a value is found in either table, it can be returned in the result set. Consider the following example:

 

 SELECT  EMP.EMPNO,         COALESCE(EMP.WORKDEPT, DEPT.DEPTNO) AS DEPTNUM,         DEPT.DEPTNAME FROM    DSN8810.EMP EMP FULL OUTER JOIN DSN8810.DEPT DEPT ON      EMP.WORKDEPT = DEPT.DEPTNO; 

The results are changed as follows:

EMPNO

DEPTNUM

DEPTNAME

200330

E21

SOFTWARE SUPPORT

200340

E21

SOFTWARE SUPPORT

------

D01

DEVELOPMENT CENTER


In this case, the last row of the result set contains the correct department code. The COALESCE function determines that the department code is stored in the DEPT.DEPTNO column and returns that value instead of the null because there is no corresponding WORKDEPT number.

NOTE

The VALUE function is a synonym for the COALESCE function. Favor using COALESCE though as it adheres to the ANSI standard.


OUTER JOIN s and Inline Views

Be aware that you might need to combine inline views (also known as nested table expressions) with the COALESCE function to return the appropriate results. Consider adding a local predicate to the preceding example:

 

 SELECT  EMP.EMPNO,         COALESCE(EMP.WORKDEPT, DEPT.DEPTNO) AS DEPTNUM,         DEPT.DEPTNAME FROM    DSN8810.EMP EMP FULL OUTER JOIN DSN8810.DEPT DEPT ON      EMP.WORKDEPT = DEPT.DEPTNO WHERE   EMP.WORKDEPT = 'D01'; 

In this case, no rows are returned. The 'D01' department number is aligned with the "DEVELOPMENT CENTER" in the DEPT table as DEPTNO , not in the EMP table as WORKDEPT . The solution is to use an inline view as follows:

 

 SELECT  EMPNO, DEPTNUM, DEPTNAME FROM    (SELECT EMPNO,                 COALESCE(EMP.WORKDEPT, DEPT.DEPTNO) AS DEPTNUM,                 DEPT.DEPTNAME          FROM   DSN8810.EMP EMP FULL OUTER JOIN DSN8810.DEPT DEPT          ON     EMP.WORKDEPT = DEPT.DEPTNO) AS OJ_EMP_DEPT WHERE   DEPTNUM = 'D01'; 

This example finds the row for 'D01' because COALESCE is applied to the inline view before the local predicate is applied.

OUTER JOIN s and Predicate Placement

Prior to DB2 V6 inline views were required to achieve optimal outer join performance. This restriction no longer exists. Consider the following OUTER JOIN with a local predicate:

 

 SELECT  EMP.EMPNO, EMP.LASTNAME, DEPT.DEPTNAME FROM    DSN8810.EMP EMP LEFT OUTER JOIN DSN8810.DEPT DEPT ON      EMP.WORKDEPT = DEPT.DEPTNO WHERE   EMP.SALARY > 50000.00; 

Running under DB2 V6 or later, this query will execute quite efficiently. However, in past releases, if thousands or millions of rows were filtered out by additional predicates, this method of coding outer joins performed quite poorly because the outer join was performed first, before any rows were filtered out. To resolve this problem in V5 and earlier DB2 subsystems, ensure that the local predicate is applied before the outer join takes place, using an inline view as follows:

 

 SELECT  E.EMPNO, E.LASTNAME, DEPT.DEPTNAME FROM    (SELECT EMPNO, LASTNAME          FROM DSN8810.EMP          WHERE SALARY > 50000.00) AS E         LEFT OUTER JOIN DSN8810.DEPT DEPT         ON E.WORKDEPT = DEPT.DEPTNO; 

By moving the local predicate into the FROM clause as an inline view, the local predicate is evaluated before the outer join, thereby reducing the number of rows to be joined and enhancing performance.

If additional local predicates are required, you can specify additional inline views. If you want to return rows only for which a domestic resource has responsibility, you can change the sample query as shown:

 

 SELECT  E.EMPNO, E.LASTNAME, DEPT.DEPTNAME FROM     (SELECT EMPNO, LASTNAME, WORKDEPT           FROM EMP           WHERE SALARY > 50000.00) AS E         LEFT OUTER JOIN          (SELECT DEPTNO, DEPTNAME           FROM DEPT           WHERE MGRNO IS NOT NULL) AS D         ON E.WORKDEPT = DEPT.DEPTNO; 

CAUTION

To reiterate, this tuning technique is applicable only to DB2 V4 and V5. Do not code outer joins with inline views in the manner described for DB2 V6 and later because the query will be more difficult to code, explain, and maintain.


Limit the Columns Grouped

When you use a GROUP BY clause to achieve data aggregation, specify only the columns that need to be grouped. Do not provide extraneous columns in the SELECT list and GROUP BY list. To accomplish data grouping, DB2 must sort the retrieved data before displaying it. The more columns that need to be sorted, the more work DB2 must do, and the poorer the performance of the SQL statement.

GROUP BY and ORDER BY Are Not Equivalent

Although the GROUP BY clause typically sorts data to aggregate, the results are not necessarily ordered by the GROUP BY . If you want to ensure that the results are displayed in a specific order, you must use the ORDER BY clause.

When you specify both GROUP BY and ORDER BY , and the ordering requirements are compatible, DB2 can avoid the redundant sort.

ORDER BY and Columns Selected

As of DB2 V6, and via a retrofit APAR to V5, it became possible to ORDER BY columns not specified in the SELECT -list. However, you cannot eliminate columns from the SELECT -list if they are specified in an ORDER BY if you also are using a column function, UNION , UNION ALL , GROUP BY , or DISTINCT .

graphics/v7_icon.gif

Keep in mind that as of DB2 V7, you no longer need to SELECT what you GROUP BY .


Use Table Expressions to Your Advantage

Table Expressions, sometimes called inline views , allow the FROM clause of a SELECT statement to contain another SELECT statement. You can write any table expression in the FROM clause.

Why would you want to use an inline view instead of simply creating an actual view prior to issuing the SELECT statement? The first potential benefit is that an inline view expression can be easier to understand. Instead of attempting to query the DB2 Catalog to extract the SQL definition of a view, the SQL is clearly displayed in the body of the SELECT statement. Second, inline views do not require object management because no DB2 object is created. Finally, inline views provide direct SQL support for certain complex queries that required a view prior to DB2 V4.

Inline views are useful, for example, when detail and aggregated information from a single table must be returned by a single query. A prime example is reporting on column length information from the DB2 Catalog. Consider a request to provide column details for each table, and on each row also report the maximum, minimum, and average column lengths for that table. One solution is to create a view. Consider the COL_LENGTH view based on SYSIBM.SYSCOLUMNS , as shown here:

 

 CREATE VIEW COL_LENGTH    (TABLE_NAME, MAX_LENGTH,     MIN_LENGTH, AVG_LENGTH) AS SELECT   TBNAME, MAX(LENGTH),             MIN(LENGTH), AVG(LENGTH)    FROM     SYSIBM.SYSCOLUMNS    GROUP BY TBNAME; 

After the view is created, you can issue the following SELECT statement joining the view to the base table, thereby providing both detail and aggregate information on each report row:

 

 SELECT   TBNAME, NAME, COLNO, LENGTH,          MAX_LENGTH, MIN_LENGTH, AVG_LENGTH FROM     SYSIBM.SYSCOLUMNS C,          authid.COL_LENGTH V WHERE    C.TBNAME = V.TABLE_NAME ORDER BY 1, 3; 

The solution using inline views is to skip the view-creation step and simply execute the following SQL statement:

 

 SELECT TBNAME, NAME, COLNO, LENGTH,        MAX_LENGTH, MIN_LENGTH, AVG_LENGTH FROM   SYSIBM.SYSCOLUMNS C,        (SELECT TBNAME AS TABLE_NAME,                MAX(LENGTH) AS MAX_LENGTH,                MIN(LENGTH) AS MIN_LENGTH,                AVG(LENGTH) AS AVG_LENGTH         FROM   SYSIBM.SYSCOLUMNS         GROUP BY TBNAME) AS V WHERE  C.TBNAME = V.TABLE_NAME ORDER BY 1,3; 

The same result is returned in a single SQL statement, but without using a view. You must enclose inline view expressions in parentheses and must use a correlation name. You cannot refer to the correlation name for the inline view expression elsewhere in the same FROM clause, but you can use it outside the FROM clause (just like any other table or view name) as the qualifier of a column name.

Consider Table Expressions to Improve Performance

Table expressions are frequently overlooked as a potential solution for resolving problems using only SQL.

One practical use for table expressions is to force the optimizer to choose a specific processing order. For example, consider the following query:

 

 SELECT    D.DEPTNO, MIN(D.DEPTNAME) AS DEPT_NAME, MIN(D.LOCATION) AS DEPT_LOCATION,           SUM(E.SALARY) AS TOTAL_SALARY FROM      DSN8810.DEPT D,           DSN8810.EMP  E WHERE     D.DEPTNO = E.WORKDEPT AND       E.BONUS BETWEEN 0.00 AND 1000.00 GROUP BY  D.DEPTNO; 

In this query, the detail rows that qualify from each table are joined prior to the GROUP BY processing. In general, there will be more EMP rows than DEPT rows because a department comprises multiple employees. Suppose there were 200 DEPT rows joined to 75,000 EMP rows ”then the GROUP BY is processed . Instead, you can use table expressions to force the optimizer to process the aggregations on a table-by-table basis:

 

 SELECT  D.DEPTNO, D.DEPTNAME, D.LOCATION, TOTAL_SALARY FROM    DSN8810.DEPT D,         (SELECT WORKDEPT, SUM(SALARY) AS TOTAL_SALARY          FROM   DSN8810.EMP  E          WHERE  E.BONUS BETWEEN 0.00 and 1000.00          GROUP BY  E.WORKDEPT) AS E WHERE   D.DEPTNO = E.WORKDEPT; 

This will produce the same results with better performance. Use table expressions to pre-filter FULL JOIN tables, to pre-filter null supplying tables of LEFT / RIGHT joins, to separate GROUP BY work, and to generate or derive data.

Consider Table Expressions Instead of Views

Instead of creating views and using them in your application programs, consider using table expressions. By moving the SQL from the view into a table expression the full intent of the SQL becomes more apparent. Additionally, it is easier to debug SQL problems when the SQL is completely visible in the program.

When confronted with a problem in SQL that uses a view, you will have to query the DB2 Catalog to find the SQL that makes up the view. This information is in SYSIBM.SYSVIEWS . The SQL in this table can be difficult to read because it is not formatted. However, if the SQL were changed to include the view SQL too, it would be much easier to understand and debug or tune the query.

Consider Row Expressions

SQL becomes even more flexible under DB2 V7 with row expressions. Row expressions allow SQL statements to be coded using more than one set of comparisons in a single predicate using a subquery. The net result is that multiple columns can be compared within the scope of a single SQL predicate ”possibly against multiple rows on the right side of the predicate. Perhaps the best way to understand this feature is by viewing an example:

 

 SELECT * FROM   SAMPLE_TABLE WHERE  (COL1, COL2) IN (SELECT COLX, COLY                         FROM   OTHER_TABLE); 

The difference between this statement and a typical SQL statement is quite obvious: Two columns are coded on the left side of the predicate, thereby enabling two columns to be selected in the SELECT statement on the right side of the predicate. Of course, a row expression need not be limited to only two columns; any number of columns can be specified, so long as the number of columns on the left matches the number of columns on the right side of the predicate.

Row expressions bring more flexibility and can greatly simplify certain types of SQL statements. Additionally, they can be more efficient than an equivalent SQL statement using multiple subqueries. Consider

 

 SELECT * FROM   SAMPLE_TABLE WHERE  COL1 IN (SELECT COLX                 FROM   OTHER_TABLE) AND    COL2 IN (SELECT COLY                 FROM   OTHER_TABLE); 

Although this SQL statement is functionally equivalent to the previous SQL statement, it contains two subquery predicates and requires two scans of OTHER_TABLE . The previous SQL statement can scan OTHER_TABLE once to achieve the same results.

Use Scalar Fullselect to Your Advantage

DB2 Version 8 extends the functionality of SQL to allow a scalar fullselect to be used almost anywhere an expression can be used. The term scalar fullselect basically means a SELECT statement that returns a single value (or a NULL ). An example will help to clarify:

 

 SELECT PROJNO, PROJNAME,        (SELECT MIN(ACSTDATE)) FROM PROJACT),        DEPTNO FROM   DSN8810.PROJ; 

The SELECT -list of this statement contains three columns and a scalar fullselect. For this SQL to work, it is important that only one value is returned to the fullselect embedded in the SELECT statement; this is what makes it a scalar fullselect.

Consider Using Multi-Row Fetches

When you need to retrieve multiple rows, consider deploying a multi-row fetch to transfer more than one row using a single FETCH statement. This capability is new as of DB2 Version 8 ”basically, it allows you to FETCH multiple rows at one time into an array in your program. By fetching multiple rows at once your request can become more efficient, especially for distributed requests . More information on multi-row fetching is provided in Chapter 11, "Using DB2 in an Application Program."

Be Flexible and Adapt to Your Circumstances

When coding SQL, there are no rules that always apply. Adopt the basic tenet of almost never saying "always" or "never" and you will have grasped the zen of SQL coding.

Remember the cardinal rule of DBMS development that we discussed earlier ”"It depends." To be successful with complex SQL, be sure to understand what can be done using SQL and have a good grasp on the fundamentals. Also, be sure to have the IBM DB2 SQL Reference manual readily available, too. With patience, willingness to experiment, and a sound understanding of your available options, you will be able to code efficient SQL statements ”regardless of the complexity required.

 <  Day Day Up  >  


DB2 Developers Guide
DB2 Developers Guide (5th Edition)
ISBN: 0672326132
EAN: 2147483647
Year: 2004
Pages: 388

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net