6.5 Joins | Oracle High Performance Tuning for 9i and 10g

< Day Day Up >

A join is a combination of rows extracted from two or more tables. Joins can be very specific, for instance an intersection between two tables, or they can be less specific such as an outer join. An outer join is a join returning an intersection plus rows from either or both tables, not in the other table.

This discussion on tuning joins is divided into three sections: join syntax formats, efficient joins, and inefficient joins. Since this book is about tuning it seems sensible to divide joins between efficient joins and inefficient joins.

Firstly, let's take a look at the two different available join syntax formats in Oracle SQL.

6.5.1 Join Formats

There are two different syntax formats available for SQL join queries. The first is Oracle Corporation's proprietary format and the second is the ANSI standard format. Let's test the two formats to see if either format can be tuned to the best performance.

The Oracle SQL proprietary format places join specifications into the WHERE clause of an SQL query. The only syntactical addition to the standard SELECT statement syntax is the use of the (+) or outer join operator. We will deal with tuning outer joins later in this chapter. Following is an example of an Oracle SQL proprietary join formatted query with its query plan, using the Employees schema. All tables are fully scanned because there is joining but no filtering. The Optimizer forces full table reads on all tables because it is the fastest access method to read all the data.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM division di, department de, project prj       WHERE di.division_id = de.division_id       AND de.department_id = prj.department_id;     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- SELECT STATEMENT on                     23    10000    640000  HASH JOIN on                           23    10000    640000   HASH JOIN on                           3      100      3600    TABLE ACCESS FULL on DIVISION         1       10       170    TABLE ACCESS FULL on DEPARTMENT       1      100      1900   TABLE ACCESS FULL on PROJECT          13    10000    280000

The next example shows the same query except using the ANSI standard join format. Notice how the query plan is identical.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM division di JOIN department de          USING(division_id)             JOIN project prj USING (department_id);     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- SELECT STATEMENT on                     23    10000    640000  HASH JOIN on                           23    10000    640000   HASH JOIN on                           3      100      3600    TABLE ACCESS FULL on DIVISION         1       10       170    TABLE ACCESS FULL on DEPARTMENT       1      100      1900   TABLE ACCESS FULL on PROJECT          13    10000    280000

What is the objective of showing the two queries above, including their query plan details? The task of this book is performance tuning. Is either of the two of Oracle SQL proprietary or ANSI join formats inherently faster? Let's try to prove it either way. Once again the Oracle SQL proprietary format is shown below but with a filter added, finding only a single row in the join.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM division di, department de, project prj       WHERE di.division_id = 5       AND di.division_id = de.division_id       AND de.department_id = prj.department_id;     Query                                      Cost     Rows     Bytes ----------------------------------------   ----   ------   ------- SELECT STATEMENT on                           4      143      9152  TABLE ACCESS BY INDEX ROWID on PROJECT       2    10000    280000   NESTED LOOPS on                             4      143      9152   NESTED LOOPS on                             2        1        36    TABLE ACCESS BY INDEX ROWID on DIVISION    1        1        17     INDEX UNIQUE SCAN on XPKDIVISION                   1    TABLE ACCESS FULL on DEPARTMENT            1       10       190   INDEX RANGE SCAN on XFKPROJECT_DEPARTMENT   1    10000

Next is the ANSI standard equivalent of the previous join, including the filter. Two of the most important aspects of tuning SQL join queries are the ability to apply filtering prior to joining tables and specifying the table with the largest filter applied as being the first table in the FROM clause, especially for very large tables. The question is this: Does the ANSI format allow for tuning of joins down to these levels of detail? Is the ANSI format a faster and more tunable option?

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM division di JOIN department de             ON(di.division_id = de.division_id)             JOIN project prj ON(de.department_id =                 prj.department_id)       WHERE di.division_id = 5;

In the previous join query filtering is visibly applied after the specification of the join. Also note that with the addition of filtering the ON clause rather than the USING clause is required. In the following query plan note that the Optimizer has not changed its plan of execution between the Oracle SQL proprietary and ANSI join formats. There is no difference in performance between Oracle SQL proprietary and ANSI standard join formats.

Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                         4      143      9152 TABLE ACCESS BY INDEX ROWID on PROJECT      2    10000    280000 NESTED LOOPS on                             4      143      9152 NESTED LOOPS on                             2        1        36 TABLE ACCESS BY INDEX ROWID on DIVISION     1        1        17 INDEX UNIQUE SCAN on XPKDIVISION            1 TABLE ACCESS FULL on DEPARTMENT             1       10       190 INDEX RANGE SCAN on XFKPROJECT_DEPARTMENT   1    10000

A more visibly tunable join could be demonstrated by retrieving a single row from the largest rather than the smallest table. Here is the Oracle SQL proprietary format.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM project prj, department de, division di       WHERE prj.project_id = 50       AND de.department_id = prj.department_id       AND di.division_id = de.division_id;

Notice in the following query plan that the cost is the same but the number of rows and bytes read are substantially reduced; only a single row is retrieved. Since the Project table is being reduced in size more than any other table it appears first in the FROM clause. The same applies to the Department table being larger than the Division table.

Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                         4        1        67  NESTED LOOPS on                            4        1        67   NESTED LOOPS on                           3        1        50    TABLE ACCESS BY INDEX ROWID on PROJECT   2        1        31     INDEX UNIQUE SCAN on XPKPROJECT         1        1    TABLE ACCESS BY INDEX ROWID on     DEPARTMENT                              1      100      1900    INDEX UNIQUE SCAN on XPKDEPARTMENT 100  TABLE ACCESS BY INDEX ROWID on DIVISION    1       10       170   INDEX UNIQUE SCAN on XPKDIVISION                  10

Now let's do the same query but with the ANSI join format. From the following query plan we can once again see that use of either the Oracle SQL proprietary or ANSI join format does not appear to make any difference to performance and capacity for tuning.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT di.name, de.name, prj.name       FROM project prj JOIN department de             ON(prj.department_id = de.department_id)             JOIN division di ON(de.division_id =                 di.division_id)       WHERE prj.project_id = 50;     Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                         4        1        67  NESTED LOOPS on                            4        1        67   NESTED LOOPS on                           3        1        50    TABLE ACCESS BY INDEX ROWID on PROJECT   2        1        31     INDEX UNIQUE SCAN on XPKPROJECT         1        1    TABLE ACCESS BY INDEX ROWID on     DEPARTMENT                              1      100      1900     INDEX UNIQUE SCAN on XPKDEPARTMENT             100   TABLE ACCESS BY INDEX ROWID on DIVISION   1       10       170    INDEX UNIQUE SCAN on XPKDIVISION                 10

Let's take this further and do some time testing. We will use the Accounts schema since the Employees schema does not have much data. We want to retrieve more rows to give a better chance of getting a time difference, thus we will not filter on the largest table first. As can be seen from the following results the timing is identical. Perhaps changing the join orders could make subtle differences but there is no reason why the ANSI join format should be considered less tunable.

SQL> SELECT COUNT(*) FROM (   2 SELECT t.text, st.text, coa.text, gl.dr, gl.cr   3 FROM type t , subtype st, coa, generalledger gl   4 WHERE t.type = 'A'   5 AND coa.type = t. type   6 AND coa.subtype = st.subtype   7 AND gl.coa# = coa.coa#);        COUNT(*) -----------      239848     Elapsed: 00:00:04.06     SQL> SELECT COUNT(*) FROM (   2 SELECT t.text, st.text, coa.text, gl.dr, gl.cr   3 FROM type t JOIN coa ON(t.type = coa.type)   4 JOIN subtype st ON(st.subtype = coa.subtype)   5 JOIN generalledger gl ON(gl.coa# = coa.coa#)   6 WHERE t.type = 'A');        COUNT(*) -----------      239848     Elapsed: 00:00:04.06

6.5.2 Efficient Joins

What is an efficient join? An efficient join is a join SQL query which can be tuned to an acceptable level of performance. Certain types of join queries are inherently easily tuned and thus can give good performance. In general, a join is efficient when it can use indexes on large tables or is reading only very small tables. Moreover, any type of join will be inefficient if coded improperly.

Intersections

An inner or natural join is an intersection between two tables. In Set parlance an intersection contains all elements occurring in both of the sets, or common to both sets. An intersection is efficient when index columns are matched together in join clauses. Obviously intersection matching not using indexed columns will be inefficient. In that case you may want to create alternate indexes. On the other hand, when a table is very small the Optimizer may conclude that reading the whole table is faster than reading an associated index plus the table. How the Optimizer makes a decision such as this will be discussed in later chapters since this subject matter delves into indexing and physical file block structure in Oracle Database datafiles.

In the example below both of the Type and COA tables are so small that the Optimizer does not bother with the indexes and simply reads both of the tables fully.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT t.text, coa.text FROM type t JOIN coa          USING(type);     Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                         3       55      1430  HASH JOIN on                               3       55      1430   TABLE ACCESS FULL on TYPE                 1        6        54    TABLE ACCESS FULL on COA                 1       55       935

With the next example the Optimizer has done something a little odd by using a unique index on the Subtype table. The Subtype table has only four rows and is extremely small.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT t.text, coa.text FROM type t JOIN coa          USING(type)             JOIN subtype st USING(subtype);     Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                         3       55      1650  NESTED LOOPS on                            3       55      1650   HASH JOIN on                              3       55      1540    TABLE ACCESS FULL on TYPE                1        6        54   TABLE ACCESS FULL on COA                  1       55      1045    INDEX UNIQUE SCAN on XPKSUBTYPE          4        8

Once again in the following example the Optimizer has chosen to read the index for the very small Subtype table. However, the GeneralLedger table has its index read because it is very large and the Optimizer considers that more efficient. The reason for this is that the GeneralLedger table does have an index on the COA# column and thus the index is range scanned.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT t.text, coa.text       FROM type t JOIN coa USING(type)             JOIN subtype st USING(subtype)                   JOIN generalledger gl ON(gl.coa# =                       coa.coa#);     Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- SELECT STATEMENT on                        58   752740  31615080  NESTED LOOPS on                           58   752740  31615080   NESTED LOOPS on                           3       55      1980    HASH JOIN on                             3       55      1870     TABLE ACCESS FULL on TYPE               1        6        54     TABLE ACCESS FULL on COA                1       55      1375    INDEX UNIQUE SCAN on XPKSUBTYPE          4        8   INDEX RANGE SCAN on XFK_GL_COA#           1   752740   4516440

The most efficient type of inner join will generally be one retrieving very specific rows such as that in the next example. Most SQL is more efficient when retrieving very specific, small numbers of rows.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT t.text, st.text, coa.text, gl.dr, gl.cr       FROM generalledger gl JOIN coa ON(gl.coa# = coa.coa#)             JOIN type t ON(t.type = coa.type)                   JOIN subtype st ON(st.subtype =                       coa.subtype)       WHERE gl.generalledger_id = 100;

Note how all tables in the query plan are accessed using unique index hits.

Query                                  Pos     Cost     Rows      Bytes -----------------------------------   ----   ------   -------   ------- SELECT STATEMENT on                      6        6         1        64  NESTED LOOPS on                         1        6         1        64   NESTED LOOPS on                        1        5         1        55    NESTED LOOPS on                       1        4         1        45     TABLE ACCESS BY INDEX ROWID on GENE  1        3         1        20      INDEX UNIQUE SCAN on XPKGENERALLED  1        2         1     TABLE ACCESS BY INDEX ROWID on COA   2        1        55      1375      INDEX UNIQUE SCAN on XPKCOA         1                 55    TABLE ACCESS BY INDEX ROWID on SUBTY  2        1         4        40     INDEX UNIQUE SCAN on XPKSUBTYPE      1       4   TABLE ACCESS BY INDEX ROWID on TYPE    2        1         6        54    INDEX UNIQUE SCAN on XPKTYPE          1        6

Self Joins

A self join joins a table to itself. Sometimes self-joining tables can be handled with hierarchical queries. Otherwise a self join is applied to a table containing columns within each row which link to each other. The Employee table in the Employees schema is such a table. Since both the MANAGER_ID and EMPLOYEE_ID columns are indexed it would be fairly efficient to join the tables using those two columns.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT manager.name, employee.name       FROM employee manager, employee employee       WHERE employee.manager_id = manager.employee_id;

In the query plan the Employee table is fully scanned twice because all the data is read and the Optimizer considers this faster because the Employee table is small.

Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- SELECT STATEMENT on                      3      110      2970  HASH JOIN on                            3      110      2970   TABLE ACCESS FULL on EMPLOYEE          1      111      1554    TABLE ACCESS FULL on EMPLOYEE         1      111      1443

Equi-Joins and Range Joins

An equi-join uses the equals sign (=) and a range join uses range operators (<, >, <=, >=) and the BETWEEN operator. In general, the = operator will execute an exact row hit on an index and thus use unique index hits. The range operators will usually require the Optimizer to execute index range scans. BTree (binary tree) indexes, the most commonly used indexes in Oracle Database, are highly amenable to range scans. A BTree index is little like a limited depth tree and is optimized for both unique hits and range scans.

Going back into the Accounts schema, this first query uses two unique index hits. The filter helps that happen.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa.*, gl.*       FROM generalledger gl JOIN coa ON(gl.coa# = coa.coa#)       WHERE generalledger_id = 10;     Query                                   Cost     Rows     Bytes -------------------------------------   ----   ------   ------- SELECT STATEMENT on                        4        1        51  NESTED LOOPS on                           4        1        51   TABLE ACCESS BY INDEX ROWID on     GENERALLEDG                            3        1        26    INDEX UNIQUE SCAN on XPKGENERALLEDGER   2        1   TABLE ACCESS BY INDEX ROWID on COA       1       55      1375    INDEX UNIQUE SCAN on XPKCOA                     55

This second query uses a range index scan on the GeneralLedger table as a result of the range operator in the filter. Do you notice that the join clause inside the ON clause is where the range join operator is placed? Well there isn't really much point in joining ON (gl.coa# >= coa.coa#). I do not think I have ever seen an SQL join joining using a range operator. The result would be a very unusual type of outer join perhaps. Thus there is no need for a query plan.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa.*, gl.*       FROM generalledger gl JOIN coa ON(gl.coa# >= coa.coa#)       WHERE generalledger_id = 10;

Tip

A Cartesian product is generally useless in a relational database, so is a range join.

6.5.3 Inefficient Joins

What is an inefficient join? An inefficient join is an SQL query joining tables which is difficult to tune or cannot be tuned to an acceptable level of performance. Certain types of join queries are inherently both poor performers and difficult if not impossible to tune. Inefficient joins are best avoided.

Cartesian Products

The ANSI join format calls a Cartesian product a Cross Join. A Cross Join is only tunable as far as columns selected match indexes such that rows are retrieved from indexes and not tables.

The following second query has a lower cost than the first because the selected columns match indexes on both tables. I have left the Rows and Bytes columns in the query plans as overflowed numbers replaced with a string of # characters. This is done to stress the pointlessness of using a Cartesian product in a relational database.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT * FROM coa, generalledger;     Query                                 Cost     Rows       Bytes -----------------------------------   ----   ------     ------- SELECT STATEMENT on                  27116  #######   #########  MERGE JOIN CARTESIAN on             27116  #######   #########   TABLE ACCESS FULL on COA               1       55        1375    BUFFER SORT on                    27115   752740    19571240     TABLE ACCESS FULL on      GENERALLEDGER                     493   752740    19571240     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa.coa#, gl.generalledger_id FROM coa,          generalledger gl;     Query                                 Cost     Rows       Bytes -----------------------------------   ----   ------     ------- SELECT STATEMENT on                  11936  #######   #########  MERGE JOIN CARTESIAN on             11936  #######   #########   INDEX FULL SCAN on XPKCOA              1       55         330   BUFFER SORT on                     11935   752740     4516440    INDEX FAST FULL SCAN on     XPKGENERALLEDGER                   217   752740     4516440

Outer Joins

Tuning an outer join requires the same approach to tuning as with an inner join. The only point to note is that if applications require a large quantity of outer joins there is probably potential for data model tuning. The data model could be too granular. Outer joins are probably more applicable to reporting and data warehouse type applications.

An outer join is not always inefficient. The performance and to a certain extent the indication of a need for data model tuning depends on the ratio of rows retrieved from the intersection to rows retrieved outside the intersection. The more rows retrieved from the intersection the better.

My question is this: Why are outer joins needed? Examine the data model first.

Anti-Joins

An anti-join is always a problem. An anti-join simply does the opposite of a requirement. The result is that the Optimizer must search for everything not meeting a condition. An anti-join will generally always produce a full table scan as seen in the first example following. The second example uses one index because indexed columns are being retrieved from one of the tables. Again the Rows and Bytes columns are left as overflowing showing the possibly folly of using anti-joins.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT t.text, coa# FROM type t, coa WHERE          t.type != coa.type;     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- SELECT STATEMENT on                      7      275      4675  NESTED LOOPS on                         7      275      4675   TABLE ACCESS FULL on TYPE              1        6        54    TABLE ACCESS FULL on COA              1       55       440     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa.coa#, gl.generalledger_id FROM coa,          generalledger gl       WHERE coa.coa# != gl.coa#;     Query                       Pos     Cost     Rows        Bytes ------------------------   ----   ------   -------     ------- SELECT STATEMENT on       27116    27116   #######   #########  NESTED LOOPS on              1    27116   #######   #########   INDEX FULL SCAN on XPKCOA   1        1        55         330    TABLE ACCESS FULL on       GENERALLEDGER            2      493    752740     9032880

Mutable and Complex Joins

A mutable join is a join of more than two tables. A complex join is a mutable join with added filtering. We have already examined a complex mutable join in the section on intersection joins and various other parts of this chapter.

6.5.4 How to Tune a Join

So how can a join be tuned? There are a number of factors to consider.

Use equality first.
Use range operators only where equality does not apply.
Avoid use of negatives in the form of != or NOT.
Avoid LIKE pattern matching.
Try to retrieve specific rows and in small numbers.
Filter from large tables first to reduce rows joined. Retrieve tables in order from the most highly filtered table downwards; preferably the largest table has the most filtering applied.

Tip
The most highly filtered table is the table having the smallest percentage of its rows retrieved, preferably the largest table.
Use indexes wherever possible except for very small tables.
Let the Optimizer do its job.

< Day Day Up >