| < Day Day Up > |
|
A join is a combination of rows extracted from two or more tables. Joins can be very specific, for instance an intersection between two tables, or they can be less specific such as an outer join. An outer join is a join returning an intersection plus rows from either or both tables, not in the other table.
This discussion on tuning joins is divided into three sections: join syntax formats, efficient joins, and inefficient joins. Since this book is about tuning it seems sensible to divide joins between efficient joins and inefficient joins.
Firstly, let's take a look at the two different available join syntax formats in Oracle SQL.
There are two different syntax formats available for SQL join queries. The first is Oracle Corporation's proprietary format and the second is the ANSI standard format. Let's test the two formats to see if either format can be tuned to the best performance.
The Oracle SQL proprietary format places join specifications into the WHERE clause of an SQL query. The only syntactical addition to the standard SELECT statement syntax is the use of the (+) or outer join operator. We will deal with tuning outer joins later in this chapter. Following is an example of an Oracle SQL proprietary join formatted query with its query plan, using the Employees schema. All tables are fully scanned because there is joining but no filtering. The Optimizer forces full table reads on all tables because it is the fastest access method to read all the data.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM division di, department de, project prj WHERE di.division_id = de.division_id AND de.department_id = prj.department_id; Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 23 10000 640000 HASH JOIN on 23 10000 640000 HASH JOIN on 3 100 3600 TABLE ACCESS FULL on DIVISION 1 10 170 TABLE ACCESS FULL on DEPARTMENT 1 100 1900 TABLE ACCESS FULL on PROJECT 13 10000 280000
The next example shows the same query except using the ANSI standard join format. Notice how the query plan is identical.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM division di JOIN department de USING(division_id) JOIN project prj USING (department_id); Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 23 10000 640000 HASH JOIN on 23 10000 640000 HASH JOIN on 3 100 3600 TABLE ACCESS FULL on DIVISION 1 10 170 TABLE ACCESS FULL on DEPARTMENT 1 100 1900 TABLE ACCESS FULL on PROJECT 13 10000 280000
What is the objective of showing the two queries above, including their query plan details? The task of this book is performance tuning. Is either of the two of Oracle SQL proprietary or ANSI join formats inherently faster? Let's try to prove it either way. Once again the Oracle SQL proprietary format is shown below but with a filter added, finding only a single row in the join.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM division di, department de, project prj WHERE di.division_id = 5 AND di.division_id = de.division_id AND de.department_id = prj.department_id; Query Cost Rows Bytes ---------------------------------------- ---- ------ ------- SELECT STATEMENT on 4 143 9152 TABLE ACCESS BY INDEX ROWID on PROJECT 2 10000 280000 NESTED LOOPS on 4 143 9152 NESTED LOOPS on 2 1 36 TABLE ACCESS BY INDEX ROWID on DIVISION 1 1 17 INDEX UNIQUE SCAN on XPKDIVISION 1 TABLE ACCESS FULL on DEPARTMENT 1 10 190 INDEX RANGE SCAN on XFKPROJECT_DEPARTMENT 1 10000
Next is the ANSI standard equivalent of the previous join, including the filter. Two of the most important aspects of tuning SQL join queries are the ability to apply filtering prior to joining tables and specifying the table with the largest filter applied as being the first table in the FROM clause, especially for very large tables. The question is this: Does the ANSI format allow for tuning of joins down to these levels of detail? Is the ANSI format a faster and more tunable option?
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM division di JOIN department de ON(di.division_id = de.division_id) JOIN project prj ON(de.department_id = prj.department_id) WHERE di.division_id = 5;
In the previous join query filtering is visibly applied after the specification of the join. Also note that with the addition of filtering the ON clause rather than the USING clause is required. In the following query plan note that the Optimizer has not changed its plan of execution between the Oracle SQL proprietary and ANSI join formats. There is no difference in performance between Oracle SQL proprietary and ANSI standard join formats.
Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 4 143 9152 TABLE ACCESS BY INDEX ROWID on PROJECT 2 10000 280000 NESTED LOOPS on 4 143 9152 NESTED LOOPS on 2 1 36 TABLE ACCESS BY INDEX ROWID on DIVISION 1 1 17 INDEX UNIQUE SCAN on XPKDIVISION 1 TABLE ACCESS FULL on DEPARTMENT 1 10 190 INDEX RANGE SCAN on XFKPROJECT_DEPARTMENT 1 10000
A more visibly tunable join could be demonstrated by retrieving a single row from the largest rather than the smallest table. Here is the Oracle SQL proprietary format.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM project prj, department de, division di WHERE prj.project_id = 50 AND de.department_id = prj.department_id AND di.division_id = de.division_id;
Notice in the following query plan that the cost is the same but the number of rows and bytes read are substantially reduced; only a single row is retrieved. Since the Project table is being reduced in size more than any other table it appears first in the FROM clause. The same applies to the Department table being larger than the Division table.
Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 4 1 67 NESTED LOOPS on 4 1 67 NESTED LOOPS on 3 1 50 TABLE ACCESS BY INDEX ROWID on PROJECT 2 1 31 INDEX UNIQUE SCAN on XPKPROJECT 1 1 TABLE ACCESS BY INDEX ROWID on DEPARTMENT 1 100 1900 INDEX UNIQUE SCAN on XPKDEPARTMENT 100 TABLE ACCESS BY INDEX ROWID on DIVISION 1 10 170 INDEX UNIQUE SCAN on XPKDIVISION 10
Now let's do the same query but with the ANSI join format. From the following query plan we can once again see that use of either the Oracle SQL proprietary or ANSI join format does not appear to make any difference to performance and capacity for tuning.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT di.name, de.name, prj.name FROM project prj JOIN department de ON(prj.department_id = de.department_id) JOIN division di ON(de.division_id = di.division_id) WHERE prj.project_id = 50; Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 4 1 67 NESTED LOOPS on 4 1 67 NESTED LOOPS on 3 1 50 TABLE ACCESS BY INDEX ROWID on PROJECT 2 1 31 INDEX UNIQUE SCAN on XPKPROJECT 1 1 TABLE ACCESS BY INDEX ROWID on DEPARTMENT 1 100 1900 INDEX UNIQUE SCAN on XPKDEPARTMENT 100 TABLE ACCESS BY INDEX ROWID on DIVISION 1 10 170 INDEX UNIQUE SCAN on XPKDIVISION 10
Let's take this further and do some time testing. We will use the Accounts schema since the Employees schema does not have much data. We want to retrieve more rows to give a better chance of getting a time difference, thus we will not filter on the largest table first. As can be seen from the following results the timing is identical. Perhaps changing the join orders could make subtle differences but there is no reason why the ANSI join format should be considered less tunable.
SQL> SELECT COUNT(*) FROM ( 2 SELECT t.text, st.text, coa.text, gl.dr, gl.cr 3 FROM type t , subtype st, coa, generalledger gl 4 WHERE t.type = 'A' 5 AND coa.type = t. type 6 AND coa.subtype = st.subtype 7 AND gl.coa# = coa.coa#); COUNT(*) ----------- 239848 Elapsed: 00:00:04.06 SQL> SELECT COUNT(*) FROM ( 2 SELECT t.text, st.text, coa.text, gl.dr, gl.cr 3 FROM type t JOIN coa ON(t.type = coa.type) 4 JOIN subtype st ON(st.subtype = coa.subtype) 5 JOIN generalledger gl ON(gl.coa# = coa.coa#) 6 WHERE t.type = 'A'); COUNT(*) ----------- 239848 Elapsed: 00:00:04.06
What is an efficient join? An efficient join is a join SQL query which can be tuned to an acceptable level of performance. Certain types of join queries are inherently easily tuned and thus can give good performance. In general, a join is efficient when it can use indexes on large tables or is reading only very small tables. Moreover, any type of join will be inefficient if coded improperly.
An inner or natural join is an intersection between two tables. In Set parlance an intersection contains all elements occurring in both of the sets, or common to both sets. An intersection is efficient when index columns are matched together in join clauses. Obviously intersection matching not using indexed columns will be inefficient. In that case you may want to create alternate indexes. On the other hand, when a table is very small the Optimizer may conclude that reading the whole table is faster than reading an associated index plus the table. How the Optimizer makes a decision such as this will be discussed in later chapters since this subject matter delves into indexing and physical file block structure in Oracle Database datafiles.
In the example below both of the Type and COA tables are so small that the Optimizer does not bother with the indexes and simply reads both of the tables fully.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT t.text, coa.text FROM type t JOIN coa USING(type); Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 3 55 1430 HASH JOIN on 3 55 1430 TABLE ACCESS FULL on TYPE 1 6 54 TABLE ACCESS FULL on COA 1 55 935
With the next example the Optimizer has done something a little odd by using a unique index on the Subtype table. The Subtype table has only four rows and is extremely small.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT t.text, coa.text FROM type t JOIN coa USING(type) JOIN subtype st USING(subtype); Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 3 55 1650 NESTED LOOPS on 3 55 1650 HASH JOIN on 3 55 1540 TABLE ACCESS FULL on TYPE 1 6 54 TABLE ACCESS FULL on COA 1 55 1045 INDEX UNIQUE SCAN on XPKSUBTYPE 4 8
Once again in the following example the Optimizer has chosen to read the index for the very small Subtype table. However, the GeneralLedger table has its index read because it is very large and the Optimizer considers that more efficient. The reason for this is that the GeneralLedger table does have an index on the COA# column and thus the index is range scanned.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT t.text, coa.text FROM type t JOIN coa USING(type) JOIN subtype st USING(subtype) JOIN generalledger gl ON(gl.coa# = coa.coa#); Query Cost Rows Bytes -------------------------------------- ---- ------ ------- SELECT STATEMENT on 58 752740 31615080 NESTED LOOPS on 58 752740 31615080 NESTED LOOPS on 3 55 1980 HASH JOIN on 3 55 1870 TABLE ACCESS FULL on TYPE 1 6 54 TABLE ACCESS FULL on COA 1 55 1375 INDEX UNIQUE SCAN on XPKSUBTYPE 4 8 INDEX RANGE SCAN on XFK_GL_COA# 1 752740 4516440
The most efficient type of inner join will generally be one retrieving very specific rows such as that in the next example. Most SQL is more efficient when retrieving very specific, small numbers of rows.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT t.text, st.text, coa.text, gl.dr, gl.cr FROM generalledger gl JOIN coa ON(gl.coa# = coa.coa#) JOIN type t ON(t.type = coa.type) JOIN subtype st ON(st.subtype = coa.subtype) WHERE gl.generalledger_id = 100;
Note how all tables in the query plan are accessed using unique index hits.
Query Pos Cost Rows Bytes ----------------------------------- ---- ------ ------- ------- SELECT STATEMENT on 6 6 1 64 NESTED LOOPS on 1 6 1 64 NESTED LOOPS on 1 5 1 55 NESTED LOOPS on 1 4 1 45 TABLE ACCESS BY INDEX ROWID on GENE 1 3 1 20 INDEX UNIQUE SCAN on XPKGENERALLED 1 2 1 TABLE ACCESS BY INDEX ROWID on COA 2 1 55 1375 INDEX UNIQUE SCAN on XPKCOA 1 55 TABLE ACCESS BY INDEX ROWID on SUBTY 2 1 4 40 INDEX UNIQUE SCAN on XPKSUBTYPE 1 4 TABLE ACCESS BY INDEX ROWID on TYPE 2 1 6 54 INDEX UNIQUE SCAN on XPKTYPE 1 6
A self join joins a table to itself. Sometimes self-joining tables can be handled with hierarchical queries. Otherwise a self join is applied to a table containing columns within each row which link to each other. The Employee table in the Employees schema is such a table. Since both the MANAGER_ID and EMPLOYEE_ID columns are indexed it would be fairly efficient to join the tables using those two columns.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT manager.name, employee.name FROM employee manager, employee employee WHERE employee.manager_id = manager.employee_id;
In the query plan the Employee table is fully scanned twice because all the data is read and the Optimizer considers this faster because the Employee table is small.
Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 3 110 2970 HASH JOIN on 3 110 2970 TABLE ACCESS FULL on EMPLOYEE 1 111 1554 TABLE ACCESS FULL on EMPLOYEE 1 111 1443
An equi-join uses the equals sign (=) and a range join uses range operators (<, >, <=, >=) and the BETWEEN operator. In general, the = operator will execute an exact row hit on an index and thus use unique index hits. The range operators will usually require the Optimizer to execute index range scans. BTree (binary tree) indexes, the most commonly used indexes in Oracle Database, are highly amenable to range scans. A BTree index is little like a limited depth tree and is optimized for both unique hits and range scans.
Going back into the Accounts schema, this first query uses two unique index hits. The filter helps that happen.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT coa.*, gl.* FROM generalledger gl JOIN coa ON(gl.coa# = coa.coa#) WHERE generalledger_id = 10; Query Cost Rows Bytes ------------------------------------- ---- ------ ------- SELECT STATEMENT on 4 1 51 NESTED LOOPS on 4 1 51 TABLE ACCESS BY INDEX ROWID on GENERALLEDG 3 1 26 INDEX UNIQUE SCAN on XPKGENERALLEDGER 2 1 TABLE ACCESS BY INDEX ROWID on COA 1 55 1375 INDEX UNIQUE SCAN on XPKCOA 55
This second query uses a range index scan on the GeneralLedger table as a result of the range operator in the filter. Do you notice that the join clause inside the ON clause is where the range join operator is placed? Well there isn't really much point in joining ON (gl.coa# >= coa.coa#). I do not think I have ever seen an SQL join joining using a range operator. The result would be a very unusual type of outer join perhaps. Thus there is no need for a query plan.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT coa.*, gl.* FROM generalledger gl JOIN coa ON(gl.coa# >= coa.coa#) WHERE generalledger_id = 10;
Tip | A Cartesian product is generally useless in a relational database, so is a range join. |
What is an inefficient join? An inefficient join is an SQL query joining tables which is difficult to tune or cannot be tuned to an acceptable level of performance. Certain types of join queries are inherently both poor performers and difficult if not impossible to tune. Inefficient joins are best avoided.
The ANSI join format calls a Cartesian product a Cross Join. A Cross Join is only tunable as far as columns selected match indexes such that rows are retrieved from indexes and not tables.
The following second query has a lower cost than the first because the selected columns match indexes on both tables. I have left the Rows and Bytes columns in the query plans as overflowed numbers replaced with a string of # characters. This is done to stress the pointlessness of using a Cartesian product in a relational database.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT * FROM coa, generalledger; Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 27116 ####### ######### MERGE JOIN CARTESIAN on 27116 ####### ######### TABLE ACCESS FULL on COA 1 55 1375 BUFFER SORT on 27115 752740 19571240 TABLE ACCESS FULL on GENERALLEDGER 493 752740 19571240 EXPLAIN PLAN SET statement_id='TEST' FOR SELECT coa.coa#, gl.generalledger_id FROM coa, generalledger gl; Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 11936 ####### ######### MERGE JOIN CARTESIAN on 11936 ####### ######### INDEX FULL SCAN on XPKCOA 1 55 330 BUFFER SORT on 11935 752740 4516440 INDEX FAST FULL SCAN on XPKGENERALLEDGER 217 752740 4516440
Tuning an outer join requires the same approach to tuning as with an inner join. The only point to note is that if applications require a large quantity of outer joins there is probably potential for data model tuning. The data model could be too granular. Outer joins are probably more applicable to reporting and data warehouse type applications.
An outer join is not always inefficient. The performance and to a certain extent the indication of a need for data model tuning depends on the ratio of rows retrieved from the intersection to rows retrieved outside the intersection. The more rows retrieved from the intersection the better.
My question is this: Why are outer joins needed? Examine the data model first.
An anti-join is always a problem. An anti-join simply does the opposite of a requirement. The result is that the Optimizer must search for everything not meeting a condition. An anti-join will generally always produce a full table scan as seen in the first example following. The second example uses one index because indexed columns are being retrieved from one of the tables. Again the Rows and Bytes columns are left as overflowing showing the possibly folly of using anti-joins.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT t.text, coa# FROM type t, coa WHERE t.type != coa.type; Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 7 275 4675 NESTED LOOPS on 7 275 4675 TABLE ACCESS FULL on TYPE 1 6 54 TABLE ACCESS FULL on COA 1 55 440 EXPLAIN PLAN SET statement_id='TEST' FOR SELECT coa.coa#, gl.generalledger_id FROM coa, generalledger gl WHERE coa.coa# != gl.coa#; Query Pos Cost Rows Bytes ------------------------ ---- ------ ------- ------- SELECT STATEMENT on 27116 27116 ####### ######### NESTED LOOPS on 1 27116 ####### ######### INDEX FULL SCAN on XPKCOA 1 1 55 330 TABLE ACCESS FULL on GENERALLEDGER 2 493 752740 9032880
A mutable join is a join of more than two tables. A complex join is a mutable join with added filtering. We have already examined a complex mutable join in the section on intersection joins and various other parts of this chapter.
So how can a join be tuned? There are a number of factors to consider.
Use equality first.
Use range operators only where equality does not apply.
Avoid use of negatives in the form of != or NOT.
Avoid LIKE pattern matching.
Try to retrieve specific rows and in small numbers.
Filter from large tables first to reduce rows joined. Retrieve tables in order from the most highly filtered table downwards; preferably the largest table has the most filtering applied.
Tip | The most highly filtered table is the table having the smallest percentage of its rows retrieved, preferably the largest table. |
Use indexes wherever possible except for very small tables.
Let the Optimizer do its job.
| < Day Day Up > |
|