Inner Joins | Microsoft SQL Server 2008 Bible

In this section we'll be talking about inner joins, which only return rows with matching values from both joined tables excluding all other rows.

Two syntaxes for inner joins

There are two different syntaxes for table joins in SQL, one introduced in 1992 by the SQL standards committee, and the "old" one, used before this date but which still persists. The new standard syntax began to take a foothold in the late 1990s with introduction of Microsoft SQL Server 7.0, IBM DB2 UDB 7.1, and eventually Oracle joining the club with its 9i version, which supports the standard "new" SQL syntax in addition to the old one.

SQL99

SQL99 defines syntax that uses FROM clause of the SELECT statement to join tables:

... FROM  <table1> [INNER | NATURAL | CROSS] JOIN       <table2> [ON <condition>] | [USING <column_name>,...],... ...

The actual examples are given later in this chapter; for now let's clarify what all the optional clauses stand for.

Inner join options

With an inner join, you have these options:

Keyword INNER is optional; it could be used for clarity to distinguish between inner and outer joins.
Keyword NATURAL is used to specify a natural join between two tables, i.e., join them by column(s) with identical names. You cannot invoke either the ON or USING clause along with the NATURAL keyword. Out of all our "big three" RDBMS, it is available only in Oracle 9i. The natural join is discussed in more detail later in this chapter.
Keyword CROSS is used to produce a cross join, as discussed later in this chapter. The keyword is valid for Oracle 9i and MS SQL Server 2000 syntax but is not recognized by DB2 UDB 8.1.

ON and USING clauses

The ON clause is to specify the join condition (equijoin or nonequijoin, explained later in the chapter); all our "big three" databases have it in their syntax.

When you are specifying an equijoin of columns that have the same name in both tables, a USING clause can indicate the column(s) to be used. You can use this clause only if the join columns in both tables have the same name. The USING clause could be used when the NATURAL join would not work properly, i.e., tables have more identically named columns than you would actually want to use in your join. For example, if hypothetical tables A and B have common fields CUST_ID, PROD_ID, and ORDER_ID, but you only want to join them by CUST_ID and PROD_ID, you could specify

... FROM A JOIN B USING (CUST_ID, PROD_ID) ...

Note

The column names in USING clause should not be qualified by table names.

The USING clause is only supported by Oracle 9i out of our three vendors, but the same results could easily be achieved with ON clause:

... FROM A JOIN B ON A.CUST_ID = B.CUST_ID AND A.PROD_ID = B.PROD_ID ...

Old syntax

The other syntax, which is commonly used by virtually everyone (Oracle only recognized the "standard" syntax starting version 9i, and Microsoft SQL Server introduced it in version 7.0), performs joins in the WHERE clause:

... WHERE [<qualifier>.]<column_name>         <join_condition>        [<qualifier>.]<column_name> [AND  [<qualifier>].<column_name>         <join_condition>       [<qualifier>.]<column_name>],... ...

The join_condition can be the "equal" sign ('=') for equijoin or "not equal," "greater than," "less than," "greater than or equal," or "less than or equal" signs (<>, >, <, >=, <=) for non-equijoin. The explanation of terms "equijoin" and non-equijoin" follows.

Equijoin

Equijoin, which is the most popular type of table join, uses exact matching between two columns. For example, in the ACME database the CUSTOMER table does not contain any information about customer phone numbers; the PHONE table has the field PHONE_CUSTID_FN, which is the foreign key to CUSTOMER table. To display information from both tables, use equijoin as shown in Figure 9-1.

click to expand
Figure 9-1: Simple inner join

SQL99 standard syntax

The SQL99-compliant syntax to produce the output shown on the bottom of Figure 9-1 is given here:

SELECT          cust_id_n,                  cust_name_s,                  phone_phonenum_s,                 phone_type_s FROM            customer JOIN phone ON              cust_id_n = phone_custid_fn

Natural join

Natural join is a special case of equijoin performed by RDBMS on the following assumption: "Always join tables using a column (or multiple columns) if they share the same name(s)." In other words, you don't have to specify the columns explicitly in the ON subclause of the SELECT statement's FROM clause. The ON subclause is omitted completely.

Note

The idea of natural join contradicts the quasi-"Hungarian" notation concept (see Appendix B). In the ACME database, every column is prefixed with its abbreviated table name, so every column name in the database is unique and as such the natural join cannot be used at all. There are different opinions on whether the primary and the foreign keys should share the same name. The possibility to perform natural joins seems to be the only advantage in this case (and only if you are using Oracle), so in our ACME database we decided to use the notation consistently.

Cross-References

The naming conventions for the ACME database are described in Appendix B.

The natural join can make the query writing process easier, but only assuming the database structure and the query itself are simple enough; otherwise undesirable joins and incorrect query output could result.

Fully Qualified Column Names

If two or more tables share one or more column names referenced in your query, you must prefix column names with either names of tables or with table aliases; otherwise the RDBMS will not be able to resolve the ambiguity and an error message will be generated.

It is considered to be a good practice to qualify column references explicitly; however, as we mentioned before, in our ACME sample database each and every column name is unique (i.e., is already fully qualified, in a way), which makes the use of the fully qualified names unnecessary. The following example illustrates the use of the qualified names with the same query as in the previous example:

SELECT          customer.cust_id_n,                  customer.cust_name_s,                  phone.phone_phonenum_s,                 phone.phone_type_s FROM            customer JOIN phone ON              customer.cust_id_n =  phone.phone_custid_fn

We'll use the fully qualified names randomly in our further examples.

Old syntax

The old syntax joins tables in the WHERE clause of the SELECT statement. The syntax has its defenders and opponents. The main argument against it is that the WHERE clause should not mix joins and predicates — its only purpose should be setting the "vertical" limits on the produced resultset. The major point for it is that in queries that join dozens of tables the old syntax is more readable. Another big point is code legacy, especially for Oracle where the standard syntax was not available until version 9i. It is your choice which syntax to use (we would suggest the SQL99-compliant one for all new developments), but it is definitely not a bad idea to be familiar with both.

The following statement is the equivalent to one given in the previous section; the old syntax being the only difference:

SELECT         c.cust_id_n,                c.cust_name_s,                 p.phone_phonenum_s FROM           customer c,                phone p WHERE          c.cust_id_n = p.phone_custid_fn

Nonequijoin

Sometimes you need to join tables based on criteria other than equality. While the most typical use of equijoin deals with primary/foreign key relationships, that is not usually the case with nonequijoins — such a join would usually populate the resulting set in a way that does not make much sense at all. For example, if you replace the = (equals) operator in the query from the last section with <> (not equal), the resulting query will return every possible combination of customers and phone numbers except the ones that actually represent the valid customer/phone listings (Figure 9-2); in other words, the result will be somehow similar to the Cartesian product (CROSS JOIN) discussed later in this chapter.

click to expand
Figure 9-2: Primary/foreign key nonequijoin

In some situations though, nonequijoins appear to be quite useful. For example, imagine ACME, INC. has the following set of business rules. There is no discount for orders under $10,000; a 2 percent discount is given to all orders with total amount between $10,000 and $14,999; the orders with a total amount between $15,000 and $19,999 receive a 3 percent discount; and, finally, the orders that are $20,000 or more are eligible for a 4 percent discount.

One of the ways to implement such business rules is to have a table that stores minimum and maximum order amount along with the corresponding discount percentage as shown in Figure 9-3.

click to expand
Figure 9-3: The contents of DISCOUNT table

Tip

It is always a good practice to store values that potentially could change (for example, the discount for orders over $20,000 might change from 4 to 3.5 percent) in a RDBMS table rather than to hardcode them in the application. Should business rules change, you only need to update the table value(s) — which is fast and painless in comparison with changing the application code, recompiling the application, and so on.

In the following example, the nonequijoin between view V_CUSTOMER_TOTALS (that displays the dollar amount totals per order calculated using the order line amounts) and table DISCOUNT is shown. The query calculates the appropriate discount for each dollar amount (TOTAL_DISCOUNT) by multiplying the order total amount by the appropriate percent discount found based on the rules described above.

SQL99 standard syntax

SELECT  ct.order_number,          ct.total_price,          ct.total_price * d.disc_pct / 100 AS total_discount FROM    v_customer_totals ct    JOIN         discount d    ON   ct.total_price > d.disc_minamount_n    AND         ct.total_price < d.disc_maxamount_n     ORDER_NUMBER                   TOTAL_PRICE TOTAL_DISCOUNT ------------------------------ ----------- -------------- 523774                                6608              0 523778                                6608              0 ... 523777                               10010          200.2 523737                               10010          200.2 523730                             10915.8        218.316 ... 523781                             15402.2        462.066 523741                             15402.2        462.066 523775                             15456.8        463.704 523735                             15456.8        463.704     51 rows selected.

Old syntax

SELECT  order_number,          total_price,          total_price * disc_pct / 100 AS total_discount  FROM    v_customer_totals ct,          discount d WHERE   ct.total_price > d.disc_minamount_n  AND     ct.total_price < d.disc_maxamount_n

Note

Tables can be joined by more than one column. In this example we join the V_CUSTOMER_TOTALS column TOTAL_PRICE with both the DISC_MINAMOUNT_N and DISC_MAXAMOUNT_N columns of the DISCOUNT table to find the appropriate discount percentage for the specified range. Another typical example of multicolumn joins is when you have to join composite primary/foreign keys.

Self-join

The idea of self-join is probably one of the most unintuitive SQL concepts. Even though it sounds very simple — a table is joined with itself rather than with another table — sometimes it causes lots of confusion.

One important thing to understand here is that despite the fact that you are joining the table with itself, you are still dealing with two instances of the same table, or with two identical tables rather than with just one table, so the self-join should be considered as a special case of the multitable join.

Note

The understanding of the difference between a table and its instance is also very important. Two table instances could be thought of as the table data loaded twice into two different memory locations completely separate from each other.

Imagine if ACME, INC. decided to implement some kind of multilevel marketing. It starts selling products to resellers, who in their turn either sell the products to a lower-level reseller or directly to a customer. That means any of ACME's customers now could be a reseller (the one who buys and resells products) or a supplier (the one whom products are bought from), or both. The relationship has been implemented in the RESELLER table. ACME, INC. has NULL in the RESELLER_SUPPLIER_ID column because it's on the top of the tree and so it does not have a supplier. Everybody else has another entity's (from the RESELLER table) primary key in this field.

Figure 9-4 illustrates this concept as well as the idea of multiple table instances.

click to expand
Figure 9-4: Self-join

SQL99 standard syntax

The following query uses self-join to retrieve the reseller id and name along with its supplier id and name (for each entity in RESELLER table) using SQL99-compliant syntax:

SELECT  r.reseller_id_n   AS res_id,           r.reseller_name_s AS res_name,          s.reseller_id_n   AS sup_id,          s.reseller_name_s AS sup_name FROM    reseller r JOIN reseller s   ON    r.reseller_supplier_id = s.reseller_id_n     RES_ID RES_NAME                     SUP_ID SUP_NAME ------ ---------------------------- ------ --------------------------       2 MAGNETICS USA INC.          1      ACME, INC       3 MAGNETOMETRIC DEVICES INC.  1      ACME, INC       4 FAIR PARK GARDENS           2      MAGNETICS USA INC.       5 FAIR AND SONS AIR CONDTNG   2      MAGNETICS USA INC.       6 FABRITEK INC.               2      MAGNETICS USA INC.       7 WILE ELECTRONICS INC.       3      MAGNETOMETRIC DEVICES INC.       8 INTEREX USA                 3      MAGNETOMETRIC DEVICES INC.       9 JUDCO MANUFACTURING INC.    4      FAIR PARK GARDENS      10 ELECTRO BASS INC.           5      FAIR AND SONS AIR CONDTNG     (9 rows affected)

The information is to be interpreted this way: ACME, INC. sells directly to MAGNETICS USA INC. and MAGNETOMETRIC DEVICES INC. MAGNETOMETRIC DEVICES INC. buys products from ACME, INC. and sells them to WILE ELECTRONICS INC. and INTEREX USA, and so on.

Old syntax

The following statement is the old syntax equivalent of the same query:

SELECT  r.reseller_id_n   AS res_id,           r.reseller_name_s AS res_name,          s.reseller_id_n   AS sup_id,          s.reseller_name_s AS sup_name FROM    reseller r,          reseller s WHERE   r.reseller_supplier_id = s.reseller_id_n

Cross join (Cartesian product)

Cross join, or the Cartesian product of two tables, can be defined as another (rather virtual) table that consists of all possible pairs of rows from the two source tables. Returning to our customer phone example in the nonequijoin section of this chapter, the cross join of the CUSTOMER and PHONE tables returns results very similar to what we've got on Figure 8-2 except it would also return the valid customer/phone combinations, excluded from the previously mentioned nonequijoin.

SQL99 standard syntax

This query will return all possible combinations of customer names and phone numbers by performing cross join of CUSTOMER and PHONE tables:

SELECT  cust_name_s,          phone_phonenum_s FROM    customer CROSS JOIN          phone     CUST_NAME_S                                        PHONE_PHONENUM_S              -------------------------------------------------- --------------------          DLH INDUSTRIES                                     (817) 555-5524                FAIR AND SONS AIR CONDTNG                          (817) 555-5524                KILBURN GLASS INDUSTRIES                           (817) 555-5524                BOSWELL DESIGNS CORP.                              (817) 555-5524                WILE ELECTROMATIC INC.                             (817) 555-5524                FABRITEK INC.                                      (817) 555-5524                ...                                                ...             DLH INDUSTRIES                                     (847) 555-2924                FAIR AND SONS AIR CONDTNG                          (847) 555-2924                KILBURN GLASS INDUSTRIES                           (847) 555-2924                BOSWELL DESIGNS CORP.                              (847) 555-2924                WILE ELECTROMATIC INC.                             (847) 555-2924                FABRITEK INC.                                      (847) 555-2924                ...                                                ... DLH INDUSTRIES                                     (814) 555-0323                FAIR AND SONS AIR CONDTNG                          (814) 555-0323                KILBURN GLASS INDUSTRIES                           (814) 555-0323                BOSWELL DESIGNS CORP.                              (814) 555-0323                WILE ELECTROMATIC INC.                             (814) 555-0323                FABRITEK INC.                                      (814) 555-0323                ...                                                ...     3182 rows selected.

Old syntax

The old syntax for cross join simply omits the WHERE clause in a multitable join:

SELECT  cust_name_s,          phone_phonenum_s FROM    customer,          phone

Cross join could be a very costly operation, especially when it happens as the result of a human error and involves large tables with hundreds of thousands or even millions of rows. Since the resulting set of a Cartesian product is every possible combination of all rows in both tables, if the first table is, say, one hundred thousand rows and the second one is two hundred thousand rows, the query returns twenty billion rows (100,000 * 200,000 = 20,000,000,000) — quite enough to cause considerable database server slowdown that would affect all users that are currently logged on.

Thus, the Cartesian product should be handled with care, especially because it is not something that a typical SQL user uses a lot. Cross joins can be useful to generate test data or in educational purposes (for example, to demonstrate the idea that all equijoins and nonequijoins are simply a subset of a cross join that meets certain conditions), but one would hardly need, say, all possible combinations of customer names and phone numbers. In most real-life situations a Cartesian product is simply the result of a human error that would have to be fixed by the DBA (probably by killing the user's session; more on the sessions in Chapter 7).

Joining more than two tables

In a relational database quite often you need to retrieve data from many tables simultaneously within a single query to get all necessary information. Thus, in real life a multitable query could easily mean a dozen-table query or even a hundred-table query. Probably 90 percent of the SQL programming art is the talent to properly join multiple tables based on the knowledge of the internal database objects' structure plus the ability to apply this knowledge.

The concept is simple. The result of joining two tables could be considered as a new virtual table, which, in its turn, could be joined with the next table in the query, producing yet another virtual table, and so on. For example, if you needed to know what customers ordered which products, your query would have to join four tables: CUSTOMER, ORDER_HEADER, ORDER_LINE, and PRODUCT:

SELECT DISTINCT  customer.cust_name_s,                        product.prod_description_s  FROM             customer     JOIN          order_header    ON            customer.cust_id_n =  order_header.ordhdr_custid_fn    JOIN          order_line    ON            order_header.ordhdr_id_n =   order_line.ordline_ordhdrid_fn    JOIN          product     ON            product.prod_id_n =  order_line.ordline_prodid_fn

You could envision this query as a sequence of joins. First, join CUSTOMER and ORDER_HEADER using the appropriate columns; then join the resulting set of rows with the ORDER_LINE table; and, finally, join your result with the PRODUCT table as illustrated in Figure 9-5.

click to expand
Figure 9-5: Four tables join

Also note that even though you might not need any columns either from ORDER_HEADER or from ORDER_LINE tables, you have to use them in your query joins because there is no direct relationship between the CUSTOMER and PRODUCT tables.

Number of joins

One important rule to remember is that there should be at least (n – 1) joins in an n-table query, thus, at least two joins for a three-table query, at least three joins for query that involves four tables, and so on. The words "at least" are important — there could be more than (n – 1) joins in a nonequijoin query or in a query that joins tables using composite primary/foreign keys, but if your multitable query has less than (n – 1) joins, the result will be a Cartesian product.

Note

The most common (and the most dangerous) type of a cross join happens in queries that involve more than two tables. In that situation you don't even have to omit the WHERE clause — it would be sufficient to simply skip a join between two tables out of a dozen. Also, the more tables are joined within a single query, the bigger the resulting set could be — the Cartesian product of five tables, with only one hundred rows each, produces ten billion rows.

SQL99 standard syntax

The query in the previous example uses SQL99-compliant syntax. The query that follows also uses SQL99-compliant syntax to join the CUSTOMER, ORDER_HEADER, and STATUS tables and return customer name, order number, and order status:

SELECT cust_name_s,         ordhdr_nbr_s,         status_desc_s FROM   customer     JOIN         order_header    ON  cust_id_n = ordhdr_custid_fn    JOIN           status    ON  status_id_n = ordhdr_statusid_fn     CUST_NAME_S             ORDHDR_NBR_S      STATUS_DESC_S                                     ----------------------- ------------      -------------                    WILE BESS COMPANY       523731             COMPLETE                                          WILE BESS COMPANY       523732             COMPLETE                                          WILE ELECTRONICS INC.   523733             COMPLETE                                          WILE ELECTRONICS INC.   523734             COMPLETE                                          WILE ELECTRONICS INC.   523735             COMPLETE                                          WILE BESS COMPANY       523736             COMPLETE                                          ...                     ...                ...  WILE BESS COMPANY       523738             CANCELLED                                         WILE BESS COMPANY       523742             CANCELLED WILE BESS COMPANY       523743             CANCELLED                                                                                                                          51 rows selected.

Old syntax

Here is the old syntax equivalent of the previous query:

SELECT  cust_name_s,          ordhdr_nbr_s,          status_desc_s  FROM    customer,          order_header,          status  WHERE   cust_id_n = ordhdr_custid_fn AND     status_id_n = ordhdr_statusid_fn