Section 6.9. Result Set Predicated on Absence of Data

6.9. Result Set Predicated on Absence of Data

It is a common occurrence to look for rows in one table for which there is no matching data in another tableusually for identifying exceptions. There are two solutions people most often think of when having to deal with this type of problem: using either not in ( ) with an uncorrelated subquery or not exists ( ) with a correlated subquery. Popular wisdom says that you should use not exists. Since a correlated subquery is efficient when used to mop up after the bulk of irrelevant data has been cleared out by efficient filtering, popular wisdom has it right when the subquery comes after the strong forces of efficient search criteria, and totally wrong when the subquery happens to be the only criterion.

One sometimes encounters more exotic solutions to the problem of finding rows in one table for which there is no matching data in another. The following example is a real-life case that monitoring revealed to be one of the costliest queries performed against a database (note that question marks are placeholders, or bind variables , for constant values that are passed to the query on successive executions):

     insert into ttmpout(custcode,                         suistrcod,                         cempdtcod,                         bkgareacod,                         mgtareacod,                         risktyp,                         riskflg,                         usr,                         seq,                         country,                         rating,                         sigsecsui)     select distinct custcode,                     ?,                     ?,                     ?,                     mgtareacod,                     ?,                     ?,                     usr,                     seq,                    country,                    rating,                    sigsecsui     from ttmpout a     where a.seq = ?       and 0 = (select count(*)                from ttmpout b                where b.suistrcod = ?                  and b.cempdtcod = ?                  and b.bkgareacod = ?                  and b.risktyp = ?                  and b.riskflg = ?                  and b.seq = ?)

This example must not be understood as an implicit unconditional endorsement of temporary tables! As a passing remark, I suspect that the insert statement was part of a loop. Proper performance improvement would probably be achieved by removing the loop.

An insertion into a table based on a select on the very same table as in the current example is a particular and yet not uncommon case of self-reference, an insertion derived from existing rows and conditional on the absence of the row to be created.

Using count(*) to test whether something exists or doesn't exist is a bad idea: to count, the DBMS must search and find all rows that match. We should use exists in such a case, which stops as soon as the first match is encountered. Arguably, it does not make much difference if the filtering criterion happens to be the primary key. But it may make a very significant difference in other casesand anyway from a semantic point of view there is no reason to say this:

     and 0 = (select count(*) ...)

when we mean this:

     and not exists (select 1 ...)

If we use count(*) as a test for existence, we may be lucky enough to benefit from the "invisible hand" of a smart optimizer, which will turn our query into something more suitable. But this will not necessarily be the case, and it will never be the case if the rows are counted into some variable as an independent step, because then even the smartest of optimizers cannot guess for which purpose we are counting: the result of the count( ) could be a critical value that absolutely has to be displayed to the end user!

In such a case when we want to create new, unique rows derived from rows already present in the table, however, the right construct to use is probably a set operator such as except (sometimes known as minus).

     insert into ttmpout(custcode,                          suistrcod,                          cempdtcod,                          bkgareacod,                          mgtareacod,                          risktyp,                          riskflg,                          usr,                          seq,                          country,                          rating,                          sigsecsui)     (select custcode,             ?,             ?,             ?,             mgtareacod,             ?,             ?,             usr,             seq,             country,             rating,             sigsecsui      from ttmpout      where seq = ?      except      select custcode,             ?,             ?,             ?,             mgtareacod,             ?,             ?,             usr,             seq,             country,             rating,             sigsecsui      from ttmpout      where suistrcod = ?        and cempdtcod = ?        and bkgareacod = ?        and risktyp = ?        and riskflg = ?        and seq = ?)

The big advantage of set operators is that they totally break the time frame imposed by subqueries, whether they are correlated or uncorrelated. What does breaking the time frame mean? When you have correlated subqueries, you must run the outer query, and then you must execute the inner query for each row that passes through all other filtering criteria. Both queries are extremely dependent on each other, since the outer query feeds the inner one.

The picture is slightly brighter with uncorrelated subqueries, but not yet totally rosy: the inner query must be executed, and in fact completed, before the outer query can step in and gather steam (something similar occurs even if the optimizer chooses to execute the global query as a hash join, which is the smart thing for it to do, because to execute a hash join, the SQL engine first has to scan one of the tables involved to build a hash array).

With set operators, on the contrary, whether they are union, intersect or except, none of the components in the query depends on any other. As a result, the different parts of the query can run in parallel. Of course, parallelism is of hardly any benefit if one of the steps is very slow while all the others are very fast; and it will be of no benefit at all if much of the work in one part is strictly identical to the work in another part, because then you are duplicating, rather than sharing, the work between processes. But in a favorable case, it is much more efficient to have all parts run in parallel before the final step, which combines the partial result setsdivide and rule.

There is an additional snag with using set operators: they require each part of the query to return compatible columnsan identical number of columns of identical types. A case such as the following (another real-life case, coming from a billing program) is typically unsuited to set operators:

     select whatever, sum(d.tax)     from invoice_detail d,          invoice_extractor e     where (e.pga_status = 0            or e.rd_status = 0)      and suitable_join_condition      and (d.type_code in (3, 7, 2)           or (d.type_code = 4               and d.subtype_code not in                    (select trans_code                     from trans_description                     where trans_category in (6, 7))))     group by what_is_required     having sum(d.tax) != 0

I am always fascinated by the final condition:

     sum(d.tax) != 0

and the way it evokes yellow brick roads and fantasy worlds where taxes are negative. A condition such as:

     and d.tax > 0

might have been more appropriate in the where clause, as already demonstrated.

In such a case a set operator would be rather awkward, since we would have to hit the invoice_detail tableas we can guess, not a lightweight tableseveral times. However, depending on the selectivity of the various criteria provided, typically if type_code=4 is a rare and therefore selective attribute condition, an exists might be more appropriate than a not in ( ). If, however, trans_description happens to be, at least relatively, a small table, then there is no doubt that trying to improve the query by playing on the existence test alone is a dead end.

Another interesting way to express nonexistenceand often quite an efficient oneis to use outer joins. The purpose of outer joins is basically to return, in a join, all information from one table, including rows for which no match is found in the joined table. As it happens, when we are looking for data that has no match in another table, it is precisely these rows that are of interest to us. How can we identify them? By checking the joined table columns: when there is no match, they are replaced with null values.

Something such as:

     select whatever     from invoice_detail     where type_code = 4       and subtype_code not in                     (select trans_code                      from trans_description                      where trans_category in (6, 7))

can therefore be rewritten:

     select whatever     from invoice_detail          outer join trans_description                  on trans_description.trans_category in (6, 7)                 and trans_description.trans_code = invoice_detail.subtype_code     where trans_description.trans_code is null

I have purposely included the condition on trans_category in the join clause. Whether it should rightly appear in this clause or in the where clause is debatable but, in fact, filtering before the join or after the join is result-neutral (of course, from a performance point of view, it can make a difference, depending on the relative selectivity of this condition and of the join condition itself). However, we have no such latitude with the condition on the null value, since this is something that can only be checked after the join.

Apart from the fact that the outer join may in some cases require a distinct, in practice there should be very little difference between checking the absence of data through an outer join or a not in ( ) uncorrelated subquery, since the column which is used for the join happens to be the very same column that is compared to the result set of the subquery. But SQL is famous for being a language in which the manner of the query expression often has a very real effect on the pattern of execution, even if the theory says otherwise. It all depends on the degree of sophistication of the optimizer, and whether it processes both types of queries in a similar way or not. In other words, SQL is not a truly declarative language, even if the enhancement of optimizers with each new version slowly improves its reliability.

Before closing this topic, watch out for the perennial SQL party-poopersnull values . Although in an in ( ) subquery a null value joining the flow of non-null values does not bother the outer query, with a not in ( ) subquery, any null value returned by the inner query causes the not in ( ) condition to be evaluated as false. It does not cost much to ensure that a subquery returns no null valueand doing so will save you a lot of grief.

Data sets can be compared using various techniques, but outer joins and set operators are likely to be efficient.