Converting Subselects to Join Operations

12.16.1 Problem

You want to use a query that involves a subselect, but MySQL will not support subselects until Version 4.1.

12.16.2 Solution

In many cases, you can rewrite a subselect as a join. Or you can write a program that simulates the subselect. Or you can even make mysql generate SQL statements that simulate it.

12.16.3 Discussion

Assume you have two tables, t1 and t2 that have the following contents:

mysql> SELECT col1 FROM t1;
+------+
| col1 |
+------+
| a |
| b |
| c |
+------+
mysql> SELECT col2 FROM t2;
+------+
| col2 |
+------+
| b |
| c |
| d |
+------+

Now suppose that you want to find values in t1 that are also present in t2, or values in t1 that are not present in t2. These kinds of questions sometimes are answered using subselect queries that nest one SELECT inside another, but MySQL won't have subselects until Version 4.1. This section shows how to work around that problem.

The following query shows an IN( ) subselect that produces the rows in table t1 having col1 values that match col2 values in table t2:

SELECT col1 FROM t1 WHERE col1 IN (SELECT col2 FROM t2);

That's essentially just a "find matching rows" query, and it can be rewritten as a simple join like this:

mysql> SELECT t1.col1 FROM t1, t2 WHERE t1.col1 = t2.col2;
+------+
| col1 |
+------+
| b |
| c |
+------+

The converse question (rows in t1 that have no match in t2) can be answered using a NOT IN( ) subselect:

SELECT col1 FROM t1 WHERE col1 NOT IN (SELECT col2 FROM t2);

That's a "find non-matching rows" query. Sometimes these can be rewritten as a LEFT JOIN, a type of join discussed in Recipe 12.6. For the case at hand, the NOT IN( ) subselect is equivalent to the following LEFT JOIN:

mysql> SELECT t1.col1 FROM t1 LEFT JOIN t2 ON t1.col1 = t2.col2
 -> WHERE t2.col2 IS NULL;
+------+
| col1 |
+------+
| a |
+------+

Within a program, you can simulate a subselect by working with the results of two queries. Suppose you want to simulate the IN( ) subselect that finds matching values in the two tables:

SELECT * FROM t1 WHERE col1 IN (SELECT col2 FROM t2);

If you expect that the inner SELECT will return a reasonably small number of col2 values, one way to achieve the same result as the subselect is to retrieve those values and generate an IN( ) clause that looks for them in col1. For example, the query SELECT col2 FROM t2 will produce the values b, c, and d. Using this result, you can select matching col1 values by generating a query that looks like this:

SELECT col1 FROM t1 WHERE col1 IN ('b','c','d')

That can be done as follows (shown here using Python):

cursor = conn.cursor ( )
cursor.execute ("SELECT col2 FROM t2")
if cursor.rowcount > 0: # do nothing if there are no values
 val = [ ] # list to hold data values
 s = "" # string to hold placeholders
 # construct %s,%s,%s, ... string containing placeholders
 for (col2,) in cursor.fetchall ( ): # pull col2 value from each row
 if s != "":
 s = s + "," # separate placeholders by commas
 s = s + "%s" # add placeholder
 val.append (col2) # add value to list of values
 stmt = "SELECT col1 FROM t1 WHERE col1 IN (" + s + ")"
 cursor.execute (stmt, val)
 for (col1,) in cursor.fetchall ( ): # pull col1 values from final result
 print col1
cursor.close ( )

If you expect lots of col2 values, you may want to generate individual SELECT queries for each of them instead:

SELECT col1 FROM t1 WHERE col1 = 'b'
SELECT col1 FROM t1 WHERE col1 = 'c'
SELECT col1 FROM t1 WHERE col1 = 'd'

This can be done within a program as follows:

cursor = conn.cursor ( )
cursor2 = conn.cursor ( )
cursor.execute ("SELECT col2 FROM t2")
for (col2,) in cursor.fetchall ( ): # pull col2 value from each row
 stmt = "SELECT col1 FROM t1 WHERE col1 = %s"
 cursor2.execute ("SELECT col1 FROM t1 WHERE col1 = %s", (col2,))
 for (col1,) in cursor2.fetchall ( ): # pull col1 values from final result
 print col1
cursor.close ( )
cursor2.close ( )

If you have so many col2 values that you don't want to construct a single huge IN( ) clause, but don't want to issue zillions of individual SELECT statements, either, another option is to combine the approaches. Break the set of col2 values into smaller groups and use each group to construct an IN( ) clause. This gives you a set of shorter queries that each look for several values:

SELECT col1 FROM t1 WHERE col1 IN (first group of col2 values)
SELECT col1 FROM t1 WHERE col1 IN (second group of col2 values)
SELECT col1 FROM t1 WHERE col1 IN (second group of col2 values)
...

This approach can be implemented as follows:

grp_size = 1000 # number of IDs to select at once
cursor = conn.cursor ( )
cursor.execute ("SELECT col2 FROM t2")
if cursor.rowcount > 0: # do nothing if there are no values
 col2 = [ ] # list to hold data values
 for (val,) in cursor.fetchall ( ): # pull col2 value from each row
 col2.append (val)
 nvals = len (col2)
 i = 0
 while i < nvals:
 if nvals < i + grp_size:
 j = nvals
 else:
 j = i + grp_size
 group = col2[i : j]
 s = "" # string to hold placeholders
 val_list = [ ]
 # construct %s,%s,%s, ... string containing placeholders
 for val in group:
 if s != "":
 s = s + "," # separate placeholders by commas
 s = s + "%s" # add placeholder
 val_list.append (val) # add value to list of values
 stmt = "SELECT col1 FROM t1 WHERE col1 IN (" + s + ")"
 print stmt
 cursor.execute (stmt, val_list)
 for (col1,) in cursor.fetchall ( ): # pull col1 values from result
 print col1
 i = i + grp_size # go to next group of values
cursor.close ( )

Simulating a NOT IN( ) subselect from within a program is a bit trickier than simulating an IN( ) subselect. The subselect looks like this:

SELECT col1 FROM t1 WHERE col1 NOT IN (SELECT col2 FROM t2);

The technique shown here works best for smaller numbers of col1 and col2 values, because you must hold at least the values returned by the inner SELECT in memory, so that you can compare them to the value returned by the outer SELECT. The example shown here holds both sets in memory. First, retrieve the col1 and col2 values:

cursor = conn.cursor ( )
cursor.execute ("SELECT col1 FROM t1")
col1 = [ ]
for (val, ) in cursor.fetchall ( ):
 col1.append (val)
cursor.execute ("SELECT col2 FROM t2")
col2 = [ ]
for (val, ) in cursor.fetchall ( ):
 col2.append (val)
cursor.close ( )

Then check each col1 value to see whether or not it's present in the set of col2 values. If not, it satisfies the NOT IN( ) constraint of the subselect:

for val1 in col1:
 present = 0
 for val2 in col2:
 if val1 == val2:
 present = 1
 break
 if not present:
 print val1

The code shown here performs a lookup in the col2 values by looping through the array that holds them. You may be able to perform this operation more efficiently by using an associative data structure. For example, in Perl or Python, you could put the col2 values in a hash or dictionary. Recipe 10.29 shows an example that uses that approach.

Yet another way to simulate subselects, at least those of the IN( ) variety, is to generate the necessary SQL from within one instance of mysql and feed it to another instance to be executed. Consider the result from this query:

mysql> SELECT CONCAT('SELECT col1 FROM t1 WHERE col1 = '', col2, '';')
 -> FROM t2;
+------------------------------------------------------------+
| CONCAT('SELECT col1 FROM t1 WHERE col1 = '', col2, '';') |
+------------------------------------------------------------+
| SELECT col1 FROM t1 WHERE col1 = 'b'; |
| SELECT col1 FROM t1 WHERE col1 = 'c'; |
| SELECT col1 FROM t1 WHERE col1 = 'd'; |
+------------------------------------------------------------+

The query retrieves the col2 values from t2 and uses them to produce a set of SELECT statements that find matching col1 values in t1. If you issue that query in batch mode and suppress the column heading, mysql produces only the text of the SQL statements, not all the other fluff. You can feed that output into another instance of mysql to execute the queries. The result is the same as the subselect. Here's one way to carry out this procedure, assuming that you have the SELECT statement containing the CONCAT( ) expression stored in a file named make_select.sql:

% mysql -N cookbook < make_select.sql > tmp

Here mysql includes the -N option to suppress column headers so that they won't get written to the output file, tmp. The contents of tmp will look like this:

SELECT col1 FROM t1 WHERE col1 = 'b';
SELECT col1 FROM t1 WHERE col1 = 'c';
SELECT col1 FROM t1 WHERE col1 = 'd';

To execute the queries in that file and generate the output for the simulated subselect, use this command:

% mysql -N cookbook < tmp
b
c

This second instance of mysql also includes the -N option, because otherwise the output will include a header row for each of the SELECT statements that it executes. (Try omitting -N and see what happens.)

One significant limitation of using mysql to generate SQL statements is that it doesn't work well if your col2 values contain quotes or other special characters. In that case, the queries that this method generates would be malformed.[2]

[2] As we go to press, a QUOTE( ) function has been added to MySQL 4.0.3 that allows special characters to be escaped so that they are suitable for use in SQL statements.

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References