Hack 89. Choose Any Three of Five

In a one-to-many relationship, use GROUP BY to search for the existence of multiple specific rows on the many side.

Imagine a database table for storing information about job candidates (as shown here, and in Table 10-17):

create table candidates
( id integer not null primary key 
, lname varchar(50) not null
, fname varchar(50) null
, summary text not null 
);
insert into candidates 
values
 ( 1, 'Smith','John'
 ,'Senior web developer specializing in FrontPage ...' )
,( 2, 'Jones','Todd'
 ,'Shy DBA looking for remote cubicle to hide in ... ' );

Table 10-17. The candidates table
id	lname	fname	summary
1	Smith	John	Senior web developer specializing in FrontPage...
2	Jones	Todd	Shy DBA looking for a remote cubicle to hide in ...

This is a straightforward table, but it's not easy to search methodically. Add another table, this one to identify each candidate's specific skills (see Table 10-18):

create table candidate_skills
( cid integer not null 
, foreign key cid_fk ( cid ) references candidates ( id )
, skill varchar(37) not null
, primary key ( cid , skill )
, rating tinyint default 3 
);
insert into candidate_skills 
values 
 ( 1, 'FrontPage', 5 )
,( 1, 'HTML', 3 )
,( 1, 'CSS', 3 )
,( 1, 'JavaScript', 2 )
,( 1, 'Usability', 2 )
,( 1, 'DB design', 2 )
,( 1, 'MySQL', 2 )
,( 2, 'SQL Server', 4 )
,( 2, 'Oracle', 3 )
,( 2, 'HTML', 3 )
,( 2, 'DB design', 2 );

Table 10-18. The skills table
cid	skill	rating
1	FrontPage	5
1	HTML	3
1	CSS	3
1	JavaScript	2
1	Usability	2
1	DB Design	2
1	MySQL	2
2	SQL Server	4
2	Oracle	3
2	HTML	3
2	DB Design	2

Each candidate is represented in one row of the parent candidates table and one or more rows (technically zero or more, but no one would seriously enter a candidate with no skills) in the candidate_skills table. The specific candidate in the candidate_skills table is identified by the foreign key cid, which references the candidates table. This is a classic one-to-many relationship.

The primary key of candidate_skills is a composite key consisting of the cid and the skill together, which means that the same skill cannot be entered for the same candidate more than once. The rating column might be a value from 0 (none) through 5 (guru), indicating skill level.

Yes, the skill column could actually be found in its own skills table, and then the candidate_skills table could use a foreign keybut this is not necessary. Having a lookup table for skills will save some space overall (because the candidate_skills table would then have only an integer on each row where it now has a VARCHAR(37)). But then an additional join would be required just to show a candidate's skills. It is a trade-off, but having the skills table is not required in order to have a good design.

One benefit of allowing a free-form skills value in candidate_skills is that it simplifies data entry; it does not require the data entry operator to become a domain expert and be able to realize the difference between, for example, SQL Server 7 and SQL Server 2000. Using free-form skills is acceptable, and allows you to capture the exact skills as submitted.

Now consider the following requirement: find all candidates who have both DB Design and SQL Server skills.

10.13.1. A JOIN Solution

One way to solve this requirement is with JOINs for each specific skill:

select C.lname
 , C.fname
 from candidates as C
inner
 join candidate_skills as CS1
 on CS1.cid = c.id
inner
 join candidate_skills as CS2
 on CS2.cid = C.id
 where CS1.skill = 'DB Design'
 and CS2.skill like 'SQL Server%'

lname fname
Jones Todd

This query joins each candidate row to a candidate_skills row, first for one skill, and then to another candidate_skills row for the other skill. Because these are INNER JOINs, both skills are required, and only candidates with both skills are returned.

You can extend this method for as many joins as desired. Need a third skill? Join to the candidate_skills table a third time. When you need to find an exact set of specific rows in a one-to-many relationship, the JOIN approach is practical and efficient, if cumbersome.

Now consider the following requirementfind all candidates who have at least three of the following five skills: HTML, CSS, JavaScript, DB Design, and MySQL.

How in the world are you going to write joins that will accomplish this? They'd have to be LEFT OUTER JOINs, because to use INNER JOINs as we did earlier would result in only candidates with all five skills. Yet with LEFT OUTER JOINs, you would need to write a fairly complicated WHERE clause to see which joined rows actually were present (and don't contain NULLs), and that there were at least three of them.

There is an easier way.

10.13.2. A GROUP BY Solution

Here's an approach that uses GROUP BY to accomplish this:

select C.lname
 , C.fname
 from candidates as C
inner
 join candidate_skills as CS
 on CS.cid = C.id
 where CS.skill in ( 'HTML'
 , 'CSS'
 , 'JavaScript'
 , 'DB Design'
 , 'MySQL' )
group
 by C.lname
 , C.fname
having count(*) >= 3

lname fname
Smith John

Look first at the FROM clause. You're joining each candidate row to all of its candidate_skills rows, and in the WHERE clause, you accept a skill only if it's one of the five target skills (all other skills get filtered out).

Then the magic happens. Because of the GROUP BY on the candidate name columns, you get one result row per candidate, and you can now use an aggregate function to determine how many candidate_skills rows were retrieved for the candidate. Note that the query does not return the individual rows; they are retrieved, and then evaluated in the GROUP BY clause, but the query returns only an aggregate row. The aggregate COUNT function counts the retrieved rows of skills that were among the five required skills as determined by the WHERE clause, and the HAVING clause ensures that there were at least three of these during grouping; if there were, an aggregate row for that group (the candidate) is returned.

Sweet, eh?

Rudy Limeback

SQL Fundamentals

Joins, Unions, and Views

Text Handling

Date Handling

Number Crunching

Online Applications

Organizing Data

Storing Small Amounts of Data

Locking and Performance

Reporting

Users and Administration

Wider Access

Index