The Original Operators | Database in Depth: Relational Theory for Practitioners

This section gives definitions for the original set of relational operators defined by Codd: restrict, project, join, intersect, union, difference, product, and divide.

Restrict

Let bx be a boolean expression involving zero or more attribute names, such that all of the attributes mentioned are attributes of the same relation r.^[*] Then the restriction of r according to bx:

^[*] The boolean expression is subject to certain other limitations as well. See Exercise 5-6 at the end of the chapter.

      r WHERE bx

is a relation with a heading the same as that of r and a body consisting of all tuples of r for which bx evaluates to TRUE. For example:

   S WHERE CITY = 'Paris'  | SELECT S.*                           | FROM   S                           | WHERE  S.CITY = 'Paris'

As an aside, I remark that restrict is sometimes called select; I prefer not to use this term, however, because of the potential confusion with SQL's SELECT.

Project

Let relation r have attributes X, Y,. . . , Z (and possibly others). Then the projection of r on X, Y, . . . , Z:

      r { X, Y, ..., Z }

is a relation with (a) a heading derived from the heading of r by removing all attributes not mentioned in the set {X,Y, . . . ,Z} and (b) a body consisting of all tuples {X x,Y y, . . . ,Z z} such that a tuple appears in r with X value x, Y value y, . . ., and Z value z. For example:

   S { SNAME, CITY, STATUS }  | SELECT DISTINCT S.SNAME,                              |        S.CITY, S.STATUS                              | FROM   S

To repeat, the result is a relation; thus, "duplicates are eliminated," to use the common phrase, and that DISTINCT in the SQL formulation is really needed, therefore.

By the way, Tutorial D also allows a projection to be expressed in terms of the attributes to be discarded instead of the ones to be kept. Thus, for example, the Tutorial D expressions:

   S { SNAME, CITY, STATUS }

and:

   S { ALL BUT SNO }

are equivalent. This feature can save a lot of writing (think of projecting a relation of degree 100 on 99 of its attributes). An analogous remark applies to SUMMARIZE (BY form only) and GROUP (see the later sections "Extend and Summarize" and "Group and Ungroup," respectively).

In concrete syntax, it turns out to be convenient to assign high precedence to the projection operator. In Tutorial D, for example, we take this expression:

   S JOIN P { PNO, CITY }

to mean:

   S JOIN ( P { PNO, CITY } )

and not:

   ( S JOIN P ) { PNO, CITY }

As an exercise, show the difference between these two interpretations, given our usual sample data.

Join

Let relations r and s have attributes X1, X2, . . . , Xm, Y1, Y2, ..., Yn and Y1, Y2, . . . , Yn, Z1, Z2, . . . , Zp, respectively; in other words, the Y's (n of them) are the common attributes, the X's (m of them) are the other attributes of r, and the Z's (p of them) are the other attributes of s. We can assume without loss of generality that none of the X's has the same name as any of the Z's, thanks to the availability of RENAME. Now let the X's taken together be denoted just X, and similarly for the Y's and the Z's. Then the natural join ^[*] (join for short) of r and s:

^[*] I'll discuss other kinds of joins in the next section.

      r JOIN s

is a relation with (a) a heading that is the (set-theoretic) union of the headings of r and s and (b) a body consisting of the set of all tuples t such that t is the (set-theoretic) union of a tuple appearing in r and a tuple appearing in s. In other words, the heading is {X,Y,Z} and the body consists of all tuples {X x,Y y,Z z} such that a tuple appears in r with X value x and Y value y and a tuple appears in s with Y value y and Z value z. Here again is the example from the introductory section:

   P JOIN S  | SELECT P.PNO, P.PNAME, P.COLOR, P.WEIGHT, P.CITY,             |        S.SNO, S.SNAME, S.STATUS             | FROM   P, S             | WHERE  P.CITY = S.CITY

Now, I'm quite sure you were already familiar with what I've said so far regarding join, but you might not be so familiar with the following points. First of all, I remind you that the SQL standard allows the join of parts and suppliers on cities to be expressed in an alternative style that's a little closer to that of Tutorial D (and this time I deliberately replace that long list of column references in the SELECT clause by a simple "*"):

   SELECT *   FROM   P NATURAL JOIN S

However, not all SQL products actually support this syntax.

Second, and more important, intersection is a special case of join. To be specific, if m = p = 0 (meaning there are no X's and no Z's, and r and s are thus of the same type), r JOIN s degenerates to r INTERSECT s (see the next subsection "Intersect").

Third (also important), cartesian product is a special case of join, too: if n = 0, meaning there are no Y's and r and s thus have no common attributes at all, r JOIN s degenerates to r TIMES s (see the later subsection "Cartesian Product"). Why? Because (a) the set of attributes common to r and s in this case is the empty set of attributes; (b) every possible tuple has the same value for the empty set of attributes (namely, the 0-tuple); thus, (c) every tuple in r joins to every tuple in s, and so we get the cartesian product as stated.

Fourth, it turns out in practice that many queries that require join really require an extended form of that operator called semijoin. (You might not have heard of semijoin before, but in fact it's quite important.) Here's the definition: the semijoin of r with s is the join of r and s, projected back on to the attributes of r. By way of example, consider the query "Get suppliers who supply at least one part":

   S SEMIJOIN SP  | SELECT DISTINCT S.*                  | FROM   S, SP                  | WHERE  S.SNO = SP.SNO

As you can see, this query can be thought of as asking for just those suppliers that match at least one shipment. Tutorial D therefore provides a more user-friendly spelling for SEMIJOIN that allows the query to be expressed thus:

   S MATCHING SP

Now observe what happens to r JOIN s if p = 0 (meaning the heading of s is a subset of that of r-- equivalently, every attribute of s is an attribute of r). As I hope you can see, r JOIN s degenerates to r MATCHING s in this case. Likewise, if m = 0, r JOIN s degenerates to s MATCHING r (note that r MATCHING s and s MATCHING r are different, in general).

Fifth, join is fundamentally a dyadic operator, meaning it takes two operands. However, it's possible, and useful, to support an n-adic or prefix version of the operator (and Tutorial D does), according to which we can write expressions of the form:

   JOIN { r, s, ..., w }

to join any number of relations r, s, . . . ,w. For example, the join of parts and suppliers could alternatively be expressed as follows:

   JOIN { P, S }

What's more, we can use this syntax to ask for "joins" of just a single relation, or even of no relations at all! The join of a single relation, JOIN{r}, is just r itself; this case is perhaps not of much practical importance. Perhaps surprisingly, however, the join of no relations at all, JOIN{}, is very important indeed! and the result is TABLE_DEE. (Recall that TABLE_DEE is the unique relation with no attributes and one tuple.) Why is the result TABLE_DEE? Well, consider the following:

In ordinary arithmetic, 0 is the identity with respect to "+"; that is, for all numbers x, the expressions x + 0 and 0 + x are both identically equal to x. As a consequence, the sum of no numbers is 0. (To see that this claim is reasonable, consider a piece of code that computes the sum of n numbers by initializing the sum to 0 and then iterating over those n numbers. What happens if n = 0?)
In like manner, 1 is the identity with respect to "*"; that is, for all numbers x, the expressions x * 1 and 1 * x are both identically equal to x. As a consequence, the product of no numbers is 1.
In the relational algebra, TABLE_DEE is the identity with respect to JOIN; that is, the join of any relation r with TABLE_DEE is simply r. As a consequence, the join of no relations is TABLE_DEE.

If you're having difficulty with this idea, don't worry about it too much for now. But if you come back to reread this section later, I suggest you try to convince yourself then that r JOIN TABLE_DEE and TABLE_DEE JOIN r are indeed both identically equal to r. It might help if I point out that the joins in question are actually cartesian products (right?).

Intersect

Let relations r and s be of the same type. Then the intersection of those relations, r INTERSECT s, is a relation of the same type, with a body consisting of all tuples t such that t appears in both r and s. For example:

   S { CITY }    | SELECT DISTINCT S.CITY     INTERSECT   | FROM   S     P { CITY }  | INTERSECT                 | SELECT DISTINCT P.CITY                 | FROM   P

(Actually we don't need those DISTINCTs in the SQL version, but I prefer to include them for explicitness. See the upcoming discussion of union.)

As we've already seen, intersect is really just a special case of join. Tutorial D and SQL both support it, however, if only for psychological reasons. Tutorial D also supports an n-adic or prefix form, but I'll skip the details here.

Union

Again, let relations r and s be of the same type. Then the union of those relations, r UNION s, is a relation of the same type, with a body consisting of all tuples t such that t appears in r or s or both. For example:

   S { CITY } UNION P { CITY }  | SELECT DISTINCT S.CITY                                | FROM   S                                | UNION  DISTINCT                                | SELECT DISTINCT P.CITY                                | FROM   P

As with projection, it's worth noting explicitly in connection with union that "duplicates are eliminated." In the SQL version, however, that DISTINCT after the keyword UNION is not strictly needed; although UNION provides the same options as SELECT does (DISTINCT versus ALL), the default for UNION is DISTINCT, not ALL. (It's the other way around for SELECT, as you'll recall from Chapter 3.) As a consequence, the DISTINCTs following the two SELECTs aren't needed, either, though of course they're not wrong. Analogous remarks apply to intersection (also to difference, which we'll get to in a moment).

Tutorial D also supports disjoint union (D_UNION), a version of union that requires its operands to be disjoint, which means they have no tuples in common. For example:

   S { CITY } D_UNION P { CITY }

Given our usual sample data, this expression will produce a runtime error because supplier cities and part cities aren't disjoint. An SQL counterpart might look like this:

   SELECT *   FROM ( SELECT S.CITY FROM S          UNION          SELECT P.CITY FROM P ) AS POINTLESS   WHERE  NOT EXISTS        ( SELECT S.CITY FROM S          INTERSECT          SELECT P.CITY FROM P )

There's a difference, though: if supplier cities and part cities aren't disjoint, the SQL expression won't fail at run time, but will simply return an empty result. I should also mention that not all SQL products allow subqueries (as they're called) to be nested inside the FROM clause in the manner shown, though the SQL standard does.

NOTE

The correlation name POINTLESS in the foregoing expression is indeed pointless notice that it's never referenced! but it's required by the standard's syntax rules.

Tutorial D also supports n-adic or prefix forms of both UNION and D_UNION. I'll skip the details here.

Difference

Again let relations r and s be of the same type. Then the difference between those relations, r MINUS s (in that order), is a relation of the same type, with a body consisting of all tuples t such that t appears in r and not s. For example:

   S { CITY } MINUS P { CITY }  | SELECT S.CITY                                | FROM   S                                | EXCEPT                                | SELECT P.CITY                                | FROM   P

Recall now that there's an operator related to join called semijoin. Well, it turns out there's a semidifference operator, too, but in this case the operators aren't simply "related" rather, regular difference is a special case of semidifference ("all differences are semidifferences," you might say). And if semijoin is in some ways more important than join, a similar remark applies here also, but with even more force in practice, most queries that require difference really require semidifference. Here's the definition: the semidifference between r and s is the difference between r and r MATCHING s; that is, r SEMIMINUS s is equivalent to r MINUS (r MATCHING s). By way of example, consider the query "Get suppliers who supply no parts at all":

   S SEMIMINUS SP  | SELECT S.*                   | FROM   S                   | EXCEPT                   | SELECT S.*                   | FROM   S, SP                   | WHERE  S.SNO = SP.SNO

Tutorial D also provides a more user-friendly spelling that allows this query to be expressed thus:

   S NOT MATCHING SP

To see that regular difference is a special case of semidifference, consider what happens if r and s are of the same type (I'll leave the details as an exercise).

Cartesian Product

I include this operator mainly for completeness; as we've seen, it's really just a special case of join, and as a matter of fact Tutorial D doesn't support it directly but let's suppose it did, just for the sake of discussion. Clearly, then, the operands must have no common attribute names (what would happen otherwise?), and the result heading is the set-theoretic union of the input headings. Here's an example:

   ( S RENAME ( CITY AS SCITY ) )   TIMES   ( P RENAME ( CITY AS PCITY ) )

SQL analog:

   SELECT S.SNO, S.SNAME, S.STATUS, S.CITY AS SCITY        P.PNO, P.PNAME, P.COLOR, P.WEIGHT, P.CITY AS PCITY   FROM   S, P

Divide

As promised in the introduction to this chapter, I'll give a precise definition of this operator here, but for several reasons I don't want to discuss it in much detail. The most important of those reasons is that any query that can be formulated in terms of divide can alternatively, and I think more simply, be formulated in terms of relational comparisons (I'll discuss relational comparisons later in this chapter). You might therefore want to skip the following discussion, at least on a first reading.

Another reason I don't want to get into too much detail is that there are several different "divide" operators anyway! That is, there are, unfortunately, several different operators all called "divide," and I certainly don't want to explain all of them. Instead, I'll limit my attention here to the original and simplest one, which can be defined as follows. Let relations r and s be such that the heading of s is a subset of the heading of r. Then the division of r by s, r DIVIDEBY s,^[*] is shorthand for the following:

^[*] Tutorial D doesn't directly support "the original and simplest" divide, and this is thus not valid Tutorial D syntax.

   r { X } MINUS ( ( r { X } TIMES s ) MINUS r ) { X }

X here is the (set-theoretic) difference between the heading of r and that of s. Thus, for example, the expression:

   SP { SNO, PNO } DIVIDEBY P { PNO }

(given our usual sample data values) yields:

(loosely, supplier numbers for suppliers who supply all parts; I'll explain the reason for that qualifier "loosely" in the section "Relational Comparisons," later in this chapter). Here's an SQL analog:

   SELECT DISTINCT SPX.SNO   FROM   SP AS SPX   WHERE  NOT EXISTS      ( SELECT P.PNO        FROM   P        WHERE  NOT EXISTS         ( SELECT SPY.SNO           FROM   SP AS SPY           WHERE  SPY.SNO = SPX.SNO           AND  SPY.PNO = P.PNO ) )

By the way, have you ever wondered why divide is called divide? The reason is that if r and s are relations with no common attributes, and we form the cartesian product r TIMES s and then divide the result by s, we get back to r;^[*] in other words, cartesian product and divide are inverses of each other, in a sense.

^[*] So long as s isn't empty. What happens if it is?

Which Operators Are Primitive?

As I've more or less said already, not all of the operators we've been discussing are primitive several of them can be defined in terms of others. One possible primitive set (not the only one) is that consisting of restrict, project, join, union, and semidifference. Note: You might be surprised not to see rename in this list. In fact, however, rename isn't primitive, though I haven't covered enough groundwork yet to show why not. What this example shows, however, is that there's a difference between being primitive and being useful! I certainly wouldn't want to be without our useful rename operator, even if it isn't primitive.