Expression Transformation | Database in Depth: Relational Theory for Practitioners

I've now covered all of the algebraic operators I want to discuss in any detail, but there are some related topics that merit attention in this chapter. The first is the possibility of transforming a given relational expression into another, logically equivalent expression. I mentioned this possibility in Chapter 3, in the section "Why Duplicate Tuples Are Prohibited," where I explained that such transformations are one of the things the optimizer does; in fact, such transformations constitute one of the two great ideas at the heart of relational optimization (the other, beyond the scope of this book, is the use of "database statistics" to do what's called cost-based optimizing). In this section, I want to say a little more about the process of expression transformation (or query rewrite, as it's sometimes called). I'll start with a trivial example. Consider the following expression (the query is "Get suppliers who supply part P2, together with the corresponding quantities"):

   ( ( S JOIN SP ) WHERE PNO = PNO('P2') ) { ALL BUT PNO }

Suppose there are 100 suppliers and 1,000,000 shipments, of which 500 are for part P2. If the expression is simply evaluated by brute force, as it were, without any optimization at all, the sequence of events is:

Join S and SP: This step involves reading the 100 supplier tuples; reading the 1,000,000 shipment tuples 100 times each, once for each of the 100 suppliers; constructing an intermediate result consisting of 1,000,000 tuples; and writing those 1,000,000 tuples back out to the disk. (I'm assuming for simplicity that tuples are physically stored as such, and I'm also assuming I can take "number of tuple I/O's" as a reasonable measure of performance. Neither of these assumptions is realistic, but this fact doesn't materially affect my argument.)
Restrict the result of Step 1: This step involves reading 1,000,000 tuples but produces a result containing only 500 tuples, which I'll assume can be kept in main memory.
Project the result of Step 2: This step involves no tuple I/O's at all, so we can ignore it.

The following procedure is equivalent to the one just described, in the sense that it produces the same final result, but is obviously much more efficient:

Restrict SP to just the tuples for part P2: This step involves reading 1,000,000 shipment tuples but produces a result containing only 500 tuples, which can be kept in main memory.
Join S and the result of Step 1: This step involves reading 100 supplier tuples (once only, not once per P2 shipment, because all the P2 shipments are in memory). The result contains 500 tuples (still in main memory).
Project the result of Step 2: Again we can ignore this step.

The first of these two procedures involves a total of 102,000,100 tuple I/O's, whereas the second involves only 1,000,100; thus, it's clear that the second procedure is over 100 times faster than the first. It's also clear that we'd like the implementation to use the second procedure rather than the first! If it does, then it is (in effect) transforming the original expression:

   ( S JOIN SP ) WHERE PNO = PNO('P2')

(I'm ignoring the final projection now, since it isn't really relevant to the argument) into the expression:

   S JOIN ( SP WHERE PNO = PNO('P2') )

These two expressions are logically equivalent, but they have very different performance characteristics, as we've seen. If the system is presented with the first expression, therefore, we'd like it to transform it into the second before evaluating it and of course it can. The point is that the relational algebra, being a high-level formalism, is subject to various formal transformation laws; for example, there's a law that says a join followed by a restriction can be transformed into a restriction followed by a join (I was using that particular law in the example). And a good optimizer will know those laws, and will apply them because, of course, the performance of a query shouldn't depend on the specific syntax used to express that query in the first place. (Of course, it's an immediate consequence of the fact that not all of the algebraic operators are primitive that certain expressions can be transformed into others for example, an expression involving intersect can be transformed into one involving join instead but there's much more to it than that, as we'll quickly see.)

Now, there are many possible transformation laws, and this isn't the place for an exhaustive discussion. All I want to do is highlight a few important cases and key points. First, the law mentioned in the previous paragraph is actually a special case of a more general law, called the distributive law. In general, the monadic operator f distributes over the dyadic operator g if f(g(a,b)) = g(f(a),f(b)) for all a and b. In ordinary arithmetic, for example, SQRT (square root) distributes over multiplication, because:

   SQRT ( a * b ) = SQRT ( a ) * SQRT ( b)

for all a and b (take f as SQRT and g as "*"); thus, a numeric expression optimizer can always replace either of these expressions by the other when doing numeric expression transformation. As a counterexample, SQRT does not distribute over addition, because the square root of a+b is not equal to the sum of the square roots of a and b, in general.

In relational algebra, restrict distributes over intersect, union, and difference. It also distributes over join, provided the restriction condition consists at its most complex of the AND of two separate conditions, one for each of the two join operands. In the case of the example discussed above, this requirement was satisfied in fact, the restriction condition was very simple and applied to just one of the operands and so we were able to use the distributive law to replace the expression with a more efficient equivalent. The net effect was that we were able to "do the restriction early." Doing restrictions early is almost always a good idea, because it serves to reduce the number of tuples to be scanned in the next operation in sequence, and probably reduces the number of tuples in the output from that operation too.

Here are some other specific cases of the distributive law, this time involving projection. First, project distributes over intersect and union, though not difference. Second, it also distributes over join, so long as all of the joining attributes are included in the projection. These laws can be used to "do projections early," which again is usually a good idea, for reasons similar to those given earlier for restrictions.

Two more important general laws are the laws of commutativity and associativity:

The dyadic operator g is commutative if g(a,b) = g(b,a) for all a and b. In ordinary arithmetic, for example, addition and multiplication are commutative, but division and subtraction aren't. In relational algebra, intersect, union, and join are all commutative,^[*] but difference and division aren't. So, for example, if a query involves a join of two relations r and s,the commutative law tells us it doesn't matter which of r and s is taken as the "outer" relation and which the "inner." The system is therefore free to choose (say) the smaller relation as the "outer" one in computing the join.
^[*] Strictly speaking, join is not commutative in SQL, because SQL attaches significance to the left-to-right ordering of columns, and r JOIN s and s JOIN r thus differ in their result column orderings.
The dyadic operator g is associative if g(a,g(b,c)) = g(g(a,b),c) for all a, b, c. In arithmetic, addition and multiplication are associative, but subtraction and division aren't. In relational algebra, intersect, union, and join are all associative, but difference and division aren't. So, for example, if a query involves a join of three relations r, s, and u, the associative and commutative laws together tell us we can join the relations pairwise in any order we like. (They also tell us it's legitimate to define an n-adic or prefix version of the operator, as Tutorial D does.) The system is thus free to decide which of the various possible sequences is most efficient.

Observe that all of these transformations can be performed without any regard for either actual data values or actual storage structures (indexes and the like) in the database as physically stored. In other words, such transformations represent optimizations that are virtually guaranteed to be good, regardless of what the database looks like physically.