I've now covered all of the algebraic operators I want to discuss in any detail, but there are some related topics that merit attention in this chapter. The first is the possibility of transforming a given relational expression into another, logically equivalent expression. I mentioned this possibility in Chapter 3, in the section "Why Duplicate Tuples Are Prohibited," where I explained that such transformations are one of the things the optimizer does; in fact, such transformations constitute one of the two great ideas at the heart of relational optimization (the other, beyond the scope of this book, is the use of "database statistics" to do what's called cost-based optimizing). In this section, I want to say a little more about the process of expression transformation (or query rewrite, as it's sometimes called). I'll start with a trivial example. Consider the following expression (the query is "Get suppliers who supply part P2, together with the corresponding quantities"): ( ( S JOIN SP ) WHERE PNO = PNO('P2') ) { ALL BUT PNO } Suppose there are 100 suppliers and 1,000,000 shipments, of which 500 are for part P2. If the expression is simply evaluated by brute force, as it were, without any optimization at all, the sequence of events is:
The following procedure is equivalent to the one just described, in the sense that it produces the same final result, but is obviously much more efficient:
The first of these two procedures involves a total of 102,000,100 tuple I/O's, whereas the second involves only 1,000,100; thus, it's clear that the second procedure is over 100 times faster than the first. It's also clear that we'd like the implementation to use the second procedure rather than the first! If it does, then it is (in effect) transforming the original expression: ( S JOIN SP ) WHERE PNO = PNO('P2') (I'm ignoring the final projection now, since it isn't really relevant to the argument) into the expression: S JOIN ( SP WHERE PNO = PNO('P2') ) These two expressions are logically equivalent, but they have very different performance characteristics, as we've seen. If the system is presented with the first expression, therefore, we'd like it to transform it into the second before evaluating it and of course it can. The point is that the relational algebra, being a high-level formalism, is subject to various formal transformation laws; for example, there's a law that says a join followed by a restriction can be transformed into a restriction followed by a join (I was using that particular law in the example). And a good optimizer will know those laws, and will apply them because, of course, the performance of a query shouldn't depend on the specific syntax used to express that query in the first place. (Of course, it's an immediate consequence of the fact that not all of the algebraic operators are primitive that certain expressions can be transformed into others for example, an expression involving intersect can be transformed into one involving join instead but there's much more to it than that, as we'll quickly see.) Now, there are many possible transformation laws, and this isn't the place for an exhaustive discussion. All I want to do is highlight a few important cases and key points. First, the law mentioned in the previous paragraph is actually a special case of a more general law, called the distributive law. In general, the monadic operator f distributes over the dyadic operator g if f(g(a,b)) = g(f(a),f(b)) for all a and b. In ordinary arithmetic, for example, SQRT (square root) distributes over multiplication, because: SQRT ( a * b ) = SQRT ( a ) * SQRT ( b) for all a and b (take f as SQRT and g as "*"); thus, a numeric expression optimizer can always replace either of these expressions by the other when doing numeric expression transformation. As a counterexample, SQRT does not distribute over addition, because the square root of a+b is not equal to the sum of the square roots of a and b, in general. In relational algebra, restrict distributes over intersect, union, and difference. It also distributes over join, provided the restriction condition consists at its most complex of the AND of two separate conditions, one for each of the two join operands. In the case of the example discussed above, this requirement was satisfied in fact, the restriction condition was very simple and applied to just one of the operands and so we were able to use the distributive law to replace the expression with a more efficient equivalent. The net effect was that we were able to "do the restriction early." Doing restrictions early is almost always a good idea, because it serves to reduce the number of tuples to be scanned in the next operation in sequence, and probably reduces the number of tuples in the output from that operation too. Here are some other specific cases of the distributive law, this time involving projection. First, project distributes over intersect and union, though not difference. Second, it also distributes over join, so long as all of the joining attributes are included in the projection. These laws can be used to "do projections early," which again is usually a good idea, for reasons similar to those given earlier for restrictions. Two more important general laws are the laws of commutativity and associativity:
Observe that all of these transformations can be performed without any regard for either actual data values or actual storage structures (indexes and the like) in the database as physically stored. In other words, such transformations represent optimizations that are virtually guaranteed to be good, regardless of what the database looks like physically. |