Some Remarks on Physical Design | Database in Depth: Relational Theory for Practitioners

The relational model has nothing to say about physical design. But there are still some things that can usefully be said about physical design in a relational context things that are at least implied by the model, even though they aren't stated explicitly (and even though the details of physical design are, of necessity, somewhat DBMS-specific and vary from system to system).

The first point is that physical design should follow logical design. That is, the "right" approach is to do a clean logical design first, and then, as a follow-on step, to map that logical design into whatever physical structures the target DBMS happens to support. Equivalently, the physical design should be derived from the logical design and not the other way around. Ideally, in fact, the system should be able to derive an optimal physical design for itself, without any need for human involvement at all. (This goal isn't as far-fetched as it might sound. I'll say a little more about it later in this section.)

As for my second point: we saw in Chapter 1 that one reason for excluding physical issues of all kinds from the relational model was to give implementers the freedom to implement the model in any way they liked and here, I think, the widespread lack of understanding of the model has really hurt us. Certainly most SQL products have failed to live up to the model's full potential in this respect; in those products, what the user sees and what's physically stored are essentially the same thing. In other words, what's physically stored is effectively just a direct image of what the user logically sees, as Figure 7-9 suggests. (I realize these remarks are oversimplified, but they're true enough for present purposes.)

Figure 7-9. Direct-image implementation (deprecated)

Now, there are many things wrong with this direct-image style of implementation, far too many to discuss in detail here. But the overriding point is that it provides almost no data independence: if we have to change the physical design (for performance reasons, of course), we have to change the logical design too. In particular, it accounts for the argument, so often heard, to the effect that we have to "denormalize for performance." In principle, logical design has absolutely nothing to do with performance at all; but if the logical design maps one-to-one to the physical design . . . well, the conclusion is obvious. Surely we can do better than this. Relational advocates have argued for years that the relational model doesn't have to be implemented this way. And indeed it doesn't; all being well, a brand-new implementation technology is due to appear soon that addresses all of the problems of the direct-image scheme. That technology is called The TransRelational? Model. Since it is an implementation technology, the details are beyond the scope of this book; you can find a preliminary description in my book An Introduction to Database Systems, Eighth Edition (Addison-Wesley, 2004). All I want to do here is point out a few desirable consequences of having an implementation that does keep the logical and physical levels rigidly and properly separate.

First, we would never need to "denormalize for performance" at all (at the logical level, I mean); all relvars could be in 5NF, or even 6NF, without any performance penalty. The logical design really would have no performance implications at all.

Second, 6NF offers a basis for a truly relational way of dealing with the problem of missing information (I mean, a way that doesn't involve nulls and three-valued logic). If you use nulls, you're effectively making the database state explicitly that there's something you don't know. But if you don't know something, it's much better to say nothing at all! To quote Wittgenstein: Wovon man nicht reden kann, darüber muss man schweigen ("Whereof one cannot speak, thereon one must remain silent"). For example, suppose for simplicity that there are just two suppliers right now, S1 and S2, and we know the status for S1 but not for S2. A 6NF design for this situation might look as shown in Figure 7-10.

Figure 7-10. The status for supplier S2 is unknown

Of course, there's a lot more that could and should be said about this approach to missing information, but this isn't the place. Here I just want to stress the point that with this design, we don't have a "tuple" showing supplier S2's status as "null" we don't have a tuple showing supplier S2's status at all.

Finally, in the kind of system I'm sketching here, it really would be possible for the system to derive the optimal physical design from the logical design automatically, with little or no involvement on the part of any human designer. Space considerations among other things mean I can't provide evidence here to support this claim, but I stand by it.