Domain-Constrained Comparisons | Database in Depth: Relational Theory for Practitioners

Everyone knows (or should know!) that, in the relational model, two values can be tested for equality only if they come from the same domain. In the case of suppliers and parts, for example, the following comparison which might be part of a WHERE clause is obviously valid:

     SP.SNO = S.SNO         /* OK     */

By contrast, the following one is not:

     SP.PNO = S.SNO         /* not OK */

The reason is that part numbers and supplier numbers are different kinds of things, and they correspond to different domains. So the general idea is that the database management system (DBMS)^[*] should reject any attempt to perform any relational operation join, union, divide, or whatever that calls, explicitly or implicitly, for a comparison between values from different domains. For example, here's an SQL query where the user is trying to find suppliers who supply no parts:

^[*] By the way, what's the difference between a DBMS and a database? (This isn't an idle question, because the industry very commonly uses the term database when it means either some DBMS product, such as Oracle, or the particular copy of such a product that happens to be installed on a particular computer. The problem is, if you call the DBMS a database, what do you call the database?)

     SELECT S.SNO, S.SNAME, S.STATUS, S.CITY     FROM   S     WHERE  NOT EXISTS          ( SELECT SP.PNO            FROM   SP            WHERE  SP.PNO = S.SNO )      /* not OK */

(There's no terminating semicolon because this is an expression, not a statement. See Exercise 2-24 at the end of the chapter.)

As the comment says, this query is not OK. The reason is that, in the last line, the user presumably meant to say WHERE SP.SNO = S.SNO, but by mistake probably just a slip of the typing fingers he or she said WHERE SP.PNO = S.SNO instead. And, given that we're indeed talking about a simple typo (probably), it would be a friendly act on the part of the DBMS to interrupt at this point, highlight the error, and ask if the user would like to correct it before proceeding.

Now, I don't know any commercial product that actually behaves in the way I've just suggested; in today's products, depending on how you've set up the database, either the query will simply fail or it'll give the wrong answer. Well . . . not exactly the wrong answer, perhaps, but the right answer to the wrong question. (Does that make you feel any better?)

To repeat, therefore, the DBMS should reject a comparison like SP.PNO = S.SNO if it isn't valid. Codd proposed, however, that there should be a way for the user to force the DBMS to go ahead and do the comparison anyway, even if it isn't valid, on the grounds that sometimes the user will know more than the DBMS does. Now, it's a little hard for me to do justice to this proposal, because I don't agree with it but let me try.

Suppose it's your job to design a database involving, say, customers and suppliers, and you decide therefore to have a domain of customer numbers and another domain of supplier numbers. You build your database that way and load it, and everything works just fine for a year or two. Then, one day, one of your users comes along with a query you never heard before, namely: "Are any of our customers also suppliers to us?" Observe that this is a perfectly reasonable query; observe too that it might involve a comparison between a customer number and a supplier number (a cross-domain comparison), to see if they're equal. And if it does, well, certainly the system mustn't prevent you from doing it; certainly the system mustn't prevent you from asking a reasonable query.

In light of the foregoing, Codd proposed what he called domain check override (DCO) versions of certain of his algebraic operators. A DCO version of join, for example, would perform the join even if the joining attributes were defined on different domains. In SQL terms, we might imagine this proposal being realized by means of a new clause, IGNORE DOMAIN CHECKS, that could be included in an SQL query, as here:

     SELECT ...     FROM   ...     WHERE  CUSTNO = SNO     IGNORE DOMAIN CHECKS

And this new clause would be separately authorizable most users wouldn't be allowed to use it at all; perhaps only the database administrator (DBA) would be allowed to use it.

Before analyzing the DCO idea in detail, I want to look at a simpler example. Consider the following two queries:

     SELECT ...                 |   SELECT ...     FROM   P, SP               |   FROM   P, SP     WHERE  P.WEIGHT = SP.QTY   |   WHERE  P.WEIGHT - SP.QTY = 0

Assuming, reasonably enough, that weights and quantities are defined on different domains, the query on the left is clearly invalid. But what about the one on the right? According to Codd, that one's valid! In his book The Relational Model for Database Management Version 2(Addison-Wesley, 1990), he says that in such a situation "the DBMS [merely] checks that the basic data types are the same"; in the case at hand, the "basic data types" are all numbers, loosely speaking, and so the check succeeds.

To me, this conclusion seems unreasonable. The semantics of an expression should not depend on the arbitrary choice of syntax we use to formulate it! Thus, I believe the expressions P.WEIGHT = SP.QTY and P.WEIGHT - SP.QTY = 0 must either both be valid or both be invalid; the suggestion that they have different semantics is unacceptable. So it seems to me there's something strange about Codd-style domain checks in the first place, before we even get to "domain check override." (In essence, in fact, Codd-style domain checks apply only in the very special case where both comparands are specified as relational attributes and not as anything else, such as an operational expression like P.WEIGHT - SP.QTY.)

Let's look at some even simpler examples. Consider the following comparisons (each of which might appear as part of an SQL WHERE clause, for example):

     S.SNO = 'X4'         P.PNO = 'X4'         S.SNO = P.PNO

I hope you agree that it's at least possible that the first two are valid and the third not. But if so, then I hope you also agree there's something strange going on; apparently, we can have three values a, b, and c such that a = c is true and b = c is true, but as for a = b . . . well, we can't even do the comparison, let alone have it come out true! So what is going on?

I return now to the fact that attributes S.SNO and P.PNO are defined on domains SNO and PNO, respectively, and to my claim that domains are really types; in fact, I said in the introduction that domains SNO and PNO in particular were user-defined types. Now, it's likely that both types are physically represented in terms of the built-in type CHAR but physical representations are part of the implementation, not the model; they're irrelevant to the user, and in fact they're hidden from the user (or should be), as we saw in Chapter 1. In particular, the operators that apply to supplier numbers and part numbers are the operators defined in connection with those types, not the operators that happen to be defined in connection with type CHAR. For example, we can concatenate two character strings, but we probably can't concatenate two supplier numbers (we could do this latter only if concatenation was an operator defined in connection with type SNO).

Now, when we define a type, one operator we must define is what's called a selector operator, which allows us to select, or specify, an arbitrary value of the type in question.^[*] The selector for type SNO, for example (which, as we'll see in Chapter 6, will probably also be called SNO), allows us to select the particular SNO value that has some specified CHAR representation. Here's an example:

^[*] This observation is valid regardless of whether we're in an SQL context (as in the present discussion) or otherwise. I omit the details of what's involved in defining selectors in SQL, since they're a little messy but I'll assume, here and throughout this book, that such operators have indeed been defined.

     SNO('S1')

This expression is an invocation of the SNO selector, and it returns a certain supplier number: namely, the one that's conceptually represented by the character string value 'S1'. Likewise, the expression:

     PNO('P1')

is an invocation of the PNO selector, and it returns a certain part number: namely, the one that's conceptually represented by the character string value 'P1'. As you can see, therefore, the SNO and PNO selectors effectively work by converting a certain CHAR value to a certain SNO value and a certain PNO value, respectively.

Returning now to the comparison S.SNO = 'X4': What happens here is that the system notices that the left and right comparands are of different types (SNO and CHAR, to be specific). Since they're of different types, they certainly can't be equal. However, the system also knows there's an operator the SNO selector that effectively performs CHAR-to-SNO conversions. So it can invoke that operator implicitly to convert the right comparand to a supplier number, thereby effectively replacing the original comparison with this one:

     S.SNO = SNO('X4')

Now we're comparing two supplier numbers, which is legitimate.

In the same kind of way, the system effectively replaces the comparison P.PNO = 'X4' with this one:

     P.PNO = PNO('X4')

But in the case of the comparison S.SNO = P.PNO, there's no conversion operator known to the system (at least, let's assume not) that will convert a supplier number to a part number or the other way around, and so the comparison fails on a type error: the comparands are of different types, and there's no way to make them be of the same type.

Terminology: Implicit conversion is often called coercion in the literature. In the first example, therefore, the character string 'X4' is coerced to type SNO; in the second it's coerced to type PNO.

To continue with the example, another operator you must define when you define a type like SNO or PNO is what's called, generically, a THE_ operator, which effectively converts a given SNO or PNO value to the character string (or whatever else it is) that's used to represent it.^[*] Assume for the sake of the example, not unreasonably, that the THE_ operators for types SNO and PNO are both called THE_CHAR. Then, if we really did want to compare S.SNO and P.PNO, the only sense I can make of that requirement is that we want to see if the character-string representations are the same, which we might do like this:

^[*] Again this observation is valid regardless of whether we're in an SQL context or some other context, though SQL doesn't use the terminology of "THE_ operators" as such. (Actually, it doesn't use the terminology of "selectors" either.)

     THE_CHAR ( S.SNO ) = THE_CHAR ( P.PNO )

In other words: convert the supplier number to a string, convert the part number to a string, and compare the two strings.

As I'm sure you can see, the mechanism I've sketched above effectively provides both (a) the domain checking we want, in the first place, and (b) a way of overriding that checking when we want, in the second place. Moreover, it does all this in a clean, fully orthogonal, non ad hoc manner. By contrast, "domain check override" doesn't really do the job; in fact, it doesn't really make sense at all, because it confuses types and representations (as noted previously, types are a model concept, representations are an implementation concept).

Now, you might have realized that what I'm really talking about is here is what's known in language circles as strong typing. Different writers have slightly different definitions for this term, but basically it means that (a) everything in particular, every value and every variable has a type, and (b) whenever we try to perform some operation, the system checks that the operands are of the right types for the operation in question.^[*] Observe too that this mechanism works for any operation, not just for the comparison operations we've been discussing; the emphasis on comparison operations in discussions of domain checking is sanctified by historical usage but is in fact misplaced. For example, consider the following expressions:

^[*] Or, possibly, are coercible to those right types. For reasons not directly connected with the current topic, however, I would argue that all type conversions should be explicit. Coercions are a well-known source of error.

     P.WEIGHT * SP.QTY     P.WEIGHT + SP.QTY

The first of these is probably valid (it yields another weight: namely, the total weight of the pertinent shipment). The second, by contrast, is probably not valid (what could it mean to add a weight and a quantity?).