Chapter 2 -- Database Structure | Designing Relational Database Systems (Dv-Mps Designing)

Chapter 2

This chapter discusses the first facet of designing relational database models: the structure of the relations themselves. The primary goal of this design phase is simple: to ensure that the model is capable of answering every question that might reasonably be asked of it. Secondarily, you want to minimize redundancy and the problems associated with it.

Redundancy is a bad idea not only because it wastes resources but also because it makes life a lot harder. Take, for example, the recordset shown in Figure 2-1, representing a company's invoices. (Assume, for the moment, that this recordset is a base table that is explicitly stored in the database, not the result of a query.)

click to view at full size.

Figure 2-1. This recordset has redundant data.

You can see that HireDate and TelephoneExtension values are listed several times for each employee. Now, this has a couple of consequences. First, it means that every time you enter a new invoice you have to reenter values into these two fields. And, of course, every time you enter something you have another chance to get it wrong. For example, given the recordset shown in Figure 2-2, how do you know whether Steven Buchanan was hired in 1989 or 1998? Second, the structure prevents you from storing the hire date or phone number for a new employee until that employee has made a sale. Third, if the invoices for a given year are archived and removed from the database, the hire date and phone number information will be lost.

Figure 2-2. Duplicate data can result in inconsistencies.

These kinds of problems, usually called update anomalies, are even worse if the redundant data is stored in more than one relation. Consider the example shown in Figure 2-3. (Again, assume that these are base tables, not query results.) If the phone number of Around the Horn changes, you must change it in the Customer recordset and also remember to update every instance in the Invoice recordset.

click to view at full size.

Figure 2-3. The same duplicate data can exist in multiple recordsets.

It's not so much that this is impossible or even difficult to do. The problem is remembering to do it. And even if you never forget anything, how will you ensure that the maintenance programmer modifying your system six months from now will know there are redundancies of this kind, much less remember (or know how) to handle them appropriately? It's better, much better, to avoid the redundancies and the resulting problems altogether.

You need to make sure that the redundant attributes you're considering really are redundant. Consider the example in Figure 2-4. At first glance, you might think that the UnitPrice attributes in these two relations are redundant. But they actually represent two distinct values. The UnitPrice attribute in the Products relation represents the current selling price. The UnitPrice attribute in the Orders relation represents the price at the time the item was sold. Tofu, for example, is listed at a UnitPrice of $18.60 in the Orders relation and $23.25 in the Products relation. The fact that Tofu currently sells for $23.25 doesn't change the fact that it was sold for $18.60 at some point in the past. The two attributes are defined on the same domain, but they are logically distinct.

click to view at full size.

Figure 2-4. Seemingly identical data might not actually be redundant.

The ability of a data model to answer the questions asked of it is largely determined by its completeness (obviously, no database system can provide data it doesn't contain) and only secondarily by its structure. But the ease with which questions can be answered is almost exclusively the result of the structure. The principle here is that it is easy to combine attributes and relations but very difficult to take them apart, as illustrated in Figure 2-5.

click to view at full size.

Figure 2-5. Concatenating information is easy, but extracting it from composite fields is hard.

To return to the name example from Chapter 1, given the two relations shown in Figure 2-5, the FullName can be easily derived from the top relation with the statement:

 TitleOfCourtesy & " " & FirstName & " " & LastName & _ ", " & Title

But retrieving only the LastName from the FullName field shown in the bottom relation would require manipulating the string:

 Function GetLastname(FullName) As String Dim lastname As String 'strip off the Title lastname = Left(FullName, InStr(FullName, ",") - 1) 'strip off the TitleOfCourtesy lastname = Right(lastname, Len(lastname) _ _ InStr(lastname, " ")) 'strip off FirstName lastname = Right(lastname , Len(lastname ) - _ InStr(lastname , " ")) GetLastname = lastname End Function

This technique is also vulnerable to variations in the contents of the FullName field; the name "Billy Rae Jones" is going to return "Rae Jones", when what you probably wanted was "Jones". Further, producing a list in the format LastName, FirstName could get very ugly.

The second principle involved in creating a data model that can effectively answer the questions asked of it is to avoid situations where answering the question requires evaluating the same information from multiple fields. Take the relations shown in Figures 2-6 and 2-7, for example, both of which model student enrollments. To answer the question "Which students are studying Biology this year?" using the first relation, you would have to search for the value "Biology" in six fields. The SQL SELECT statement would look like this:

 SELECT StudentID FROM Enrollments WHERE Period1 = "Biology" OR Period2 = "Biology" OR Period3 = "Biology" OR Period4 = "Biology" OR Period5 = "Biology" OR Period6 = "Biology";

Using the second structure requires searching only a single field, Class:

 SELECT StudentID FROM Enrollments WHERE Class = "Biology";

They both work, but the second one is obviously easier and less error-prone to code, not to mention easier to think about.

click to view at full size.

Figure 2-6. This structure makes certain questions difficult to answer.

Figure 2-7. This structure has more records, but it's easier to formulate queries against it.

Avoiding redundancy and making it easy to retrieve the data are all you need to know about data modeling; the rest is just an attempt to formalize these two basic principles. But if you've ever done much (or any) data modeling, you'll know that, simple as they might be, these principles can be very slippery to apply. They're like a paper clip: the answer is perfectly obvious once you've seen it, but a little hard to come up with the first time you're faced with a bunch of loose paper and a bit of wire.