Section 1.2. The Importance of Being Normal

1.2. The Importance of Being Normal

Normalization, and especially that which progresses to the third normal form (3NF), is a part of relational theory that most students in computer science have been told about. It is like so many things learned at school (classical literature springs to mind), often remembered as dusty, boring, and totally disconnected from today's reality. Many years later, it is rediscovered with fresh eyes and in light of experience, with an understanding that the essence of both principles and classicism is timelessness.

The principle of normalization is the application of logical rigor to the assemblage of items of datawhich may then become structured information. This rigor is expressed in the definition of various normal forms, most typically three, although purists argue that one should analyze data beyond 3NF to what is known in the trade as Boyce-Codd normal form (BCNF), or even to fifth normal form (5NF). Don't panic. We will discuss only the first three forms. In the vast majority of cases, a database modeled in 3NF will also be in BCNF^[*] and 5NF.

^[*] You can have 3NF but not BCNF if your table contains several sets of columns that are unique (candidate keys, which are possible unique identifiers of a row) and share one column. Such situations are not very common.

You may wonder why normalization matters. Normalization is applying order to chaos. After the battle, mistakes may appear obvious, and successful moves sometimes look like nothing other than common sense. Likewise, after normalization the structures of the various tables in the database may look natural, and the normalization rules are sometimes dismissively considered as glorified common sense. We all want to believe we have an ample supply of common sense; but it's easy to get confused when dealing with complex data. The three first normal forms are based on the application of strict logic and are a useful sanity checklist.

The odds that our creating un-normalized tables will increase our risk of being struck by divine lightning and reduced to a little mound of ashes are indeed very low (or so I believe; it's an untested theory). Data inconsistency, the difficulty of coding data-entry controls, and error management in what become bloated application programs are real risks, as well as poor performance and the inability to make the model evolve. These risks have a very high probability of occurring if we don't adhere to normal form, and I will soon show why.

How is data moved from a heterogeneous collection of unstructured bits of information into a usable data model? The method itself isn't complicated. We must follow a few steps, which are illustrated with examples in the following subsections.

1.2.1. Step 1: Ensure Atomicity

First of all, we must ensure that the characteristics, or attributes, we are dealing with are atomic. The whole idea of atomicity is rather elusive, in spite of its apparent simplicity. The word atom comes from ideas first advanced by Leucippus, a Greek philosopher who lived in the fifth century B.C., and means "that cannot be split." (Atomic fission is a contradiction in terms.) Deciding whether data can be considered atomic or not is chiefly a question of scale. For example, a regiment may be an atomic fighting unit to a general-in-chief, but it will be very far from atomic to the colonel in command of that regiment, who deals at the more granular level of battalions or squadrons. In the same way, a car may be an atomic item of information to a car dealer, but to a garage mechanic, it is very far from atomic and consists of a whole host of further components that form the mechanic's perception of atomic data items.

From a purely practical point of view, we shall define an atomic attribute as an attribute that, in a where clause, can always be referred to in full. You can split and chop an attribute as much as you want in the select list (where it is returned); but if you need to refer to parts of the attribute inside the where clause, the attribute lacks the level of atomicity you need. Let me give an example. In the previous list of attributes for used cars, you'll find "safety equipment," which is a generic name for several pieces of information, such as the presence of an antilock braking system (ABS), or airbags (passenger-only, passenger and driver, frontal, lateral, and so on), or possibly other features, such as the centralized locking of doors. We can, of course, define a column named safety_equipment that is just a description of available safety features. But we must be aware that by using a description we forfeit at least two major benefits:

The ability to perform an efficient search: If some users consider ABS critical because they often drive on wet, slippery roads, a search that specifies "ABS" as the main criterion will be very slow if we must search column safety_equipment in every row for the "ABS" substring. As I'll show in Chapter 3, regular indexes require atomic (in the sense just defined) values as keys. One can sometimes use query accelerators other than regular indexes (full-text indexing, for instance), but such accelerators usually have drawbacks, such as not being maintained in real time. Also take note that full-text search may produce awkward results at times. Let's take the example of a color column that contains a description of both body and interior colors. If you search for "blue" because you'd prefer to buy a blue car, gray cars with a blue interior will also be returned. We have all experienced irrelevant full-text search results through web searches.
Database-guaranteed data correctness: Data-entry is prone to error. More importantly than dissuasive search times, if "ASB" is entered instead of "ABS" into a descriptive string, the database management system will have no way to check whether the string "ASB" is meaningful. As a result, the row will never be returned when a user specifies "ABS" in a search, whether as the main or as a secondary criterion. In other words, some of our queries will return wrong results (either incomplete, or even plain wrong if we want to count how many cars feature ABS). If we want to ensure data correctness, our only means (other than double-checking what we have typed) is to write some complicated function to parse and analyze the safety equipment string when it is entered or updated. It is hard to decide what will be worse: the hell that the maintenance of such a function would be, or the performance penalty that it will inflict on loads. By contrast, a mandatory Y/N has_ABS column would not guarantee that the information is correct, but at least declarative check constraints can make the DBMS reject any value other than Y or N.

Partially updating a complex string of data requires first-rate mastery of string functions. Thus, you want to avoid cramming multiple values into a single string.

Defining data atoms isn't always a simple exercise. For example, the handling of addresses frequently raises difficult questions about atomicity. Must we consider the address as some big, opaque string? Or must we break it into its components? And if we decompose the address, to what level should we split it up? Remember the points made earlier about atomicity and business requirements. How we represent an address actually depends on what we want to do with the address. For example, if we want to compute statistics or search by postal code and town, then it is desirable to break the address up into sufficient attribute components to uniquely identify those important data items. The question then arises as to how far this decomposition of the address should be taken.

The guiding principle in determining the extent to which an address should be broken into components is to test each component against the business requirements, and from those requirements derive the atomic address attributes. What these various address attributes will be cannot be predicted (although the variation is not great), but we must be aware of the danger of adopting an address format just because some other organization may have chosen it, before we have tested it critically against our own business needs.

Note that sometimes, the devil is in the details. By trying to be too precise, we may open the door to many distracting and potentially irrelevant issues. If we settle for a level of detail that includes building number and street as atomic items, what of ACME Corp, the address of which is simply "ACME Building"? We should not create design problems for information we don't need to process. Properly defining the level of information that is needed can be particularly important when transferring data from an operational to a decision-support system.

Once all atomic data items have been identified, and their mutual interrelationships resolved, distinct relations emerge. The next step is to identify what uniquely characterizes a rowthe primary key. At this stage, it is very likely that this key will be a compound one, consisting of two or more individual attributes. To go on with our used car example, for a customer it's the combination of make, model, version, style, year, and mileage that will identify a particular vehiclenot the current registration number. It isn't always easy to correctly define a key. A good, classic example of attribute analysis is the business definition of "customer." A customer may be identified by a name. However, a name may not be the best identifier. If our customers are companies, the way we identify them may be the source of ambiguitiesis it "RSI," "Relational Software," "Relational Software Inc" (with or without a dot following "Inc," with or without a comma after "Relational Software") that identifies this given company? Uppercase? Lowercase? Capitalized initials? We have here all the conditions for storing information inside a database and never seeing it again. The choice of the customer name as identifier is a challenging one, because it demands the strict application of naming standards to avoid possible ambiguities. It may be preferable to identify a customer on the basis of either a standard short name, or possibly by use of a unique code. And one should always keep in mind the impact on related data of Relational Software Inc. changing its name to, say, Oracle Corporation. If we need to keep a history of our relationship, then we must be able to identify both names as representing the same company at different points in time.

As a general rule, you should, whenever possible, use a unique identifier that has meaning rather than some obscure sequential integer. I must stress that the primary key is what characterizes the datawhich is not the case with some sequential identifier associated with each new row. You may choose to add such an identifier later, for instance because you find your own company_id easier to handle than the place of incorporation and registration number that truly identify a company. You can even promote the sequential identifier to the envied status of primary key, as a technical substitute (or shorthand) for the true key, in exactly the same way that you'd use table aliases in a query in order to be able to write:

     where a.id =  b.id

instead of:

     where table_with_a_long_name.id = table_even_worse_than_the_other.id

But a technical, numerical identifier doesn't constitute a real primary key by the mere virtue of its existence and mustn't be mistaken for the real thing. Once all the attributes are atomic and keys are identified, our data is in first normal form (1NF).

1.2.2. Step 2: Check Dependence on the Whole Key

I have pointed out that some of the information that we should store to help used car buyers make an informed choice would already be known by a car enthusiast. In fact, many used car characteristics are not specific to one particular car. For example, all the cars sharing make, model, version, and style will have the same seating and cargo capacity, regardless of year and mileage. In other words, we have attributes that depend on only a part of the key. What are the implications of keeping them inside a used_cars table?

Data redundancy: If we happen to have for sale many cars of the same make, model, version, and style (a set of characteristics that we can generically call the car model), all the attributes that are not specific to one particular car will be stored as many times as we have cars of the same model. There are two issues with the storage of redundant data . First, redundant data increases the odds of encountering contradictory information because of input errors (and it makes correction more time-consuming). Second, redundant data is an obvious storage waste. It is customary to hear that nowadays storage is so cheap that one no longer needs to be obsessed with space. True enough, except that such an argument overlooks the fact that there is also more and more data to store in today's world. It also overlooks the fact that data is often mirrored, possibly backed up to other disks on a disaster recovery site where it is mirrored again, and that many development databases are mere copies of production databases. As a result, every wasted byte isn't wasted once, but four or five times in the very best of cases. When you add up all the wasted bytes, you sometimes get surprisingly high figures. Besides the mere cost of storage, sometimesmore importantlythere is also the issue of recovery. There are cases when one experiences "unplanned downtime," a very severe crash for which the only solution is to restore the database from a backup. All other things being equal, a database that is twice as big as necessary will take twice the time to restore than would otherwise be needed. There are environments in which a long time to restore can cost a lot of money. In a hospital, it can even cost lives.
Query performance: A table that contains a lot of information (with a large number of columns) takes much longer to scan than a table with a reduced set of columns. As we shall see in other chapters, a full table scan is not necessarily the scary situation that many beginners believe it to be; there are many cases where it is by far the best solution. However, the more bytes in the average row, the more pages will be required to store the table, and the longer it takes to scan the table. If you want to display a selectable list of the available car models, an un-normalized table will require a select distinct applied to all the available cars. Running a select distinct doesn't mean only scanning many more rows than we would with a separate car_model table, but it also means having to sort those rows to eliminate duplicates. If the data is split in such a way that the DBMS engine can operate against only a subset of the data to resolve the query, performance will be significantly better than when it operates against the whole.

To remove dependencies on a part of the key, we must create tables (such as car_model). The keys of those new tables will each be a part of the key for our original table (in our example, make, model, version, and style). Then we must move all the attributes that depend on those new keys to the new tables, and retain only make, model, version, and style in the original table. We may have to repeat this process, since the engine and its characteristics will not depend on the style. Once we have completed the removal of attributes that depend on only a part of the key, our tables are in second normal form (2NF).

1.2.3. Step 3: Check Attribute Independence

When all data has been correctly moved into 2NF, we can commence the process of identifying the third normal form (3NF). Very often, a data set in 2NF will already be in 3NF, but nevertheless, we should check the 2NF set. We now know that each attribute in the current set is fully dependent on the unique key. 3NF is reached when we cannot infer the value of an attribute from any attribute other than those in the unique key. For example, the question must be asked: "Given the value of attribute A, can the value of attribute B be determined?"

International contact information provides an excellent example of when you can have an attribute dependent on another non-key attribute: if you know the country, you need not record the international dialing code with the phone number (the reverse is not true, since the United States and Canada share the same code). If you need both bits of information, you ought to associate each contact with, say, an ISO country code (for instance IT for Italy), and have a separate country_info table that uses the country code as primary key and that holds useful country information that your business requires. For instance, a country_info table may record that the international dialing code for Italy is 39, but also that the Italian currency is the euro, and so on. Every pair of attributes in our 2NF data set should be examined in turn to check whether one depends on the other. Such checking is a slow process, but essential if the data is to be truly modeled in 3NF. What are the risks associated with not having the data modeled in 3NF? Basically you have the same risks as from not respecting 2NF.

There are various reasons that modeling to the third normal form is important. (Note that there are cases in which designers deliberately choose not to model in third normal form; dimensional modeling, which will be briefly introduced in Chapter 10, is such a case. But before you stray from the rule, you must know the rule and weigh the risks involved.) Here are some reasons:

A properly normalized model protects against the evolution of requirements.: As Chapter 10 will show, a non-normalized model such as the dimensional one finds its justification in assumptions about how the data is maintained and queried (the same can be said of the physical data structures that you'll see in Chapter 5; but a physical implementation change will not jeopardize the logic of programs, even if it can seriously impact their performance). If the assumptions prove wrong one day, all you can do is throw everything away and rebuild from scratch. By contrast, a 3NF model may require some query adjustments, but it will be flexible enough to accommodate changes.
Normalization minimizes data duplication.: As I have already pointed out, duplicate data is costly, both in terms of disk space and processing power, but it also introduces a much-increased possibility of data becoming corrupt. Corruption happens when one instance of a data value is modified, but the same data held in another part of the database fails to be simultaneously (and identically) modified. Losing information doesn't only mean data erasure: if one part of the database says "white" while another part says "black," you have lost information. Data inconsistency can be prevented by the DBMS if the modeling allows itif your atomic attributes let you define column constraints, or if you can declare referential integrity constraints. Otherwise, it has to be prevented by additional programming traps. You then have the choice between using triggers and stored procedures that can grow very complex and add significant overhead, or making programs unnecessarily complicated and therefore costlier to maintain. Triggers and stored procedures must be extremely well documented. Data consistency ensured in programs moves the protection of data integrity out of the database and into the application layer. Any other program that needs to access the same data has the choice between duplicating the data integrity protection effort, or happily corrupting the data painfully maintained in a consistent state by other programs.

The normalization process is fundamentally based on the application of atomicity to the world you are modeling.