Let's design a database for our CD collection. Assume that, until now, we've been keeping all the information about our CDs on index cards and that, for each CD, we've written down the artist's name, what country the artist is from, the title of the album, the release date, the label, and a list of the songs on the album. We want to convert this system of cards into an Oracle database. Based on our index cards, we first design a table with the following columns, omitting for now the list of songs: CD_COLLECTION TITLE ARTIST COUNTRY RELEASE_DATE LABEL Our database table is shown in Figure 2-1. Figure 2-1. CD database table.Notice that, to identify each CD uniquely, we can use the combination of the two columns TITLE and ARTIST. This combination is called the primary key for the table. A primary key isn't required by Oracle, and for our current purposes, we could probably get by without one, but a primary key is required by the definition of a relational database. A relational database requires that we must be able to address the intersection of any row and column uniquely.
To create the table, we'll type the following at the sqlplus prompt: CREATE TABLE CD_COLLECTION ( ALBUM_TITLE VARCHAR2(100), ARTIST VARCHAR2(100), COUNTRY VARCHAR2(25), RELEASE_DATE DATE, LABEL VARCHAR2(25), PRIMARY KEY (ALBUM_TITLE, ARTIST) ); The reason that we've omitted the songs from this table is that relational databases don't have a list or array type, so there is no natural way to store them in this table. A table in a relational database should store only items that have a one-to-one relationship with the other items in the tables. If it were just a matter of storing the titles of the songs, it would be easy to fake it by having a series of columns SONG_1, SONG_2, SONG_3, etc. But this quickly gets unmanageable if we later decide to store more information about each song, such as length and composer. Also, there is no absolute way to determine the maximum number of songs we should allow per CD. Instead, we'll need to create a separate table for the songs on the CDs. Let's enter a couple of CDs into this table with the following SQL statements:
Representing the relationship between a whole, such as a CD, to its parts, such as the songs on the CD, is a fundamental part of designing a database and is termed a one-to-many relationship. The process of properly organizing data into tables what the tables are, what the keys are, and what columns are in each table is called normalization. We took the first step toward normalizing our database when we decided not to include songs in the CD_COLLECTION table. NormalizationNormalization is not a simple, all-or-nothing process it's a step-by-step process. As we normalize, we apply a specific new rule at each step to our database. Each of these steps is called a normal form. We'll look at the first three: the first, second, and third normal forms. Each is stricter than the next, and its requirements are in addition to those of the previous forms. Thus, a database in third normal form, by definition, meets the requirements of second and first normal forms, as well. First Normal FormThe first normal form formalizes the principle that we've already stated about the songs on our CDs; we shouldn't have aggregate or repeating types, such as lists or arrays, in a table. Where there is a one-to-many relationship, each type of data should be in its own table the wholes in one and the parts in another. In the case of our CD collection, this means that the songs should be in a separate table of their own. We'll have a table that lists each of the songs (many) on each of the tracks (many) on each (one) of our CDs. This one-to-many relationship is illustrated in Figure 2-2. Figure 2-2. One-to-many relationship.We'll also include the length of the song and the composer, using the following columns: SONGS SONG_TITLE COMPOSER LENGTH TRACK ARTIST TITLE In this table, the primary key consists of three columns, TRACK, ARTIST, and TITLE, all of which are sufficient together to identify each song on each CD uniquely. To tie a set of songs in this table with a CD in the CD_COLLECTION table, we need to include the primary key (which identifies a unique CD) from the CD_COLLECTION table; here, in the SONGS table, this key is called a foreign key. In addition to including columns for ARTIST and ALBUM_TITLE, we explicitly identify the foreign key, in addition to the primary key, when we create the table: CREATE TABLE SONGS ( SONG_TITLE VARCHAR2(100), COMPOSER VARCHAR2(100), LENGTH NUMBER, TRACK NUMBER, ARTIST VARCHAR2(100), ALBUM_TITLE VARCHAR2(100), FOREIGN KEY (ARTIST, ALBUM_TITLE) REFERENCES CD_COLLECTION(ARTIST, ALBUM_TITLE), PRIMARY KEY (SONG_TITLE, ARTIST, ALBUM_TITLE) ); Associating the ARTIST and TITLE columns in this table with the ARTIST and TITLE columns of the CD_COLLECTION table by specifying with a foreign key forms a referential constraint. The database will enforce this constraint to ensure that any songs we enter into this table are associated with a CD that we've entered in the CD_COLLECTION table. This means that we must first enter CD information before we enter song information. Let's insert a couple of songs for each of our CDs:
As we'll see later, SQL will allow us to query the two tables in ways that will put the information back together like it was on our index card. But because we have not thrown information away about the individual songs by lumping them together with the CD-as-a-whole information, we can also do new things that would've been hard to do with index cards, such as determining how many CDs contain a given song or finding all songs by a given composer. Second Normal FormThe second normal form says that a column in a table should not depend on just part of the key. This is actually a problem in our first table with the column COUNTRY. Assume that our CD collection has CDs from around the world. Our primary key for this table is ARTIST and TITLE, but COUNTRY depends only on ARTIST. To straighten this out, we'll need to remove the COUNTRY column from our CD_COLLECTION table and add a new table, an ARTISTS table that lists information, including at least COUNTRY, about each of the artists represented in our CD collection: ARTISTS ARTIST COUNTRY We can probably add more information to this later, such as date of birth, Web site URL, etc. This is already starting to show one of the typical effects of normalization lots of smaller tables. Third Normal FormThe third normal form states that a column in the table must not depend on any other column of the table. To put it another way, all columns in the table must depend directly on the whole key and nothing but the key. There is no example of this in our database now, but it's not hard to see how a situation could arise if we extend our example a bit. Suppose that we wanted to use our database for a music store and that one of the pieces of information we needed to store was the list price. Our first thought might be to add a column LIST_PRICE to the CD_COLLECTION tables: CD_COLLECTION ALBUM_TITLE ARTIST COUNTRY RELEASE_DATE LABEL LIST_PRICE Now suppose that each label sets the list price, the same price, for all of its CDs. Because LIST_PRICE does not vary for each individual CD but instead depends on the non-key column LABEL, the third normal form suggests that we need to remove the LIST_PRICE column and add instead a new table, LIST_PRICES, which lists the price by label: LIST_PRICES LABEL LIST_PRICE This has clear advantages. When a recording label changes its list price, there's only one value to change. Further, consistency is also ensured because it eliminates the possibility of entering wrong values on individual CDs. This is as far as we'll take normalization here. These first three forms cover most of the issues that we need to deal with common database designs. Finding a way to resolve them is usually as straightforward as we've seen. It's usually much harder to fix problems with databases involving fourth and fifth normal forms, where the solutions are more varied, depending largely on the specifics of each situation. If your application attempts to solve a large and complex data-modeling problem, you may wish to consult with someone who has more experience or refer to a book dealing more specifically with data modeling. DenormalizationAs we saw above, there is at least one drawback to normalizing a database: You get more tables, some of which can be quite small. This will make your queries more complex and may require more programming, because the new tables may, for example, require new user interfaces for data input. In our example, if the only information that we are keeping specific to each artist is COUNTRY, we may decide that the programming necessary to maintain a separate table is not worth the effort. The major drawback is that, unless we restrict what we allow as a valid entry in some other way, we may end up with different, inconsistent values for different CDs by the same artist. Maybe that's acceptable. The best way to determine whether you've normalized your data enough (or perhaps gone too far) is to test your data model before investing significant development time in developing an application that uses it. Using SQL*Plus manually or, preferably, using scripts, create the database, add some data, and try out typical queries. In later chapters, we will occasionally revisit this CD collection example, and as we do so, we'll find that we'll need to make changes as we take more considerations into account. This evolution of a database is a common part of the development process and needs to be taken into account when planning a project. |