Designing a Database | Java Oracle Database Development

Let's design a database for our CD collection. Assume that, until now, we've been keeping all the information about our CDs on index cards and that, for each CD, we've written down the artist's name, what country the artist is from, the title of the album, the release date, the label, and a list of the songs on the album. We want to convert this system of cards into an Oracle database.

Based on our index cards, we first design a table with the following columns, omitting for now the list of songs:

 CD_COLLECTION   TITLE   ARTIST   COUNTRY   RELEASE_DATE   LABEL

Our database table is shown in Figure 2-1.

Figure 2-1. CD database table.

graphics/02fig01.gif

Notice that, to identify each CD uniquely, we can use the combination of the two columns TITLE and ARTIST. This combination is called the primary key for the table.

A primary key isn't required by Oracle, and for our current purposes, we could probably get by without one, but a primary key is required by the definition of a relational database. A relational database requires that we must be able to address the intersection of any row and column uniquely.

We'll assume that a database is available for you to use in order to follow along with the examples. If you've installed Oracle and didn't choose to have a database created at installation time, you can use the Oracle Database Configuration Assistant to create one now.

In general, you shouldn't be logged in as administrator except when you need to do database administration work. If you don't have a regular user account already, you should create one and log in as that user. To create a user account, using SQL*Plus, connect to the database as administrator and type the following command at the sqlplus prompt:

 CREATE USER username IDENTIFIED BY password

Next, grant the rights to connect and create and use objects:

 GRANT CONNECT, RESOURCE TO username

Now you can connect to the database as the user you created:

 CONNECT username/password

If you are using a database created by someone else, you should get the administrator of that database to create a user account for you, perhaps in your own tablespace. You should ensure that you have the sufficient rights to create and drop tables.

In this chapter, we will be using SQL*Plus to create tables and enter information into the database using SQL. At this stage, apart from noticing that CREATE TABLE is used to create a table, INSERT is used to put a row of data into it, and SELECT is used to list the contents, you shouldn't worry about the details of SQL.

To create the table, we'll type the following at the sqlplus prompt:

 CREATE TABLE CD_COLLECTION (   ALBUM_TITLE   VARCHAR2(100),   ARTIST        VARCHAR2(100),   COUNTRY       VARCHAR2(25),   RELEASE_DATE  DATE,   LABEL         VARCHAR2(25),   PRIMARY KEY (ALBUM_TITLE, ARTIST) );

The reason that we've omitted the songs from this table is that relational databases don't have a list or array type, so there is no natural way to store them in this table. A table in a relational database should store only items that have a one-to-one relationship with the other items in the tables.

If it were just a matter of storing the titles of the songs, it would be easy to fake it by having a series of columns SONG_1, SONG_2, SONG_3, etc. But this quickly gets unmanageable if we later decide to store more information about each song, such as length and composer. Also, there is no absolute way to determine the maximum number of songs we should allow per CD. Instead, we'll need to create a separate table for the songs on the CDs.

Let's enter a couple of CDs into this table with the following SQL statements:

[View full width]
 INSERT INTO CD_COLLECTION VALUES('Black Sheets of Rain', 'Bob Mould', 'USA','1-JAN-1992',  'Virgin'); INSERT INTO CD_COLLECTION VALUES('Candy Apple Grey', 'Husker Du', 'USA','1-JAN-1986',  'Warner Brothers');

Representing the relationship between a whole, such as a CD, to its parts, such as the songs on the CD, is a fundamental part of designing a database and is termed a one-to-many relationship. The process of properly organizing data into tables what the tables are, what the keys are, and what columns are in each table is called normalization. We took the first step toward normalizing our database when we decided not to include songs in the CD_COLLECTION table.

Normalization

Normalization is not a simple, all-or-nothing process it's a step-by-step process. As we normalize, we apply a specific new rule at each step to our database. Each of these steps is called a normal form. We'll look at the first three: the first, second, and third normal forms. Each is stricter than the next, and its requirements are in addition to those of the previous forms. Thus, a database in third normal form, by definition, meets the requirements of second and first normal forms, as well.

First Normal Form

The first normal form formalizes the principle that we've already stated about the songs on our CDs; we shouldn't have aggregate or repeating types, such as lists or arrays, in a table. Where there is a one-to-many relationship, each type of data should be in its own table the wholes in one and the parts in another.

In the case of our CD collection, this means that the songs should be in a separate table of their own. We'll have a table that lists each of the songs (many) on each of the tracks (many) on each (one) of our CDs. This one-to-many relationship is illustrated in Figure 2-2.

Figure 2-2. One-to-many relationship.

graphics/02fig02.gif

We'll also include the length of the song and the composer, using the following columns:

 SONGS   SONG_TITLE   COMPOSER   LENGTH   TRACK   ARTIST   TITLE

In this table, the primary key consists of three columns, TRACK, ARTIST, and TITLE, all of which are sufficient together to identify each song on each CD uniquely. To tie a set of songs in this table with a CD in the CD_COLLECTION table, we need to include the primary key (which identifies a unique CD) from the CD_COLLECTION table; here, in the SONGS table, this key is called a foreign key. In addition to including columns for ARTIST and ALBUM_TITLE, we explicitly identify the foreign key, in addition to the primary key, when we create the table:

 CREATE TABLE SONGS (   SONG_TITLE      VARCHAR2(100),   COMPOSER        VARCHAR2(100),   LENGTH          NUMBER,   TRACK           NUMBER,   ARTIST          VARCHAR2(100),   ALBUM_TITLE     VARCHAR2(100), FOREIGN KEY (ARTIST, ALBUM_TITLE) REFERENCES CD_COLLECTION(ARTIST, ALBUM_TITLE),   PRIMARY KEY (SONG_TITLE, ARTIST, ALBUM_TITLE) );

Associating the ARTIST and TITLE columns in this table with the ARTIST and TITLE columns of the CD_COLLECTION table by specifying with a foreign key forms a referential constraint. The database will enforce this constraint to ensure that any songs we enter into this table are associated with a CD that we've entered in the CD_COLLECTION table. This means that we must first enter CD information before we enter song information.

Let's insert a couple of songs for each of our CDs:

[View full width]
 INSERT INTO SONGS VALUES('Black Sheets of Rain','Mould', NULL,1, 'Bob Mould', 'Black  Sheets of Rain); INSERT INTO SONGS VALUES('Crystal','Mould', 3.28, 1, 'Husker Du', 'Candy Apple Grey'); INSERT INTO SONGS VALUES('Don''t want to know if you are lonely ','Hart', 3.28, 2,  'Husker Du', 'Candy Apple Grey'); INSERT INTO SONGS VALUES('I don''t know for sure','Mould', 3.28, 3, 'Husker Du', 'Candy  Apple Grey');

As we'll see later, SQL will allow us to query the two tables in ways that will put the information back together like it was on our index card. But because we have not thrown information away about the individual songs by lumping them together with the CD-as-a-whole information, we can also do new things that would've been hard to do with index cards, such as determining how many CDs contain a given song or finding all songs by a given composer.

Second Normal Form

The second normal form says that a column in a table should not depend on just part of the key. This is actually a problem in our first table with the column COUNTRY. Assume that our CD collection has CDs from around the world. Our primary key for this table is ARTIST and TITLE, but COUNTRY depends only on ARTIST. To straighten this out, we'll need to remove the COUNTRY column from our CD_COLLECTION table and add a new table, an ARTISTS table that lists information, including at least COUNTRY, about each of the artists represented in our CD collection:

 ARTISTS   ARTIST   COUNTRY

We can probably add more information to this later, such as date of birth, Web site URL, etc. This is already starting to show one of the typical effects of normalization lots of smaller tables.

Third Normal Form

The third normal form states that a column in the table must not depend on any other column of the table. To put it another way, all columns in the table must depend directly on the whole key and nothing but the key. There is no example of this in our database now, but it's not hard to see how a situation could arise if we extend our example a bit. Suppose that we wanted to use our database for a music store and that one of the pieces of information we needed to store was the list price. Our first thought might be to add a column LIST_PRICE to the CD_COLLECTION tables:

 CD_COLLECTION   ALBUM_TITLE   ARTIST   COUNTRY   RELEASE_DATE   LABEL   LIST_PRICE

Now suppose that each label sets the list price, the same price, for all of its CDs. Because LIST_PRICE does not vary for each individual CD but instead depends on the non-key column LABEL, the third normal form suggests that we need to remove the LIST_PRICE column and add instead a new table, LIST_PRICES, which lists the price by label:

 LIST_PRICES   LABEL   LIST_PRICE

This has clear advantages. When a recording label changes its list price, there's only one value to change. Further, consistency is also ensured because it eliminates the possibility of entering wrong values on individual CDs.

This is as far as we'll take normalization here. These first three forms cover most of the issues that we need to deal with common database designs. Finding a way to resolve them is usually as straightforward as we've seen. It's usually much harder to fix problems with databases involving fourth and fifth normal forms, where the solutions are more varied, depending largely on the specifics of each situation.

If your application attempts to solve a large and complex data-modeling problem, you may wish to consult with someone who has more experience or refer to a book dealing more specifically with data modeling.

Denormalization

As we saw above, there is at least one drawback to normalizing a database: You get more tables, some of which can be quite small. This will make your queries more complex and may require more programming, because the new tables may, for example, require new user interfaces for data input.

In our example, if the only information that we are keeping specific to each artist is COUNTRY, we may decide that the programming necessary to maintain a separate table is not worth the effort. The major drawback is that, unless we restrict what we allow as a valid entry in some other way, we may end up with different, inconsistent values for different CDs by the same artist. Maybe that's acceptable.

The best way to determine whether you've normalized your data enough (or perhaps gone too far) is to test your data model before investing significant development time in developing an application that uses it. Using SQL*Plus manually or, preferably, using scripts, create the database, add some data, and try out typical queries.

In later chapters, we will occasionally revisit this CD collection example, and as we do so, we'll find that we'll need to make changes as we take more considerations into account. This evolution of a database is a common part of the development process and needs to be taken into account when planning a project.