Building Complex Data Structures and Modeling Techniques

Choosing the right data structure is essential for building sophisticated applications, because the way data is organized in a database is the basis of success for the entire application. In this section, you learn how complex data structures can be built efficiently with PostgreSQL.

Creating and Dropping Views

It has been said that eyes are the window to the soul. This might be true. Definitely true is that views can be used to build more complex applications, and they can help you obtain a broader perspective of your data and the data structure. A view can be seen as a virtual table that is a kind of preprocessor for your data. In reality, a view is the result of a SELECT statement that looks like a table.

Views are also used in the system tables; one example of a view in system tables is the pg_indexes view. The following is the definition of that view:

 shop=#  \   d   pg_indexes  View "pg_indexes"  Attribute  Type  Modifier -----------+------+----------  tablename  name   indexname  name   indexdef   text  View definition: SELECT c.relname AS tablename, i.relname AS indexname, pg_get_indexdef(x.indexrelid) AS indexdef FROM pg_index x, pg_class c, pg_class i WHERE ((c.oid = x.indrelid) AND (i.oid = x.indexrelid));

The definition of the view contains only a SELECT statement.

Recall the data structure of the sales table:

 Table "sales"  Attribute      Type                         Modifier -----------+--------------+------------------------------------------------  id         integer       not null default nextval('sales_id_seq'::text)  salestime  timestamp      prodid     bigint         prodprice  numeric(9,2)   amount     integer        name       varchar(50)    email      varchar(50)    state      varchar(50)   Indices: sales_email_key,          sales_pkey

Let's create a view that contains only the total price a user has to pay. You calculate this price by multiplying prodprice and amount . Because this operation is a fundamental one, you can use a view to avoid bugs and make your SQL statements shorter and easier to understand:

  CREATE VIEW sales_price_view AS   SELECT id, salestime, prodid,   prodprice*amount AS totalprice,   name, email, state   FROM sales;

The view is now listed as a relation in the database. It is labeled as view , but it can be used like any other table:

 shop=#  \   d  List of relations        Name          Type     Owner ------------------+----------+----------  customer          table     postgres  customer_id_seq   sequence  postgres  prodcat           table     postgres  prodcat_id_seq    sequence  postgres  prodtext          table     postgres  prodtext_id_seq   sequence  postgres  products          table     postgres  products_id_seq   sequence  postgres  sales             table     postgres  sales_id_seq      sequence  postgres  sales_price_view  view      postgres (11 rows)

We will now perform a SELECT statement. The result contains all records we store in table sales . The view calculates the required results for the prices implicitly:

 shop=#  SELECT id, totalprice FROM sales_price_view;  id  totalprice ----+------------   1       39.99   2       39.99   4       44.99   3      149.97 (4 rows)

Views can also be created on top of other views. The following example shows the creation of a view consisting of components from table prodtext and view sales_price_view . We will use it to add the title of the book to the result:

  CREATE VIEW sales_prodtext AS   SELECT prodtext.proddesc, sales_price_view.prodid,   sales_price_view.totalprice   FROM prodtext, sales_price_view   WHERE sales_price_view.prodid=prodtext.prodid   AND prodtext.lang='english';

If we perform a full table scan on the view, we receive the following result:

 shop=#  SELECT * FROM sales_prodtext;  proddesc               prodid  totalprice -----------------------------------+--------+------------  Python Developer's Handbook        385343       44.99  Python Developer's Handbook        385343      149.97  Linux Hardware Handbook            394568       39.99  Linux: Networking for your Office  106666       39.99 (4 rows)

Like any other table in the system, a view can also be joined easily with other tables. If your hierarchy of views gets complicated, you have to take care of system performance. If a view contains a slow query, all views on top of that view will also become slow. If you are working on a high-availability system with a lot of data and complex queries, you can run into trouble easily. Make sure that you have tested your data structure under real-world load.

Data Integrity with Constraints

Data integrity can be a very tricky thing. But what does integrity mean? Imagine two tables that store information about a product. One product stores the product id and the price of a product, the second table stores some more details about the product. The second table has multiple entries for one product. Picture a situation where a product id is changed in table one. This leads you into trouble when performing joins because the product id in table two stays unchanged. It can be a problem taking care of issues like these in production environments and to change multiple keys in countless tables by hand. In this case, foreign keys are defined to make sure that data integrity is guaranteed . Here is an example where foreign keys would be useful:

 shop=#  SELECT DISTINCT name, email from sales WHERE name='Robert';  name       email --------+---------------  Robert  robert  @no.any (1 row) shop=#  SELECT name, email from customer WHERE name='Robert';  name      email --------+--------------  Robert  rober@  no.any (1 row)

If you compare the results of the two queries, you might recognize that the email address of Robert seems to be wrong in table customer (a t is missing). This is a tricky bug because it isn't likely to be found on first sight. It would be useful to have something that checks the data when the INSERT statement is performed. Foreign keys are usually used to perform the job.

Let's re-create the tables we have used in the sample database (at least a short version of the tables) and add some foreign keys to it (don't forget to drop the old tables before using the CREATE TABLE command):

  CREATE TABLE "customer" (   "id" serial NOT NULL,   "name" character varying(50) NOT NULL,   "email" character varying(50),   "state" character varying(50) NOT NULL,   PRIMARY KEY ("name", "email")   );   CREATE TABLE "sales" (   "id" serial NOT NULL,   "prodid" int8,   "prodprice" numeric(9,2),   "amount" int4,   "name" character varying(50),   "email" character varying(50),   PRIMARY KEY ("id"),   FOREIGN KEY (name, email) REFERENCES customer   );   COPY "customer" FROM stdin;   1       John    john@no.any     Florida   2       Robert  rober@no.any    Georgia   3       Peter   peter@no.any    Alaska   \.

First we create the customer table. Note that we have defined a multicolumn primary key to ensure that a combination of name and email address has to be unique.

Furthermore the primary key will be referred by the sales table. We have created table sales and added the foreign key. Because we have a multicolumn primary key, we have to use the FOREIGN KEY constraint.

Let's try to insert some values into the database:

  INSERT INTO sales(prodid, prodprice, amount, name, email)   VALUES ('385343','20','3','John','john@no.any');

Inserting this record works perfectly well because the name John and the correct email address can be found in the master table. What happens if we want to perform the operation with a slightly different email address:

 shop=#  INSERT INTO sales(prodid, prodprice, amount, name, email) VALUES   ('385343','20','3','John','johnathan@no.any');  ERROR:  <unnamed> referential integrity violation - key referenced from sales not found in customer

You can see that the INSERT failed because the required email address is not found in table customer . If we want to update the name of the record we have already inserted into table sales , we also receive an error, because this leads to a violation of the referential integrity:

 shop=#  UPDATE sales SET name='Alex';  ERROR:  <unnamed> referential integrity violation - key referenced from sales not found in customer

What if we are planning to change the values of the keys we have in our tables? The next example defines the integrity rules of a column precisely.

  CREATE TABLE "customer" (   "id" serial NOT NULL,   "name" character varying(50) NOT NULL,   "email" character varying(50),   "state" character varying(50) NOT NULL,   PRIMARY KEY ("name")   );   CREATE TABLE "sales" (   "id" serial NOT NULL,   "prodid" int8,   "prodprice" numeric(9,2),   "amount" int4,   "name" character varying(50)   REFERENCES customer   ON UPDATE CASCADE,   "email" character varying(50),   PRIMARY KEY ("id")   );   COPY "customer" FROM stdin;   1       John    john@no.any     Florida   2       Robert  rober@no.any    Georgia   3       Peter   peter@no.any    Alaska   \.

We have modified the data structure slightly. The primary key for table customer has only one column now. In table sales , we have defined a foreign key on column name . This key references, as in the earlier example, to the customer tables. The difference is that we use ON UPDATE , which defines the action that has to happen when the primary key of the master table is updated. In the earlier example, we combined the ON UPDATE with the CASCADE command. Using CASCADE means that an update on the primary key of the master table leads to the update of all foreign key columns that refer to it.

Other commands are supported by PostgreSQL. We can also define certain database activities with the help of ON DELETE , which is always used when an entry in the master table is deleted. The following is a short overview of all commands that can be combined with ON UPDATE and ON DELETE :

Command	Action
NO ACTION	This is the default value; UPDATES and DELETES won't be performed to protect referential integrity.
CASCADE	All foreign keys will be updated when the primary key changes (with ON UPDATE). All foreign key records will be deleted when the primary key record is deleted (ON DELETE).
SET NULL	The foreign key is set to NULL when the primary key is updated or deleted.
SET DEFAULT	The foreign key is set to the default value of the column when the primary key is deleted or updated.

Here is an example of how ON UPDATE CASCADE works:

  INSERT INTO sales(prodid, prodprice, amount, name, email)   VALUES ('385343','20','3','John','john@no.any');

The INSERT command works because John is a valid name in table customer . Let's update the name to Paul :

  UPDATE customer SET name='Paul' WHERE name='John';

The update has been performed without any trouble; the following shows what happened inside the two tables:

 shop=#  SELECT * FROM sales;  id  prodid  prodprice  amount  name     email ----+--------+-----------+--------+------+-------------   1  385343      20.00       3  Paul  john@no.any (1 row) shop=#  SELECT * FROM customer WHERE name='Paul';  id  name     email      state ----+------+-------------+---------   1  Paul  john@no.any  Florida (1 row)

You can see that both tables have been updated to Paul as we expected it to be.

Note

ON UPDATE and ON DELETE can both be defined for one column; you need not decide whether ON UPDATE or ON DELETE is more important to yousimply use both for one column.

The `CHECK` Constraint

Some applications demand some sort of input restriction. This can be done with the CHECK constraint, which checks whether a list of conditions is fulfilled before an INSERT command is processed . In the next example, we want to except only orders where the total price is higher than 100 dollars; otherwise , the INSERT command should fail:

  CREATE TABLE "sales" (   "id" serial NOT NULL,   "prodid" int8,   "prodprice" numeric(9,2),   "amount" int4,   "name" character varying(50),   "email" character varying(50),   PRIMARY KEY ("id"),   CHECK (amount*prodprice > 100)   );

We try to insert an order:

 shop=#  INSERT INTO sales(prodid, prodprice, amount, name, email)   VALUES('385343', '39.99', '1', 'John', 'john@no.any');  ERROR:  ExecAppend: rejected due to CHECK constraint

This fails because one book costs only 39.99 dollars. The minimum is much higher, so the database displays an error. Now we try to order three books that cost 39.99 each. The total price is higher than 100 dollars, so the INSERT statement will be processed successfully:

 shop=#  INSERT INTO sales(prodid, prodprice, amount, name, email)   VALUES('385343', '39.99', '3', 'John', 'john@no.any');  INSERT 53120 1

In many cases, one condition might not be enough for an application. PostgreSQL's CHECK constraint can be used with multiple conditions, as in the next example. We simply have to connect the conditions with the help of AND or OR tokens:

  CREATE TABLE "sales" (   "id" serial NOT NULL,   "prodid" int8,   "prodprice" numeric(9,2),   "amount" int4,   "name" character varying(50),   "email" character varying(50),   PRIMARY KEY ("id"),   CHECK (amount*prodprice > 100   AND amount*prodprice < 1000)   );

Now we can insert only values that are higher than 100 but lower than 1000 ; all other attempts to insert data will fail:

 shop=#  INSERT INTO sales(prodid, prodprice,   amount, name, email)   VALUES('385343', '39.99', '300',   'John', 'john@no.any');  NOTICE:  sales_id_seq.nextval: sequence was re-created ERROR:  ExecAppend: rejected due to CHECK constraint

Using Inheritance

Inheritance is one of the key features of object-relational databases. Objects have certain properties that can be inherited by other classes; in PostgreSQL, tables can inherit from other tables.

Tables can be in a parent-child relationship, which means that all information a parent has is also available in the child (see Figure 3.4).

Figure 3.4. A simple parent-child example.

graphics/03fig04.gif

Imagine a class person that stores all information persons normally have ( height , gender , and so on).

  CREATE TABLE persons(id serial, name varchar(50), gender char(1), height int4);  INSERT INTO persons(name, gender, height) VALUES ('George','m','178');

Now we create a class of persons where we have some additional information about this group of persons, such as profession and income.

  CREATE TABLE worker(prof varchar(50), income int4) INHERITS (persons);

We have additional information about workers, which means that we also have all information about "ordinary" persons. The following is the data structure of table worker :

 persons=#  \   d   worker  Table "worker"  Attribute     Type                          Modifier -----------+-------------+--------------------------------------------------  id         integer      not null default nextval('persons_id_seq'::text)  name       varchar(50)   gender     char(1)       height     integer       prof       varchar(50)   income     integer

Let's add a worker:

  INSERT INTO worker (name, gender, height, prof, income) VALUES   ('Carl','m','182');

Derived classeslike parent classescan have children. In the following example, we create a class called cybertec that inherits all information from worker and adds more information, such as photo and email address:

  CREATE TABLE cybertec(photo varchar(50), email varchar(50)) INHERITS (worker);

Of course we will add some data:

  INSERT INTO cybertec(name, gender, height, prof, income, photo, email)   VALUES ('Ewald Geschwinde','m','182', 'technical director', '1',   'www.cybertec.at/epi/ps2.jpg','eg@cybertec.at');   INSERT INTO cybertec(name, gender, height, prof, income, photo, email)   VALUES ('Hans-Juergen Schoenig','m','178', 'director of marketing', '1',   'www.cybertec.at/hans/1.jpg','hs@cybertec.at');

If we want to write a query to retrieve all persons from table persons , this can easily be done:

 persons=#  SELECT * FROM persons;  id   name   gender  height ----+--------+--------+--------   1  George  m          178 (1 row)

One record has been returned by the database, but let's modify the SQL command slightly:

 persons=#  SELECT * FROM persons*;  id          name           gender  height ----+-----------------------+--------+--------   1  George                 m          178   2  Carl                   m          182   3  Ewald Geschwinde       m          182   4  Hans-Juergen Schoenig  m          178 (4 rows)

We have added an asterisk ( * ) to the name of the table, but now the database returns all values from table persons , including all tables derived from persons .

Note

Only the columns that are available in persons are displayed.

If we try the same thing with one of the other tables, we will receive more columns.

The next example shows how you can query all records from table cybertec and table worker . First we select all records available and exclude those that are in table persons :

 persons=#  SELECT name FROM persons* EXCEPT SELECT name FROM persons;  name -----------------------  Carl  Ewald Geschwinde  Hans-Juergen Schoenig (3 rows)

A new table cannot only inherit from one table; it also is possible to define multiple parents for one table. The next example shows how the son inherits from mother and father :

  CREATE TABLE father(name text);   CREATE TABLE mother(address text);   CREATE TABLE son(gender char(1)) INHERITS (father, mother);

The table data structure of son now looks like this:

 persons=#  \   d   son  Table "son"  Attribute   Type    Modifier -----------+---------+----------  name       text      address    text      gender     char(1)

After we have successfully created the tables, we try to delete one of them:

 persons=#  DROP TABLE father;  ERROR:  Relation '53451' inherits 'father'

As you might have thought, no table can be dropped that is the parent of another table. If you want to delete tables in a hierarchy, you have to do it "bottom up," which means children first:

 persons=#  DROP TABLE son;  DROP persons=#  DROP TABLE father, mother;  DROP

The previous examples work for PostgreSQL database releases earlier than 7.1. In 7.1 and later, inherited tables will be accessed automatically, and therefore an asterisk is not needed. If you want to access noninherited tables, you can use ONLY as a keyword.

Modeling Techniques

This section is dedicated to those who want a brief insight into modeling techniques. You learn how to build entity relationship models and explore the normalization rules for databases.

The Entity Relationship Model

In almost any application, a database is used to build a model of reality. Our reality consists of objects that have certain properties and certain relations with other objects. We will call these objects entities.

The entity relationship model is designed to represent real-world objects and processes to build data structures. For every entity, the relation to other entities is shown. Three types of relations are recognized:

Two entities can be in a 1:1 relationship, which means that one entity might have exactly one entity related to it.
A 1:m relationship means that one entity can have multiple entities related to it. An example of this is a product that has multiple categories assigned to it (one product <-> many categories).
Entities can also be in an m:n relationship, which means that m entities are related with n entities.

An entity relationship model tries to model these issues in a graphical way. Every entity is represented as a box , and the relation between the boxes is shown as arrows or lines.

Entity relationship models are often used in real-world scenarios, because it is easier to understand a graphical overview of complex data structures than a huge amount of code.

Normalization

When designing a database from scratch, you normally end with a set of large tables with everything packed into it. This can soon lead to confusing data structures and a lot of redundancy in the tables. Redundancy means that values are stored more often than necessary. To get rid of the problems, a normalization can be performed. With normalization, the data model's anomalies, redundancies, and inconsistencies are reduced in every step of the process.

Normalization is a fundamental issue in database theory, and many modeling techniques are based on E.F. Codd's rules for normalization. In this section, you explore normalization through a short and easy-to-understand example.

The following is a table that has not been normalized:

Name	Address	Training Course
John Gore	Sesamestreet 1; 1010 Vienna	German, Chemistry, Sports
Lisa Bush	John Street 33; 1150 Vienna	Maths, Computer Science, English

In this example, if a teacher had more than three courses, we would soon run into trouble because there is no space left in the field (let's assume that it has a fixed length).

According to Codd's theories , a table is in first normal form when every value is stored separately and when one value does not consist of further values. Here is the same table, but now in first norm form (1 NF):

pid	first name	surname	street	town	cid	training course
15	John	Gore	Sesame Street 1	1010 Vienna	44	German
15	John	Gore	Sesame Street 1	1010 Vienna	45	Chemistry
15	John	Gore	Sesame Street 1	1010 Vienna	46	Sports
6	Lisa	Bush	John Street 33	1150 Vienna	47	Maths
6	Lisa	Bush	John Street 33	1150 Vienna	48	Computer Science
6	Lisa	Bush	John Street 33	1150 Vienna	49	English

We have added an id in order to distinguish the records ( pid for persons and cid for the id of the training course). One problem with the first normal form is that if one teacher changes address, we have to change that in three records. Another problem is that a new teacher can be added only when assigned to a training course. To put it in another way, we would have to remove teachers when they have no more training courses. This might not fit reality; for that reason, we should change our first normal form to a second normal form.

All components that have a separate key are now treated as separate tables. In our case, we create two tablesone for the teachers and one for the training courses.

Here is the table for the teachers:

pid	firstname	surname	street	town
15	John	Gore	Sesame Street 1	1010 Vienna
6	Lisa	Bush	John Street 33	1150 Vienna

Here is the table for the courses:

pid	cid	training course
15	44	German
15	45	Chemistry
15	46	Sports
6	47	Maths
6	48	Computer Science
6	49	English

We can now connect those two tables without storing the name of the teacher in a redundant way. According to Codd's theories, this must not happen because this could lead to anomalies we could have when using the first normal form.

In the mid-seventies, Codd discovered that the second normal form can also lead to anomalies. Assume that we add a column category to the table where we store the courses. Because many courses might belong to the same category, we would still have redundancies in the table and would have to update many rows when one value changes. In this case, we would have to introduce a third table to store the categories. This would be a data model using the third normal form.

In most cases, a data model in third normal form is also called a Boyce-Codd normal form.

In the following cases, a third normal form is not a Boyce-Codd normal form:

Candidate keys in the relation are composite keys, which means that they are not single attributes.
There is more than one candidate key in the relation.
The keys are not disjoint , which means that some attributes in the key are common.

When designing data models, keep Codd's theories about normalization in mind, because you will build more flexible data structures when taking the most important points of the theory into consideration.

It is a good idea to create a highly flexible data model. If your data structure is not flexible enough, you will soon run into trouble and extending your application can become truly painful.

The Waterfall Model

A guide to database design says: "Using waterfall model, maybe is for cowards. Using no model is for kamikazes only!" (quoted from http://www.cdt.luth.se/~jmb/ presents /19990903/index.htm) Many situations have proven that the author of this sentence knows what he is talking about.

The waterfall model is one of the most widespread models available. Academics seem to be especially enthusiastic about it; every student who has to deal with databases and modeling techniques will face the waterfall model at least once.

The waterfall model is a systematic, step-by-step approach to software development, in which the next phase of the project is entered as soon as one phase is completed. According to the basic waterfall model, the whole process is straightforward only.

Nowadays, many slightly different and more sophisticated versions of the waterfall model exist, but the most widespread version still seems to be the original model.

A process described by the waterfall model consists of seven steps (see Figure 3.5):

Feasibility study. Determine whether a project can be done.
Requirement analysis and specification. If the project can be done, find out what the basic requirements are. Requirements are not only hardware; human resources are, in most cases, more important than hardware. Define the key features of the product.
Design and specification. Define every detail of the final product precisely. The waterfall model is like a one-way streetyou can't revise the specification for the rest of the project. Everything has to be clear; the better the specification, the better the final product. In software design, four parts have to be defined: data structure, software architecture, procedural detail, and the user interface, or human computer interface (HCI).
Coding and module testing. While programming the code, extensively test the modules to ensure high quality of the final product. If the specification is good, coding is a small part of the process.
Integration and system testing. Integrate and connect code with other components. Test the system under real-world circumstances (using people who have not participated in the project to this point) to find all bugs.
Delivery. Deliver the product to the customer(s) and implement it on site.
Maintenance. Make ongoing changes to correct errors and meet new demands. Reapply every step during changes.

Figure 3.5. An original waterfall model.

graphics/03fig05.gif

Remember, in the original waterfall model, when you finish one step of the process, there is no way back. If you keep this in mind when designing and implementing an IT system, you will ensure a product of good quality.

As mentioned before, many versions of waterfall models exist. In some versions, you can go one step back in the process if something turns out to be wrong (see Figure 3.6).

Figure 3.6. A waterfall model in which you can go back to a previous state in the process.

graphics/03fig06.gif

Many people and project leaders are using a "let's see later" strategy. This can lead to real disaster. If you start coding immediately, you will most likely find out that you have forgotten some crucial points and features. In this case, you have to rewrite a lot of code, which leads to bugs or hard-to-understand software. You have to know perfectly well what you are doing when writing the first line of code.