Relational Databases | Beginning ASP.NET 2.0 and Databases (Wrox Beginning Guides)

Chapter 2 - Relational Databases

byJohn Kauffman, Fabio Claudio Ferracchiatiet al.?
Wrox Press ?2002

Practically speaking, we tend to think of a relational database as a database in which there can exist relationships, or logical links, between tables. These links associate rows in one table with rows in another table, by matching values in one or more fields from the first table to one or more fields in the second table. In the Northwind database, the Shippers and Orders tables are related by the fields ShipperID and ShipVia, as we can see in the figure below:

click to expand

For each row in the Orders table (that is, for each customer order), there is a row in the Shippers table where the value in the ShipperID field is equal to the value in the ShipVia field of the Orders table. Let's examine this in more detail with an example.

Try It Out - Linking the Shippers and Orders Tables

We're going to take a look at a row in the Orders table, along with its related row in the Shippers table.

Back in Access, double-click on the Orders and Shippers tables to open up views of the data in both:
If we look at the Shippers table, we see that there are a total of three rows. In each row, the ShipperID field has a value that's unique to that row - 1, 2, and 3 in this case. Back in the Orders table, the ShipVia field is logically linked to the ShipperID field, so the RDBMS will only allow values in former that exist in the latter. (As a courtesy, Access is automatically displaying the 'friendlier' name, rather than the ID. This does not change the fact that ShipVia field holds numerical values.)
If you try to change the value of the ShipVia field in any of the rows, Access won't allow you to give it anything other than one of the values found in the ShipperID field of the Shippers table.

Relational DBMSs (RDBMSs) are by far the most popular form of DBMS in today's software industry. The quality, ubiquity, and stability of RDBMSs and the relational model contribute to making relational databases the de facto standard for database developers. Database theory has come a long way since the days of 'flat-file data', but you shouldn't be misled into believing that relational is always good, and nonrelational is always bad.

Since Dr E.F. Codd first formally proposed the relational model in 1970, database research has continued to develop and expand. Newer, post-RDBMS non-relational databases, including object-oriented database management systems, have appeared and have gained considerable popularity. Object-oriented databases are designed to work well with object-oriented programming languages such as Java and C++. They extend the languages with transparently persistent objects (vs. simple data), concurrency control, data recovery, associative queries, and other database capabilities.

Examples of some of the more popular relational databases are Sybase Adaptive Server, Microsoft FoxPro, Microsoft Access, Microsoft SQL Server, Oracle, Informix, Teradata, IBM DB2, MySQL, and GNU SQL Server. Examples of object-oriented databases are Object Design's Objectstore, Objectivity/DB, Versant enJin, Gemstone, 02 Technology, Mjølner, Poet Software, and Ontos.

Before we continue our journey into the realm of relational database theory, we need to be acquainted with SQL (Structured Query Language), the de facto standard programming language for communicating with and manipulating relational databases - or at least, enough to be able to talk about the theory. Once we are familiar with relational database theory, we will have the background for a more in-depth look at SQL.

Brief Introduction to SQL

SQL is a declarative language, as distinct from procedural languages such as C, C++, C#, Visual Basic, and Java. In programs written in a procedural language, the programmer describes a sequence of steps that execution should follow (that is, a procedure). In SQL, programmers describe sets of data and the actions to be applied to those sets. It is up to the RDBMS to determine how and in what order to construct the set, and how to process each member of the set. In fact, the SQL developer has little (if any) control over how the RDBMS accomplishes its task.

Consider this example, which uses a procedural language to display rows from the Orders table that were shipped by Speedy Express. For now, we'll just use pseudo-code:

    Begin Function      For each row in the Shippers table      Loop        If Shippers.CompanyName = 'Speedy Express' Then          X = ShipperID          Exit the loop        End If      End Loop      For each row in the Orders Table      Loop        If Orders.ShipVia = X Then          Display all fields for this row        End If      End Loop    End Function

By contrast, here's what the pseudo-code for a declarative language might look like:

    Begin Function      Show all fields for the set of rows in Orders where the following is true:      The value in Orders.ShipVia = the value in Shippers.ShipperID        and Shippers.CompanyName = 'Speedy Express'    End Function

In the declarative code, we declared two sets of rows:

The set of rows in Shippers where CompanyName has the value 'Speedy Express'. This set contains only one row, since there is only one row in the Shippers table for which CompanyName = 'Speedy Express'. The value of ShipperID in this one-row set is 1.
The set of rows in Orders in which ShipVia equals the value of ShipperID in the one-row set from the Shippers table. Since ShipVia must equal one of the values for ShipperID in the set of Shippers rows, we effectively limit this set to rows from the Orders table in which ShipVia is 1.

In the procedural language, we explicitly looped through all of the rows in both tables, and displayed each of the rows that met our test criteria one by one. In the declarative language, the DBMS was responsible for those details. Later in this chapter, we'll examine SQL in enough detail that we can use the SQL language for our examples, instead of pseudo-code.

Until now, we've been talking about database theory, and we've have been referring to the "rows" in a table. When we're talking about rows in a programming context, we sometimes use the term "rows", but more frequently we use the term "records" to mean the same thing. A set of rows is then a set of records, or a record set.

Note

Rows are records.

A set of rows (or records) is a record set.

Codd's Twelve Rules

In theory, a relational database is more than just data organized into related tables. Before we continue our investigations of the SQL language, let's examine this concept in more detail. The relational database model is based firmly in the mathematical theory of relational algebra and calculus. Dr E.F. Codd proposed the original concept for the model in a 1970 paper entitled A Relational Model of Data for Large Shared Data Banks. Later, Dr Codd clarified his model in a two-part article in Computer World (14th and 21st October, 1985). He defined twelve rules (Codd's Rules) that a DBMS must meet to be considered a relational database management system (RDBMS). We will now examine each of these rules.

Note that many database products are considered "relational" even if they do not strictly adhere to all twelve of these rules.

1. Data is presented in tables.

This rule has already been covered in detail in the sections above.

2. Data is logically accessible.

We don't reference a row by its position in the table. Instead, we must specify a particular row in a table based on the values contained in its fields. In the Shippers table of the Northwind database, we wouldn't try to refer to the second record; rather, we'd refer to the record with a ShipperID of 2.

3. Nulls are treated uniformly as unknown.

When its value is not known, a field may be 'empty'. That empty field is said to have a null value, which is quite different from a zero-length string or a zero integer; these are known to be "blank" or zero respectively. Any calculation that includes a null value will yield NULL. This means that 1 + 1 + NULL = NULL, and (100 > NULL) =NULL (not TRUE or FALSE).

Note

Developers who are new to database development often find that their calculations return a null value when they are expecting data. The reason for this is that at least one of the fields in the calculation contains a null value.

4. The DBMS is self-describing.

RDBMSs contain crucial, self-describing information called metadata; this is data that a database keeps about itself (such as the relationship between the Orders table and the Shippers table that we saw earlier). This metadata is stored in special tables called system tables or system catalogs. You can see these tables in Microsoft Access by selecting the Tables tab to view all of the tables in the Northwind database. Next, click Tools | Options from the menu and select the View tab on the Options dialog box. Check System objects, and click OK. You should now see a number of additional tables whose names begin with MSys (see screenshot below). All Access databases have the same system tables, but the data in these tables will vary to describe the particular database.

click to expand

5. A single language is used to communicate with the DBMS.

SQL is the de facto standard database language. We will discuss some SQL basics in this chapter, while subsequent chapters will use and explain it at length.

6. The DBMS provides alternatives for viewing data.

Views act rather like virtual tables and provide an alternative way of looking at data from one or more tables. In Microsoft Access jargon, views are called queries. You can see the queries in the Northwind database by clicking on the Queries tab (see screenshot below). Note that these queries are not tables, but are simply an alternative view of the data.

click to expand

7. The DBMS supports set-based or relational operations.

Rows are treated as sets for data manipulation operations. A relational database must support basic relational algebra operations (selection, projection, and join) and set operations (union, intersection, division, and difference). Set operations and relational algebra are used to operate on 'relations' (tables) to produce other relations. A database that supports only row-at-a-time (navigational) operations does not meet this requirement, and is not considered relational.

8. Physical data independence.

Applications that access data in a relational database must be unaffected by changes in the way the data is physically stored. For example, the code in an application that reads and writes data in a file-based database typically depends on the file format. (For example, the code references a 'phone number' field that is ten characters wide, is preceded by the 'zip code' field, and is followed by the 'fax number' field). If the layout of the data in the file changes, the application must change accordingly. By contrast, the physical storage and access methods used in a relational database can change without affecting the application's use of the data - it still just sees tables (the logical structure). An application that uses data in a relational database contains only a basic definition of the data (type and length), and does not need to know how the data is physically stored or accessed.

9. Logical data independence.

Logical independence means that changes to the relationships among tables (such as the addition of new tables to the database, the ordering of fields in an existing table, and the addition of new fields to an existing table) can occur without impairing the function of an application using the database. The database's schema (its structure of tables and relationships) can change without you having to recreate the database or the applications that use it.

10. Data integrity is a function of the DBMS.

For a DBMS to be relational, it must have the ability to regulate data integrity internally - as an internal function of the DBMS - rather than in an application program. Data integrity means that when we define fields in two or more tables as being related, the RDBMS guarantees that the values of those fields in each pair of rows always match, regardless of the operations that the application performs. Examples of data that lacks integrity are orphaned rows and duplicate rows.

11. The DBMS supports distributed operations.

Data in a relational database can be stored centrally, or it can be distributed. Users can join data from tables from different databases on the same server, or on different servers (distributed queries), or from other RDBMSs (heterogeneous queries).

12. Data integrity cannot be subverted.

There cannot be other paths into a database that subvert the integrity of the data it contains. In other words, you can't get in through the 'back door' and change the data. The RDBMS must protect data from being modified by machine language intervention.

All together, Codd's rules provide us with a precise definition of relational database management systems, how they behave, and how we can interact with them. We have looked a bit beyond the rules in an attempt to identify the most powerful features of the relational model, so that we can learn to take full advantage of them in our database design and database programming.