1.1 The Relational Model and ANSI SQL | SQL in a Nutshell (In a Nutshell (OReilly))

Relational Database Management Systems (RDBMSs), such as those covered in this book, are the primary engines of information systems worldwide, particularly web applications and distributed client/server computing systems. They enable a multitude of users to quickly and simultaneously access, create, edit, and manipulate data without impeding or impacting other users. They also allow developers to write useful applications to access their resources as well as provide administrators with the capabilities they need to maintain, secure, and optimize organizational data resources.

An RDBMS is defined as a system whose users view data as a collection of tables related to each other through common data values. Data is stored in tables , which are composed of rows and columns . Tables of independent data can be linked (or related ) to one another if they each have unique, identifying columns of data (called keys ) that represent a data value held in common. E. F. Codd first described relational database theory in his landmark paper "A Relational Model of Data for Large Shared Data Banks," published in the Communications of the ACM (Association for Computing Machinery) in June, 1970. Under Codd's new relational data model, data was structured (into tables of rows and columns); manageable using operations like selections, projections, and joins; and consistent as the result of integrity rules like keys and referential integrity. Codd also articulated rules that governed how a relational database should be designed; the process for applying these rules is now known as normalization .

1.1.1 Codd's Rules for Relational Database Systems

Codd applied rigorous mathematical theories , primarily set theory, to the management of data, and he compiled a list of criteria a database must meet to be considered relational. At its core , the relational database concept centers around storing data in tables. This concept is now so common as to seem trivial; however, not long ago designing a system capable of sustaining the relational model was considered a long shot with limited usefulness . Following are Codd's Twelve Principles of Relational Databases :

Information is represented logically in tables.
Data must be logically accessible by table, primary key, and column.
Null values must be uniformly treated as "missing information," not as empty strings, blanks, or zeros.
Metadata (data about the database) must be stored in the database just as regular data is.
A single language must be able to define data, views, integrity constraints, authorization, transactions, and data manipulation.
Views must show the updates of their base tables and vice versa.
A single operation must be available to do each of the following operations: retrieve data, insert data, update data, or delete data.
Batch and end- user operations are logically separate from physical storage and access methods .
Batch and end-user operations can change the database schema without having to recreate it or the applications built upon it.
Integrity constraints must be available and stored in the metadata, not in an application program.
The data manipulation language of the relational system should not care where or how the physical data is distributed and should not require alteration if the physical data is centralized or distributed.
Any row processing done in the system must obey the same integrity rules and constraints that set-processing operations do.

These principles continue to be the litmus test used to validate the "relational" characteristics of a database platform; a database that does not meet all of these rules is not fully relational. While these rules do not apply to applications development, they do determine whether the database engine itself can be considered truly "relational." Currently, most commercial RDBMS products pass Codd's test. Among the platforms discussed in SQL in a Nutshell , Second Edition, only MySQL fails to support all of these requirements.

Understanding Codd's principles assists programmers and developers in the proper development and design of Relational Databases (RDBs). The following sections detail how some of these requirements are met within SQL using RDBs.

1.1.1.1 Data structures (rules 1, 2, and 8)

Codd's rules 1 and 2 state that "information is represented logically in tables" and that "data must be logically accessible by table, primary key, and column." So the process of defining a table for a SQL database does not require that programs instruct the database how to interact with the underlying physical data structures. Furthermore, SQL logically isolates the process of accessing data and physically maintaining that data as required by rule 8, "batch and end-user operations are logically separate from physical storage and access methods."

In the relational model, data is shown logically as a two-dimensional table that describes a single entity (for example, business expenses). Academics refer to tables as entities and to columns as attributes . Tables are composed of rows , or records (academics call them tuples ), and columns (called attributes , since each column of a table describes a specific attribute of the entity). The intersection of a record and a column provides a single value . The column or columns whose values uniquely identify each record can act as a primary key . These days this representation seems elementary, but it was actually quite innovative when it was first proposed.

SQL2003 defines a whole data structure hierarchy beyond simple tables, though tables are the core data structure. Relational design handles data on a table-by-table basis, not on a record-by-record basis. This table-centric orientation is the heart of set programming. Consequently, almost all SQL commands operate much more efficiently against sets of data within or across tables than against individual records. Said another way, effective SQL programming requires that you think in terms of sets of data, rather than of individual rows.

Figure 1-1 is a description of the SQL2003 terminology used to describe the hierarchical data structures used by a relational database: clusters contain sets of catalogs ; catalogs contain sets of schemas ; schemas contain sets of objects , such as tables and views ; and tables are composed of sets of columns and records .

Figure 1-1. SQL2003 dataset hierarchy

For example, in a Business_Expense table, a column called Expense_Date might show when an expense was incurred. Each record in the table describes a specific entity; in this case, everything that makes up a business expense (when it happened , how much it cost, who incurred the expense, what it was for, and so on). Each attribute of an expense, in other words each column, is supposed to be atomic ; that is, each column is supposed to contain one, and only one, value. If a table is constructed in which the intersection of a row and column can contain more than one distinct value, then one of SQL's primary design guidelines has been violated. (Naturally, some of the database platforms discussed in this book allow you to place more than one value into a column, via VARRAY or TABLE datatypes.)

There are rules of behavior specified for column values. Foremost is that column values must share a common domain , better known as a datatype . For example, the value ELMER should not be placed into the Expense_Date field. The value ELMER is a string, not a date, while the Expense_Date field can contain only dates. Therefore, this column would be defined as having a DATE datatype. In addition, SQL2003 allows further controls of such values through the application of constraints and assertions . (Constraints are discussed in detail later in Chapter 2.) A SQL constraint might limit Expense_Date to expenses less than a year old.

Additionally, data access for all individuals and computer processes is controlled at the schema level by an AuthorizationID or user . Permissions to specific sets of data may be granted or restricted to each user.

Moreover, SQL databases also employ character sets and collations . Character sets are the "symbols" or "alphabets" used by the "language" of the data. For example, the American English character set does not contain the special character for ± in the Spanish character set. Collations are sets of sorting rules that operate on a character set. A collation defines how a given data manipulation operation sorts data. For example, an American English character set might be sorted either by character-order , case-insensitive , or by character-order , case - sensitive .

The ANSI standard does not say how sorts should be done, only that platforms must provide common collations found in a particular language.

It is important to know what collation you are using when writing SQL code against a database platform since it can have a direct impact on how queries behave, particularly on the behavior of the WHERE and ORDER BY clauses of SELECT statements. For example, a query that sorts data using a binary collation will return data in a very different order than one that sorts data using, say, an American English collation.

1.1.1.2 NULLS (rule 3)

Most databases allow any of their supported datatypes to store NULL values. Inexperienced SQL programmers and developers tend to think of NULL as zero or blank. In fact, NULL is neither of these. In SQL2003, NULL literally means that the value is unknown or indeterminate. (This question alone ”whether NULL should be considered unknown or indeterminate ”is the subject of academic debate.) This differentiation enables a database designer to distinguish between those entries that represent a deliberately placed zero (for example) and those where either the data is not recorded in the system or where a NULL has been explicitly entered. For an example of this semantic difference, consider a system that tracks payments. A product with a NULL price does not mean that the product is free; instead, a NULL price indicates that the amount is not known or perhaps not yet determined.

There is a good deal of differentiation between the database platforms in terms of how they handle NULL values. This leads to some major porting issues between those platforms relating to NULLs. For example, an empty string (i.e., a NULL string) is inserted as a NULL value on Oracle. With the exception of Sybase, all the other databases covered in this book permit the insertion of an empty string into VARCHAR and CHAR columns.

One side effect of the indeterminate nature of a NULL value is that it cannot be used in a calculation or a comparison. Here are a few brief but very important rules, from the ANSI standard, to remember about the behavior of NULL values when dealing with NULLs in SQL statements:

A NULL value cannot be inserted into a column defined as NOT NULL.
NULL values are not equal to each other. It is a frequent mistake to compare two columns that contain NULL and expect the NULL values to match. (The proper way to identify a NULL value in a WHERE clause or in a Boolean expression is to use phrases such as "value IS NULL" and "value IS NOT NULL".)
A column containing a NULL value is ignored in the calculation of aggregate values such as AVG , SUM , or MAX . COUNT .
When columns that contain NULL values are listed in the GROUP BY clause of a query, the query output contains a single row for NULL values. In essence, the ANSI standard considers all NULLs found to be in a single group.
DISTINCT and ORDER BY clauses, like GROUP BY , also see NULL values as indistinguishable from each other. With the ORDER BY clause, the vendor is free to choose whether NULL values sort high (first in the result set) or sort low (last in the result set) by default.

1.1.1.3 Metadata (rules 4 and 10)

Codd's fourth rule for relational databases states that data about the database (metadata) must be stored in standard tables, just as all other data. Metadata is data that describes the database itself. For example, every time you create a new table or view in a database, records are created and stored that describe the new table. Additional records are needed to store any columns, keys, or constraints on the table. This technique is implemented in most commercial and open source SQL database products. For example, SQL Server uses what it calls "system tables" to track all the information about the databases, tables, and database objects in any given database. It also has "system databases" that keep track of information about the server on which the database is installed and configured.

1.1.1.4 The language (rules 5 and 11)

Codd's rules do not require SQL to be used with a relational database. His rules, particularly rules 5 and 11, only specify how the language should behave when coupled with a relational database. At one time, SQL competed with other languages (like Digital's RDO or Fox/PRO) that might have fit the relational bill, but SQL won out for three reasons. First, SQL is a relatively simple, intuitive, English-like language that handles most aspects of data manipulation. Second, SQL is satisfyingly high-level. A programmer or Database Administrator (DBA) does not have to spend time ensuring that data is stored in the proper memory registers or that data is cached to disk. The Database Management System (DBMS) handles that task automatically. Finally, because SQL is not owned by any single vendor, it was adopted across a number of platforms.

1.1.1.5 Views (rule 6)

A view is a virtual table that does not exist as a physical data repository of data, but is instead constructed, on the fly, from a SELECT statement whenever the view is queried. Views enable you to construct different representations out of the same source data for a variety of audiences without having to alter the way in which the data is stored.

Some vendors support database objects called materialized views . Materialized views are not governed by the same rules as ANSI standard views.

1.1.1.6 Set operations (rules 7 and 12)

Other database manipulation languages, such as the venerable Xbase, perform their data operations quite differently from SQL. These languages require you to tell the program exactly how to treat the data, one record at a time. Since the program cycles down through a list of records, performing its logic on one record after another, this style of programming is frequently called row processing or procedural programming .

In contrast, SQL programs operate on logical sets of data. Set theory is applied in most all SQL statements such as SELECT , INSERT , UPDATE , or DELETE statements. In effect, data is selected from a set called a table. Unlike the row processing style, set processing allows a programmer to tell the database simply what is required, not how each individual piece of data should be handled. Sometimes set processing is referred to as declarative processing , since a programmer declares only what data is wanted, as in "Give me all employees in the southern region who earn more than $70,000 per year," rather than describing the exact procedure used to retrieve or manipulate the data.

Set theory was the brainchild of mathematician Georg Cantor, who developed it at the end of the nineteenth century. At the time, set theory (and his theory of the infinite) was quite controversial . Today, set theory is such a common part of life that it is learned in elementary school.

Examples of set theory in conjunction with relational databases are detailed in the following section.

1.1.2 Codd's Rules in Action: Simple SELECT Examples

Up to this point, the chapter has been about the individual aspects of a relational database platform as defined by Codd and implemented under ANSI SQL. This section presents a high-level overview of the most important SQL statement, SELECT , and some of its most salient points ”namely, the relational operations known as projections , selections , and joins :

Projection: Retrieves specific columns of data.
Selection: Retrieves specific rows of data.
Join: Returns columns and rows from two or more tables in a single result set.

Although at first glance it might appear that the SELECT statement deals only with the relational selection operation, in actuality, SELECT embodies all three operations. (Refer to SELECT Statement.)

The following statement embodies the projection operation by selecting the first and last names of an author, plus his home state, from the authors table:

 SELECT au_fname, au_lname, state FROM   authors

The results from any such SELECT statement are presented as another table of data:

 au_fname             au_lname                            state -------------------- ----------------------------------- -----  Johnson              White                               CA Marjorie             Green                               CA     Cheryl               Carson                              CA     Michael              O'Leary                             CA Meander              Smith                               KS     Morningstar          Greene                              TN     Reginald             Blotchet-Halls                      OR     Innes                del Castillo                        MI

The resulting data is sometimes called a result set , work table , or a derived table , differentiating it from the base table in the database that is the target of the SELECT statement.

It is important to note that the relational operation of projection, and not selection, is specified using the SELECT clause (that is, the keyword SELECT followed by a list of expressions to be retrieved) of a SELECT statement. Selection, the operation of retrieving specific rows of data, is specified using the WHERE clause in a SELECT statement. WHERE filters out unwanted rows of data and retrieves only the requested rows. Using the previous example, the following selects authors from states other than California:

 SELECT au_fname, au_lname, state FROM   authors WHERE  state <> 'CA'

The previous query retrieved all authors; the result of this query is a much smaller subset of records:

 au_fname             au_lname                            state  -------------------- ----------------------------------- -----  Meander              Smith                               KS     Morningstar          Greene                              TN     Reginald             Blotchet-Halls                      OR     Innes                del Castillo                        MI

By combining the capabilities of projection and selection together in a single query, you can use SQL to retrieve only the columns and records that you need at any given time.

Joins are the next , and last, relational operation we're going to talk about in this section. A join relates one table to another in order to return a result set consisting of related data from both tables.

Different vendors allow varying numbers of tables to join in a single join operation. For example, Oracle places no limit on the number of joins, while Microsoft SQL Server allows up to 256 tables in a join operation.

The ANSI standard method of performing joins is to use the JOIN clause in a SELECT statement. An older method, know as a theta join , performs the join analysis in the WHERE clause. The following example shows both approaches. Each statement retrieves employee information from the employee base table as well as job descriptions from the jobs base table. The first SELECT uses the newer , ANSI JOIN clause, while the second SELECT uses a theta join:

 -- ANSI style SELECT a.au_fname, a.au_lname, t.title_id FROM   authors AS a JOIN   titleauthor AS t ON a.au_id = t.au_id WHERE  a.state <> 'CA' -- Theta style SELECT a.au_fname, a.au_lname, t.title_id FROM   authors AS a,        titleauthor AS t  WHERE  a.au_id = t.au_id    AND a.state <> 'CA'

For more information about joins, refer to JOIN Subclause.