Using SQL for Database Access | Beginning XML Databases (Wrox Beginning Guides)

So, you need to know how to access a database using SQL. SQL is the most common form of database access. Additionally, SQL can be used to cater to merging relational databases with XML documents and other XML technology. Lets first briefly describe the basics of the SQL language.

SQL works by reading records from the tables in a relational database.

A table is the structure applied to repetitions of some data item. For example, in order to store customers in a database, you need to create a table for those customers. Those customers have names and addresses. A customer table consists of a field for the customer name and a field for the address of the customer.

A record is the name and address of each customer. Each record is a repetition of the structure of the table (the name and address fields).

SQL consists of a number of sections:

Queries: The SELECT command is used to read data from a relational database. There are various options when using the SELECT command, which can be used to alter a query:
- SELECT command: Retrieve all records from a single table.
- WHERE clause: Retrieve some records from a table by filtering the records using a WHERE clause. The filter can be used to include a subset of a tables rows, or exclude a subset of a tables rows, or both.
- ORDER BY clause: Retrieve records from a table, but change the order that those records are stored in, using the ORDER BY clause.
- GROUP BY clause: Before returning records from a database, summarize them into fewer rows using the GROUP BY clause. You can also use the HAVING clause to filter grouped records.
- JOIN clause: The JOIN clause can be used to query records from more than one table. The JOIN clause allows you to merge records from two tables based on common values. There are various types of joins creating intersections, outer joins, and various other types of joins.
- Subquery : A subquery allows one query to be embedded within another query, such that one query is the calling query, and the other query is executed by the calling query. It is also possible to pass values from the calling query to the subquery. Passing values from calling query to subquery allows the calling query to determine the records that the subquery retrieves from the database, for each record retrieved by the calling query.

UNION clause : The UNION clause can be used to create a composite of two queries where all records are returned from both queries as a merge of the data in two tables. Unlike a join or a subquery, both of which allow matching of values between two tables, a composite query does not allow any matching.
Changing data: INSERT , UPDATE , and DELETE commands can be used to change data in a database:
- INSERT command: Add new records to a table using the INSERT command.
- UPDATE command: Change existing records in a table using the UPDATE command.
- DELETE command: Delete existing records in a table using the DELETE command.

Metadata change commands: Records stored in database tables are the actual data stored in the database, such as the names of customers. The metadata in a relational database is the structure of that data, such as the customer table. The table is the box used to store the customers in. In my example, the customer table contains a field for the customer name and a field for the address of the customer. So the metadata is the data about the data. There are various commands that allow changes to metadata. For example, to create a table for customers you would use a CREATE TABLE command. To change the customer table you could use an ALTER TABLE command. To drop a customer table you would use a DROP TABLE command.

Metadata commands are not covered in this book as they are too far removed from the topic of XML. This book is about XML and databases. XML documents are all about data, not changing the structure of data.

Lets go into some detail on the use of relational database access commands. Lets begin by examining queries.

Queries

A query is a question asked of a database. The answer to the question is that records are returned from tables stored in that database. There are various methods of querying data.

The SELECT Command

The simplest method of writing a query is using a basic SELECT statement, consisting of a SELECT statement and a FROM clause. The SELECT statement determines what is retrieved. The FROM clause determines where data is retrieved from, enhancing the SELECT statement. This is the syntax of the SELECT statement:

    SELECT { [alias.]field  expression  [alias.]* [, field ] }   FROM table [alias];

This query finds all records in a table called COUNTRY , in the demographics database:

   SELECT * FROM COUNTRY;

The * in the SELECT statement fields list finds all fields in a table.

See Appendix B for details of the demographics database, which is used for examples throughout this book.

This is a partial result of the preceding query (both records and fields have been removed from the output):

   COUNTRY_ID  REGION_ID COUNTRY                          CO POPULATION        AREA ---------- ---------- -------------------------------- -- ---------- ----------          1          1 Algeria                          AG   32930091     2381741          2          1 Angola                           AO   12127071     1246699          3          1 Benin                            BN    7862944      110619          4          1 Botswana                         BC    1639833      585371          5          1 Burkina Faso                     UV   13902972      273799          6          1 Burundi                          BY    8090068       25649          7          1 Central African Republic         CT    4303356      622980          8          1 Congo                            CG   62660551     2267599          9          1 Djibouti                         DJ     486530       21979         10          1 Equatorial Guinea                EK     540109       28050         11          1 Ethiopia                         ET   74777981     1119683

You can also get a single field using a query like this:

   SELECT COUNTRY FROM COUNTRY;

And this is a partial result:

   COUNTRY -------------------------------- Afghanistan Albania Algeria American Samoa Angola Argentina Armenia Australia Austria Azerbaijan Bahamas

Or you can specify to retrieve a number of fields:

   SELECT COUNTRY, POPULATION, AREA FROM COUNTRY;

Again, this is the partial result:

   COUNTRY                          POPULATION       AREA -------------------------------- ---------- ---------- Algeria                            32930091    2381741 Angola                             12127071    1246699 Benin                               7862944     110619 Botswana                            1639833     585371 Burkina Faso                       13902972     273799 Burundi                             8090068      25649 Central African Republic            4303356     622980 Congo                              62660551    2267599 Djibouti                             486530      21979 Equatorial Guinea                    540109      28050 Ethiopia                           74777981    1119683

You can even execute an expression, as shown previously in the syntax definition of the SELECT statement. This query finds the population density, or the number of people per square kilometer, for each country and rounds it up to zero decimal places:

   SELECT COUNTRY, ROUND (POPULATION / AREA, 0) "Population Density" FROM COUNTRY;

The result is shown here. The division by zero error is caused because some of the AREA values in the table are zero. Dividing any number by zero is impossible and thus mathematically undefined:

   COUNTRY                          Population Density -------------------------------- ------------------ Algeria                                          14 Angola                                           10 Benin                                            71 Botswana                                          3 Burkina Faso                                     51 Burundi                                         315 Central African Republic                          7 Congo                                            28 Djibouti                                         22 Equatorial Guinea                                19 Ethiopia                                         67 Gabon                                             6 Gambia                                            0 Ghana                                            97 Guinea                                           39 ERROR: ORA-01476: divisor is equal to zero

In the preceding query and query result, the field header is renamed using the quoted string Population Density , appearing after the expression. The quoted value changes the name of the header and not the field name in the query. The AS clause can be used to change a field name, within the current query. The AS clause will be discussed later on in this chapter when it is appropriate.

Before going further it seems sensible to find a simple resolution for the division by zero issue. If you are executing these examples against a database add a function something like this:

   CREATE OR REPLACE FUNCTION ZD(pINT IN INTEGER) RETURN VARCHAR2 IS BEGIN    RETURN NVL(NULLIF(pINT, 0), 1); END; /

The preceding function will compile for an Oracle database. If using SQL Server or a different relational database, you may have to recode for appropriate syntax.

The query result using the ZD function is as follows :

   SELECT COUNTRY, ROUND (POPULATION / ZD(AREA)) "Population Density" FROM COUNTRY;

And a partial query result using the ZD function is as follows:

   COUNTRY                          Population Density -------------------------------- ------------------ Algeria                                          14 Angola                                           10 Benin                                            71 Botswana                                          3 Burkina Faso                                     51 Burundi                                         315 Central African Republic                          7 Congo                                            28 Djibouti                                         22 Equatorial Guinea                                19 Ethiopia                                         67

The SELECT statement can also use what is called an alias . An alias allows renaming a table during the execution of a query. An alias is defined in the FROM clause, and then used in the list of fields selected to refer to the table found by the FROM clause:

   SELECT C.COUNTRY, ROUND (C.POPULATION / C.AREA, 0) "Population Density" FROM COUNTRY C;

The division by zero error should appear again.

In the preceding query, the COUNTRY table is aliased as the letter C . Each of the fields retrieved from the COUNTRY table is then referenced using an alias.field notation. The result of the preceding query will be identical to the previous example. You will learn the real usefulness of using aliases when you read about join queries later on in this chapter.

A join query is a query that joins fields and records together between two tables, returning records from both tables.

The WHERE Clause

The WHERE clause enhances a SELECT statement by filtering out unwanted records returned by a SELECT statement. This is the basic syntax of the WHERE clause:

   SELECT ... FROM ...  [   WHERE [table.alias.] { field  expression }   comparison   [table.alias.] { field  expression }   ];

The SELECT statement and the FROM clause are mandatory. The WHERE clause is optional and is thus enclosed in square brackets as in [ WHERE ˆ ] .

This query finds all countries with a population of over 100 million people:

   SELECT COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population" FROM COUNTRY WHERE POPULATION > 100000000;

The preceding query uses a special TO_CHAR function in order to format the output of the large num bers into a more readable format.

Also in the preceding query, note the use of the single quote (') character. Which types of quotation char acters are used, and in which types of commands, varies from one database engine to another.

This is the result of the preceding query:

   COUNTRY                          Population -------------------------------- -------------- Nigeria                             131,859,731 Mexico                              107,449,525 China                             1,313,973,713 Indonesia                           245,452,739 Japan                               127,463,611 Bangladesh                          147,365,352 India                             1,095,351,995 Pakistan                            165,803,560 United States                       298,444,215 Russia                              142,893,540 Brazil                              188,078,227

The WHERE clause can apply multiple filters to selected records. Multiple filters are placed into the same query by joining the two filters together using logical operators called AND , OR , and NOT . Logical operators are shown in the following syntax:

   SELECT ... FROM ... [    WHERE filter  [   { AND  OR } [ NOT ]   [table.alias.] { field  expression }   comparison   [table.alias.] { field  expression }   [ { AND  OR } [ NOT ] ...   ]  ];

This query finds all countries with a population over 100 million people and a population density of over 100 people per square kilometer. The two different comparisons are joined together using the AND logical operator (or conjunction):

   SELECT COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    ROUND(POPULATION / ZD(AREA)) "Population Density" FROM COUNTRY WHERE POPULATION > 100000000 AND ROUND(POPULATION / ZD(AREA) , 0) > 100;

Notice how the expression ROUND (POPULATION / ZD(AREA), 0) is used in the WHERE clause as well as retrieved by the SELECT clause. Also, the division by zero error is removed using the AREA field filter.

This is the result of the preceding query:

   COUNTRY                          Population     Population Density -------------------------------- -------------- ------------------ Nigeria                             131,859,731                145 China                             1,313,973,713                141 Indonesia                           245,452,739                135 Japan                               127,463,611                323 Bangladesh                          147,365,352               1100 India                             1,095,351,995                368 Pakistan                            165,803,560                213

And heres another example using the OR operator in addition to the AND operator:

   SELECT COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population", AREA "Km sq.",    ROUND(POPULATION / ZD(AREA), 0) "Population Density" FROM COUNTRY WHERE ((POPULATION > 100000000 AND (ROUND(POPULATION / ZD(AREA), 0) > 100)) OR (POPULATION < 100000000 AND (ROUND(POPULATION / ZD(AREA), 0) > 500)));

This query finds heavily populated countries based on population density, or just high density (a smaller country with a high population density). For example, Singapore has a relatively low population but is a very small country, more like a city-state. Singapore is very densely populated because all of its people are packed like sardines onto a small island:

   COUNTRY                          Population         Km sq. Population Density -------------------------------- -------------- ---------- ------------------ Mauritius                             1,240,827       1849                671 Nigeria                             131,859,731     910771                145 Barbados                                279,912        430                651 Bermuda                                  65,773         49               1342 Malta                                   400,214        321               1247 China                             1,313,973,713    9326411                141 Indonesia                           245,452,739    1811831                135 Japan                               127,463,611     394744                323  Singapore                             4,492,150        624               7199  Taiwan                               23,036,087      32261                714 Bahrain                                 698,585        619               1129 Bangladesh                          147,365,352     133911               1100 India                             1,095,351,995    2973190                368 Pakistan                            165,803,560     778720                213

There are also types of comparisons you can use as shown in the following query, including <= , >= , BETWEEN n AND m, LIKE 'pattern' , and of course = for equality:

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0) "Population Density" FROM COUNTRY WHERE ((POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100) OR (ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10)) AND REGION_ID = 1 AND COUNTRY LIKE '%a%';

The preceding query finds heavily populated countries in region 1 (the continent of Africa), as long as the country name contains a letter a somewhere in its name. The result is this:

   REGION_ID COUNTRY                  Population         Km sq. Population Density ---------- ------------------------ -------------- ---------- ------------------          1 Angola                       12,127,071    1246699                 10          1 Botswana                      1,639,833     585371                  3          1 Central African Republic      4,303,356     622980                  7          1 Equatorial Guinea               540,109      28050                 19          1 Gabon                         1,424,906     257669                  6          1 Gambia                                0      10000                  0          1 Ivory Coast                           0          0                  0          1 Libya                         5,900,754    1759540                  3          1 Mali                         11,716,829    1219999                 10          1 Mauritania                    3,177,388    1030400                  3          1 Namibia                       2,044,147     823291                  2          1 Zaire                                 0          0                  0 12 records

Precedence in SQL

Another important factor to consider in SQL at this stage is the concept of precedence. Precedence of execution is the sequence in which the different filters are applied to the records retrieved by the SELECT statement. The previous query is duplicated here but formatted slightly differently, with everything spread out more clearly according to precedence of execution established by use of the round brackets:

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION,     AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0) "Population Density" FROM COUNTRY WHERE ((POPULATION <= 1000000       AND ROUND           (POPULATION / ZD(AREA)              , 0) <= 100)    OR    (ROUND       (POPULATION / ZD(AREA)          , 0) BETWEEN 0 AND 10)) AND REGION_ID = 1 AND COUNTRY LIKE '%a%';

This is the sequence in which the preceding query WHERE clause comparisons are executed, based on the way that the round brackets determine precedence of execution:

The top-most ROUND function against the <= 100 comparison is executed:

The ROUND function is executed first as determined by expression precedence. At this point, I am dis cussing WHERE clause filtering comparison precedence.
```
   AND ROUND     (POPULATION / ZD(AREA)        , 0) <= 100   
```
The second comparison ( BETWEEN 0 AND 10 ), containing the ROUND function is executed:
```
   ROUND (POPULATION / ZD(AREA)    , 0) BETWEEN 0 AND 10   
```

The POPULATION <= 1000000 is then logically combined with the result of the comparison in Step 1:

   (POPULATION <= 1000000    AND ROUND        (POPULATION / ZD(AREA)           , 0) <= 100)

The results of the comparisons in Steps 3 and 2 are then combined using the OR logical operator, meaning that either Step 2 or Step 3 can be true to produce a true result:

   ((POPULATION <= 1000000       AND ROUND           (POPULATION / ZD(AREA)              , 0) <= 100)    OR    (ROUND       (POPULATION / ZD(AREA)          , 0) BETWEEN 0 AND 10))

Finally the two AND logical operators, executed as the AND REGION_ID = 1 comparison, followed by the AND COUNTRY LIKE '%a%' comparison, are executed to produce a resulting true or false result for every record retrieved by the SELECT statement:
```
   AND REGION_ID = 1 AND COUNTRY LIKE '%a%'   
```

Technically, ZD(AREA) is computed first, but since the ZD function is used everywhere for a population density calculation, I have assumed it a part of the contained expression in this situation. The objective is to give you an overall picture of the process.

Just to demonstrate the use of round brackets and how they apply precedence, the query that follows is the same as the previous query, except that I have removed all the round brackets, which determine the precedence of execution of comparisons in the WHERE clause (excluding those encapsulating the ROUND function as an expression):

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0) "Population Density" FROM COUNTRY WHERE POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100 OR ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10 AND REGION_ID = 1 AND COUNTRY LIKE '%a%';

In this situation the AND operator has higher precedence than the OR operator and thus all operators are processed as they are encountered (left to right and top to bottom). As shown in the code that follows, the result of the query is completely different. This is a partial result and even this shows extra countries, which are not even in Africa:

   REGION_ID COUNTRY                  Population         Km sq. Population Density ---------- ------------------------ -------------- ---------- ------------------          1 Angola                       12,127,071    1246699                 10          1 Botswana                      1,639,833     585371                  3          1 Central African Republic      4,303,356     622980                  7  1 Djibouti                        486,530      21979                 22  1 Equatorial Guinea               540,109      28050                 19          1 Gabon                         1,424,906     257669                  6          1 Gambia                                0      10000                  0          1 Ivory Coast                           0          0                  0          1 Libya                         5,900,754    1759540                  3          1 Mali                         11,716,829    1219999                 10          1 Mauritania                    3,177,388    1030400                  3          1 Namibia                       2,044,147     823291                  2          1 Zaire                                 0          0                  0  4 Bahamas                               0      10070                  0   6 Cyprus                          784,301       9241                 85   6 Iceland                         299,388     100251                  3   6 Northern Ireland                      0          0                  0   6 Scotland                              0          0                  0   6 Slovak Republic                       0      48800                  0   6 Yugoslavia                            0     102136                  0   10 Greenland                        56,361     341701                  0   11 Comoros Islands                       0          0                  0

In the preceding query, excluding comparison precedence round brackets, there are now 18 records instead of the original 9 records. Djibouti is in Africa but has too low a population, failing the LIKE operator test. The rest of the countries have too low a population, none are in Africa, some fail the LIKE test, and many fail the population density test. However, population density is not a prerequisite because it can be overridden by the low population size comparison. That clearly demonstrates why the round brackets are used to change the precedence of execution, of comparisons, in the WHERE clause.

The only other rule to remember is that the logical operators have an overriding order of precedence. NOT is executed first, followed by AND , and finally by OR . Use of NOT AND or NOT OR requires failure of a test to get a true result.

Precedence in Expressions

The precedence in WHERE clause comparisons determines the sequence in which those comparisons are executed. Mathematically speaking, an expression can be executed in a specific order.

An expression is a single valued, or multiple valued, calculation that when evaluated produces a single result. The number 2 is an expression because its result is 2. The string (5*10)+3 is also an expression because the result is 53. More abstractly, (x*y)+z is also an expression because it equals (x*y)+z.

The expression ((100/((x+y)*z))-5) is an expression in itself, and also contains multiple expressions within it, all producing a single result, where each result is passed back to the calling (or container) expression. This expression can be divided up as follows, where expressions proceed outwards with indentations. It begins with the expression x+y, followed by (x+y)*z, 100/((x+y)*z), (100/((x+y)*z))-5, and finally the round brackets around the entire expression. The round brackets are unnecessary but make the resulting expression look a little easier to read:

   ((100/       ((x+y)          *z))    -5)

For an example of expression precedence, consider the following:

   ((100/((2+3)*2))-5) = ((100/((2+3)*2))-5)                      = ((100/(5*2))-5)                      = ((100/10)-5)                      = (10-5)                      = 5

Now remove all the brackets from the preceding expression and simple rules of mathematical expression apply, where * and / are executed before + and ˆ , but * has equal precedence to /, and + has equal precedence to ˆ . Additionally, the expression is evaluated (calculated) from left to right:

   100/2+3*2-5 = 50+3*2-5             = 50+6-5             = 56-5             = 1

As you see in the preceding code, no round brackets and a different value results. As you saw in queries executed earlier, mathematical expressions are used in SQL, specifically when using the ROUND function in the ROUND (POPULATION / ZD(AREA), 0) calculation. This function executes the division first and then returns a result that is rounded up to the nearest zero. In this query the ROUND function is removed from the fields and expressions retrieved from the table:

   SELECT COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population", AREA "Km sq.",    POPULATION / ZD(AREA) "Population Density" FROM COUNTRY WHERE ((POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100) OR (ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10)) AND REGION_ID = 1 AND COUNTRY LIKE '%a%';

The resulting query contains a population density figure that is more difficult to read at first glance:

   COUNTRY                  Population         Km sq. Population Density ------------------------ -------------- ---------- ------------------ Angola                       12,127,071    1246699         9.72734477 Botswana                      1,639,833     585371         2.80135675 Central African Republic      4,303,356     622980         6.90769527 Equatorial Guinea               540,109      28050         19.2552228 Gabon                         1,424,906     257669         5.52998615 Gambia                                0      10000                  0 Ivory Coast                           0          0                  0 Libya                         5,900,754    1759540         3.35357764 Mali                         11,716,829    1219999         9.60396607 Mauritania                    3,177,388    1030400         3.08364519 Namibia                       2,044,147     823291          2.4828973 Zaire                                 0          0                  0

One more point about expression precedence is that any function, such as the ROUND function, has higher precedence than all the arithmetic operators *, /, +, and ˆ . Also some versions of SQL, for some database engines, will use a POWER function. The POWER function raises one number to the power of another. For example 2 ³ = 8, but could also be executed as POWER (2, 3) = 8, depending on the database engine. Some database engines even have a special exponent operator for executing exponents, sometimes using a ^ character, or even a ^^ character sequence. And thus 2^3 or 2^^3 might both be equal to 8, depending on the database engine in use. In this case, the precedence of arithmetic operators would be ROUND (or any other function), followed by ^ or ^^, followed by * and /, and finally + and ˆ .

Mathematically, exponentiation is actually the conversion of large numbers to exponential notation where the number 1,000,000,000,000 can also be represented as 1*10 ¹² , where 12 is the exponent, and the number 1*10 ¹² is an exponential notational expression of the number 1,000,000,000,000. In all the database engines I have worked with in the past, the term exponent is applied to raising a number to the power of another. The exponent is actually the number that another number is raised by, not the entire expression itself. Thus, in the expression 2 ³ , 2 is the number, and 3 is the exponent.

The ORDER BY Clause

The ORDER BY clause can be used to further enhance a SELECT statement by changing the order in which records are returned. This is the basic syntax of the ORDER BY clause:

   SELECT ... FROM ... [ WHERE ... ]  [ ORDER BY { field  expression [ASC DESC] [ , ... ] } ];

The SELECT statement and the FROM clause are mandatory. The WHERE clause and the ORDER BY clause are both optional and thus enclosed in square brackets as in [ ORDER BY ˆ ] .

The ORDER BY clause always appears last in a query. The ORDER BY clause is applied to the filtered records after the application of the WHERE clause, not before records are filtered.

This is the same query used before, except that now the records are returned in order of decreasing population density, within increasing REGION_ID value:

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0) "Population Density" FROM COUNTRY WHERE ((POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100) OR (ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10)) AND REGION_ID IN (1, 13) AND COUNTRY LIKE '%a%' ORDER BY REGION_ID ASC, ROUND(POPULATION / ZD(AREA), 0) DESC;

Like the WHERE clause, expressions can be used in the ORDER BY clause, as in ROUND(POPULATION / ZD(AREA), 0) DESC .

A new comparison is introduced in the form of REGION_ID IN (1, 13) . The IN operator allows val idation against a list of values. In this case, the REGION_ID can be 1 (Africa), or 13 (South America).

Here is the result of the preceding query with Equatorial Guinea and Bolivia being the most densely populated countries in Africa and South America, respectively:

   REGION_ID COUNTRY                  Population         Km sq. Population Density ---------- ------------------------ -------------- ---------- ------------------  1 Equatorial Guinea               540,109      28050                 19  1 Angola                       12,127,071    1246699                 10          1 Mali                         11,716,829    1219999                 10          1 Central African Republic      4,303,356     622980                  7          1 Gabon                         1,424,906     257669                  6          1 Botswana                      1,639,833     585371                  3          1 Mauritania                    3,177,388    1030400                  3          1 Libya                         5,900,754    1759540                  3          1 Namibia                       2,044,147     823291                  2          1 Zaire                                 0          0                  0          1 Ivory Coast                           0          0                  0          1 Gambia                                0      10000                  0  13 Bolivia                       8,989,046    1084389                  8  13 Guyana                          767,245     196850                  4         13 French Guiana                   199,509      89150                  2         13 Surinam                               0     161471                  0

The AS Clause in the ORDER BY Clause

The AS clause can be used to rename a field within the scope of a running query. In the previous query the quoted strings Population and Population Density were used to change only the field headers in the output. This next query changes the name of the population density expression ROUND(POPULATION / ZD(AREA), 0) , giving the expression a field name of its own. This allows access to the name of the expression in the ORDER BY clause, rather than duplicating the entire expression:

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0)  AS DENSITY  FROM COUNTRY WHERE ((POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100) OR (ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10)) AND REGION_ID IN (1, 13) AND COUNTRY LIKE '%a%' ORDER BY REGION_ID ASC,  DENSITY  DESC;

The result of the preceding query is very similar to the previous querys output, as you can see in the code that follows, except that the field name of the calculated expression is now DENSITY as opposed to Population Density :

   REGION_ID COUNTRY                  Population         Km sq.  DENSITY  ---------- ------------------------ -------------- ---------- ----------          1 Equatorial Guinea               540,109      28050         19          1 Angola                       12,127,071    1246699         10          1 Mali                         11,716,829    1219999         10          1 Central African Republic      4,303,356     622980          7          1 Gabon                         1,424,906     257669          6          1 Botswana                      1,639,833     585371          3          1 Mauritania                    3,177,388    1030400          3          1 Libya                         5,900,754    1759540          3          1 Namibia                       2,044,147     823291          2          1 Zaire                                 0          0          0          1 Ivory Coast                           0          0          0          1 Gambia                                0      10000          0         13 Bolivia                       8,989,046    1084389          8         13 Guyana                          767,245     196850          4         13 French Guiana                   199,509      89150          2         13 Surinam                               0     161471          0

You cannot, however, do as in the following query and replace the density expression in the WHERE clause. The reason is very simple. The ORDER BY clause is always performed after all records have been retrieved from the database. Therefore the calculated expression exists as DENSITY at the point of execution of the ORDER BY clause. The WHERE clause on the other hand is executed as records are retrieved from the database. Consequently, the DENSITY field does not actually exist until after the WHERE clause has been applied to database access. The following query will return an error because the WHERE clause filters cannot find the field called DENSITY :

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0)  AS DENSITY  FROM COUNTRY WHERE ((POPULATION <= 1000000 AND  DENSITY  <= 100) OR (  DENSITY  BETWEEN 0 AND 10)) AND REGION_ID IN (1, 13) AND COUNTRY LIKE '%a%' ORDER BY REGION_ID ASC,  DENSITY  DESC;

Most database engines allow ORDER BY clause specification not only by field names and expression, but also by the position of a field (or expression), in the list of items retrieved by the SELECT statement. So you could change the ORDER BY clause to that shown in the following query, and get the same result as in the previous query:

   SELECT REGION_ID, COUNTRY, TO_CHAR(POPULATION, '9,999,999,990') "Population",    AREA "Km sq.", ROUND(POPULATION / ZD(AREA), 0)  AS DENSITY  FROM COUNTRY WHERE ((POPULATION <= 1000000 AND ROUND(POPULATION / ZD(AREA), 0) <= 100) OR (ROUND(POPULATION / ZD(AREA), 0) BETWEEN 0 AND 10)) AND REGION_ID IN (1, 13) AND COUNTRY LIKE '%a%' ORDER BY 1 ASC, 5 DESC;

In the preceding query and ORDER BY clause, the REGION_ID field appears in the first position in the SELECT items list, and the population density expression appears fifth in the SELECT statement items list. You can even mix positional values in the ORDER BY clause with field names and expressions. Additionally, some databases do not require that ORDER BY clause items actually be retrieved, as in the following example:

   SELECT REGION_ID, REGION FROM REGION ORDER BY ROUND(POPULATION / ZD(AREA), 0) DESC;

In the query that follows, it is clear to see that the density expression calculation is not returned by the query:

   REGION_ID REGION ---------- --------------------------------          9 Near East          4 Caribbean          7 Far East          6 Europe          2 Asia          5 Central America          8 Middle East          1 Africa         13 South America         10 North America         11 Oceania         12 Russian Federation          3 Australasia

The ORDER BY clause is simple compared with the WHERE clause.

The GROUP BY Clause

The GROUP BY clause can be used to further enhance a SELECT statement by summarizing records returned from the SELECT statement. This is the basic syntax of the GROUP BY clause:

   SELECT ... FROM ... [ WHERE ... ]  [ GROUP BY expression [, ... ] ]  [ ORDER BY ... ];

The SELECT statement and the FROM clause are mandatory. The WHERE clause, the ORDER BY clause, and the GROUP BY clause are all optional and thus enclosed in square brackets as in [ GROUP BY ˆ ] .

The ORDER BY clause always appears last in a query. Therefore the ORDER BY clause is applied to both filtered and summarized records, after the application of both the WHERE clause and the GROUP BY clause.

The next query reads the COUNTRY table. It finds global population figures, for all countries, in all regions . This particular query produces total regional populations sorted in decreasing order of regional population:

   SELECT REGION_ID,    TO_CHAR(SUM(POPULATION)/1000000000, '990.9')' billion' "Population" FROM COUNTRY GROUP BY REGION_ID ORDER BY SUM(POPULATION) DESC;

See Appendix B for details of the demographics database, which is used for examples throughout this book.

Here is the result of the preceding query including regions with relatively negligible (zero) populations:

   REGION_ID Population ---------- --------------          7    2.1 billion          9    1.5 billion          1    0.8 billion          6    0.5 billion         13    0.4 billion         10    0.3 billion          8    0.3 billion         12    0.3 billion          5    0.1 billion          2    0.0 billion          4    0.0 billion          3    0.0 billion         11    0.0 billion

The GROUP BY clause can be further extended by using the HAVING clause. The HAVING clause is applied as a filter to the results of the GROUP BY clause, allowing filtering of resulting summarized records. This is the basic syntax of the GROUP BY clause, this time with the HAVING clause added:

   SELECT ... FROM ... [ WHERE ... ]  [ GROUP BY expression [, ... ] [ HAVING condition ] ]  [ ORDER BY ... ];

This query is adapted to remove regions with relatively negligible populations, creating a cut-off point at anything over half a billion people:

   SELECT REGION_ID,    TO_CHAR(SUM(POPULATION)/1000000000, '990.99')' billion' "Population" FROM COUNTRY GROUP BY REGION_ID HAVING SUM(POPULATION)/1000000000 >= 0.5 ORDER BY SUM(POPULATION) DESC;

Both the GROUP BY clause and the HAVING clause can contain expressions as well as simple field names, as in SUM(POPULATION) .

Here is the result of the preceding query showing the regions of the Far East, the Near East, and Africa:

   REGION_ID Population ---------- ---------------          7    2.10 billion          9    1.50 billion          1    0.79 billion

Some database engines allow specialized clauses for the GROUP BY clause. These extensions are gener ally reserved for data warehouse OLAP (Online Application Processing). OLAP functionality is used to produce analytical projection reporting, such as sub- totaling rollups, cubic cross-tabs, and even spreadsheets.

The JOIN Clause

The JOIN clause is used to merge the records of two separate queries together, based on common field values. This is the basic syntax of the JOIN clause:

   SELECT ... FROM table [alias]  [ JOIN table [alias] USING (common-field) ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The common-field element determines that records from two tables are joined together using a field con taining values common to records in both tables.

The following query reads both the REGION and COUNTRY tables, joining the two with the JOIN clause:

   SELECT REGION, COUNTRY FROM REGION JOIN COUNTRY USING (REGION_ID);

The preceding query applies the USING clause to join the REGION and COUNTRY tables based on a REGION_ID value. Every region has a unique value for REGION_ID . Whenever a new record is added to the COUNTRY table, then that country record is allocated a REGION_ID value based on the region the country is in. For example, Cameroon is a country in Africa, Georgia is a country in the Russian Federation, and South Korea is a country in the Far East region.

See Appendix B for details of the demographics database, which is used for examples throughout this book.

Figure B-1 in Appendix B , and the description of Entity Relationship Diagrams (ERDs), also in Appendix B , show that there can be many countries within each region.

Here is a partial result of the preceding query, showing some countries within their respective regions:

   REGION                   COUNTRY ------------------------ ----------------------- Africa                   Cameroon Africa                   Cote Divoire Russian Federation       Georgia Russian Federation       Kazakhstan Far East                 Myanmar Middle East              Palestinian Territories Europe                   Serbia And Montenegro Far East                 South Korea Russian Federation       Ukraine Middle East              United Arab Emirates Russian Federation       Uzbekistan

The JOIN clause can be expanded upon by joining more than two tables using multiple JOIN clauses. The syntax is expanded upon as shown here:

   SELECT ... FROM table [alias]  [ JOIN table [alias] USING (common-field)   [ JOIN table [alias] USING (common-field) ]   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The following query joins regions to countries, and then countries to states using two JOIN clauses:

   SELECT REGION, COUNTRY, STATE "State or Province" FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN STATE USING (COUNTRY_ID);

Again, the USING clause is applied to join the REGION and COUNTRY tables using REGION ID , plus a second USING clause is applied to join the COUNTRY and STATE tables together.

Figure B-1 in Appendix B , and the description of Entity Relationship Diagrams (ERDs), also in Appendix B , show that there can be many countries within each region and many states within each country.

Here is a partial result of the preceding query, showing some states within the United States and Canada, in North America (only North America has states in this database):

   REGION                   COUNTRY                  State or Province ------------------------ ------------------------ ---------------- North America            United States            Washington North America            United States            Wisconsin North America            United States            West Virginia North America            United States            Wyoming North America            United States            Nebraska North America            United States            New Jersey North America            United States            Vermont North America            Canada                   British Columbia North America            Canada                   Nova Scotia North America            Canada                   Ontario North America            Canada                   Quebec North America            Canada                   Alberta

And going just a little further, the next query joins regions to countries, and then countries to states, followed by states to cities. This time the query uses four JOIN clauses:

   SELECT REGION, COUNTRY, STATE "State or Province", CITY FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN STATE USING (COUNTRY_ID)       JOIN CITY USING (STATE_ID);

And once again, the USING clause is applied to join the REGION and COUNTRY tables using the REGION ID field. A second USING clause is applied to join the COUNTRY and STATE tables together using the COUNTRY_ID field. A third USING clause is applied to join the STATE and CITY tables together using the STATE_ID field.

Figure B-1 in Appendix B , and the description of Entity Relationship Diagrams (ERDs), also in Appendix B , show that there can be many cities within each state, within each country, within each region.

Here is a partial result of the preceding query, showing some cities within their respective states, in the United States and Canada, in North America:

   REGION         COUNTRY        State or Province  CITY -------------- -------------- ------------------ --------------- North America  United States  Texas              Waco North America  United States  Utah               Salt Lake City North America  United States  Virginia           Norfolk North America  United States  Washington         Seattle North America  United States  Washington         Spokane North America  United States  Wisconsin          Madison North America  United States  Wisconsin          Milwaukee North America  United States  Wyoming            Cheyenne North America  Canada         Alberta            Burlington North America  Canada         Alberta            Calgary North America  Canada         Alberta            Edmonton

Another variation of JOIN clause syntax is the ON clause, substituting for the USING clause. This is the altered syntax of the JOIN clause:

   SELECT ... FROM table [alias]  [ JOIN table [alias] ON (left-field = right-field)   [ JOIN table [alias] USING (left-field = right-field) ]   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The USING clause can be applied when there are common fields between two tables, and those fields contain common values. You use the ON clause when two fields that are named the same in two tables do not contain common values. Or, when using aliases in a join query, the field names on the joined tables are effectively different because they are altered by two different table aliases. For example, alias-1.field1 is not the same as alias-2.field1 .

In the following example query, aliases rename fields in tables within the query, effectively giving fields different names. The USING clause would not be able to perform the join because it cannot determine that r.REGION_ID is the same as c.REGION_ID :

   SELECT r.REGION, c.COUNTRY FROM REGION r JOIN COUNTRY c ON (c.REGION_ID = r.REGION_ID);

This is the resulting query:

   REGION                   COUNTRY ------------------------ ------------------------ Africa                   Cameroon Africa                   Cote Divoire Russian Federation       Georgia Russian Federation       Kazakhstan Far East                 Myanmar Middle East              Palestinian Territories Europe                   Serbia And Montenegro Far East                 South Korea Russian Federation       Ukraine Middle East              United Arab Emirates Russian Federation       Uzbekistan

Types of Joins

There are various different types of joins in that two tables can be joined together, finding different sets of records for each join type. These join types are generally called natural joins, inner joins, cross-joins, and various subtypes of outer joins.

A natural join uses the NATURAL keyword to allow the database to guess at the correct field to join two tables on. The syntax looks like this:

   SELECT ... FROM table [alias]  [ NATURAL JOIN table [alias] ... ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

A natural join does not require the ON or USING keywords.

The following query joins regions to countries, with a best guess using common field names between the two tables:

   SELECT REGION, COUNTRY FROM REGION NATURAL JOIN COUNTRY;

The preceding query will not return any records. Examine the ERD in Appendix B and you will see that the REGION and COUNTRY tables have three common fields, in the form of the REGION_ID, POPULATION, and AREA fields. Obviously, population and square kilometer values for regions and countries are completely different, and the result is no records found by the preceding query.

The keyword INNER can be used to indicate an inner or intersecting join. All the join queries seen so far in this section are inner joins.

From a mathematical perspective, an intersection is made up of the common elements in two sets of values.

The syntax looks like this where either the USING or ON clause are required:

   SELECT ... FROM table [alias]  [   INNER JOIN table [alias] USING (common-field)   INNER JOIN table [alias] ON (left-field = right-field)   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

As already stated, all working example join queries seen so far in this section are inner joins.

The INNER keyword is optional.

A special type of intersecting join is a self-join. A self-join is simply an intersection between records in the same table. You can join records within the same table. Commonly self-joins are used to join records in a table containing hierarchical data.

A cross-join returns all records from two tables, merging regardless of any common values. In other words, every record in one table is joined with every record in another table. A cross-join completely ignores common fields and common values.

Mathematically speaking, a cross-join is a Cartesian product which is a multiplication or product of all the elements in two sets.

The syntax for a cross-join looks like this:

   SELECT ... FROM table [alias]  [ CROSS JOIN table [alias] ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The following example, as you can see, does not relate regions with countries:

   SELECT REGION, COUNTRY FROM REGION CROSS JOIN COUNTRY;

This is the result of the preceding query. The United Kingdom is not in the Near East region, and Afghanistan is definitely not in North America:

   REGION                   COUNTRY ------------------------ ---------------------  Near East                United Kingdom  Near East                United States Near East                Uruguay Near East                Uzbekistan Near East                Venezuela Near East                Vietnam Near East                Yemen Near East                Yugoslavia Near East                Zaire Near East                Zambia Near East                Zimbabwe  North America            Afghanistan  North America            Albania North America            Algeria North America            American Samoa

Data warehouses sometimes use cross-joins.

An outer join can be used to return both intersecting records, plus records in one or either table, but not in the other table. There are three types of outer joins: a left outer join, a right outer join, and a full outer join.

A left outer join finds all records in the table on the left, plus all records in the table on the right, where records in the left-side table do not have to exist in the right-side table. This is the syntax for a left outer join:

   SELECT ... FROM table [alias]  [   LEFT OUTER JOIN table [alias] USING (common-field)   LEFT OUTER JOIN table [alias] ON (left-field = right-field)   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The USING or ON clause is required for outer joins.

This query finds all countries regardless of whether they have states or not:

   SELECT COUNTRY, STATE "State or Province" FROM COUNTRY LEFT OUTER JOIN STATE USING (COUNTRY_ID);

This is the result of the preceding query. The United States has states, and Canada has provinces . Other countries do not have states, at least not in this database:

   COUNTRY                  State or Province ------------------------ ----------------- United States            Washington United States            Wisconsin United States            West Virginia United States            Wyoming Canada                   Alberta United States            Nebraska United States            New Jersey United States            Vermont  South Korea   India   Maldive Islands   Papua New Guinea

In an outer join, records without matching field values have non-retrievable field values set to NULL. Thus in the preceding query, State or Province values for the last four countries are all NULL values.

A right outer join finds all records in the table on the right, plus all records in the table on the left, where records in the right-side table do not have to exist in the left-side table. This is the syntax for a right outer join:

   SELECT ... FROM table [alias]  [   RIGHT OUTER JOIN table [alias] USING (common-field)   RIGHT OUTER JOIN table [alias] ON (left-field = right-field)   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

This query simply reverses the sequence of the tables from the previous example query, again finding all countries regardless of whether they have states or not:

   SELECT COUNTRY, STATE "State or Province" FROM STATE RIGHT OUTER JOIN COUNTRY USING (COUNTRY_ID);

The result for the preceding query is identical to the previous query result.

The last type of outer join is a full outer join. A full outer join finds all intersecting records, plus any records in the left-side table not in the right side, and any records in the right-side table not in the left. This is the syntax for a full outer join:

   SELECT ... FROM table [alias]  [   FULL OUTER JOIN table [alias] USING (common-field)   FULL OUTER JOIN table [alias] ON (left-field = right-field)   ]  [ WHERE ... ] [ GROUP BY ... ] [ ORDER BY ... ];

The following query can potentially find all countries regardless of whether they have states or not, and vice versa:

   SELECT COUNTRY, STATE "State or Province" FROM COUNTRY FULL OUTER JOIN STATE USING (COUNTRY_ID);

The database used in this book does not have any relationship that would demonstrate a full outer join, and thus showing a query result is pointless.

Subqueries

A subquery is a query that is embedded within, and called from, another query. The calling query is known as the calling query or the parent query. Subqueries can be nested to an arbitrary depth, depending usually on the specific database engine in use. Subqueries can return a single scalar value, multiple values, or even multiple records. A subquery can be correlated, passing values passed into a child query, which can affect records retrieved by the subquery, for each parent query record retrieved. There are various different ways that subqueries can be used and built:

The IN and EXISTS operators can contain embedded subqueries.
A subquery can be embedded in the FROM clause of a SELECT statement.
A subquery can be embedded in WHERE , ORDER BY and GROUP BY clauses.
Subqueries can be embedded into INSERT , UPDATE and DELETE statements.

INSERT , UPDATE and DELETE statements are used to add to, change and remove records in tables.

In this first case, a subquery finds all countries, as long as they have a related entry in the REGION table:

   SELECT * FROM COUNTRY WHERE REGION_ID IN (SELECT REGION_ID FROM REGION);

The subquery is used in the WHERE clause in the preceding query.

This is a partial result of the preceding query. Not all fields are included in this result:

   COUNTRY_ID  REGION_ID COUNTRY                          CO POPULATION ---------- ---------- -------------------------------- -- ----------          1          1 Algeria                          AG   32930091          2          1 Angola                           AO   12127071          3          1 Benin                            BN    7862944          4          1 Botswana                         BC    1639833          5          1 Burkina Faso                     UV   13902972          6          1 Burundi                          BY    8090068          7          1 Central African Republic         CT    4303356          8          1 Congo                            CG   62660551          9          1 Djibouti                         DJ     486530         10          1 Equatorial Guinea                EK     540109         11          1 Ethiopia                         ET   74777981

This next query embeds multiple layers of subqueries:

   SELECT * FROM CITY WHERE CITY_ID IN    (SELECT CITY_ID FROM STATE WHERE STATE_ID IN       (SELECT STATE_ID FROM STATE WHERE COUNTRY_ID IN          (SELECT COUNTRY_ID FROM COUNTRY WHERE REGION_ID IN             (SELECT REGION_ID FROM REGION))));

This is a partial result of the preceding query:

   CITY_ID COUNTRY_ID   STATE_ID CITY                             POPULATION ---------- ---------- ---------- -------------------------------- ----------        583        127          8 Phoenix                             3800000        584        127            Richmond                            1175000        585        127         37 Rochester                           1050000        586        127          9 San Diego                           3000000        587        127         29 St. Louis                           2800000        588        127         14 Tampa                               2650000        589        127            Virginia Beach                      1675000        590        127         12 Washington                          8050000        591        165            Tashkent                            2400000        592        151            Barquisimeto                        1150000        593        151            Maracaibo                           2150000

A subquery can return a single scalar value, multiple values, or even multiple records. In this example, the subquery returns the REGION_ID value for North America only:

   SELECT * FROM COUNTRY WHERE REGION_ID =    (SELECT REGION_ID FROM REGION WHERE REGION='North America');

This is a partial result, finding only countries in North America:

   COUNTRY_ID  REGION_ID COUNTRY                          CO POPULATION ---------- ---------- -------------------------------- -- ----------        125         10 Canada                           CA   33098932        126         10 Greenland                        GL      56361        127         10 United States                    US  298444215

In the next subquery the IN operator validates using two fields:

   SELECT * FROM CITY WHERE (COUNTRY_ID, STATE_ID) IN    (SELECT COUNTRY_ID, STATE_ID FROM COUNTRY JOIN STATE USING (COUNTRY_ID));

This is a partial result:

   CITY_ID COUNTRY_ID   STATE_ID CITY                             POPULATION ---------- ---------- ---------- -------------------------------- ----------        583        127          8 Phoenix                             3800000        585        127         37 Rochester                           1050000        586        127          9 San Diego                           3000000        587        127         29 St. Louis                           2800000        588        127         14 Tampa                               2650000        590        127         12 Washington                          8050000          1        125          1 Vancouver                           2200000          2        125          2 Halifax          3        125          3 Ottawa                              1150000          4        125          3 Toronto                             5150000          5        125          4 Montreal                            3600000

A subquery can be correlated, using values passed into it from its parent query, which can affect records retrieved by the subquery:

   SELECT * FROM COUNTRY WHERE REGION_ID IN    (SELECT REGION_ID FROM REGION WHERE REGION_ID=COUNTRY.REGION_ID);

In the preceding query, the subquerys WHERE clause ( WHERE REGION_ID=COUNTRY.REGION_ID ) takes a value passed from the calling query into the subquery, determining which records the subquery retrieves for every record retrieved by the calling query. In other words, for every country found, the subquery uses the countrys REGION_ID value to find that countrys region.

This is a partial result:

   COUNTRY_ID  REGION_ID COUNTRY                          CO POPULATION ---------- ---------- -------------------------------- -- ----------          1          1 Algeria                          AG   32930091          2          1 Angola                           AO   12127071          3          1 Benin                            BN    7862944          4          1 Botswana                         BC    1639833          5          1 Burkina Faso                     UV   13902972          6          1 Burundi                          BY    8090068          7          1 Central African Republic         CT    4303356          8          1 Congo                            CG   62660551          9          1 Djibouti                         DJ     486530         10          1 Equatorial Guinea                EK     540109         11          1 Ethiopia                         ET   74777981

Some database engines allow the use of an EXISTS operator, which can be used as a slight variation on the IN operator. The IN operator returns a value or a set of values. The EXISTS operator returns a true or false result (a Boolean). So the previous correlated query can be changed to this:

   SELECT * FROM COUNTRY WHERE EXISTS    (SELECT REGION_ID FROM REGION WHERE REGION_ID = COUNTRY.REGION_ID);

The result of the preceding query is the same as the result for the previous example query.

Some databases allow embedding of a subquery into the FROM clause of a SELECT statement. Unlike subqueries in other SQL statements, a FROM clause embedded subquery can return fields and records to the calling query:

   SELECT r.REGION, c.COUNTRY FROM REGION r,    (SELECT * FROM COUNTRY) c WHERE c.REGION_ID = r.REGION_ID;

This is the result of the query where the name of the region is pulled from the calling query, and the name of the country from the FROM clause embedded subquery:

   REGION                           COUNTRY -------------------------------- -------------------------------- Africa                           Zambia Africa                           Zimbabwe Asia                             Burma Australasia                      Australia Australasia                      New Zealand Caribbean                        Bahamas Caribbean                        Barbados Caribbean                        Bermuda Caribbean                        Costa Rica

A subquery can be one of the expressions in a SELECT statement field selection list:

   SELECT (SELECT r.REGION FROM REGION r WHERE r.REGION_ID = c.REGION_ID)    ,c.COUNTRY FROM COUNTRY c;

The result of the preceding query will be the same as the previous query.

A subquery can even be placed into the ORDER BY clause of a SELECT statement, helping to determine the sorted order of the output of a query. What this query does is to sort regions and countries based on the currency exchange rate (in descending order), without returning the exchange rate in the query (there are other ways to do this of course):

   SELECT r.REGION, c.COUNTRY FROM REGION r JOIN COUNTRY c ON (c.REGION_ID = r.REGION_ID) ORDER BY (SELECT RATE FROM COUNTRY WHERE COUNTRY_ID = c.COUNTRY_ID);

The result is shown here where the same fields are found, as in the previous query output result shown, except the sorted order changes the records returned:

   REGION                           COUNTRY -------------------------------- -------------------------------- Europe                           United Kingdom Europe                           Switzerland North America                    Canada Australasia                      Australia Australasia                      New Zealand Europe                           Germany Far East                         Singapore South America                    Brazil Far East                         Malaysia

The preceding query, like most others in this chapter, is only a partial result of all records found by the query. The resort returns a completely different set of records for the first few records in the query result.

The AS Clause in Subqueries

As with the ORDER BY clause, in some situations, after completion of execution of a subquery, the AS clause can be used to refer to retrieved items using an AS clause field name change. In the following query the subquery changes the names of the COUNTRY_ID field to COUNTRY , and the COUNTRY field to NAME . The renamed COUNTRY and NAME fields are then accessed in the parent query as c.COUNTRY (containing the COUNTRY_ID value), and c.NAME (containing the COUNTRY value):

   SELECT r.REGION, c.COUNTRY, c.NAME FROM REGION r,    (SELECT REGION_ID, COUNTRY_ID AS COUNTRY, COUNTRY AS NAME FROM COUNTRY) c WHERE c.REGION_ID = r.REGION_ID;

This is the result showing the field name changes in the header:

   REGION                              COUNTRY NAME -------------------------------- ---------- -------------------------------- Africa                                   42 Zaire Africa                                   43 Zambia Africa                                   44 Zimbabwe Asia                                     45 Burma Australasia                              46 Australia Australasia                              47 New Zealand Caribbean                                48 Bahamas Caribbean                                49 Barbados Caribbean                                50 Bermuda Caribbean                                51 Costa Rica Caribbean                                52 Cuba Caribbean                                53 Dominican Republic

Subqueries can also be placed within INSERT , UPDATE , and DELETE statements, which allows changes to records in a database. These types of subqueries are discussed in a later section in this chapter, after the INSERT , UPDATE , and DELETE statements have been examined.

The UNION Clause

The UNION clause returns records from two queries. The resulting records can be treated as a single query in that they can have an ORDER BY clause applied to them, and be resorted. Additionally, each query can contain query clauses such as WHERE and GROUP BY clauses.

The only requirement for a UNION clause is that the number of fields in the two queries are the same, and that data types for each field are at least compatible. This query finds all regions and countries, regardless of any relationship between the two tables:

   SELECT REGION_ID "ID-1", NULL "ID-2", REGION "Location" FROM REGION UNION SELECT REGION_ID "ID-1", COUNTRY_ID "ID-2", COUNTRY "Location" FROM COUNTRY ORDER BY 1, 2 NULLS FIRST;

The NULLS FIRST modifier is not available in all database engines. NULLS FIRST returns all null values before any non-null values, thus returning regions at the top of each list of countries, within that particular region.

The NULLS FIRST and NULLS LAST options will not be available in all database engines.

This is an interesting use of the UNION clause because it effectively sorts all countries within all regions without using a join, and with all records in the same query. This is the result:

   ID-1       ID-2 Location ---------- ---------- --------------------------------  2            Asia  2         45 Burma  3            Australasia  3         46 Australia          3         47 New Zealand  4            Caribbean  4         48 Bahamas          4         49 Barbados          4         50 Bermuda          4         51 Costa Rica          4         52 Cuba          4         53 Dominican Republic          4         54 Haiti          4         55 Jamaica          4         56 Martinique          4         57 Puerto Rico  5            Central America  5         58 El Salvador

Try It OutUsing the SELECT Statement

Five of the tables in the demographics database (see Appendix B) are REGION, COUNTRY, LANGUAGE, POPULATION, and POPULATIONBYLANGUAGE. The tables look like this:

   Table: REGION Field                         Null?    Datatype ----------------------------- -------- -------------------- REGION_ID                     NOT NULL INTEGER REGION                        NOT NULL VARCHAR(32) POPULATION                             INTEGER AREA                                   INTEGER

An Oracle Database VARCHAR2 data type is the same as VARCHAR. An Oracle Database NUMBER(38) data type is the same as an INTEGER data type.

   Table: COUNTRY Field             Null?    Type ----------------- -------- ------------ COUNTRY_ID        NOT NULL NUMBER(38) REGION_ID         NOT NULL NUMBER(38) COUNTRY           NOT NULL VARCHAR2(32) CODE              NOT NULL CHAR(2) POPULATION                 NUMBER(38) AREA                       NUMBER(38) FXCODE                     CHAR(3) CURRENCY                   VARCHAR2(32) RATE                       FLOAT(126) Table: LANGUAGE Field             Null?    Datatype ----------------- -------- ------------ LANGUAGE_ID       NOT NULL INTEGER LANGUAGE          NOT NULL VARCHAR(16) Table: POPULATION Field                         Null?    Datatype ----------------------------- -------- ---------- POPULATION_ID                 NOT NULL INTEGER COUNTRY_ID                             INTEGER YEAR                                   INTEGER POPULATION                             INTEGER BIRTHS_PER_1000                        FLOAT DEATHS_PER_1000                        FLOAT MIGRANTS_PER_1000                      FLOAT NATURAL_INCREASE_PERCENT               FLOAT GROWTH_RATE                            FLOAT Table: POPULATIONBYLANGUAGE Field                         Null?    Datatype ----------------------------- -------- ----------- LANGUAGE_ID                   NOT NULL INTEGER POPULATION_ID                 NOT NULL INTEGER MALE                                   INTEGER FEMALE                                 INTEGER

Use a SELECT statement to find records in which the female population is greater than the male population. Show both male and female populations in the year 1976, summed up for each language within each region, and sorted by each language within region. You dont need to display the year. You dont need to display the language. You dont need to return any fields from the COUNTRY and POPULATION tables, but you do need to get the name of the region from somewhere. The only numbers you should find are the male and female population values. This is how:

Create a SELECT statement with a list of fields:

   SELECT REGION, LANGUAGE, FEMALE, MALE

Add the FROM clause with the first table:

   SELECT REGION, LANGUAGE, FEMALE, MALE FROM REGION

Add the second table as a join to the first table:

   SELECT REGION, LANGUAGE, FEMALE, MALE FROM REGION JOIN COUNTRY USING (REGION_ID)

Add a third table to the join:

   SELECT REGION, LANGUAGE, FEMALE, MALE FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN POPULATION USING (COUNTRY_ID)

Add fourth and fifth tables to the join:

   SELECT REGION, LANGUAGE, FEMALE, MALE FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN POPULATION USING (COUNTRY_ID)       JOIN POPULATIONBYLANGUAGE USING (POPULATION_ID)          JOIN LANGUAGE USING (LANGUAGE_ID)

A WHERE clause comparison ensures that you find only records in the year 1976, and where the female population exceeds the male population:

   SELECT REGION, LANGUAGE, FEMALE, MALE FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN POPULATION USING (COUNTRY_ID)       JOIN POPULATIONBYLANGUAGE USING (POPULATION_ID)          JOIN LANGUAGE USING (LANGUAGE_ID) WHERE FEMALE > MALE AND YEAR = 1976

Summarize by language within groups by applying the SUM functions to both the FEMALE and MALE fields, and then aggregating in the GROUP BY clause using the REGION and LANGUAGE fields:

   SELECT REGION, LANGUAGE, SUM(FEMALE), SUM(MALE) FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN POPULATION USING (COUNTRY_ID)       JOIN POPULATIONBYLANGUAGE USING (POPULATION_ID)          JOIN LANGUAGE USING (LANGUAGE_ID) WHERE FEMALE > MALE AND YEAR = 1976 GROUP BY REGION, LANGUAGE

Sort the record output by language and region:

   SELECT REGION, LANGUAGE, SUM(FEMALE), SUM(MALE) FROM REGION JOIN COUNTRY USING (REGION_ID)    JOIN POPULATION USING (COUNTRY_ID)       JOIN POPULATIONBYLANGUAGE USING (POPULATION_ID)          JOIN LANGUAGE USING (LANGUAGE_ID) WHERE FEMALE > MALE AND YEAR = 1976 GROUP BY REGION, LANGUAGE ORDER BY REGION, LANGUAGE;

Last, you need to qualify the field name in the query because with so many tables in the join, SQL will get confused trying to figure out what to join with, and where to select fields from. So you need to add aliases and change all the USING clauses to more explicit joining ON clauses. You can also make the query look a little nicer by changing field names using the AS clause:

   SELECT r.REGION, l.LANGUAGE, SUM(pl.FEMALE), SUM(pl.MALE) FROM REGION r JOIN COUNTRY c ON (c.REGION_ID=r.REGION_ID)    JOIN POPULATION p ON (p.COUNTRY_ID=c.COUNTRY_ID)       JOIN POPULATIONBYLANGUAGE pl ON (pl.POPULATION_ID=p.POPULATION_ID)          JOIN LANGUAGE l ON (l.LANGUAGE_ID=pl.LANGUAGE_ID) WHERE pl.FEMALE > pl.MALE AND p.YEAR = 1976 GROUP BY r.REGION, l.LANGUAGE ORDER BY r.REGION, l.LANGUAGE;

If you are using the demographics database as provided by the database creation process described in Appendix B, and the downloads from my website at the following URL:

   http://www.oracledbaexpert.com/oracle/beginningxmldatabases/index.html

you should get a result like this:

   REGION                           LANGUAGE            FEMALES      MALES -------------------------------- ---------------- ---------- ---------- Australasia                      English Only        5185387    5096972 Australasia                      French                42750      38236 Australasia                      Russian                4979       3676 North America                    English             7101880    7020890 North America                    Estonian               6205       5770 North America                    Finish                15185      13285 North America                    Flemish                4015       3780 North America                    French              2977060    2910140 North America                    German               239995     236720 North America                    Icelandic              2650       2380 North America                    Japanese               8045       7480 North America                    Lettish                5610       5545 North America                    Russian               12105      11375 North America                    Welsh                  1045       1010 North America                    Yiddish               12325      11115 15 records

How It Works

You created a fairly complex SELECT statement that retrieved four fields from four different tables. The four tables were all joined with explicit field comparisons using the ON clause and inner (intersecting) joins. This means you only found records where all matches existed in all four tables. You also applied two filters in a WHERE clause; plus you summarized two integer values using a SUM function. Most database engines have a built-in SUM function for adding simple numbers together. Last, you altered the query by adding aliases.

Changing Data in a Database

Changes are made to fields, in records, in tables, in a database, using the INSERT , UPDATE , and DELETE statements.

The INSERT statement is used to add new records to tables. This is the basic syntax for an INSERT statement:

   INSERT INTO table [ (field [, ... ]) ] VALUES (expression [ , ... ]);

A new region could be added to the REGION table with the following statement:

   INSERT INTO REGION (REGION_ID, REGION, POPULATION, AREA) VALUES(100,'A New Region', 1000000, 1000);

Some database engines allow omission of the list of fields if the VALUES clause exactly matches the structure of the table:

   INSERT INTO REGION VALUES (101,'Another New Region', 2000000, 2000);

The UPDATE statement is used to change existing records in tables and has the following basic syntax:

   UPDATE table SET field = expression [, ... ] [ WHERE ... ];

The WHERE clause filter determines which records are updated.

If no WHERE clause is present in an UPDATE statement, then all records in the table will be altered with a single UPDATE statement.

In this example, both of the records created in the previous two INSERT statements are updated:

   UPDATE REGION SET REGION = UPPER (REGION) WHERE REGION_ID IN (100, 101);

The DELETE statement is similar in syntax to the UPDATE statement:

   DELETE FROM table [ WHERE ... ];

This statement will DELETE the two records created by the previous INSERT statements and subsequently altered by the single UPDATE statement shown previously:

   DELETE FROM REGION WHERE REGION_ID IN (100, 101);

Subqueries in Database Change Statements

Database change statements are the INSERT , UPDATE , and DELETE statements. Subqueries can be embedded into database change statements in various ways.

A subquery can be embedded into the VALUES clause of an INSERT statement:

   INSERT INTO REGION (REGION_ID, REGION, POPULATION, AREA) VALUES (102, 'East '(SELECT REGION FROM REGION WHERE REGION_ID=1), 1000000, 100);

The result is that there is a new region called East Africa, as shown in this query:

   SELECT * FROM REGION; REGION_ID REGION                           POPULATION       AREA ---------- -------------------------------- ---------- ----------          1 Africa                            789548670   26780325          2 Asia                               47382633     657741          3 Australasia                        24340222    7886602          4 Caribbean                          40417697     268857          5 Central America                   142653392    2360325          6 Europe                            488674441    4583335          7 Far East                         2100636517   15357441          8 Middle East                       294625718    6798768          9 Near East                        1499157105    4721322         10 North America                     331599508   18729272         11 Oceania                             9133256     536238         12 Russian Federation                258037209   21237500         13 South America                     375489788   17545171  102 East Africa                         1000000        100

A subquery can even become the VALUES clause of an INSERT statement:

   INSERT INTO REGION    SELECT COUNTRY_ID, COUNTRY, POPULATION, AREA    FROM COUNTRY WHERE COUNTRY_ID > 150;

The UPDATE and DELETE statements can contain a WHERE clause. Thus the same rules apply as for the SELECT statement, such that a subquery can be embedded in the WHERE clause of UPDATE and DELETE statements.

An interesting use of embedding subqueries into an UPDATE statement is that some database engines allow more than a single field to be updated within the same UPDATE statement. The multiple fields updated must be retrieved using a subquery as in the following example:

   UPDATE REGION SET (POPULATION, AREA) = (SELECT SUM(POPULATION), SUM(AREA)    FROM COUNTRY    WHERE REGION_ID = REGION.REGION_ID    GROUP BY REGION_ID);

The preceding query will quite sensibly sum up all POPULATION and AREA values for each region from the COUNTRY table, and set those values into the same summary fields in the REGION table.

What Is a Transaction?

It is very important to understand what a transaction is with respect to a relational database. In a relational database, a transaction allows you to temporarily store changes. At a later point, you can choose to store the changes permanently using a COMMIT statement. Or you can also completely remove all changes you have made by using a ROLLBACK statement, as long as you have not already committed changes.

You can execute multiple database changes, such as the three INSERT statements shown here:

   INSERT INTO REGION (REGION_ID, REGION, POPULATION, AREA)    VALUES (111, 'Region 111', 100000, 100); INSERT INTO REGION (REGION_ID, REGION, POPULATION, AREA)    VALUES (112, 'Region 112', 200000, 200); COMMIT; INSERT INTO REGION (REGION_ID, REGION, POPULATION, AREA)    VALUES (113, 'Region 113', 300000, 300); ROLLBACK;

In the preceding example, the first two records will be permanently stored in the database by the COMMIT statement. The third record will be removed by the ROLLBACK statement.

Some general concepts of a relational database transaction are as follows:

A transaction is completed when a COMMIT or ROLLBACK statement is issued.
A COMMIT statement stores changes to a database.
A ROLLBACK statement removes pending changes from a database.
A pending change is a change not yet committed (a transaction not as yet terminated by a COMMIT or ROLLBACK statement).
A transaction can exist for one or more database change statements.
Before a change is committed, only the current session (database connection) can see any changes made within that current session. It follows that database changes made in other sessions (other database connections for other users) cannot be seen by the current session until those other sessions issue COMMIT statements.
The term transaction control describes the process of how SQL code can be used to control database changes made by one or more SQL statements.

Changing Database Objects

Database objects such as tables and indexes define how data is stored and accessed in a database. Tables and indexes are known as metadata in that they are the data about the data and not the actual records. So in the demographics database (see Appendix B), the COUNTRY table is metadata and the actual countries are the real data. The COUNTRY table describes the structure of how the countries are stored in the database.

Changing database objects is not really within the scope of this book but it may help to have a very basic picture of some database change statements.

The following CREATE TABLE statement creates a table called MOREREGIONS:

   CREATE TABLE MOREREGIONS(REGION_ID INTEGER PRIMARY KEY NOT NULL,    REGION VARCHAR2(32) UNIQUE NOT NULL,    POPULATION INTEGER NOT NULL,    AREA INTEGER NOT NULL);

The next ALTER TABLE statement is used to reset the POPULATION and AREA fields in the MOREREGIONS table, to nullable:

   ALTER TABLE MOREREGIONS MODIFY(POPULATION INTEGER, AREA INTEGER);

And because the MOREREGIONS table is actually useless for the purposes of the demographics database, the MOREREGIONS table can be dropped with this statement:

   DROP TABLE MOREREGIONS;

Also commonly used in a relational database are database objects called indexes. An index simply creates a physical copy of a small section in a table and creates it in a way that allows for fast searching through the index, such as using a binary tree. In general, an index could be created using a statement like this:

   CREATE INDEX XAK_REGION_POPULATION ON REGION (POPULATION);

Indexes can also be altered and dropped using the ALTER INDEX and DROP INDEX statements.

In general, most relational databases allow metadata change commands that are a lot more comprehensive than just creating, altering, and dropping simple tables and indexes. Any more on this topic is not necessary for the purposes of this book.

Some databases include database object change commands within transactions, and some databases do not. In other words, some databases allow a database object change such as a CREATE TABLE statement to be rolled back and some databases do not. Some databases even automatically commit when executing database object change statements.