Chapter 10: Databases | Programming Interviews Exposed: Secrets to Landing Your Next Job, 2nd Edition (Programmer to Programmer)

With the rise of Web-based applications, more and more programmers find themselves using databases for data storage and manipulation, so don’t be surprised if you are asked questions about your experience with databases. Although different kinds of databases are available, the relational database is by far the most common type, so that’s what this chapter covers.

Database Fundamentals

There are tools available to help you create and manage databases, many of which hide the complexities of the underlying data structures. Ruby on Rails, for example, abstracts all database access and makes most direct access unnecessary, as do component technologies such as Enterprise JavaBeans and many object-oriented frameworks. Still, you need an understanding of how relational databases work to make good design decisions.

Relational Databases

Relational databases originated in the 1960s from the work of E. F. Codd, a computer scientist who designed a database system based on the concepts of relational algebra. However, you don’t need to understand relational algebra or other mathematical concepts to use a relational database.

Data in a relational database is stored in tables, which consist of rows and columns. (A set of tables is referred to as a schema.) Each table has at least one column, but there may be no rows. Each column has a type associated with it, which limits the type of data that can be stored in the column, as well as additional constraints. Although the columns are ordered, the rows aren’t. Any ordering that is required is done when the data is fetched (via a query) from the database.

Most tables have keys, although it’s not a requirement (but it is good design). A key is a column or set of columns that uniquely identifies a particular row in the table. One of the keys is designated to be the primary key. For example, in a table of employees, you would use the employee identification number - guaranteed to be unique for each employee - as the primary key.

A table can be linked to another table using a foreign key. A foreign key is usually a primary key value taken from the other table. Foreign keys ensure that data isn’t deleted prematurely: You can’t delete a row from a table if the foreign key of another table references the row. This is known as referential integrity, and it ensures that related tables are always in a consistent state with respect to each other.

There are different ways to manipulate databases, but the most common way is through the use of a structured query language.

Structured Query Language (SQL)

SQL is the lingua franca of relational database manipulation. It provides mechanisms for most kinds of database manipulations. Understandably, SQL is a big topic, and numerous books are devoted just to SQL and relational databases. Nevertheless, the basic tasks of storing and retrieving data are fairly simple with SQL. Let’s look at some of the highlights of SQL.

Most interview database problems involve writing queries for a database with a given schema, so you won’t usually need to design a schema yourself. This introduction works with the following schema:

 Player (   name   CHAR(20),   number INT(4) ) Stats (   number      INT(4),   totalPoints INT(4),   year        CHAR(20) )

Some sample data for Player is shown in Table 10-1, and a sample Stats table is shown in Table 10-2.

Table 10-1: Player Sample Data
Open table as spreadsheet
name	number
Larry Smith	23
David Gonzalez	12
George Rogers	7
Mike Lee	14
Rajiv Williams	55

Table 10-2: Stats Sample Data
Open table as spreadsheet
number	totalPoints	year
7	59	Freshman
55	90	Senior
23	15	Senior
86	221	Junior
36	84	Sophomore

The first thing to notice about the schema is that neither table has a primary key defined, although the number column in both tables is a natural key because the player number uniquely identifies each player. In fact, the number column in the Stats table is really a foreign key - a reference to the number column in the Player table. You might suggest to the interviewer that the schema could be improved with the following changes:

 Player (   name   CHAR(20),   number INT(4) PRIMARY KEY ) Stats (   number      INT(4) PRIMARY KEY,   totalPoints INT(4),   year        CHAR(20),   FOREIGN KEY number REFERENCES Player )

With these changes, the database takes an active role in ensuring the correctness of the data. For example, you can’t add a row to the Stats table that references a player not listed in the Player table: The foreign key relationship between Stats.number and Player.number forbids this. At this point you could have a very detailed discussion with the interviewer about the advantages and disadvantages of primary keys and foreign keys in database design.

One fundamental SQL statement is INSERT, which is used to add values to a table. For example, to insert a player named Bill Henry with the number 50 into the Player table, you would use the following statement:

 INSERT INTO Player VALUES('Bill Henry', 50)

SELECT is the SQL statement most commonly seen in interviews. A SELECT statement retrieves data from a table. For example, the statement

 SELECT * FROM Player

will return all of the values in the table Player:

 +----------------+--------+ | name           | number | +----------------+--------+ | Larry Smith    |     23 | | David Gonzalez |     12 | | George Rogers  |      7 | | Mike Lee       |     14 | | Rajiv Williams |     55 | | Bill Henry     |     50 | +----------------+--------+

You can specify which columns you want to return like this:

 SELECT name FROM Player

The preceding code returns the following:

 +----------------+ | name           | +----------------+ | Larry Smith    | | David Gonzalez | | George Rogers  | | Mike Lee       | | Rajiv Williams | | Bill Henry     | +----------------+

You may want to be more restrictive about which values you return. For example, if you want to return only the names of the players with numbers less than 10 or greater than 40, you would use the following statement:

 SELECT name FROM Player WHERE number < 10 OR number > 40

That would return the following:

 +----------------+ | name           | +----------------+ | George Rogers  | | Rajiv Williams | | Bill Henry     | +----------------+

Often, you will want to use data from two tables. For example, you may want to print out the names of all players along with the number of points that each player has scored. To do this, you will have to join the two tables on the number field. The number field is called a common key because it represents the same unique value in both tables. The query is as follows:

 SELECT name, totalPoints FROM Player, Stats WHERE Player.number = Stats.number

It returns this:

 +----------------+-------------+ | name           | totalPoints | +----------------+-------------+ | George Rogers  |          59 | | Rajiv Williams |          90 | +----------------+-------------+

The aggregates, MAX, MIN, SUM, and AVG, are another commonly used SQL feature. These aggregates enable you to retrieve the maximum, minimum, sum, and average, respectively, for a particular column. For example, you may want print out the average number of points each player has scored. To do this, you would use the following query:

 SELECT AVG(totalPoints) FROM Stats

It yields this:

 +------------------+ | AVG(totalPoints) | +------------------+ |          93.8000 | +------------------+

Other times, you may want to use the aggregates over a subset of the data. For example, you may want to print out the year along with the average number of points that each year’s players have scored. You will need to use the GROUP BY clause to do this, as in the following query:

 SELECT year, AVG(totalPoints) FROM Stats GROUP BY year

It gives this result:

 +-----------+------------------+ | year      | AVG(totalPoints) | +-----------+------------------+ | Freshman  |          59.0000 | | Junior    |         221.0000 | | Senior    |          52.5000 | | Sophomore |          84.0000 | +-----------+------------------+

Most interview problems focus on using these sorts of insert and select statements. You’re less likely to encounter SQL problems related to other features, such as UPDATE statements, DELETE statements, permissions, security, or optimization. One thing you should definitely understand is transactions.

Database Transactions

The integrity of the data stored in a database is paramount: If the data is ever corrupted, every application that depends on the database may fail or be in error. While referential integrity helps keep the data consistent, the best way to ensure data integrity is to use a database transaction.

A transaction groups a set of related database manipulations together into a single unit. If any operation within the transaction fails, the entire transaction fails and any changes made by the transaction are abandoned (rolled back). Conversely, if all the operations succeed, then all the changes are committed together as a group.

Chapter 8 included a simple example involving the addition and removal of money from a bank account. If you expand the example to involve the transfer of money between two accounts, you’ll see why transactions are so important. A transfer is really two operations: removing money from the first account and then adding it to the second account. If an error occurs immediately after the money is removed from the first account, you want the system to detect the problem and redeposit the withdrawn money into the original account. These are the kinds of problems that transactions solve.

The four properties of a transaction are as follows:

Atomicity - The database system guarantees that either all operations with the transaction succeed or else they all fail.
Consistency - The transaction must ensure that the database is in a correct, consistent state at the start and the end of the transaction. No referential integrity constraints can be broken, for example.
Isolation - All changes to the database within a transaction are isolated from all other queries and transactions until the transaction is committed.
Durability - Once committed, changes made in a transaction are permanent. The database system must have some way to recover from crashes and other problems so that the current state of the database is never lost.

These four properties are generally referred to as ACID. As you might imagine, there is a significant penalty performance to be paid if all four properties are to be guaranteed on each transaction. The isolation requirement can be particularly onerous on a system with many simultaneous transactions, so most systems allow the isolation requirements to be relaxed in different ways in order to provide improved performance.

Note that ACID compliance is not a relational database requirement, but most modern databases support it.