Databases are basically giant containers for storing information. They are the electronic crawlspaces and digital attics of the corporate, academic, and governmental worlds. For example, anything that needs to be saved for later use by payroll, inventory management, or the external auditor is placed in a database.
Just like our crawlspaces and attics at home, the information that is placed in a database needs to be organized and classified. Figure 3-1 shows my attic in its current state. As you can see, it is going to be pretty hard to find those old kids’ clothes for the thrift store clothing drive! I know they are up there somewhere.
Figure 3-1: My attic, with no organization
Without some type of order placed on it, all the stuff in our home storage spaces becomes impossible to retrieve when we need it. The same is true in the world of electronic storage, as shown in Figure 3-2. Databases, like attics, need structure; otherwise, we won’t be able to find anything!
Figure 3-2: An unorganized database
The first step in getting organized is to have a place for everything and to have everything in its place. In order to achieve this, you need to add structure to the storage space, whether it is a space for box storage like my attic or a space for data storage like a database. In order to maintain this structure, you also need to have discipline of one sort or another as you add items to the storage space.
To get my attic organized, I need some shelves, a few labels, and some free time so that I can add the much-needed structure to this storage space. To keep my attic organized, I will also need the discipline to pay attention to my new signs each time I put another box into storage. Figure 3-3 shows my attic as it exists in my fantasy world where I have tons of free time and loads of self-discipline.
Figure 3-3: My attic in my fantasy world
Structure in the database world comes in the form of tables. Each database is divided up into a number of tables. It is these tables that actually store the information. Each table contains only one type of information. Figure 3-4 shows customer information in one table, payment information in another, and invoice header information in a third.
Figure 3-4: A database organized by tables
Invoice Header is used as the name of the third table for consistency with the sample database that will be introduced later in this chapter and used throughout the remainder of the book. The Invoice Header name helps to differentiate this table from the Invoice Detail table that stores the detail lines of the invoice. The Invoice Detail table is not discussed here, but it will be present in the sample database.
Dividing each table into rows and columns brings additional structure to the database. Figure 3-5 shows the Customer table divided into several rows—one row for each customer whose information is being stored in the table. In addition, the Customer table is divided into a number of columns. Each column is given a name: Customer Number, Customer Name, Address, and Phone. These names tell you what information is being stored in each column.
Figure 3-5: A database table organized by rows and columns
With a database structured as tables, rows, and columns, you know exactly where to find a certain piece of information. For example, it is pretty obvious that the customer name for customer number 135384 will be found in the Name column of the second row of the Customer table. We are starting to get this data organized, and it was a lot easier than cleaning out the attic!
Rows in a database are also called records. Columns in a database are also called fields. Reporting Services uses the terms rows and records interchangeably. It also uses the terms columns and fields interchangeably. Don’t be confused by this!
Columns also force some discipline on anyone putting data into the table. Each column has certain characteristics assigned to it. For instance, the Customer Number column in Figure 3-5 may only contain strings of digits (0–9), no letters (A–Z) allowed. It is also limited to a maximum of six characters. In data design lingo, these are known as constraints. Given these constraints, it is not possible to store a customer’s name in the Customer Number column. The customer’s name is likely too long and it contains characters that are not legal in the Customer Number column. Constraints provide the discipline to force organization within a database.
Typically, when you design a database, you create tables for each of the things you want to keep track of. In Figure 3-4, the database designer knew that her company needed to track information for customers, payments, and invoices. Database designers call these things entities. The database designer created tables for the customer, payment, and invoice header entities. These tables are named Customer, Payment, and Invoice Header.
Once the entities have been identified, the database designer determines what information needs to be known about each entity. In Figure 3-5, the designer identified the customer number, customer name, address, and phone number as the things that need to be known for each customer. These are attributes of the customer entity. The database designer creates a column in the Customer table for each of these attributes.
As entities and attributes are being defined, the database designer needs to identify a special attribute for each entity in the database. This special attribute is known as the primary key. The purpose of the primary key is to be able to uniquely identify a single entity, or in the case of a database table, a single row in the table.
There are two simple rules for primary keys. First, every entity must have a primary key value. Second, no two rows in an entity can have the same primary key value. In Figure 3-5, the Customer Number column can serve as the primary key. Every customer is assigned a customer number and no two customers can be assigned the same customer number.
For most entities, the primary key is a single attribute. However, in some cases, two attributes must be combined to create a unique primary key. This is known as a composite primary key. For instance, if you were defining an entity based on Presidents of the United States, first name would not be a valid primary key. John Adams, John Quincy Adams, and John Kennedy all have the same first name. You would need to create a composite key combining first name, middle name, and last name to have a valid primary key.
As the database designer continues to work on identifying entities and attributes, she will notice that two different entities have some of the same attributes. For example, in Figure 3-6 both the customer entity and the invoice header entity have attributes of Customer Name and Address. This duplication of information seems rather wasteful. Not only is the customer’s name and address duplicated between the Customer and Invoice Header tables, but they are also duplicated in several rows in the Invoice Header table itself.
Figure 3-6: Database tables with duplicate data
The duplicate data also leads to another problem. Suppose that Rosenblinker, Inc. changes its name to RB, Inc. Then, Ann in the data processing department changes the name in the Customer table because this is where we store information about the customer entity. However, the customer name has not been changed in the Invoice Header table. Because the customer name in the Invoice Header table no longer matches the customer name in the Customer table, it is no longer possible to determine how many invoices are outstanding for RB, Inc. Believe me, the accounting department will think that this is a bad situation.
In order to avoid these types of problems, database tables are normalized. Normalization is a set of rules for defining database tables so that each table contains attributes from only one entity. The rules for creating normalized database tables can be quite complex. You can hear database designers endlessly debating whether a proper database should be in third normal form, fourth normal form, or one hundred and twenty-seventh normal form. Let the database designers debate all they want. All you really need to remember is this: A normalized database avoids data duplication.
A relation is a tool that the database designer uses to avoid data duplication when creating a normalized database. It is simply a way to put the duplicated data in one place and then point to it from all the other places in the database where it would otherwise occur. The table that contains the data is called the parent table. The table that contains a pointer to the data in the parent table is called the child table. Just like parents and children of the human variety, the parent table and the child table are said to be related.
In our example, the customer name and address are stored in the Customer table. This is the parent table. A pointer is placed in the Invoice Header table in place of the duplicate customer names and addresses it had contained. The Invoice Header table is the child table.
As was mentioned previously, each customer is uniquely identified by their customer number. Therefore, the Customer Number column serves as the primary key for the Customer table. In the Invoice Header table, we need a way to point to a particular customer. It makes sense to use the primary key in the parent table, in this case the Customer Number column, as that pointer. This is illustrated in Figure 3-7.
Figure 3-7: A database relation
Each row in the Invoice Header table now contains a copy of the primary key of a row in the Customer table. The Customer Number column in the Invoice Header table is called the foreign key. It is called a foreign key because it is not one of the native attributes of the invoice header entity. The customer number is a native attribute of the customer entity. The only reason the Customer Number column exists in the Invoice Header table is to create the relationship.
Let’s look back at the name change problem, this time using our new database structure that includes the parent-child relationship. When Rosenblinker, Inc. changes its name to RB, Inc., Ann changes the name in the Customer table as before. In our new structure, however, the customer name is not stored in any other location. Instead, the Invoice Header table rows for RB, Inc. point back to the Customer table row that has the correct name. The accounting department stays happy because it can still figure out how many invoices are still outstanding for RB, Inc.
Database relations can be classified by the number of records that can exist on each side of the relationship. This is known as the cardinality of the relation. For example, the relation in Figure 3-7 is a one-to-many relation. In other words, one parent record can have many children. More specifically, one customer can have many invoices.
It is also possible to have a one-to-one relation. In this case, one parent record can have only one child. For example, let’s say that our company rewards customers with a customer loyalty discount. Because only a few customers will receive this loyalty discount, we do not want to set aside space in every row in the Customer table to store the loyalty discount information. Instead, we create a new table to store this information. The new table is related to the Customer table, as shown in Figure 3-8. Our company’s business rule says that a given customer can only receive one loyalty discount. Because the Loyalty Discount table has only one Customer Number column, each row can link to just one customer. The combination of the business rule and the table design make this a one-to-one relation.
Figure 3-8: A one-to-one relation
It is also possible to have a many-to-many relation. This relation no longer fits our parent/child analogy. It is better thought of as a brother/sister relationship. One brother can have many sisters, and one sister can have many brothers.
Suppose that we need to keep track of the type of business that is engaged in by each of our customers. We can add a Business Type table to our database with columns for the business type code and the business type description. We can add a column for the business type code to the Customer table. We now have a one-to-many relation, where one business type can be related to many customers. This is shown in Figure 3-9.
Figure 3-9: Tracking business type using a one-to-many relation
The problem with this structure becomes apparent when we have a customer that does multiple things. If Landmark, Inc. only produces paper products, there isn’t a problem. We can put the business type code for “Paper Products” in the Customer table row for Landmark, Inc. We run into a bit of a snag, however, if Landmark, Inc. also produces plastics. We could add a second business type code column to the Customer table, but this still limits a customer to a maximum of two business types. In today’s world of national conglomerates, this is not going to work.
The answer is to add a third table to the mix to create a many-to-many relationship. This additional table is known as a linking table. Its only purpose is to link two other tables together in a many-to-many relation. To use a linking table, you create the Business Type table just as before. This time, instead of creating a new column in the Customer table, we’ll create a new table called Customer To Business Type Link. The new table has columns for the customer number and the business type code. Figure 3-10 shows how this linking table relates the Customer table to the Business Type table. By using the linking table, we can relate one customer to many business types. In addition, we can relate one business type to many customers.
Figure 3-10: Tracking the business type using a many-to-many relation
We now have all the tools we need to store our data in an efficient manner. With our data structure set, it is time to figure out how we can access that data to use it in our reports. Data that was split up into multiple tables must be recombined for reporting. This is done using a database tool called a join. In most cases, we will also want the data in the report to appear in a certain order. This is accomplished using a sort.
Suppose we need to know the name and address of the customer associated with each invoice. This is certainly a reasonable desire, especially if we want to send invoices to these clients and actually have those invoices paid. Checking the Invoice Header table, you will see that it contains the customer number, but not the name and address. The name and address is stored in the Customer table.
To print our invoices, we will need to join the data in the Customer table with the data in the Invoice Header table. This join is done by matching the customer number in each record of the Invoice Header table with the customer number in the Customer table. In the language of database designers, we are joining the Customer table to the Invoice Header table on the Customer Number column.
The result of the join is a new table that contains information from both the Customer table and the Invoice Header table in each row. This new table is known as a result set. The result set from the Customer table–to–Invoice Header table join is shown in Figure 3-11. You will note that the result set table contains nearly the same information that was in the Invoice Header table before it was normalized. The result set is a denormalized form of the data in the database.
Figure 3-11: The result set from the Customer table–to–Invoice Header table join
It may seem like we are going in circles, first normalizing the data and then denormalizing it. There is, however, one very important difference between the denormalized form of the Invoice Header table that we started with in Figure 3-6 and the result set in Figure 3-11. The denormalized result set is a temporary table.
It exists only as long as it is needed; then it is automatically deleted. The result set is re-created each time we execute the join; therefore, the result set is always up-to-date.
Let’s go back one more time to Ann, our faithful employee in Data Processing. We will again consider the situation where Rosenblinker, Inc. changes its name to RB, Inc. Ann makes the change in the Customer table, as in the previous example. The next time we execute the join, this change is reflected in the result set. The result set has the new company name because our join gets a new copy of the customer information from the Customer table each time it is executed. The join finds the information in the Customer table based on the primary key, the customer number, which has not changed. Our invoices are linked to the proper companies, so Accounting can determine how many invoices are outstanding for RB, Inc., and everyone is happy!
In the previous section, we looked at a type of join known as an inner join. When you do an inner join, your result set includes only those records that have a representative on both sides of the join. In Figure 3-11, Landmark, Inc. and Phillips Mfg., Inc. are not represented in the result set, because they do not have any Invoice Header table rows linked to them.
Figure 3-12 shows another way to think about joins. Here, the two tables are shown as sets of customer numbers. The left-hand circle represents the set of customer numbers in the Customer table. It contains one occurrence of each and every customer number that is present in the Customer table. The right-hand circle represents the set of customer numbers in the Invoice Header table. It contains one occurrence of each and every customer number that is present in the Invoice Header table. The center region, where the two sets intersect, contains one occurrence of each and every customer number that is present in both the Customer table and the Invoice Header table. Looking at Figure 3-12, you can quickly tell that there are no customer numbers that are present in the Invoice Header table, but not in the Customer table. This is as it should be. We should not have any invoice headers that are assigned to a customer that does not exist in the Customer table.
Figure 3-12: The set representation of the Customer and Invoice Header tables
Figure 3-13 shows a graphical representation of the inner join in Figure 3-11. Only records with customer numbers that appear in the shaded section will be included in the result set. Remember that there are two rows in the Invoice Header table that contain customer number 135384. For this reason, the result set contains three rows—two rows for customer number 135384 and one row for customer number 647382.
Figure 3-13: The set representation of the inner join of the Customer table and the Invoice Header table
The result set in Figure 3-11 allows us to print invoice headers that contain the correct customer name and address. Now let’s look at customers and invoice headers from a slightly different angle. Suppose we have been asked for a report showing all customers and the invoice headers that have been sent to them. If we were to print this customers/invoice headers report from the result set in Figure 3-11, it would exclude Landmark, Inc. and Phillips Mfg., Inc. because they do not have any invoices and, therefore, would not fulfill the requirements.
What we need is a result set that includes all the customers in the Customer table. This is illustrated graphically in Figure 3-14. This type of join is known as a left outer join. The name comes from the fact that this join is not limited to the values in the intersection of both circles. It also includes the values that are to the left of the inner, overlapping sections of the circles.
Figure 3-14: The set representation of the left outer join of the Customer table and the Invoice Header table
We can also perform a right outer join on two tables. In our example, a right outer join would return the same number of rows as the inner join. This is due to the fact that there are no customer numbers to the right of the intersection.
The result set produced by a left outer join of the Customer table and the Invoice Header table is shown in Figure 3-15. Notice the fact that the columns populated by data from the Invoice Header table are empty in rows for Landmark, Inc. and Phillips Mfg., Inc. The columns are empty because these two customers do not have any Invoice Header rows to provide data on the right side of the join.
Figure 3-15: The result set from the left outer join of the Customer table and the Invoice Header table
Joins, whether inner or outer, always involve two tables. However, in Figure 3-10, you were introduced to a many-to-many relation that involved three tables. How do you retrieve data from this type of relation? The answer is to chain together two different joins, each involving two tables.
Figure 3-16 illustrates the joins that are required to reassemble the data from Figure 3-10. Here, the Customer table is joined to the Customer To Business Type Link table using the Customer Number column common to both tables. The Customer To Business Type Link table is then joined to the Business Type table using the Business Type Code column that is present in both tables. The final result set contains the data from all three tables.
Figure 3-16: The result set from the join of the Customer table, the Customer To Business Type Link table, and the Business Type table
In our previous example, we needed to join three tables to get the required information. Other joins may only require a single table. For instance, we may have a customer that is a subsidiary of another one of our customers. In some cases, we’ll want to treat these two separately, so both appear in our result set. This requires us to keep the two customers as separate rows in our Customer table. In other cases, we may want to combine information from the parent company and the subsidiary into one record. To do this, our database structure must include a mechanism to tie the subsidiary to its parent.
To track a customer’s connection to its parent, we need to create a relationship between the customer’s row in the Customer table and its parent’s row in the Customer table. To do this, we add a Parent Customer Number column to the Customer table, as shown in Figure 3-17. In the customer’s row, the Parent Customer Number column will contain the customer number of the row for the parent. In the row for the parent, and in all the rows for customers that do not have a parent, the Parent Customer Number column is empty.
Figure 3-17: The Customer/Parent Customer relation
When we want to report from this parent/subsidiary relation, we need to do a join. This may seem like a problem at first because a join requires two tables, and we only have one. The answer is to use the Customer table on one side of the join and a “copy” of the Customer table on the other side of the join. The second occurrence of the Customer table is given a nickname, called an alias, so we can tell the two apart. This type of join, which uses the same table on both sides, is known as a self-join. Figure 3-18 shows the results of the self-join on the Customer table.
Figure 3-18: The result set from the Customer table self-join
In most cases, one final step is required before our result sets can be used for reporting. Let’s go back to the result set produced in Figure 3-15 for the customers/invoice headers report. Looking back at this result set, you will notice that the customers do not appear to be in any particular order. In most cases, users do not appreciate reports with information presented in this unsorted manner. This is especially true when two rows for the same customer do not appear consecutively, as is the case here.
We need to sort the result set as it is being created in order to avoid this situation. This is done by specifying the columns that should be used for the sort. Sorting by Customer Name probably makes the most sense for the customers/invoice headers report. Columns can be sorted either in ascending order, smallest to largest (A–Z), or descending order, largest to smallest (Z–A). An ascending sort on Customer Name would be most appropriate.
We still have a situation where the order of the rows is left to chance. Because there are two rows with the same customer name, we do not know which of these two rows will appear first and which will appear second. A second sort field is necessary to break this “tie.” All the data copied into the result set from the Customer table will be the same in both of these rows. We need to look at the data copied from the Invoice Header table for a second sort column. In this case, an ascending sort on Invoice Number would be a good choice. Figure 3-19 shows the result set sorted by Customer Name, ascending, and then Invoice Number, ascending.
Figure 3-19: The sorted result set from the left outer join of the Customer table and the Invoice Header table