Semantics | Beginning Transact-SQL With SQL Server 2000 and 2005

The words used to describe data concepts are often different, depending a great deal upon the context of the discussion. Data lives in tables. Usually, a table represents some kind of business entity, such as a Product or Customer, for example. Each item in a table is called a row or record. For our purposes, these mean the same thing. I may use these words interchangeably throughout the book. Envision several rows in an Excel worksheet representing different products. Each product has a manufacturer, supplier, packaging quantity, and price. In Excel, these values would be contained in different cells. In a table, separate values are referred to as a column or field. As far as we're concerned, these words have the same meaning as well. How do you decide how data should be organized into tables and columns? That is the fine art of database design and is often no easy task. To arrive at an optimal database design, you must first have a thorough understanding of the business process and the how data will be used.

So, what is data, really? We often hear the words information and data used to mean the same thing. In reality, they are very different concepts. We, as humans, generally concern ourselves with meaningful information we can use day-to-day. Information has a context — it makes sense to us. If my wife were to give me a call and ask that I stop by the store on the way home from work and pick up eggs and milk, I should have enough information to accomplish this simple task. I have a few informational items to contend with in this scenario: the store, eggs, and milk. If we were to ask some people in the database business about these simple things, we might get some interesting (or not so interesting) answers. For example, my friend Greg, a city geographic information systems (GIS) expert employed by the city government, might point out that in his database, the store is a building with an address, property plot number, city zoning record, water, sewer, and electrical service locations. It has latitude and longitude coordinates, a business license, and tax record. If we were to talk to someone in the grocery business, they might tell us that eggs and milk exist in a products table in their point of sale and inventory management database systems. Each is assigned a product record ID and UPC codes. The product supplier, vendors, shipping companies, and the dairies likely have their own systems and deal with these items in different ways. However, as a consumer, I'm not concerned with such things. I just need to stop by the store and pick up the eggs and milk.

Here's the bottom line: data is just numbers and letters in a database or computer application somewhere. At some point, all of that cryptic data was probably useful information until it was entered into the database. For the database designer or programmer, these values may be meaningful. For the rest of us, it isn't useful at all until it gets translated back into something we understand — information.

Changing Terminology

One of the greatest challenges in our relatively new world of technology is how we use common language to communicate both technical and non-technical concepts. Even when dealing with the same system, terminology often changes as you progress through the different stages of the solution design and construction. These stages are generally as follows:

Conceptual or Architectural Design
Logical Design
Physical Design

Conceptual Design

As you approach the subject of automating business processes through the use of databases and software, one of the first and most important tasks is to gather business requirements from users and other business stakeholders. Beginning with non-technical, business, and user-centric language, you must find terms to describe each unit of pertinent information to be stored and processed. A complete unit of information is known as an entity. Business entities generally represent a whole unit that can stand on its own. For example, a customer and a product are examples of entities. Another conceptual unit of information is an attribute. This unit of information describes a characteristic of an entity. An attribute may be something as simple as the color or price of a product. It could also be something more complex such as the dimensions of a package. The important thing during conceptual design is to deal with the simple and conceptual aspects and not all of the implementation details. This way you leave your options open to consider creative ways to model and manage the data according to your business requirements.

In most processes, different terms may be used to describe the same or similar concepts. For example, in an order processing environment, the terms customer, shopper, and purchaser could mean the same thing. Under closer evaluation, perhaps a shopper is a person who looks for products and a customer is a person who actually purchases a product. In this case, a shopper may become a customer at some point in the process. In some cases, a customer may not actually be a person. A customer could also be an organization. It's important to understand the distinction between each entity and find agreeable terms to be used by anyone dealing with the process, especially non-technical users and business stakeholders. Conceptual design is very free-form and often takes a few iterations to reveal all of the hidden requirements.

Along with the entity and attribute concepts, another important notion is that of an instance. You may have 100,000 customers on record, but as far as your database system is concerned, these customers don't really exist until you need to deal with their information. Sure, these people do exist out in customer land, but your unfeeling database system couldn't care less about customers who are not currently engaged in buying products, spending money, or updating their billing information. Your system was designed to process orders and purchase products — that's it. If a customer isn't involved in ordering, purchasing, or paying, the system pays no attention. When a customer places an order, you start caring about this information and your order processing system needs to do something with the customer information. At this point, your system reaches into the repository of would-be customers and activates an instance of a specific customer. The customer becomes alive just long enough for the system to do something useful with it and then put it back into cold storage, moving on to the next task.

Logical Design

This stage of design is the transition between the abstract, non-specific world of conceptual design and the very specific, technical world of physical design. After gaining a thorough understanding of business requirements in the language of users, this is an opportunity to model the data and the information flow through the system processes. With respect to data, you should be able to use the terms entity, attribute, and instance to describe every unit of data. Contrasted with conceptual design, logical design is more formalized and makes use of diagramming models to confirm assumptions made in conceptual design. Prototyping is also part of the logical design effort. A quick mock-up database can be used to demonstrate design ideas and test business cases. It's important, though, that prototypes aren't allowed to evolve into the production design. As Fredrick P. Brooks said in his book, The Mythical Man Month, "Plan to throw one away. You will do that, anyway. Your only choice is whether to try to sell the throwaway to customers." When you finally happen upon a working model, throw it out and start fresh. This gives you the opportunity to design a functional solution without the baggage of evolutionary design. In logical design, you decide what you're going to build and for what purpose.

In particular, logical database design involves the definition of all the data entities and their attributes. For example, you know that a customer entity should have a name, a shipping location, and a line of credit. Although you realize that the customer's name may consist of a first name, middle initial, and last name, this is unimportant in this stage of design. Likewise, the customer's location may consist of a street address, city, state, and zip code; you also leave these details for the physical design stage. The point during this stage is to understand the need and recognize how this entity will behave with other data entities and their attributes.

Physical Design

One of the greatest reasons to have a formal design process is to find all of the system requirements before attempting to build the solution. Requirements are like water. They're easier to build on when they're frozen. An attempt to define requirements as you go along will inevitably lead to disastrous results. Ask any seasoned software professional. I guarantee their response will be preceded with either a tear or a smile.

Physical design is like drawing the blueprints for a building. It's not a sketch or a rough model. It is the specification for the real project in explicit detail. As your design efforts turn to the physical database implementation, entities may turn into tables and attributes into columns. However, there is not always a one-to-one correspondence between conceptual entities and physical tables. The value of appropriate design is to find similarities and reduce redundant effort. You will likely discover the need for more detail than originally envisioned.

In a recent project, I needed to design a database system to manage a youth activity. The requirements specified both youth and adult entities. Due to the similarities between these entities, I created a single table of members with a flag to indicate the member type as either an adult member or youth member.

Relationships

Although I briefly discussed entity relationships in Chapter 1, I want to devote a little more time expounding on the concepts to add clarity to the current topic of design. The purpose of nearly all database systems is to model elements in our physical world. To do this effectively, you need to consider the associations that exist between the various entities you want to keep track of. This concept of an item or multiple items being related to a different item or multiple items is known as cardinality or multiplicity. To illustrate this concept, just look around you. Nearly everything fits into some kind of collection or set of like objects. The leaves on a tree, the passengers in a car, and the change in your pocket are all examples of this simple principle. These are sets of similar objects in a collection and associated with some kind of container or attached to some type of parent entity. Relationships can be described and discovered using common language. As you describe associations, listen for words such as is, have, and has. For example a customer has orders. Now turn it around: an order has a customer. By looking at the equation from both sides, you've discovered a one-to-many relationship between customers and orders.

Relationships generally can be grouped into three different types of cardinality:

One-to-one
One-to-many
Many-to-many

The one-to-one and one-to-many relationships are fairly easy to define using a combination of foreign key and unique constraints, but many-to-many relationships cannot actually be defined using two tables. To reduce redundancy, minimize data storage requirements, and facilitate these relationships, you apply standard rules of normalization (the rules of normal form), which are described briefly in this section.

Primary Keys

According to the first rule of normal form (1NF), which says that each column contains a single type of information, a single value, and there are no repeating groups of data, it is imperative that each row (or record) be stamped with a unique key value. This key could either be a meaningful value that is useful for other reasons, or a surrogate key, a value generated only for the sake of uniqueness. The uniqueness of a record depends entirely on the primary key. Be very cautious and think twice (or three times) before choosing to use non-surrogate key values. I've designed more than a few database systems where it seemed to make sense to use an intelligent value for the primary key (for example, social security number, address, phone number, product code, and so on) and later wished I had just generated a unique value for the key. Most experienced database folks have horror stories to share about such experiences.

I'll briefly share an experience of my own. A few years ago, I was asked to design a database solution for a large fire department to manage the wellness and immunization records of their employees. They had some existing data and used social security numbers to identify each person in their personnel table. Trying to avoid problems and accommodate future requirements, I asked the project sponsor if every one of their employees would always have an SSN on file. She said that this was absolute — every employee would always have an SSN and that this could always be used as an identifier for an employee. I made the SSN the primary key of the Person table and constructed an entire application around it. A year later the client called me on the phone with a problem. She explained that they had been contracted by the volunteer fire department in a small town to manage their health wellness records and that when she entered new volunteer firefighters the system was throwing an error (something about a primary key violation). I asked about social security numbers and she told me that these were unavailable for volunteer personnel. As I began to remind her of our earlier conversation, she interrupted me and repeated our exchange word-for-word: "You asked me if all of our employees had social security numbers. These aren't our employees." I had not asked if they would be managing personnel records other than their own employees. Lesson learned: Use surrogate keys or have a very good reason not to.

Two common forms of surrogate key values exist. An identity key type is simply an integer value that is automatically incremented by the database system. This will serve as a unique value as long as all data is entered into a single instance of the database. In distributed systems consisting of multiple, disconnected databases, it can be a bit challenging to keep these values unique. The other type of automatically generated key uses a special data type called a unique identifier or globally unique identifier (GUID). This SQL data type is equipped to store a very large numeric value automatically generated by the system. A complex algorithm is used to produce a value, partially random and partially predictable. The result is what I call a big ugly number, guaranteed to be unique — any time and anywhere. The chances of this value being duplicated are astronomically improbable.

Foreign Keys

One purpose for keys is to relate the records in one table to those in another table. A column in the table containing related records is designated as a foreign key. This means that it contains the same values found in the primary key column(s) of the primary table. Unlike a primary key, a foreign key doesn't have to be unique. Using the Customer/Order example, one customer can have multiple orders but one order only has one customer. This describes a one-to-many relationship. The primary key column of the Customer table is related to the foreign key column of the Order table through a relationship known as a foreign key constraint. Later, in Chapter 6, you see how this relationship is defined in Transact-SQL.