Data Modeling and Logical Data Design | MCAD/MCSE/MCDBA 70-229 Exam Cram 2: Designing & Implementing Databases w/SQL Server 2000 Enterprise Edition

Most individuals in the technology arena understand the physical components of a database system. It is easy to recognize data in a columnar format from spreadsheets, tables, data files, and other common data storage techniques. In a structured data storage system, however, much planning goes into the makeup of the storage system before any physical forms are taken. Long before a combination of files containing records and fields along with tables containing rows and columns form a database, the data content is analyzed and a concept of the data, or logical structure is formed. This is an involved process that takes much practice before a database professional can become competent. Logical modeling is an integral part of database implementation. For this reason, logical data modeling encompasses many of the objectives of the exam.

It is difficult to discuss the data modeling exam objectives without looking at some of the physical elements in the database itself. This chapter mentions and defines some of the physical components of a database system. However, in this chapter our focus will be the logical design of the database. In most business problems the physical database itself would not be considered until a logical model had been drawn. For this reason, the physical elements mentioned in this chapter are discussed in full in Chapter 3, "Implementing the Physical Database," so we can concentrate here on drawing logical data models.

Logical data modeling in practice involves identifying important elements of data and recognizing how these elements interact. It is also important to be knowledgeable about the business process being analyzed. This knowledge will aid in determining data flows and processing required to pull meaningful information out of reams of raw data.

For those heading down the path to becoming a database administrator, data modeling plays an important role in the MCDBA 70-229 Exam, but for those attempting the MCSD certification, similar information may be tested in the MCSD Exam 70-300 .NET Solutions Architecture Exam.

In designing a database, to meet a given business need, the logical data model most used is the Entity Relationship Model, or ER Model. In determining the ER model, you define the elements of data used in the business process and how they relate to each other. In looking at any logical implementation, you must understand the nature and use of the data. Knowing how the data is used helps you understand the relationships between the separate data elements.

Source documents, reports, and other samplings from an existing business process, together with information gathered through interviews, will assist in collecting data examples. With this information you can begin designing a data model in accordance with the current business situation. It is important that in gathering information, the data samples collected and drawn up be as complete as possible. Accompanying these data samples, full descriptions of all procedures that interact with the data would also be used as an aid to the development of a model.

Data Modeling

To develop a data model, various application architectures can be considered. During logical modeling, variations in implementations do not need to be known. In meeting a specific business need, the data needs to be modeled in a pristine fashion. If you have a tendency toward one implementation or another, you could hamper this design goal. It is best to design the model as a generic data model before taking that model into future phases. In subsequent stages of development, the database will take on its physical attributes and will be adjusted as applications take shape; yet these stages should not be stepped into prematurely. Focus first on the raw information as a base for the logical model.

Relational database design modeling was first developed by the database engineer Charles Bachman, in 1960, and then in 1976, database design modeling became the ER Model by Peter Chen. An ER Model allows a database to be defined in a simple and organized manner. Other modeling techniques have come and gone over the years, but the ER Model is the preferred technique used by most experienced database developers.

Modeling with the ER Data Model

As mentioned previously, the ER data model is one of the most popular data models in use. An ER data model consists of three main components that are present in any model format: entities, attributes, and relationships.

An entity is a discernible thing about which data is kept. In data modeling, entities can be closely compared to the physical element of a table in a database. An entity is a collection of related data elements or attributes and usually represents a major facet of the business problem. Supplier, product, employee, and order are all good examples of entities.
Attributes are the characteristics given to an entity, such as ProductNumber and FirstName. Attributes of an entity represent a set of properties, each property being a data element within an entity. They will become the columns or fields in the physical database.
Relationships show a logical link between two or more entities. Where two entities have a direct affiliation, a relationship is used to define the connection between the entities. A supplier entity may be related to orders; a product entity may be related to purchases. Relationships will normally establish constraints within the physical database.

Entities represent the primary elements of the ER Model. Each entity will represent a person, place, thing, or concept involved in a business process. It is usually easiest to determine the entities of a process by isolating the important players. When a customer places an order containing one or more products, the process utilizes at least three major entities: customer, order, and product.

An ER Model is drawn out using an Entity Relationship Diagram (ERD). A rough diagram can be constructed using pencil and paper and a few simple shapes. For a more polished look a drawing tool is recommended, such as Microsoft Paint, or better yet a tool specifically used for the modeling process, such as Microsoft Visio. The actual tool used for drawing the diagram can vary, and you will find that different developers have their own preferences and reasons for preferring one tool over another. Most of the diagrams throughout this book have been developed using the two previously mentioned programs, along with SQL Server's own built-in tools. To illustrate the concepts of the ER modeling process, we will be using a fictional company, Northwind Traders. This company will be familiar to many Microsoft Access users and it is also supplied with SQL Server as a sample database.

Entity Selection

To begin the modeling process, you need to first isolate the entities required for the business process. A standard documentation technique is to draw entities as rectangular boxes with enough space to hold many attributes. An entity is the basic division of a database. In the logical design, entities are representative of the tables that will be present when the database development process moves into the physical design phase. Each entity exists as a separate individual data collection, unique from all the other entities.

Entities are generally the people, places, and things that make up a process. They can be qualified as one of three basic types: kernel entities, associative entities, and characteristic entities. These entity types are described further in the following list:

A kernel entity exists on its own; it doesn't define or provide descriptive information for other entities. An example of a kernel entity would be a product listing in an inventory model. The information contained in each kernel entity of a table represents the heart of the database model.
Associative entities are needed to allow multiple kernel entities to be tied together. In the inventory system, a sales entity would be needed to tie a customer kernel entity to the products they have purchased. This same sales entity could be tied to another kernel entity, such as salespeople.
A characteristic entity provides additional information for a respective kernel or associative entity. Information contained in characteristic entities can be updated independently of the related entity. A product entity could have a characteristic parts entity. A given product could be made up of a number of parts. A part that becomes unavailable could affect the product's availability. Changes over time to parts information could be made more easily if a parts entity existed, instead of your having to make changes against the products kernel.

Some entities stand out within a process and are easily recognized, particularly those that represent people or organizations. Entities such as customer, supplier, employee, and shipper are all relatively easy to identify, whereas other entities are more difficult to identify. Careful thought about a business process will help flush them out. Let's look at the process used when a customer orders merchandise and use it as an example.

A customer will order products from one of our salespeople. The order is recorded on an invoice on which each line item represents the quantity of a single product ordered. Any products not currently in stock will be back-ordered. Ordering the needed products from the supplier will fill backorders. When an order is ready, the company can ship it to the customer using one of the available shipping methods. This simple process indicates that a few other entities are needed. Order, product, and order detail will be needed to fulfill the order process and track the data accordingly.

This process will allow for the initial sketch of entities to be drawn, as shown in Figure 2.1.

Figure 2.1. A rough draft of Northwind's entities.

With further knowledge of the business process, other entities may come to light. In the case of the Northwind process, products all fall into separate categories and the product categories are also tracked. This would require the addition of an additional entity, and the beginnings of the ER diagram would look similar to the example shown in Figure 2.2.

Figure 2.2. Northwind's entities in an ER Model.

Entity structuring accommodates the initial stages of database design. When you're designing an appropriate logical model, the data must be organized into these separate elements that will later make up the physical database tables. An entity is characterized by its attributes. Attributes are used to define the data elements of an entity. After the initial entities have been defined, the process of describing each entity through its characteristic properties begins.

Attribute Definition

Identifying attributes is the next step in ensuring a successful data modeling process. In defining attributes you are setting out to define entity composition. Each entity will have descriptive elements that pertain solely to that element. An attribute is a descriptive element or property of an entity. Fields will represent the attributes when the logical design progresses to the physical design stage.

Attributes are characteristics or properties defined within a single entity, and they correspond to real-world properties of a place, thing, or concept.

Attributes such as names and addresses are almost always present for people and organizations. Other attributes provide further information for the entity as required for the business process being defined.

Deciphering attributes from written descriptions and reports is more of a real-world scenario. The exam will provide the attributes; you will be expected to pick appropriate entities.

Try to find out the attributes that fit each of the entities. More attributes may have to be added later as the model becomes more complete. If you missed an attribute or added extra attributes at this time, they will likely be pointed out when the model is normalized. The normalization of a data model will be discussed later in this chapter, in the section "Data Normalization with Normal Forms." Attribute decisions will vary from person to person, depending on your business scenario perspective for which the data is being modeled.

The identification and creation of attributes is a developed skill; there is no true method for defining all attributes of an entity. Each business problem will require a variation of entity content, so the business process itself will lead to a lot of attribute choices.

A few guidelines to use in the identification, creation, and naming of attributes will help ease this process. The first is how you name your entities. A good name makes an attribute look professional and helps in its readability. Appropriate naming conventions are often developed as a corporate or development standard within an organization. Often mechanisms for shortening names or using common prefixing or suffixing is part of a programming team's standard. Here are some good guidelines that help in naming entities. Consistently following these guidelines will help to keep all of your designs up to the same standard:

An entity or an attribute should be named in its singular form, thereby implying that it is only a single instance. An instance is a single occurrence of an entity.
The use of underscores (_), spaces, and other special characters is not a good habit because special characters have particular meanings in some software packages, and the mixture of text and other characters is difficult to type. Try to distinguish a word from another by using mixed case, as in LastName instead of Last_Name.
Entity (and all other object name) identification should be kept small while still providing a description of the object. Names should be kept as small as possible but should still provide a meaningful object title.
Entity names should be unique.
Reserved words, though permitted in the context of SQL names, should be minimized to ease development. Later this will also add to the performance of procedures.

Keep in mind that many of these guidelines refer to all object naming, and when developing the names for attributes, you should still be providing a descriptive name that is concise and unique within the entity. Attribute names should be consistent across entities. For example, if you name an attribute LastName within one entity, you should not name a similar attribute Surname in other entities.

Decomposing an Attribute

Many attributes can be handled as a single attribute. It is also common for some attributes to be broken down into other, smaller attributes. This process is commonly known as decomposing attributes. Decomposing an attribute takes an attribute from its original form and divides it into its components. A good example of this is the breaking down of the Address attribute. An Address attribute can easily be broken down into attributes that store data as shown here:

`Street`	Stores the street address of the user
`City`	Stores where the user lives
`Region`	Stores the state or province the user lives in
`Postal Code`	Stores the user's zip code or other postal code
`Country`	Stores the user's nation

The process of decomposing an attribute helps to develop a normalized structure, as defined later in this chapter. Decomposing is a function of usage as well. If, for example, a person's name is needed only in its full form, then a decision may be made to not break it up into the separate attributes of first name and last name. This is common for a ContactPerson attribute in an entity that relates to a corporation.

Decomposing an attribute provides many benefits, in contrast to generic compound attributes that are not decomposed. Data integrity is improved. Data integrity is a measurement of how well data is kept consistent and flawless throughout the whole ER Model. When attributes are decomposed, different methods of ensuring data integrity can be applied to the broken-down segments rather than the attribute as a whole. It can be difficult to check the validity of an entire address, but when decomposed, the elements can be more easily checked.

Decomposing also aids in the sorting of data for use in specific business processes, such as mass mailing. You will, in most cases, also be improving data retrieval performance when decomposed attributes are used. A generic attribute, Address, contains the street, city, region, postal code, and country. To get just the region of a customer in Washington, you have to select the whole Address attribute and parse it to find Washington, thereby degrading performance because of the redundant data retrieved. If you have four separate attributes, you can select the Region column and get your results more quickly. This same rule applies to updating data. It's easier to update a single part of an Address than to parse and then update the whole Address attribute.

As shown by the example provided in Figure 2.3, the Address has been fully decomposed for all entities, but the name has been decomposed only for the Employee entity.

Figure 2.3. Northwind's entities with attributes.

Key Attributes

The use of an attribute can vary from system to system, but some attributes will be present in most systems to help sort data and perform relationship ties between one entity and another. A key attribute is almost always present within an entity to act as an identifier, much as a person's name identifies that person as being a unique individual. Entities are usually interdependent: Each holds information that relates to other entities. These relationships can be defined by their correlated dependencies. Key attributes are also therefore used for the purpose of relating one entity to another.

Recognizing Key Attributes

After all attributes have been defined and keys have begun to be recognized, the modeling process will be completed with the application of relationships and the normalization of data. These two processes are closely related, as you will see later, but before they can begin, key attributes must be recognized. These are specialized attributes referred to as identifiers. An identifier is an attribute or a set of attributes that defines one unique element of an entity. The use of identifiers allows for the individual selection of records from an entity. As the design progresses to the physical stage, identifiers will become the primary and foreign keys, allowing entities to be tied together through association or relationships. For Example, a product's identifying attribute is usually a unique product ID.

Identification of Primary and Foreign Keys

A primary key is a specialized attribute that is generally defined for each entity. The primary key is almost always defined, though is not necessarily required for all entities in a data model. However, the provision of a primary key does allow for a considerable number of benefits and should be considered in every instance. When defining a primary key, you should keep various factors in mind. The primary key normally defines uniqueness in an entity in that every record of a table has its own unique primary key. Also, when defined, a primary key should not be permitted to be empty. If a primary key is empty, you have a situation in which data integrity is difficult (if not impossible) to maintain.

A primary key should be defined as a single attribute that doesn't allow for duplicates or empty content. The primary key should be as small as possible. It is possible to create a compound primary key that uses multiple attributes or a key that contains a large number of bytes, but in the physical design this will increase the overhead and response time associated with data retrieval. A compound primary key consisting of multiple attributes is used in instances in which a singular attribute by itself does not enforce uniqueness within an entity. As we move through the modeling process, we will discover the need to use such a compound structure in some instances.

In most cases, an attribute is identified for each entity that will act as a key. This attribute could be a customer number, a product number, or simply an attribute to serve no other purpose than to act as a key identifier. When an attribute is added solely for the sake of being an identifier, it is known as a surrogate key. Whether you use a natural or surrogate key, that key will represent an important part in establishing relationships. In most circles surrogate keys are preferred because there is never a need to have this surrogate key change. The process of altering a key value can have repercussions on the business process and can also effect many elements of the database system.

A longtime favorite answer on Microsoft database exams is to use surrogate keys in all entities. As mentioned previously, there are several benefits to doing this.

A foreign key is used to tie one entity to the primary key or unique data value of another entity. The relationship is created for the purpose of creating a dependency between the entities. A single attribute or combination of attributes can act as a foreign key depending on the makeup of the referenced primary key. A foreign key doesn't have to be unique. In fact, foreign keys are often in a many-to-one relationship with a primary key in another entity. Foreign key values* should be copies of the primary key values. No value in the foreign key, except a null value, should ever exist unless the same value exists in the primary key of the referenced entity.

A foreign key works in conjunction with a primary key or some other unique attribute to enforce referential integrity among entities. A relationship is created to enforce referential integrity between these two related entities. Foreign key connections may not be fully recognized for the model until you begin to look at the interrelationships of the entities. After a relationship is defined, the connection developed for the relationship will contain the foreign key pointing to the correlated primary key attribute.

For a foreign key to be created, a primary key must first be properly defined. Once defined, this primary key can be referenced by the foreign key. To quickly summarize the use of primary keys, keep the following in mind:

Primary keys consist of sets of attributes whose values uniquely identify the rows in an entity.
Primary keys give an ID to a row. They make the row unique throughout the entity. This means that rows can easily be located by this identifier.
Primary keys can be used only for columns or attributes that don't allow empty entries. Allowing empty values would mean that a row would not be uniquely identified.
The attribute chosen to hold a primary key must have values unique throughout the entity.

Foreign keys help in the relational process between two entities. When a primary key is created on a parent entity, it is connected to another entity by linking to the other entity's foreign key. For example, in an invoice situation, there are usually two entities: one for invoice general information and the other for invoice details. The invoice details would contain a hook on to the invoice general entity through the use of a foreign key, potentially the invoice number or a surrogate key.

Keys are usually created as part of the table creation process, but they can be added to the table after the initial generation. The syntax for the creation of keys and their association to tables are discussed in Chapter 3.

Before we are ready to draw all relationships into the model, the data model should begin a process of normalization. Some of the relationships can no doubt already be seen; however, when normalization standards are applied to the model, more relationships may be found. Only after the model has been fully normalized will you have a complete view of the entity relationships.