Data Modeling | Microsoft Corporation - Analyzing Requirements and Defining Solutions Architecture. MCSD Training Kit

Data modeling is the process of identifying, documenting, and implementing data requirements for an application. Data modeling involves reviewing existing models and processes to determine if the data models can be reused. Data modeling also involves creating new data models and processes to suit a team's application requirements. The major events in data modeling are typically:

Identifying data and associated processes (for example, the data's logical design).

Defining data types, sizes, and defaults.

Ensuring data integrity by using business rules and validation checks.

Defining operational processes, such as security reviews and backups.

Choosing a data storage technology, such as relational, hierarchical, or indexed.

It's important to understand that data modeling often involves unexpected organizational management. For example, data ownership, along with the implied responsibility of data maintenance, accuracy, and timeliness, is often challenged by new insights into the specific data elements to be maintained by particular organizations. Data design often forces an organization to recognize how enterprise data systems are interdependent. Data design also encourages efficiency, cost savings, and strategic opportunities, which arise because of coordinated data planning.

NOTE
As discussed in Chapter 6, gathering information typically begins with requirements and progresses with Conceptual Design's use cases and usage scenarios and Logical Design's object, user, and service designs.

Identifying Data

Data describes a real-world information resource important to applications. It describes the people, products, items, customers, assets, records, and—ultimately—the data structures that applications categorize, organize, and maintain.

Identifying data is an iterative process. At first, the development team may pinpoint several vague, high-level details about how the application must handle its information. As the team's knowledge of the application's intended business process expands, the team fills in more details. The description for each data item typically includes:

Name.

General description.

Ownership (who is responsible).

Data characteristics.

Logical events, processes, and relationships (how and when the data is created, modified, and used).

Part of the data design process is to specify how each data item should be quantified. Some typical data specifications, or attributes, are:

Location (address, country, warehouse bin)

Physical (weight, dimension, volume, color, material, texture)

Conceptual (name, rank, serial number)

Relational (assemblies consisting of subassemblies, authors who write multiple books)

Value (currency, esteem)

The process of identifying data can involve interviews, analysis of existing data structures, document preparation, and peer reviews. The eventual result is a documented, conceptual view of an application's data that answers the questions of "Who, what, where, when, and how?" Generally, this is an early-stage exploration of how various departments, organizations, and applications need to research data.

Defining Data

As the team learns more about the application's data structures, it can group selected data items together and assign certain information details that describe the data's characteristics and relationships. A general approach to defining data includes:

Defining tables, rows, and columns.

Inserting index keys.

Creating table relations.

Assigning data types.

Defining Tables, Rows, and Columns

Regardless of how an application's data is physically stored, the data is typically organized into multiple tables or files, each having a set of rows or records, and columns or fields, similar to a spreadsheet's rows and columns (see Table 9.1). Each row in the table contains information about a particular person, product, item, customer, asset, or related item.

Table 9.1 Example of a People table

Row	Name	Address	Phone
1	Dan Shelly	100 Microsoft Way	(xxx) xxx-xxxx
2	Tim O'Brien	100 Microsoft Way	(xxx) xxx-xxxx
3	Marilou Moris	100 Microsoft Way	(xxx) xxx-xxxx
4	Jane Clayton	100 Microsoft Way	(xxx) xxx-xxxx

Inserting Index Keys

An index key is a special field that provides an index for fast retrieval. Such a key can be unique or non-unique, depending on whether duplicates are allowed. An index key can be designated as the primary key, making it the unique identifier for each table row. Keys should be used when applications need direct access to certain rows. For example, in Table 9.2 an author's identification number (Au_id) is the table's primary key because Au_id uniquely identifies only one author. A query using Au_id will provide extremely fast retrieval of the author's information.

Table 9.2 Example of an Author table

Authors table
Au_id (key)
Au_name
Au_Address
Au_Phone

Creating Table Relations

A database is usually composed of several tables rather than just one. These tables are often related to one another in various ways. For instance, a Titles table for library books could list the International Standard Book Number (ISBN), title, and year a particular book was published. In a database such as this one, it would also be useful to identify the publisher for each title. Rather than repeating all publisher information for each title in the Titles table, a relationship with a separate Publishers table can be established by including the publisher's identification (Pu_id) as a foreign key in the Titles table, as shown in Table 9.3.

Table 9.3 Example of related Titles and Publishers tables

Titles table	Publishers table
Ti_isbn (key)	Pu_id
Ti_title	Pu_name
Ti_yearpublished	Pu_address
Pu_id (foreign key)	Pu_phone

The relationship as demonstrated in this table is a one-to-many relationship. A single row in the Titles table is related to just one publisher in the Publishers table, but a single row in the Publishers table can be related to one or more records in the Titles table. As shown in Figure 9.2, relationships can be one-to-one, one-to-many, and many-to-many.

click to view at full size

Figure 9.2 One-to-one, one-to-many, and many-to-many data relationships

It's worth noting that although a relationship has been identified as existing between the Titles and Publishers tables, no commitment has been made as to how that relationship will be managed. Depending on the final implementation, the development team might meet this management challenge by using table joins and foreign key constraints available with Microsoft SQL Server, or by writing custom code to read file structures directly and handle referential integrity inside application business objects.

Assigning Data Types

A data type is a named data category characterized by a set of values, a way to denote the set of values, and several implied operations that can interpret and manipulate the set of values. Data types can be:

Intrinsic This data type is provided by a database. For example, SQL Server provides intrinsic data types such as integer, datetime, bit, char, and varchar.

Derived This data type is defined using the Data Modeling Language (DML) provided by a database. A derived data type is built from available intrinsic data types or previously defined derived data types. Developers typically provide a name and structure for the derived data type. With derived data types, consistent use of special data types can be assured for selected columns, variables, and parameters.

Data types are important because they assure that an assigned data value is the correct type and within the acceptable range of values. Different data storage technologies and development languages support a variety of data types, such as character, variable character, integer, double integers, floating point numbers, bit fields, and binary fields.

While assigning data types, the development team should be sure that the data-type range fits the data to be stored, and will continue to fit it in the future. For example, with SQL Server an application can handle a maximum of 255 customer records if "tinyint" is chosen as a data type for customer identification, but more than two billion customer records if "integer" is chosen. As another example, if a single character is chosen as the customer service code, an expansion to two characters could create maintenance havoc. Every component and routine that accesses the service code would have to be examined and potentially updated.

Space can be saved in a database and table join operations can be improved by choosing appropriate data types for fields. As a general rule, the smallest data type appropriate for data in a field should be chosen.

While assigning data types, the development team should consider:

Allowable maximums and minimums.

Default values.

Empty (or NULL) values.

Anticipated growth.

Expected and, when possible, unexpected changes.

In a relational database environment, data types help enforce the business rules. For example, dollars can't be mathematically added to colors. Although no one would ever program this mathematical process intentionally, a relational database identifies the data type mismatch and automatically denies database queries that attempt the impossible.

Ensuring Data Integrity

Data integrity involves actual values stored and used in an application's data structures. An application must exert deliberate control over every process that uses data to ensure continued correct information.

Developers can ensure data integrity through careful implementation of several key concepts, including:

Normalizing data.

Defining business rules for data access.

Providing referential integrity.

Validating data.

Normalizing Data

A database designer must refine and structure data to eliminate unnecessary duplication and provide a rapid search path to all necessary information. This process of refining tables, keys, columns, and relationships to create an efficient database is called normalization. Data normalization applies to relational as well as indexed files.

Normalization is a complex process with many specific rules and different levels of intensity. In its full definition, normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency, and separating non-key attributes. In simple terms, rules for normalization can be summed up as follows: "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." In other words, each table should describe only one type of entity (such as a person, place, customer order, or product item).

Some benefits of normalization are:

Data integrity There is no redundant, neglected data.

Optimized queries Normalized tables produce rapid, efficient joins.

Faster index creation and sorting Tables have few columns.

Faster updating There are fewer indexes per table.

Improved concurrency resolution Table locks will affect less data.

Most simple databases can be normalized by following a simple rule: Tables containing duplicate information should be divided into separate tables to eliminate redundancy. For example, suppose a bookseller has developed a new application that helps track information about each book. This application might include the following data:

Author name

Author address

Author phone

Title

ISBN

Year published

Publisher name

Publisher address

Publisher phone

The bookseller could create a single table with a field for each data item. However, such a table would contain many redundancies. For example, many authors have written more than one book for a single publisher, so the publisher information for each author would be repeated multiple times. If these fields were included in a single table, confusing duplication of entries would most likely occur. Using normalization principles, the data should be broken into four groups as shown in Table 9.4.

Table 9.4 Normalized Authors, AuthorsTitles, Titles, and Publishers tables

Authors table	AuthorsTitles table	Titles table	Publishers table
Au_id (key)	Au_id (foreign key)	Ti_isbn (key)	Pu_id (key)
Au_name	Ti_isbn (foreign key)	Ti_title	Pu_name
Au_address		Ti_yearpublished	Pu_address
Au_phone		Pu_id (foreign key)	Pu_phone

The keys provide a means of establishing table relationships. For example, the AuthorsTitles table creates a many-to-many relationship between the Authors and Titles tables (many authors might write many titles). Using the AuthorsTitles table, a query can find the ISBN of every book number an author wrote (using Au_id), and also determine which author wrote a certain book (using Ti_isbn).

Rather than create a separate AuthorsTitles table, the Au_id attribute could be added to the Titles table. Such an option is neither superior nor inferior to the previous example; it's simply a design tradeoff in which developers must evaluate their applications' expected query types, potential multi-user concurrency issues, and a possible performance issue of three indexes on one table.

Defining Business Rules for Data Access

Developers can use business rules to provide correct and consistent control of their applications' data access. Furthermore, subsequent applications should be expected to use the business rules set with the initial application, and thereby benefit from the built-in process dependencies and relationships already provided. In general, business rules that perform data access must be designed thoughtfully to provide self-contained, carefully coordinated processes.

Applications require data access business rules under the following circumstances:

Inserting, updating, deleting, and viewing data

Validating data

Controlling data security

Handling multi-source data access

Providing application-based referential integrity

A business rule can be used each time an application inserts, updates, deletes, or views data. Business rules implemented in this manner provide concise control over data that can be updated. For example, if an application applies new sales orders to an invoice file, a business rule should automatically check the customer's credit limit before accepting and inserting sales order line items.

Data integrity is the process of verifying field values and validating related file values. In other words, data integrity verifies that numeric fields are genuinely number-based and within range, and also checks to see if particular relationships exist in their appropriate files. By putting all data validation routines into business rules, applications can guarantee correct data and easily adapt to future requirements.

Applications may require data access security to control access privileges for those permitted to use organizational applications. Business rules are an excellent way to manage data access privileges.

If a particular application needs to trace a complex chain of records as preparation for a decision process, a business rule can be used to simplify multi-source access. Such a business rule would automatically locate all required data stores and repackage them for easy use. For example, suppose an application needs to determine maximum possible payout for a single procedure in a multi-line healthcare claim. Inspecting the current line item involves searching the beneficiary's entire claim history for prior use of that line item's identical procedure. Additionally, lifetime and current year-to-date limits must be checked to determine the allowable amount. This multi-source access presents an excellent opportunity to create a reusable business rule that consistently and correctly handles the checking process.

One of the most common uses for business rules is handling the referential integrity processes for indexed files. Because indexed files such as those indexed using the virtual storage access method (VSAM) are typically controlled by the data storage engines, an application must provide custom code to handle constraints, foreign key deletions, and other common referential integrity issues. Application-based referential integrity can also be appropriate for relational databases, especially in situations where available triggers, constraints, and stored procedures are either inadequate or too complicated.

Providing Referential Integrity

For referential integrity to be maintained, a foreign key in any referencing table must always refer to a valid row in the table referenced. Referential integrity ensures that the relationship between two tables remains synchronized during updates and deletes. For example, suppose that an application contains both a Titles and Publishers table as shown in Table 9.5. Referential integrity requires that these two tables be synchronized. In other words, each publisher identification (Pu_id) in the Titles table must also exist in the Publishers table. The application cannot simply delete the Pu_id row from the Publishers table, because the Pu_id in the Titles table would lack a reference. However, it would be permissible to delete the Pu_id row from the Publishers table, and also delete every row in the Titles table that has an identical Pu_id. Such an action would maintain referential integrity for the two tables.

Table 9.5 Example Titles and Publishers tables

Titles table	Publishers table
Ti_isbn (key)	Pu_id (key)
Ti_title	Pu_name
Ti_yearpublished	Pu_address
Pu_id	Pu_phone

In a similar manner, the application can't simply add a row to the Titles table without a valid Pu_id already in the Publishers table. To do so would require bad data in the Pu_id field. Therefore, the application must ensure a valid Publishers row before inserting the related Titles row.

The actual implementation of referential integrity depends entirely on the data storage engine, as well as the application's design requirements. Historically, applications using mainframe VSAM files would use application code to handle referential integrity. Even if an application uses SQL Server or Oracle, the development team shouldn't necessarily use triggers, foreign keys, constraints, and cascading delete functions to maintain referential integrity. Instead, referential issues might be handled using application-based code.

NOTE
Typically, application-based referential issues can be implemented most efficiently within data access objects and services. Occasionally, referential issues are handled within business layer objects and services.

Validating Data

Data validation guarantees that every data value is correct and accurate. Data validation can be designed into an application by using several different mechanisms: UI code, application code, database constraints, and business rules. Several types of data validation are possible:

Data type validation One of the simplest forms of validating data, this type provides answers to such questions as "Is the string alphabetic?" and "Is the number numeric?" Usually such simple validations can be handled through the application's user interface.

Range checking As an extension of simple type validation, range checking ensures that a provided value is within allowable minimums and maximums. For example, a character data type service code might allow only the Roman alphabetic letters A through Z. All other characters would be invalid. As with data type validation, an application's interface can typically provide necessary range validation, although a business rule could also be created to handle range validation.

Code checking Code checking can be complicated and typically requires a lookup table. For example, when creating an application that calculates sales tax, the development team might need to create a validation table to hold the authorized, regional tax codes. This validation table could be part of a business rule, or it could be implemented directly in the application's database for query lookup.

Complex validation When simple file and lookup validation are insufficient, complex validation can be used. This type of validation is best handled with business rules. For example, a healthcare-related application could read a single claim with a billed amount of $123.57, but the allowable amount may depend on a year-to-date rolling accumulation capped at $1,500 (not to exceed the lifetime policy maximum of $100,000). In this situation, data validation extends beyond the immediate data entry screen to one of careful evaluation of how to pay this claim based on the policy limits and year-to-date accruals.

CAUTION
Be careful when creating applications that require localization, or distribution with different language code sets. Much of the data validation can be moved into a resource file that can be swapped into the application depending on the localization.

In older file structures, data is unfortunately often corrupted. (For instance, numeric fields might be blank or contain non-numeric characters.) As developers build enterprise applications, they may want to build a testing utility to verify that every field in every record of the files their applications use is correct. If not, these applications may provide unpredictable results.

Defining Operational Processes

Regular operation processes to protect data integrity are the backbone of applications. Every application requires ongoing maintenance activities, including:

Maintaining databases.

Providing data backups.

Maintaining Databases

If a development team is responsible for maintaining databases, the team will need to periodically perform a number of database-oriented tasks. For relational databases, these tasks include clearing the log file, checking memory and procedure cache peak loads, compressing file size, and validating links between tables and index pages. For hierarchical databases, these tasks include checking for broken links by walking the data structure thoroughly through all subordinate records. For VSAM files, the team will need to perform a rebuild with additional pre-allocated index positions for future expansion, based on the file's historical growth trends.

NOTE
If the team is using SQL Server, developers can use the Database Consistency Checker (DBCC) to analyze and repair various aspects of a SQL Server database and installation.

Providing Data Backups

Backups are crucial to applications in case developers need to recover data due to corruption or another catastrophic event. Several backup choices are available, including:

Dumping.

Mirroring.

Replicating.

Dumping involves copying data to an external device, such as a disk or tape, using a specialized format. Because dumping provides a snapshot of an entire database, changes made after dumping are lost (unless developers supplement the dump with transaction logs). Dumping is a popular method of handling data recovery. Developers using dumping for data recovery purposes should be sure to minimally perform it nightly.

Most relational databases (including SQL Server) provide mirroring, an alternative to dumping that involves continuously copying all transactions on one device to its duplicate, or mirror, device. Once mirroring is set up, it is generally seamless and maintenance-free. The primary benefit of mirroring is that recovery from a single device failure is virtually instantaneous. Those implementing mirroring should take caution, however; if the network connection to the mirroring computer becomes corrupted, many databases will cease to function. Even when implementing mirroring, developers should still perform periodic dumps to be safe.

NOTE
Another technology for data fault tolerance is clustering, which is similar in concept to mirroring. Clustering connects two separate computers to a single data store. Clustering database servers can eliminate the physical hardware failure points of a single computer but typically do not provide a backup of the data.

By replicating, or using replication, developers can copy all or portions of their databases to any number of remote computers. Besides providing a means to backup data, replication has other benefits as well—developers can distribute their data throughout a network, balancing the workload and minimizing the risk of catastrophic failure. Replication is primarily a tool for keeping data consistent among distributed databases, and is not generally recognized as a way to provide database backups.

Choosing a Data Storage Technology

Developers have many options for defining a database and storing their application's data. Among the most popular are indexed, hierarchical, and relational data storage technologies. These types of data storage differ not only in that they physically manage storage and retrieval of data, but also in conceptual models they present to users and developers.

Indexed databases, such as VSAM, which provide extremely fast data retrieval, are suitable for sequential lists, random retrieval, and complex file relationships. Developers can easily read a VSAM file from beginning to end, or simply retrieve specific records using an index key. Indexed databases typically provide only data storage and retrieval. Developers' application code must handle referential integrity and data validation processes to accommodate what indexed databases can not provide.

A hierarchical database is especially useful for implementing inverted tree structures, such as bills of material or organizational ranking structures. Hierarchical data access is extraordinarily fast, because these data structures are directly linked. One interesting feature of a hierarchical database is that referential integrity is built-in. However, implementation of hierarchical databases often requires an experienced systems programmer to compensate for these databases' deficiencies in modeling complex relationships.

Relational databases have generally become the de facto standard for database storage. Relational databases are preferred because the relational model has a high usability factor, and also provides a standard interface, SQL, that allows numerous database tools and products to work together in a consistent and readable manner. Additionally, relational databases typically provide mechanisms for handling referential integrity, data validation, and a host of administrative processes to set up and maintain application data.

If a team designs an enterprise application, the application's developers may use existing mainframe databases, including relational, VSAM, and AS/400 files. With SNA Server, a team can seamlessly use mainframe and distributed databases on a network. This integration of mainframe and distributed data with Windows-based applications means that developers have many choices for data storage technologies.