Data modeling is the process of identifying, documenting, and implementing data requirements for an application. Data modeling involves reviewing existing models and processes to determine if the data models can be reused. Data modeling also involves creating new data models and processes to suit a team's application requirements. The major events in data modeling are typically:
It's important to understand that data modeling often involves unexpected organizational management. For example, data ownership, along with the implied responsibility of data maintenance, accuracy, and timeliness, is often challenged by new insights into the specific data elements to be maintained by particular organizations. Data design often forces an organization to recognize how enterprise data systems are interdependent. Data design also encourages efficiency, cost savings, and strategic opportunities, which arise because of coordinated data planning.
NOTE
As discussed in Chapter 6, gathering information typically begins with requirements and progresses with Conceptual Design's use cases and usage scenarios and Logical Design's object, user, and service designs.
Data describes a real-world information resource important to applications. It describes the people, products, items, customers, assets, records, and—ultimately—the data structures that applications categorize, organize, and maintain.
Identifying data is an iterative process. At first, the development team may pinpoint several vague, high-level details about how the application must handle its information. As the team's knowledge of the application's intended business process expands, the team fills in more details. The description for each data item typically includes:
Part of the data design process is to specify how each data item should be quantified. Some typical data specifications, or attributes, are:
The process of identifying data can involve interviews, analysis of existing data structures, document preparation, and peer reviews. The eventual result is a documented, conceptual view of an application's data that answers the questions of "Who, what, where, when, and how?" Generally, this is an early-stage exploration of how various departments, organizations, and applications need to research data.
As the team learns more about the application's data structures, it can group selected data items together and assign certain information details that describe the data's characteristics and relationships. A general approach to defining data includes:
Regardless of how an application's data is physically stored, the data is typically organized into multiple tables or files, each having a set of rows or records, and columns or fields, similar to a spreadsheet's rows and columns (see Table 9.1). Each row in the table contains information about a particular person, product, item, customer, asset, or related item.
Table 9.1 Example of a People table
Row | Name | Address | Phone |
---|---|---|---|
1 | Dan Shelly | 100 Microsoft Way | (xxx) xxx-xxxx |
2 | Tim O'Brien | 100 Microsoft Way | (xxx) xxx-xxxx |
3 | Marilou Moris | 100 Microsoft Way | (xxx) xxx-xxxx |
4 | Jane Clayton | 100 Microsoft Way | (xxx) xxx-xxxx |
An index key is a special field that provides an index for fast retrieval. Such a key can be unique or non-unique, depending on whether duplicates are allowed. An index key can be designated as the primary key, making it the unique identifier for each table row. Keys should be used when applications need direct access to certain rows. For example, in Table 9.2 an author's identification number (Au_id) is the table's primary key because Au_id uniquely identifies only one author. A query using Au_id will provide extremely fast retrieval of the author's information.
Table 9.2 Example of an Author table
Authors table |
---|
Au_id (key) |
Au_name |
Au_Address |
Au_Phone |
A database is usually composed of several tables rather than just one. These tables are often related to one another in various ways. For instance, a Titles table for library books could list the International Standard Book Number (ISBN), title, and year a particular book was published. In a database such as this one, it would also be useful to identify the publisher for each title. Rather than repeating all publisher information for each title in the Titles table, a relationship with a separate Publishers table can be established by including the publisher's identification (Pu_id) as a foreign key in the Titles table, as shown in Table 9.3.
Table 9.3 Example of related Titles and Publishers tables
Titles table | Publishers table |
---|---|
Ti_isbn (key) | Pu_id |
Ti_title | Pu_name |
Ti_yearpublished | Pu_address |
Pu_id (foreign key) | Pu_phone |
The relationship as demonstrated in this table is a one-to-many relationship. A single row in the Titles table is related to just one publisher in the Publishers table, but a single row in the Publishers table can be related to one or more records in the Titles table. As shown in Figure 9.2, relationships can be one-to-one, one-to-many, and many-to-many.
Figure 9.2 One-to-one, one-to-many, and many-to-many data relationships
It's worth noting that although a relationship has been identified as existing between the Titles and Publishers tables, no commitment has been made as to how that relationship will be managed. Depending on the final implementation, the development team might meet this management challenge by using table joins and foreign key constraints available with Microsoft SQL Server, or by writing custom code to read file structures directly and handle referential integrity inside application business objects.
A data type is a named data category characterized by a set of values, a way to denote the set of values, and several implied operations that can interpret and manipulate the set of values. Data types can be:
Data types are important because they assure that an assigned data value is the correct type and within the acceptable range of values. Different data storage technologies and development languages support a variety of data types, such as character, variable character, integer, double integers, floating point numbers, bit fields, and binary fields.
While assigning data types, the development team should be sure that the data-type range fits the data to be stored, and will continue to fit it in the future. For example, with SQL Server an application can handle a maximum of 255 customer records if "tinyint" is chosen as a data type for customer identification, but more than two billion customer records if "integer" is chosen. As another example, if a single character is chosen as the customer service code, an expansion to two characters could create maintenance havoc. Every component and routine that accesses the service code would have to be examined and potentially updated.
Space can be saved in a database and table join operations can be improved by choosing appropriate data types for fields. As a general rule, the smallest data type appropriate for data in a field should be chosen.
While assigning data types, the development team should consider:
In a relational database environment, data types help enforce the business rules. For example, dollars can't be mathematically added to colors. Although no one would ever program this mathematical process intentionally, a relational database identifies the data type mismatch and automatically denies database queries that attempt the impossible.
Data integrity involves actual values stored and used in an application's data structures. An application must exert deliberate control over every process that uses data to ensure continued correct information.
Developers can ensure data integrity through careful implementation of several key concepts, including:
A database designer must refine and structure data to eliminate unnecessary duplication and provide a rapid search path to all necessary information. This process of refining tables, keys, columns, and relationships to create an efficient database is called normalization. Data normalization applies to relational as well as indexed files.
Normalization is a complex process with many specific rules and different levels of intensity. In its full definition, normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency, and separating non-key attributes. In simple terms, rules for normalization can be summed up as follows: "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." In other words, each table should describe only one type of entity (such as a person, place, customer order, or product item).
Some benefits of normalization are:
Most simple databases can be normalized by following a simple rule: Tables containing duplicate information should be divided into separate tables to eliminate redundancy. For example, suppose a bookseller has developed a new application that helps track information about each book. This application might include the following data:
The bookseller could create a single table with a field for each data item. However, such a table would contain many redundancies. For example, many authors have written more than one book for a single publisher, so the publisher information for each author would be repeated multiple times. If these fields were included in a single table, confusing duplication of entries would most likely occur. Using normalization principles, the data should be broken into four groups as shown in Table 9.4.
Table 9.4 Normalized Authors, AuthorsTitles, Titles, and Publishers tables
Authors table | AuthorsTitles table | Titles table | Publishers table |
---|---|---|---|
Au_id (key) | Au_id (foreign key) | Ti_isbn (key) | Pu_id (key) |
Au_name | Ti_isbn (foreign key) | Ti_title | Pu_name |
Au_address | Ti_yearpublished | Pu_address | |
Au_phone | Pu_id (foreign key) | Pu_phone |
The keys provide a means of establishing table relationships. For example, the AuthorsTitles table creates a many-to-many relationship between the Authors and Titles tables (many authors might write many titles). Using the AuthorsTitles table, a query can find the ISBN of every book number an author wrote (using Au_id), and also determine which author wrote a certain book (using Ti_isbn).
Rather than create a separate AuthorsTitles table, the Au_id attribute could be added to the Titles table. Such an option is neither superior nor inferior to the previous example; it's simply a design tradeoff in which developers must evaluate their applications' expected query types, potential multi-user concurrency issues, and a possible performance issue of three indexes on one table.
Developers can use business rules to provide correct and consistent control of their applications' data access. Furthermore, subsequent applications should be expected to use the business rules set with the initial application, and thereby benefit from the built-in process dependencies and relationships already provided. In general, business rules that perform data access must be designed thoughtfully to provide self-contained, carefully coordinated processes.
Applications require data access business rules under the following circumstances:
A business rule can be used each time an application inserts, updates, deletes, or views data. Business rules implemented in this manner provide concise control over data that can be updated. For example, if an application applies new sales orders to an invoice file, a business rule should automatically check the customer's credit limit before accepting and inserting sales order line items.
Data integrity is the process of verifying field values and validating related file values. In other words, data integrity verifies that numeric fields are genuinely number-based and within range, and also checks to see if particular relationships exist in their appropriate files. By putting all data validation routines into business rules, applications can guarantee correct data and easily adapt to future requirements.
Applications may require data access security to control access privileges for those permitted to use organizational applications. Business rules are an excellent way to manage data access privileges.
If a particular application needs to trace a complex chain of records as preparation for a decision process, a business rule can be used to simplify multi-source access. Such a business rule would automatically locate all required data stores and repackage them for easy use. For example, suppose an application needs to determine maximum possible payout for a single procedure in a multi-line healthcare claim. Inspecting the current line item involves searching the beneficiary's entire claim history for prior use of that line item's identical procedure. Additionally, lifetime and current year-to-date limits must be checked to determine the allowable amount. This multi-source access presents an excellent opportunity to create a reusable business rule that consistently and correctly handles the checking process.
One of the most common uses for business rules is handling the referential integrity processes for indexed files. Because indexed files such as those indexed using the virtual storage access method (VSAM) are typically controlled by the data storage engines, an application must provide custom code to handle constraints, foreign key deletions, and other common referential integrity issues. Application-based referential integrity can also be appropriate for relational databases, especially in situations where available triggers, constraints, and stored procedures are either inadequate or too complicated.
For referential integrity to be maintained, a foreign key in any referencing table must always refer to a valid row in the table referenced. Referential integrity ensures that the relationship between two tables remains synchronized during updates and deletes. For example, suppose that an application contains both a Titles and Publishers table as shown in Table 9.5. Referential integrity requires that these two tables be synchronized. In other words, each publisher identification (Pu_id) in the Titles table must also exist in the Publishers table. The application cannot simply delete the Pu_id row from the Publishers table, because the Pu_id in the Titles table would lack a reference. However, it would be permissible to delete the Pu_id row from the Publishers table, and also delete every row in the Titles table that has an identical Pu_id. Such an action would maintain referential integrity for the two tables.
Table 9.5 Example Titles and Publishers tables
Titles table | Publishers table |
---|---|
Ti_isbn (key) | Pu_id (key) |
Ti_title | Pu_name |
Ti_yearpublished | Pu_address |
Pu_id | Pu_phone |
In a similar manner, the application can't simply add a row to the Titles table without a valid Pu_id already in the Publishers table. To do so would require bad data in the Pu_id field. Therefore, the application must ensure a valid Publishers row before inserting the related Titles row.
The actual implementation of referential integrity depends entirely on the data storage engine, as well as the application's design requirements. Historically, applications using mainframe VSAM files would use application code to handle referential integrity. Even if an application uses SQL Server or Oracle, the development team shouldn't necessarily use triggers, foreign keys, constraints, and cascading delete functions to maintain referential integrity. Instead, referential issues might be handled using application-based code.
NOTE
Typically, application-based referential issues can be implemented most efficiently within data access objects and services. Occasionally, referential issues are handled within business layer objects and services.
Data validation guarantees that every data value is correct and accurate. Data validation can be designed into an application by using several different mechanisms: UI code, application code, database constraints, and business rules. Several types of data validation are possible:
CAUTION
Be careful when creating applications that require localization, or distribution with different language code sets. Much of the data validation can be moved into a resource file that can be swapped into the application depending on the localization.
In older file structures, data is unfortunately often corrupted. (For instance, numeric fields might be blank or contain non-numeric characters.) As developers build enterprise applications, they may want to build a testing utility to verify that every field in every record of the files their applications use is correct. If not, these applications may provide unpredictable results.
Regular operation processes to protect data integrity are the backbone of applications. Every application requires ongoing maintenance activities, including:
If a development team is responsible for maintaining databases, the team will need to periodically perform a number of database-oriented tasks. For relational databases, these tasks include clearing the log file, checking memory and procedure cache peak loads, compressing file size, and validating links between tables and index pages. For hierarchical databases, these tasks include checking for broken links by walking the data structure thoroughly through all subordinate records. For VSAM files, the team will need to perform a rebuild with additional pre-allocated index positions for future expansion, based on the file's historical growth trends.
NOTE
If the team is using SQL Server, developers can use the Database Consistency Checker (DBCC) to analyze and repair various aspects of a SQL Server database and installation.
Backups are crucial to applications in case developers need to recover data due to corruption or another catastrophic event. Several backup choices are available, including:
Dumping involves copying data to an external device, such as a disk or tape, using a specialized format. Because dumping provides a snapshot of an entire database, changes made after dumping are lost (unless developers supplement the dump with transaction logs). Dumping is a popular method of handling data recovery. Developers using dumping for data recovery purposes should be sure to minimally perform it nightly.
Most relational databases (including SQL Server) provide mirroring, an alternative to dumping that involves continuously copying all transactions on one device to its duplicate, or mirror, device. Once mirroring is set up, it is generally seamless and maintenance-free. The primary benefit of mirroring is that recovery from a single device failure is virtually instantaneous. Those implementing mirroring should take caution, however; if the network connection to the mirroring computer becomes corrupted, many databases will cease to function. Even when implementing mirroring, developers should still perform periodic dumps to be safe.
NOTE
Another technology for data fault tolerance is clustering, which is similar in concept to mirroring. Clustering connects two separate computers to a single data store. Clustering database servers can eliminate the physical hardware failure points of a single computer but typically do not provide a backup of the data.
By replicating, or using replication, developers can copy all or portions of their databases to any number of remote computers. Besides providing a means to backup data, replication has other benefits as well—developers can distribute their data throughout a network, balancing the workload and minimizing the risk of catastrophic failure. Replication is primarily a tool for keeping data consistent among distributed databases, and is not generally recognized as a way to provide database backups.
Developers have many options for defining a database and storing their application's data. Among the most popular are indexed, hierarchical, and relational data storage technologies. These types of data storage differ not only in that they physically manage storage and retrieval of data, but also in conceptual models they present to users and developers.
Indexed databases, such as VSAM, which provide extremely fast data retrieval, are suitable for sequential lists, random retrieval, and complex file relationships. Developers can easily read a VSAM file from beginning to end, or simply retrieve specific records using an index key. Indexed databases typically provide only data storage and retrieval. Developers' application code must handle referential integrity and data validation processes to accommodate what indexed databases can not provide.
A hierarchical database is especially useful for implementing inverted tree structures, such as bills of material or organizational ranking structures. Hierarchical data access is extraordinarily fast, because these data structures are directly linked. One interesting feature of a hierarchical database is that referential integrity is built-in. However, implementation of hierarchical databases often requires an experienced systems programmer to compensate for these databases' deficiencies in modeling complex relationships.
Relational databases have generally become the de facto standard for database storage. Relational databases are preferred because the relational model has a high usability factor, and also provides a standard interface, SQL, that allows numerous database tools and products to work together in a consistent and readable manner. Additionally, relational databases typically provide mechanisms for handling referential integrity, data validation, and a host of administrative processes to set up and maintain application data.
If a team designs an enterprise application, the application's developers may use existing mainframe databases, including relational, VSAM, and AS/400 files. With SNA Server, a team can seamlessly use mainframe and distributed databases on a network. This integration of mainframe and distributed data with Windows-based applications means that developers have many choices for data storage technologies.