7.4 Data Repositories | Modernizing Legacy Systems: Software Technologies, Engineering Processes, and Business Practices

Most legacy systems use some form of data repository. Many have database management systems that predate the relational models now prevalent . Given the importance and quantity of legacy data, data warehousing is becoming a popular way to make this data available to modern applications. In this section, we discuss both database management systems and data warehouses. Database management systems support mainly on-line transaction processing (OLTP), whereas data warehouses support on-line analytical processing (OLAP).

Database Management Systems

Most database management systems (DBMSs) in use today are based on the relational database model. In the past, systems were based on other database models, such as the hierarchical model and the network model. ^[3] The marketplace is slowly evolving toward object-oriented database models. ^[4]

^[3] The inverted-list model is another common pre “relational database model.

^[4] Relational databases that incorporate object-oriented concepts are referred to as object-relational databases.

Hierarchical Databases

Hierarchical databases are based on parent-child relationships, modeled as a tree. A database schema is represented as multiple occurrences of a single type of tree, as shown in Figure 7-2. The example contains four record types: DEPARTMENT, MANAGER, EMPLOYEE, and TASK. The database is represented by multiple occurrences of the root record type, DEPARTMENT, which is a parent record type for the MANAGER and EMPLOYEE record types. Likewise, TASK is a child record type of the EMPLOYEE record type.

Figure 7-2. Example of a hierarchical database schema

graphics/07fig02.gif

In this model, a parent can have zero or more children, allowing one-to-one and one-to-many relationships. In the example, a department has one manager and many employees . Many-to-many relationships are not supported in this model, because a child can have only one parent. In the example, even though an employee can be assigned to many tasks , it is impossible ”or at least difficult without some hacking ”to assign the same task to more than one employee.

Network Databases

This model is an extension of the hierarchical model. In a hierarchical model, a child record can have only one parent, whereas in the network model, a child can have any number of parents. A database schema is represented by a set of records and a set of links. Each link type connects one parent record type with one child record type, as shown in Figure 7-3. This example shows three record types ”DEPARTMENT, EMPLOYEE, and TASK ”and three link types ”MANAGER, DEPARTMENT-EMPLOYEE, and EMPLOYEE-TASK. Link types can also contain information, as in the DEPARTMENT-EMPLOYEE link record.

Figure 7-3. Example of a network database schema

graphics/07fig03.gif

Parents are usually connected to their child occurrences through a circular linked list that starts at the parent record, traverses all the child occurrences, and returns to the parent record. Many-to-many relationships are supported because a record can participate in any number of link types either as a child or as a parent. In the example, an employee can be linked to many tasks, and a task can be linked to many employees.

Relational Databases

The majority of database systems today are relational. A relational database schema is composed of tables. The data elements describing data to be stored in the tables are called columns . Each entry in a table is called a row. Primary keys are unique identifiers for a row in a table, and foreign keys represent table data that is related to data in another table. Relational database schemas are usually represented in an entity-relationship (E-R) diagram, such as Figure 7-4.

Figure 7-4. Example of a relational database schema

graphics/07fig04.gif

The example shows four tables: EMPLOYEE, DEPARTMENT, TASK, and TASKASSIGNMENT. The primary key for each table is identified as PK. The foreign keys are identified as FK#. One-to-one relationships are represented by a column in a table acting as foreign key to another table. The DEPT# column in the EMPLOYEE table represents a one-to-one relationship between EMPLOYEE and DEPARTMENT. It signifies that an employee works in one department. One-to-many relationships and many-to-many relationships are represented by an additional table that maintains the relationship. The TASKASSIGNMENT table contains EMP# (the EMPLOYEE primary key) and TASK# (the TASK primary key) and stores the employees assigned to the different tasks. ^[5]

^[5] Relational database design usually follows a normalization process to produce optimal table definitions.

The most important aspect of relational databases is SQL support for defining, manipulating, and retrieving data without predefining access paths, as in the previous two models.

Object-Oriented Databases

Object-oriented database management systems (ODBMSs or OODBMSs) integrate database and object-programming language capabilities. For the programming language, the OODB objects appear as programming language objects. The language itself is extended with libraries for database capabilities, such as queries, transparently persistent data, ^[6] concurrency control, and data recovery.

^[6] Transparent persistence is the ability to directly manipulate data stored in a database, using an object-programming language.

An object-oriented database schema is usually represented by a class diagram, as shown in Figure 7-5. Comparing the OO model to the relational model, we find that classes are similar to tables, object instances to rows, and attributes to columns. Data is retrieved from the database as persistent objects and manipulated directly by the object programming language, as if they were in-memory, nonpersistent objects. The data model at the application and database levels are the same. The OODBMS transparently synchronizes the persistent and nonper sistent data. There is no need for embedded SQL or JDBC to retrieve data from the database. The equivalent functionality is part of the object programming language. For all these characteristics, object-oriented databases are becoming popular for object-oriented applications that have high-performance and complex data requirements. OODBs are also being used as data staging areas for object-oriented applications written in C++, Java, or any object-oriented programming language.

Figure 7-5. Example of an object-oriented database schema

graphics/07fig05.gif

Although object-oriented databases have many advantages, several factors have limited the acceptance of OODBMS.

Organizations are wary about adopting OODBMS, owing to limited industry experience.
Many RDBMS vendors have introduced object-oriented features to their core products, creating object-relational database management systems (ORDBMS).
OODBMSs imply tight coupling between the application and the data, as the data model is the same both at the application and database levels. Owing to this lack of a data abstraction layer, moving to a different DBMS requires significant adaptation and testing.
OODBMSs require different development and administration skills. Every DBMS product has its own proprietary extensions. Finding people with specific OODBM skills is more challenging than finding people with RDBMS skills, especially in the area of database administration.
Ad hoc query support is still emerging in OODBMS products. Object-oriented databases are still weak at supporting queries that, for example, require extracting data from objects that do not share a relationship.

Data Warehouses

A data warehouse is a repository that supports management decision making at the enterprise or business-unit level. Data warehouses contain data that presents a coherent picture of business conditions at a single point in time. At the technical level, the development of a data warehouse includes developing systems to extract data from operational systems and installing a warehouse database system that provides flexible access to the data. At the business level, data warehouses often require modifying the decision-making processes so that they efficiently use this data.

The term data warehousing generally refers to combining many different databases across an entire enterprise. A data mart is a database, or collection of databases, that helps managers make strategic decisions about their business. Whereas a data warehouse combines databases across an entire enterprise, data marts are usually smaller and focus on a particular subject or department. Some data marts, called dependent data marts, are subsets of larger data warehouses. Data in a data mart is accessed using a business intelligence (BI) application. An example of data marts and data warehouses as defined by IBM in its Information Aggregation pattern is shown in Figure 7-6 [IBM 01].

Figure 7-6. Data marts and data warehouses

graphics/07fig06.gif

Standards

CODASYL (Conference on Data Systems Languages)

CODASYL ^[7] was founded in 1957 by the U.S. Department of Defense to guide the development of a standard programming language that could be used on many computers. This effort led to the development of COBOL. In 1971, CODASYL's Database Task Group (DBTG) proposed the network data model ^[8] as a standard for database definition and access. The DBTG report contained proposals for a data description language (DDL), a data manipulation language (DML), and the underlying network database structure.

^[7] CODASYL is also referred to as Committee on Data Systems Languages because of the working group that was created after the conference to address the various issues and proposals.

^[8] The network database model is also referred to the CODASYL data model.

SQL (Structured Query Language)

SQL is a database query language that was adopted as an industry standard in 1986. SQL statements are used to retrieve and update data in a database. SQL works with relational database systems, which usually have their own proprietary extensions to the language. This standard has continually evolved over the past 15 years .

SQL-86: The first SQL standard provided basic language constructs for defining and manipulating tables of data.
SQL-89: This version added language extensions for referential-integrity and generalized-integrity constraints.
SQL-92: Also known as SQL2, this standard provided facilities for schema manipulation and data administration, as well as substantial enhancements for data definition and data manipulation.
SQL:1999: Formerly known as SQL3, this standard specifies the SQL:1999 object model, which adds user -defined types to SQL.
SQL:200n: This is the working document of the standard.

Open Database Connectivity

ODBC is an Open Group standard API for accessing a database and was developed by Microsoft. An ODBC database driver links an application to a specific database. Both the application and the DBMS must be ODBC compliant. The application must issue ODBC commands, and the DBMS must respond to them.

ODBC allows programs to use SQL requests to access databases without knowing the proprietary interfaces. ODBC converts the SQL request into a request that the individual database system understands. The benefit of ODBC is that an application's source code does not need to be recompiled for each database it accesses .

Java Database Connectivity

JDBC is an API that allows Java programs to send dynamic SQL statements to a database. JDBC is similar to ODBC but is designed specifically for Java programs. ODBC, by comparison, is language independent. The JDBC standard defines four types of JDBC drivers:

Type 1. This is the JDBC-ODBC bridge that requires software to be installed on client systems.
Type 2. This contains native methods calls (C or C++) and Java methods . It also requires software to be installed on the client.
Type 3. These drivers use a networking protocol and middleware to communicate with a server. The server then translates the protocol to DBMS-specific function calls.
Type 4. These drivers use Java to implement a DBMS vendor networking protocol, such as Oracle's SQL*Net or Ingres's Ingres/Net. Type 4 drivers are pure Java drivers.

Object Data Management Group

ODMG is a standard produced by the Object Data Management Group for persistent object storage. The standard builds on existing database, object, and programming language standards, including those of the Object Management Group (OMG), to simplify object storage and ensure application portability [Barry 98].

Object Query Language

OQL is the query language of the ODMG-93 standard. An SQL-like declarative language with support for objects, OQL can be used either as an embedded function in a programming language or as an ad hoc query language [ODMG 98].

OQL works with programming languages for which ODMG has defined bindings, such as C++, Java, and Smalltalk. The advantage gained by using OQL is that it returns objects matching types in the specific programming language so that these objects can be easily manipulated.

Products

Data Management System 2200

The DMS is a logical data manager based on the specifications recommended by the CODASYL committees for network database processing. DMS 2200 includes a variety of database storage structures, high-level language interfaces, and a selection of accessing techniques, as well as several levels of database recovery and security. DMS 2200 is a highly respected data manager for large, complex databases with demanding performance requirements.

Oracle Database

Oracle is a relational database management system ^[9] designed to support data management, transaction processing, and data warehousing. The latest versions of the Oracle database provide built-in capabilities for Internet development and deployment ”Web applications, portals, database distribution, replication, database management, and high availability. Oracle also includes application development capabilities, such as PL/SQL and Java programmatic interfaces for writing database triggers and stored procedures. Additional features of interest for this case study include a JVM with a native compiler, a CORBA v2.0 ORB, an EJB server, an embedded server-side JDBC driver, an SQLJ ^[10] translator, and XML support.

^[9] Oracle 9i, the latest version of the Oracle database, is considered to be an ORDBMS because it includes SQL with object-relational capabilities.

^[10] SQLJ enables programmers to embed static SQL operations in Java code.

Oracle Discoverer

This ad hoc query, reporting, analysis, and Web-publishing tool is used to access information from data marts, data warehouses, OLTP systems, and non-Oracle data sources. Data can be obtained from all these sources and stored in summary tables for on-the-fly analysis.

Relational Data Management System 2200

RDMS 2200 is a relational database that uses SQL for data definition and manipulation. SQL statements can be embedded in COBOL, FORTRAN, and other programs. RDMS 2200 provides a self-organizing database. All data is presented in simple, two-dimensional tables of horizontal columns divided into rows. Relational tables are easy to access and update. New tables can be created by selecting and combining columns and rows from the same or different tables.

UniAccess for OS 2200

UniAccess, a product from Applied Information Sciences (AIS), provides SQL access from clients to RDMS 2200 data. With UniAccess, RDMS data can be accessed with the same client tools and applications that are used to access data in SQL databases running on other platforms, such as Oracle and Sybase. UniAccess enables Java clients to access RDMS data by using a JDBC-to-ODBC bridge to link to the UniAccess ODBC driver.

Unisys Data Access

This product provides an SQL interface to nonrelational data on Unisys mainframes, including DMS. The product accesses data stored in DMS through ODBC. Because ODBC works with relational databases and DMS is a network database, Data Access creates a relational view of DMS, allowing the ODBC functions to operate .

Universal Data Management System 2200

UDS 2200 provides data management functions within the 2200 environment of ClearPath IX systems. UDS Control, the UDS on-line data manager, provides a common architecture and environment for all UDS data models, including RDMS 2200 and DMS 2200. Both data models can be used concurrently by the same program, and all programs use the same method to commit or roll back changes for all files. UDS Control allows users to share files, controls access to those files, and automatically and uniformly resolves access conflicts. It also allows users to designate recoverable files, regardless of the data management method used, and provides consistent file recovery.