1.2 Evolution of database systems

< Free Open Study >

Database systems have been with us since the 1960s as research vehicles (first-generation products wrapped around the hierarchical and network data models) and since the mid 1980s as fully functional products using the relational data model. Since these early beginnings, database systems have evolved from simple repositories for persistent information to very powerful tools for information management and use.

Database systems have been of interest to the computer systems performance analyst and to computer systems applications developers since the earliest days of commercial computers. Early computer systems lacked extensive on-line data storage (primary memory as well as secondary disk storage), forcing systems architects and developers to rely heavily on externally archived information (typically stored in tape drives). Initial data storage repositories were constructed using simple direct addressing schemes that linked specific storage to a specific device and specific location on that device. For example, to extract a piece of information an application needed to know what device a specific piece of information was stored on (e.g., disk 01) and the exact address on that device (e.g., sector 05, track 22, offset 255, length 1,024). Each device had its own unique way of storing, accessing, and retrieving information, making it very difficult to port applications from one place to another.

These initial repositories evolved to more robust file management systems, driven by a move toward simplifying the application/system interface. The drive to simplification was motivated by application developers and operating systems evolutions to remove the complexity of the typical storage hierarchy from the user/developer side and place it in the operating system's side. The motivation was to do the interface at the operating system level to simplify the interface. The initial file systems offered a simple interface, where applications could access persistently stored information logically by file name instead of physically by specific address paths. These initial file management systems offered the means for an application to logically persistently store information for future retrieval and use. Initial file systems offered a simple interface and implementation to store and retrieve information using coarse semantic means. One could open a file, read the record-oriented contents of a file, write a record or entire file, and close the file. Information within the file had no meaning to the control software of the operating system or to the database system. The file management software knew about entry points to a file, or subset of a file, but nothing concerning details of information content within the file. These early file systems and their crude access schemes served the needs of early mainframe machines, where jobs were run in a sequence and no sharing between jobs was explicitly required at run time.

The advent of multiuser operating systems, and multiuser applications' evolving needs for concurrent access to information stored in file systems, spawned the need for database systems to evolve from single user persistent stores into multiuser concurrent database systems. Multiuser and multiprocessing computer systems demanded that stored information within the application's computer system's file system be available for sharing. In addition, this information was not only to be shared, but was to be done so in a dynamic manner. Information storage, access, and retrieval-within such evolving systems-needed more controls in order that information could be shared, yet remain correct and consistent from the perspective of all applications using it.

One problem with information sharing within the context of these new systems was security-how do you allow only the owner, or group of users, to access or alter a file while still providing for access by others? In concert with this issue was access integrity-how to keep data intact and correct while multiple users access, modify, add, or delete information. Initially, file systems addressed most of these concerns by adding access controls, such as locks, and access lists to file managers to control such access, but these did not accomplish the intended goals. Though these were admirable enhancements, they were far too crude to allow applications true sharing of on-line data. Files needed to be further decomposed into finer-grained elements if finer concurrency of access were to be achieved. Simple file-level locking resulted in longer waits and reduced availability of data for use by other applications.

To alleviate these problems, file systems added finer-grained definitions of stored information. For example, files evolved from unstructured data to structured, record-oriented collections of information, where each record had a specific head and tail, as well as semantic meaning for the file system and its organization. At first, semantic meanings may have simply represented the order of occurrence in a file system. Semantics of data dealing with structure led to added organization of files by using records as the fundamental units of organization for applications-required information and for environmental storage. Records provided a mechanism from which to construct more complex storage structures. Records became the granularity of storage used to construct file organization as well as access schemes. It became easy to find a record within a file, since files became composed of collections of records. Through such means, access controls such as record-locking techniques evolved to control how access was to be allowed to these files and encased records.

It was only a matter of time before records, grouped into files, took on further semantic meaning and became the focal point for organizing information. For example, to define a group of students, a set of records could be defined so that each record holds the information needed to define a single student. To organize the students in a way that the application can use them, a file system could allocate one region of a file for storage of these records or could provide a means to link related records in a chain using some control strategy.

This structural concept for information focused around records led to one of the first database system storage concepts and access schemes, referred to as the network database model. The network database model organizes data as linked lists or chains of related information. In the network data model, any information that has a relationship to some other piece of stored information must have a physical link to the related pieces of information. The network database structuring model was formalized into the CODASYL database language standard and was widely implemented but never found acceptance as a true standard. Network database systems became the mainstay of most early information systems until the advent of the relational database system in the 1970s. The network database systems began to lose their luster in the mid to late 1970s into the early 1980s due to their inherent complexity and limitations. The network model requires information to be physically linked if a logical relationship between information is required. This implied that as the number of logical relationships between information items increased so did the required number of physical links to capture these logical relationships.

This added metadata requirement caused the complexity of applications to increase exponentially in size, making this model a poor choice for any system that would grow and change over time. The loss of a single link could result in the database becoming useless to the original application it was developed for. The complexity of the chains constructed within an application over time made the maintenance of such systems very expensive. Another detriment to this database model is encountered when one attempts to access stored information within this data model. To access information, the database must be entered at a specific entry point, followed by the traversal of data chains (paths) defined by the encoded relationships between the data items. This does not mean that the needed information will be found; the paths could be traversed and end in the end of the path being encountered with no data being found. There are no ways to bypass paths. To find specific data items one must traverse the path leading to this item and no other, if the information is to be located.

These and other limitations with the network database model led to the gradual demise of the model. An issue to consider with the network model is its legacy. Even though this model has not been the prevalent model of new applications over the last 20 years, there are still many databases constructed from this model due to its early entrance and long use in the information community. It is highly unlikely that all or even a majority of this information will be rehosted in a newer data model such as the relational model. Due to this large volume of legacy information, this model must be understood from its impact on the past, present, and future of information management systems. New systems, if they have a reach beyond their local system, will possibly be required to interact with such legacy systems, necessitating the understanding of their impact on performance.

The network database system's demise began with the development and publication of Codd's relational database model and seminal paper published in the early 1970s. The fundamental premise of the paper was that all information in the database system can be formed into tables called relations. These relations have a regular structure, where each row of the table has the same format. Relationships between tables are defined using concepts of referential integrity and constraints. The fundamental way one operates on these tables is through relational algebra and calculus techniques. This paper's publication was followed by an experimental system built by IBM called system R and another developed by university research called Ingress. These early developments had as their goal the proof of the relational database's theories. The relational model on paper showed much promise, but constructing software to make it real was a daunting task. A fundamental major difference in the two models is found in their model for data acquisition. The network model is a procedural model, where a user tells the system how to find the needed information, whereas the relational model is nonprocedural, where one states what one wants and lets the "system" find the information.

This shift in the fundamental way the database finds information was a very significant one-the ramifications of which the industry still improves upon. A fundamental need in the new model was system services to find information. This system service is called "query processing." The fundamental function of query processing is to determine, given a user's query, how to go about getting the requested piece of information from the relations stored in the database. Query processing led to further improvements in accessing information from the database. One primary improvement was in query optimization. The goal of query optimization is to find ways to improve on the cost of extracting information from the database and do this in real time.

These early relational database systems were instrumental in the development of many concepts wrapped around improving concurrency of access in database systems. The concept of concurrent access was not present in early network-based databases. The theory of serializability as a correctness criterion evolved from the relational model and its fundamental theories, motivated by a need to have correct and concurrent access to stored information. The serializability theory and concurrency control led to further improvements in database technology. In particular, concepts for transactions followed next-along with theories and concepts for recovery. The fundamental tenet of transactions and transaction processing is that they execute under the control of the "ACID" properties. These properties dictate that transactions execute "atomically" (all or nothing), "consistently" (all constraints on data correctness are valid), "isolated" (transactions execute as if done in isolation), and "durable" (effects of transaction execution are not alterable except by another transaction's execution). To guarantee these properties requires concurrency control and recovery.

The relational model and relational databases led the way during the 1980s in innovations and growth within the database industry. Most of the 1980s was spent refining the theories of correctness for databases and for their fundamental operation: the transaction. In addition to these fundamental improvements, the 1980s saw the improvement of the modeling capability of the model.

This period was followed by another, which we'll call the object-oriented period. During this period of time, the late 1980s and early 1990s, the need of applications developers to more closely match the data types of their applications with those provided by the database drove the need for more semantic richness of data specification and operations on these data. The object-oriented databases of this period met this need. The problem with these early object-oriented databases was that they did not possess some of the fundamental concepts developed during the evolution and growth of the relational database systems.

The late 1990s and the beginning of the twenty-first century saw the merger of the relational model with the object-oriented database model-forming the object relational database model. This model was embraced by the U.S. and international standards bodies as one worth refining and supporting for growth. The major national and international vendors have embraced this model as the next great database evolution and are all presently building products around the newly adopted standard with some of their own extensions.

It appears after this revolution that the next major change in the database arena will probably come in the area of transactions and transaction processing. The conventional model wrapped around the concept of a flat or single-tiered transaction execution segment controlled strictly by the ACID properties may be altered. There is much research and development looking at a variety of alternative execution models and theories of correctness that may lead us into the next decade of database improvements.

< Free Open Study >