2.10 Components of a database system s architecture | Computer Systems Performance Evaluation and Prediction

< Free Open Study >

2.10 Components of a database system's architecture

A database system is composed of much more than just the data definition language, data manipulation language, and data control language. These simply represent the interface into the actual database system. The core of a database management system is the collection of services that provide the persistence of data in the database and the functionality to guarantee the consistency and correctness of data and the adherence to ACID properties by transactions (Figure 2.30). The ACID properties include the atomic, consistent, independent, and durable execution of a transaction on the database. We will discuss these in more detail later in this chapter.

Figure 2.30: Architecture to support a database system.

The architecture of a database system is comprised of a set of services built on top of basic operating system services, system file storage services, and primary memory buffer management services. The file manager is the database's interface to the persistent stored information. The information managed for the database by the file system includes the internal, conceptual, and external schema for the stored information (metadatabase); the actual database; and the database log file. The log files include before images (buffer values), after images, redo records (actions of committed transactions), undo records (actions of uncommitted transactions), commit records, abort records, and transaction begin records.

Through the basic features of process management, interprocess communications, synchronization, buffer management, and file management the database systems services can be constructed. These services include catalog management, integrity management, transaction management, concurrency control, lock management, deadlock management, recovery management, security management, query processing, communications management, and log management. On top of the database services the user's applications operate through the input/output view manager and the data manipulation manager. In the following paragraphs we will briefly review each of these. Following these brief overviews, we will review some of these in greater detail.

2.10.1 Catalog manager

The catalog manager maintains information about the database's information. These metadata form the schema for the database. The database administrator, using data control language and data definition language interfaces, can alter the schema. As an example, in SQL this portion of the database would keep the definition for all relations, constraints, security assertions, and mappings to physical storage.

2.10.2 Integrity manager

The integrity manager aids in the maintenance of the database's data items' accuracy, correctness, and validity-for example, the integrity manager may check that a data item is of the proper type through a mechanism that determines when to do the check; how to do the check; and how to recover, reject, or fix the condition when encountered. The integrity manager may check to see that a data item is within a predefined domain of correct values, such as DOMAIN FIXED (5) or Weight GREATER THAN 0 AND Weight LESS THAN 2000. These would test the ranges of values a data item may span. Integrity checks can span multiple entities or relations-for example, a referential integrity check in SQL can be used to see that the relationship of many objects has a property that must hold for them to be considered valid. Such a check could be that the SUM of all account balances at a bank must equal the bank's balance. An important aspect of this management is when to perform the specified checks-for example, there is a different cost if the checks are done at database definition time, on access to the data item, on update of a data item, on an event such as a timer, or on the commit of a transaction. The tradeoff is accuracy and validity of the data versus performance. Checks done during run time will slow down the database's processing throughput.

2.10.3 Transaction manager

The transaction manager controls and coordinates the execution of transactions within the database. For now just assume that a transaction is a collection of operations on the database that are bound together into a single runtime unit. The transaction manager must perform tasks to initiate transactions (scheduling); synchronize transaction execution with the database, other transactions, and the operating system; coordinate intertransaction communications; commit (completion) processing; and abort (failure) processing, transaction constraint checking, and condition handling, as well as transaction recovery (error) management. A transaction typically is of the following form:

    TRANSACTION T (Optional Input Parameters)    Specification Part    BEGIN    BODY of T    COMMIT or ABORT of T    RECOVERY PART of T    END    END TRANSACTION T

The initial statement names the transaction, allowing it to be possibly precompiled and stored for later execution. The initial statement also leaves space for transferring input parameters to the transaction, such as the location of data to be executed. The specification part of the transaction is the area where local variables for the transaction's workspace are specified, as are preconditions and postconditions on transaction execution, recovery conditions, isolation level, access modes, and the diagnostic size to allocate. The body contains the executable code for the transaction. The commit and abort statements indicate the success or failure of the transaction. Finally, the recovery part specifies user- or system-supplied recovery or condition handlers for error processing and transaction completion processing.

2.10.4 Concurrency control manager

The concurrency control manager coordinates the actions of interactive access to the database by concurrently running transactions. The goal of concurrency control is to coordinate execution so that the VIEW or effect from the database's perspective is the same as if the concurrently executing transactions were executed in a serial fashion. This scheme is referred to as the serializable execution of transactions. Concurrency control's serializability theory has two basic modes: The simplest concerns the serializable execution of the read and write sets from conflicting transactions and is based on either locking, timestamp ordering, or optimistic read and write conflict resolution. The second concurrency control concept is more complex and uses semantic knowledge of a transaction's execution to aid in coordination. The major difference is that the granularity of the serialization operator is not the read and write but rather complex functions and procedures as well as complex data objects. The criterion of correct execution, however, is, nevertheless, serialization across concurrent transactions.

2.10.5 Lock manager

The lock manager is designed to control the access to the database lock table. The lock table of the database maintains the status of locks (read lock, write lock, share lock, semantic lock, etc.) for each item of the database that has been accessed. The lock manager isolates users from accessing the lock table directly. To acquire access to lock status, the lock manager provides lock and unlock primitives to database and user code. The lock can be a read lock, which is granted (if no one holds a conflicting write lock) when a transaction attempts to read a data item. A write lock can only be granted if no other transaction holds a read or write lock on the data item. Locks in a database can be viewed like semaphores in an operating system; they are used as a means to guarantee exclusive use to an item within the database's control.

2.10.6 Deadlock manager

When a locking protocol is being used, a lock held by one transaction can block a lock request from another transaction. If there are no circular waits for a lock, then the lock will ultimately be granted. If there are circular waits, then deadlock occurs. Deadlock is the condition where two or more transactions wait for resources held by another transaction that is waiting for a resource you hold. Since no one can move forward, the system cannot get any useful work done. The deadlock manager must detect when a deadlock condition holds and decide how to handle the condition. Typically, one of the involved transactions is aborted and its locks released, thus allowing other transactions to go on.

2.10.7 Recovery manager

The recovery manager must ensure that the database is always in a state that is recoverable consistently and correctly. This is done by ensuring that the database contains all or none of the effects of committed transactions and none from aborted or running transactions. The recovery manager uses the concept of a checkpoint (snapshot of the present state of the database) and a log file (file of operations on the database) to aid in the recovery. For conventional databases recovery attempts to bring the database back to an old state of the database and initiate processing from there. To bring the database back to a past state the recovery manager uses both undo, where uncommitted or active transaction past views are restored, and redo, where committed transactions not written to the database have their new states restored to the persistent store. These undo and redo records are applied to a checkpoint state to bring the database to some intermediate acceptable consistent state. A second form of recovery attempts to move the database forward by applying compensating transactions (to change committed effects to acceptable forms based on semantic needs), by applying extrapolations (to compute new acceptably correct and consistent future states), and by applying condition handlers to user or system semantic actions at a variety of levels within the database.

2.10.8 Security manager

The security manager has the task of limiting access, modification, and malicious intrusion to the database. To perform these control actions the security manager requires that users be identified, authenticated, and authorized for access and control over a data item being requested. Identification is similar to typical login capabilities, where the security manager asks the users to identify themselves. To make sure that not just anybody attempts access the database may also ask a user to authenticate his or her identity. This can be done with a password or by a variety of fairly elaborate mechanisms. Once the user is allowed access, he or she is further restricted to what can be viewed and altered. Authorization performs the function of limiting access to only a desirable predefined level-for example, read only, write only, alter capability, view restriction, and numerous other restrictions.

2.10.9 Query processing support manager

The query processor of a database system has the function of determining how to answer the requests for information from a user in the most optimal manner. The idea is that a query can be answered by a database system in a variety of ways. The most straightforward is the brute-force approach. This, however, is typically the most expensive in terms of time and resources consumed-for example, the cost to join two tables will be the cost of scanning each item of the first with each item of the second, or on the order of N times N or N squared if we assume they are the same size. On the other hand, if we could reduce the size of each by a factor of 2, then the cost drops by one-half. This is easily accomplished if we perform a select first on each before a join. If the size of N is large, this reduction can become significant and have a meaningful result on the database's performance. To reduce the cost of queries we look at heuristics on the order of access of relations and their combinations, relation reductions via selections and projections, preprocessing (sorting), iteration order, relation operator precedence ordering, and numerous other factors.

2.10.10 Communications manager

The communications manager has the role of traffic cop in the database. This service must coordinate the transfer of database data as well as status information to aid in the processing of data. Communications may be between database services, different databases, different processors, different transactions, or within a transaction. Mechanisms such as simple message-passing schemes, client/server protocols, and others have been implemented.

2.10.11 Log manager

The log manager has the job of coordinating the transfer of information from the active database into secondary persistent storage to aid in the recoverability of the database and to effectively mitigate the problem of the operating system paging out information prematurely. The log maintains a history of data flow in and out of the database, as well as actions that can affect the database's state. This includes transactions before images, after images, undo records, and redo records.

Transaction management

The transaction manager has the job of providing a bounded framework around which to guarantee that the database stays consistent while concurrent operations execute on the database. The database manager, without concurrency, canguarantee this with no problem-but this is neither interesting to study nor practical in the real world. The real world of database processing typically deals with a large database with a high degree of multiprocessing (concurrently executing transactions).

The execution of a transaction is similar to making a contract; both sides are involved in the contract, they negotiate for a while, and then they either come to a consensus and sign the contract, or they both walk away. A transaction is thus either all or nothing. A transaction must complete totally or must not complete at all. Now that's a concept.

The transaction is meant to be used as a consistent and reliable unit of work for the database system. A transaction interacts with the application's environment and the database's concurrency control protocols to perform its intended function (Figure 2.31a). A transaction is required to guarantee four properties when executing on a consistent database. These four properties are called the transaction's ACID properties; they include atomic, consistent, independent, and durable executions of transactions on the database.

click to expand
Figure 2.31a: Database transaction.

An ACID transaction guarantees that the database the transaction begins with and the database it finishes with are consistent, that the data are durable, that the transaction acted alone on the database, and that the transaction completely finished its actions on the database (Figure 2.31b).

Figure 2.31b: ACID transaction.

The transaction ACID properties are as follows:

Atomic-The atomic property implies that a transaction is an indivisible unit of execution that either completely performs its designed function or else its effect on the database is as if the transaction never began; that is, the database state an atomic transaction leaves if the transaction does not totally commit is the same database state that the transaction began with. On the other hand, if an atomic transaction completes, then the database state it leaves has all of the changes the transaction computed with no others installed.

Consistent-Consistent execution of a transaction requires that a transaction transform an initial consistent database state to another new consistent database state. The basic concept behind this transaction property is that the database is comprised of a set of data items, which have constraints defined on them. The database, to be considered consistent at any point in time, requires that these constraints on data items within the database all evaluate to true; that is, none of these constraints can be violated if we are to have a consistent database state. A valid transaction, which initially sees a database that is consistent, must, upon commit, leave a database that is still consistent.

Independent-Independence, sometimes referred to as the isolation property of transactions, requires that each transaction accessing shared data acts alone, without being affected by other concurrently running transactions. This property basically indicates that a transaction's effect on the database is as if it, and it alone, were executing on the database. The function of this property is to require the removal of any dependence of a transaction's execution on any other transaction's execution.

Durable-The durability of a transaction's execution requires that once a transaction is committed, its effects remain permanent in the database. What this property implies is that the changes a transaction makes to the database do not disappear when the transaction terminates. Data produced by a transaction and written to the database become permanent. Data once written to the database can only be altered by another transaction that reads and/or writes over this data item.

These transaction properties must hold for all transactions that execute within a database management system if consistency, correctness, and validity of the database are to be maintained. The properties must hold even when other transactions execute along with each other concurrently. In addition, if adhered to, the properties will guarantee a correct and consistent database even in the face of failures or errors. It is when we begin to envision what policies and mechanisms for transaction execution and operations can be developed to guarantee these properties that problems occur.

Transaction basics

A transaction is a collection of applications code and database manipulation code bound into an indivisible unit of execution; an example is shown in the following code segment:

    BEGIN-TRANSACTION Name    Applications Code    DB-Code    Applications Code    DB-Code    DB-Code          .          .    Applications Code    END TRANSACTION Name

A transaction is framed by the BEGIN TRANSACTION and END TRANSACTION markers delineating the boundaries of the transaction-for example, if we have the following three relations that describe an airline reservation system:

    FLIGHT(Fno, Date, Source, Destination, Seats-Sold, Capacity)    CUSTOMER(Cname, Address, Balance)    FlghtCust (FNO, Date, Cname, Special)

The first relation depicts the flight information-flight number, the date of the flight, the city of origin, the destination city, the number of seats sold for this flight, and the capacity of this plane. The second relation describes the customers who will be flying on a flight; it gives their names, addresses, and the balances owed on the tickets. The third relation describes the relationship between the flights and the customers. This relation in particular indicates which flight, which passengers are flying on what date, and any special requirements for these passengers-for example, maybe someone wants a McDonald's Happy Meal or a vegetarian meal.

To generate a simple transaction on these database relations, which make a reservation for a customer, we could write the following pseudocoded relational Structured Query Language request or query:

    BEGIN TRANSACTION Reservation    BEGIN    Input (FlightNo, date, customer, specl)    EXEX SQL UPDATE FLIGHT    SET Seats-Sold = Seats-Sold + 1    WHERE Fno = 'FlightNo' AND Date = 'date';    EXEX SQL INSERT INTO FlightCust (FNO, Date, Cname, Special)    VALUES (FlightNo, date, customer, specl)    OUTPUT("Transaction Completed")    END TRANSACTION Reservation;

This transaction looks for input from the keyboard for the flight number, date of the flight, the customer's name, and any special requirements the customer may have. These are input to transaction variables: FlightNo, date, customer, and specl, respectively. The contents of these variables are then inserted into the proper places within the relation through the VALUES function. We update the count of seats sold for this flight by incrementing the value by one and then updating the value in the relation. The transaction then updates the FlghtCust relation with the new information. To be complete we should also update the customer relation; this will be left as an exercise for the reader. This represents a simple transaction; however, as it stands it will not guarantee the transaction ACID properties alone.

To guarantee the transaction ACID properties we need some additional features within this simple transaction model. To meet the needs of atomic execution we require a means to determine the conditions for termination of a transaction, correct or otherwise. The first concept required for correct execution and termination is the commit. Commit is used to indicate the correct and atomic termination of a transaction. It includes the processing necessary to ensure proper updating and marking of the database. The second concept, called abort, is necessary for transactions that fail or stop execution for some reason. Abort conditions will include erroneous operations, conflicts in accessing stored information, or the inability to meet the ACID requirements on transaction processing. An abort requires that all of the effects of a transaction are removed from the database before any other transaction has a chance to see them. These two added features are necessary to facilitate atomic execution, although not in isolation.

The commit action is necessary in order to synchronize the actions of other elements of the database management system to make changes to the database permanent-for example, this command may be used to cause the database buffers and activity log to be flushed (force written) to the permanent storage subsystem, thereby making the changes durable, as shown in the following code segment:

 BEGIN TRANSACTION Reservation BEGIN Input (FlightNo, date, customer, specl) SELECT Seats-Sold, Capacity FROM FLIGHT ; IF Seats-Sold > Capacity THEN BEGIN EXEX SQL UPDATE FLIGHT SET Seats-Sold = Seats-Sold + 1 WHERE Fno = 'FlightNo' AND Date = 'date'; EXEX SQL INSERT INTO FlightCust(FNO, Date, Cname, Special) VALUES(FlightNo, date, customer, specl) OUTPUT("Transaction Completed") COMMIT Reservation; ELSEABORT Reservation; END END TRANSACTION Reservation;

This altered transaction now allows us either to go on with the transaction if it has a chance to succeed, or abort the transaction if we cannot complete it. In this example we would abort the transaction if we did not have a seat remaining in the plane to give to this customer. If there is a seat, we sell this customer a seat and commit the transaction.

Transaction formalization

A transaction, Ti, is composed of a set of operations, Oj ε {Read, Write}, where Oj is some operation from a transaction i on data items from the database D.

Let Osi = ⋃ Oij represent the union of the set of all operations j from a transaction i.

Finally, let Ni ε {Abort, Commit} represent the set of termination conditions on a transaction, either commit or abort.

A transaction is modeled as a partial ordering over its operations and end conditions. The partial ordering is represented by P <<, which indicates that the partial order P is composed of a set of operations, denoted S, and an ordering relation that holds between the elements in S denoted <<.

With these definitions we can formally describe a transaction, Ti, as a partial ordering of its composite operations, as follows:

(2.15)

where

(2.16)

(2.17)

(2.18)

What all this says is that a transaction is made up of reads, writes, and a commit or an abort operation, and that there is an explicit ordering in a transaction so that if a conflicting read precedes a conflicting write in the history, a strict sequential ordering must always hold in this transaction for these conflicting operations. In addition, all operations from the transactions must precede the commit or the abort statements. This is an important concept for developing correctness criteria for transaction executions, especially when concurrency comes into play. The transaction ordering must not be violated, to ensure that the transaction can perform the intended operation.

Transaction processing in a database system strives for guaranteeing the ACID properties, while delivering a high degree of data availability, no loss of updates, avoidance of cascading aborts, and recoverability of the database and transactions. A high degree of data availability is realized through reduced blocking of read and write requests. No loss of updates is guaranteed by correct commit processing. The avoidance of cascading aborts is provided for by robust recovery protocols. Finally, recovery is provided by redundancy and the rules governing commit.

2.10.12 Database and system mismatch

The operating system migrates storage from primary memory to secondary storage, based on the operating system's perspective on when this should be done. Demand paging and limited storage dictate that this be performed on a page fault basis. The database, however, may not wish the page to be written back to secondary memory due to concurrency control and atomicity issues. The database may wish to hold pages in memory until transaction commit time and then flush to secondary storage. This would allow the database not to require undo of transactions on failure, simply abort, and restart.

Related to this is I/O management and device management. The database may wish to order access based on the queries being presented to it in order to maintain ACID execution, whereas the operating system simply will order the accesses to deliver the greatest throughput of data back to the CPU. The order in which it returns information may be counterproductive to the database, to the point where the database has waited so long for needed data that when the data do come the operating system pages out the database software to make room for the data, or it removes the data that the new information is to be processed against. In either case this is not conducive to optimal database processing.

The problem with the operating system for this type of problem is the I/O buffer management policies and mechanisms. The database wants to use and optimize buffers to maximize transaction throughput, while the operating system wants to maximize average process response.

The control of the processor itself by the operating system may block essential functions that the database must perform-for example, the database requires that the log of database actions be flushed to secondary storage at specific points and in an uninterruptable manner in order to guarantee recovery and correct execution. Likewise, to keep the database as consistent as possible requires the database to flush committed data to the persistent store when necessary and in an atomic operation. The operating system in its wish to be fair may time-out a database function doing specifically this operation. On another related issue, if a database is sorting and processing two large data files against each other, it may wish to maintain direct control over how and when data traverse the boundaries from the storage to the processor and back. Without direct control over the allocation and deallocation mechanisms, the database could be removed from one resource while still holding another, causing a loss of the intended operation's continuity.

The operating system's locking mechanism works well for simple file management, and for the majority of applications this is sufficient. But a database needs better control over locking to allow locking at possibly a data item level only. The reason for this is to allow more concurrency and less blocking of data. The intent is to increase data availability by only locking what is being used, not an entire file. To rectify this databases are forced to use direct addressing and direct file management features to allow for their own control over the file level of locking. However, in some operating systems the database still suffers under the control of the operating system's lock manager, regardless of what mode is used.

An operating system's interprocess communication mechanisms may be too expensive to use within a database system. Many operating systems use a form of message passing involving interrupt processing. Such mechanisms may have a high cost in terms of overhead. A database may wish to provide more simple IPC mechanisms using shared memory or semaphores, especially since a database is only another process within the operating system.

Scheduling in an operating system looks to maximize overall average response time and to share resources fairly. Scheduling only deals with the selection of a process to place onto the executing hardware. A database, on the other hand, has a multilevel scheduling problem-not only must it select which transaction to place into service at any point in time, but it must also schedule which operation to perform on the underlying database to meet concurrency control requirements. An operating system's scheduler will not and does not provide such a service.

A database requires the use of copying, backup, and recovery services of the underlying infrastructure to aid in constructing database recovery protocols. The problem is that many of the other features of an operating system may get in the way and hinder the easy operation of database recovery. The database wishes to dictate how and when it will force information out to persistent storage. This is done in order to minimize the work (UNDO and REDO) that must be done to recover the database to a known consistent state. The operating system, on the other hand, will do this based on its needs to reallocate storage for processes in execution. The operating system will not take into account that this least recently used page will actually be the next page to be used by the database. It will simply choose this page and force it out immediately, based on its needs.

To make the operating system and database interface more compatible it is desirable that the operating system use semantic information, which can be provided by the database to make sound, informed decisions. This is not to say that the database should overtake or dictate the moves of the operating system. Instead it should act in a cooperative fashion to maximize the system-oriented needs of a database, which are more diverse than those of a typical application. See [1] for further information on database systems.

< Free Open Study >