Section 5.2. The Conflicting Goals

5.2. The Conflicting Goals

There are often two conflicting goals when trying to optimize the physical layout of data for a system that expects a large number of active users, some of them reading and others writing data. One goal is to try to store the data in as compact a way as possible and to help queries find it as quickly as possible. The other goal is to try to spread the data, so that several processes writing concurrently do not impede one another and cause contention and competition for resources that cannot be shared.

Even when there is no concurrency involved, there is always some tension when designing the physical aspect of a database, between trying to make both queries and updates (in the general sense of "changes to the data") as fast as possible. Indexing is an obvious case in point: people often index in anticipation of queries using the indexed columns as selection criteria. However, as seen in Chapter 3, the cost of maintaining indexes is extremely high and inserting into an index is often much more expensive than inserting into the underlying table alone.

Contention issues affect any data that has to be stored, especially in change-heavy transactional applications (I am using the generic term change to mean any insert, delete, and update operation). Various storage units and some very low layers of the operating system can take care of some contention issues. The files that contain the database data may be sliced, mirrored, and spread all over the place to ensure data integrity in case of hardware failure, as well as to limit contention.

Unfortunately, relying on the operating system alone to deal with contention is not enough. The base units of data that a DBMS handles (known as pages or blocks depending on the product) are usually, even at the lowest layers, atomic from a database perspective, especially as they are ultimately all scanned in memory. Even when everything is perfect for the systems engineer, there may be pure DBMS performance issues.

To get the best possible response time, we must try to keep the number of data pages that have to be accessed by the database engine as low as possible. We have two principal means of decreasing the number of pages that will have to be accessed in the course of a query:

Trying to ensure a high data density per page
Grouping together those pieces of data most likely to be required during one retrieval process

However, trying to squeeze the data into as few pages as possible may not be the optimum approach where the same page is being written by several concurrent processes and perhaps also being read at the same time. Where that single data page is the subject of multiple read or write attempts, conflict resolution takes on an altogether more complex and serious dimension.

Many believe that the structure of a database is the exclusive responsibility of the database administrator. In reality, it is predominantly but not exclusively the responsibility of that very important person. The way in which you physically structure your data is extremely dependent on the nature of the data and its intended use. For example, partitioning can be a valuable aid in optimizing a physical design, but it should never be applied in a haphazard way. Because there is such an intimate relationship between process requirements and physical design , we often encounter profound conflicts between alternative designs for the same data when that data is shared between two or more business processes. This is just like the dilemma faced by the general on the battlefield, where the benefits of using alternative parts of his forces (infantry, cavalry, or artillery) have to be balanced against the suitability of the terrain across which he has to deploy them. The physical design of tables and indexes is one of those areas where database administrators and developers must work together, trying to match the available DBMS features in the best possible way against business requirements.

The sections to follow introduce some different strategies and show their impact on queries and updates from a single-process perspective, which, in practice, is usually the batch program perspective.

Reads and writes don't live in harmony: readers want data clustered; and concurrent writers want data scattered.