Distributed Database Design Issues | DB2 Developers Guide (5th Edition)

< Day Day Up >

When you're designing databases in a distributed environment, follow the standard database design rules of thumb provided in Chapter 5, "Data Definition Guidelines." However, you might need to take a more rigorous approach regarding denormalization. For more information, refer to the exhaustive discussion of denormalization in Chapter 5.

Denormalization can be a useful technique in a distributed environment. In the following sections, I discuss several methods of distributed denormalization. Along the way, I make references to the denormalization types already discussed to clarify the distributed denormalization concepts.

Fragmentation

Fragmentation is a specialized form of distributed denormalization that resembles split tables. To implement fragmentation, a table must be separated into separate parts , or fragments . Each fragment is then stored at a different location. Fragmentation can enhance performance because each fragment can be stored at the location that accesses it most frequently.

As with split tables, fragmentation avoids data duplication. Each fragment must contain a logical subset of the data.

Multiple fragments can be created from a single source table. The methodology used to determine where and how to split the table depends on the data access needs of the distributed applications that must access the data.

Two types of fragmentation can be implemented: horizontal and vertical. Horizontal fragmentation splits the data by rows, whereas vertical fragmentation splits the data by columns. Tables are horizontally fragmented using ranges of values to create distinct fragments. Tables can be vertically fragmented by assigning specific columns to specific fragments.

Vertical fragmentation requires a certain amount of data duplication because the key column(s) must be stored at each site to defragment the data. Without the redundant key stored at each location, joining the tables back together so that the data returned is the unfragmented, original data would be impossible .

Ensure Lossless Joins and Unions

You must take care to ensure that fragmentation is accomplished such that defragmenting the tables does not result in additional data or a loss of data.

For horizontal fragmentation, rows must be wholly contained within one, and only one, fragment. In other words, the result of selecting all rows from every fragment and combining them together using UNION ALL must provide the same result as a SELECT of all rows from the original, unfragmented table:

 SELECT   * FROM     FRAGMENT1 UNION ALL SELECT   * FROM     FRAGMENT2 UNION ALL SELECT   * FROM     FRAGMENT  n  ;

Of course, this statement cannot be successfully executed until DB2 supports distributed request capability.

For vertical fragmentation, only the key columns can be duplicated in multiple fragments. The key columns must reside in every fragment. Even when no data is actually associated with a particular key for a particular fragment, a row must be stored in the fragment for that key to facilitate defragmentation. Nulls (or default values) can be used to indicate that the other columns contain no valid data for the particular key at that particular location.

Simply stated, the result of joining all fragments together should provide the same result as selecting from the original, unfragmented table:

 SELECT   F1.KEY, F1.COL1, F2.COL2, Fn.COLn FROM     FRAGMENT1  F1,          FRAGMENT2  F2,          FRAGMENTn  Fn WHERE    F1.KEY = F2.KEY AND      F2.KEY = Fn.KEY;

If certain keys are not included, an outer join must be used. Until such time, because DB2 provides native outer join support, always propagating keys across locations is wise.

Replication

Another type of distributed denormalization is replication . In its implementation, it is similar to mirror tables.

When data is replicated, redundant data is stored at multiple distributed locations. Because replication causes copies of the data to be stored across the network, performance can be enhanced (because distributed access is eliminated or reduced).

Replication can be implemented simply by copying entire tables to multiple locations. Alternatively, replicated data can be a subset of the rows and/or columns. The general rule of thumb is to copy only what is needed to each remote location.

Furthermore, each replica should contain accurate, up-to-date information. Whenever possible, you should update all replicated copies at the same time. This way, you can eliminate the administrative burden of having to know the state of each replica. Additionally, replication transparency is ensured when the data is accurate at each location.

To achieve optimal performance, you should always read from the closest replica. A replica may not exist at every location. By always reading from the closest replica (which supports the current requirements), you can enhance performance by reducing the communication path .

You can tune replicas independently of one another. Different clustering strategies, different indexes, and different table space parameters might be appropriate at different locations.

Finally, do not create more replicas than are required. The more replicas, the more complicated the process of updating them.

Snapshots

Similar to mirror tables, snapshot tables are read-only copies of tables. Snapshot tables also are similar to replicas, but the data currency requirements for each snapshot table can differ . Data in snapshot tables usually represents a "point in time" and is not accurate up to the second.

Decision-support applications typically use snapshot tables. Snapshots are most useful for optimizing performance when data does not have to be entirely accurate.

As with the other types of distributed denormalization, snapshots tend to optimize performance when they are stored at the location that accesses them most frequently.

You can create multiple snapshot tables ”each representing a different "point in time." The number of snapshots required depends on the nature of the data and the needs of the applications that must access them.

To achieve optimal performance, always read from the closest snapshot. A snapshot may not exist at every location. By always reading from the closest replica (which supports the current requirements), you can enhance performance by reducing the communication path.

Be sure to send all updates to the system of record, which is the master table (or tables) that always contains accurate, up-to-date information. Application updates should never be made to snapshots, only to the system of record. The snapshot tables need to be refreshed periodically with data from the system of record. You should develop a reliable, systematic method of refreshing snapshot data.

By their very nature, snapshot tables do not contain up-to-the-second information. Ad hoc users, programmers, and anyone else requiring access to snapshot tables need to be informed of the following:

The data is not current; for current data, the system of record should be accessed.
The date and time for which the data is accurate.
The next scheduled refresh date and time.

< Day Day Up >