Lesson 1: Introduction to Distributed Data | Microsoft SQL Server 7 Data Warehousing Technical Support Training Kit

This lesson starts with a description of the distributed data environment and addresses factors in deciding how to distribute data. It then summarizes the various methods of distributing data.

After this lesson, you will be able to:

List the various methods for distributing data

Describe the characteristics of Microsoft SQL Server replication

Estimated lesson time: 20 minutes

The Need for Distributed Data

A distributed data environment is one that can include multiple copies of the same information on multiple servers. For example, in an international company using a sales-tracking application, data originates in one country and is distributed to servers in other countries to make querying more efficient.

If a distributed data environment evolves rather than being created intentionally, it may become necessary to implement a solution that makes the environment more manageable. For example, in a company s early development, various divisions typically have their own independent servers, which may run different database management systems. The challenge is to make the data more accessible to the whole company.

When you create a distributed data environment, you design a solution that

Brings the data closer to the user

Allows sites to operate independently (autonomously)

Separates online transaction processing (OLTP) from read-intensive applications, such as data marts and data warehousing

Can reduce conflicts

Because data can be distributed throughout the network, the information is made available to many users, and conflicts are reduced during user requests.

Considerations for Distributing Data

Two principal strategies exist for implementing distributed data: distributed transactions and replication. With both strategies, it is possible to keep current multiple copies of data. It also is possible to design a distributed environment that includes aspects of each strategy. Figure 7.1 compares the major differences between replication and distributed transactions; these differences are described in the following discussion.

click to view at full size

Figure 7.1 Comparing replication with distributed transactions

Distributed Transactions

Distributed transactions guarantee that all copies of your data are consistent all the time. This consistency is usually based on the two-phase commit protocol. For a transaction to be committed, each server that is included in a distributed transaction must be online and able to complete its part of the transaction. Using distributed transactions is less flexible than replication because the failure of a transaction at one site means failure at all sites. You should use this approach only when data must be synchronized at all times.

Distributed transactions are typically used for, but not limited to, applications that make simultaneous updates to data in more than one database. For example, distributed transactions might be used in an invoicing application that adds an invoice to a branch-invoicing database and updates a central stock database, which records stock levels at a warehouse used by all branches.

Replication

With replication, recent copies of data are duplicated and distributed from a source database to a destination database, usually on a separate server. Autonomous sites are supported, allowing more scalability because sites can be online intermittently. Databases participating in replication can be located on a large server servicing hundreds of users but can also be located on a single user s computer, making replication useful for a wide variety of applications.

Replication is typically used for, but not limited to, data that is duplicated from one database to another. For example, replication might be used for a product catalog that is maintained at a central office and replicated to branch offices.

Factors in Deciding How to Distribute Data

When you decide how to distribute data, you must consider such factors as latency, site autonomy, transactional consistency, and database update conflicts. Figure 7.1 illustrates some of these factors.

Latency

Latency is the time delay that occurs between the updates to two or more sets of distributed data. You need to consider what degree of latency is acceptable for your database application. Distributed transactions require near-zero latency, while replication may cause or allow latency of a few seconds to a number of days.

Site Autonomy

Site autonomy refers to the degree to which sites can operate independently. Distributed transactions require that sites be permanently connected over reliable and ample-bandwidth connections, whereas some forms of replication can allow sites to be completely disconnected for long periods.

Transactional Consistency

A transaction is a series of data modifications that must be completed entirely or not at all. Distributed transactions enforce complete, immediate transactional consistency. Some forms of replication maintain transactional consistency, although there is a delay from the initial update; other forms of replication do not guarantee transactional consistency.

Data Update Conflicts

If data is being updated at different sites, conflicts can occur. For multiple sites, distributed transactions give the same degree of multiuser consistency as a single server. Some forms of replication prevent conflicts by allowing data to be updated at only one site. Alternatively, you can design your application so that each participating site works with data that is strictly segregated, or partitioned, from other sites. For example, you might design your order entry system so that a given sales representative has a unique territory code, preventing orders from conflicting with those of other sales representatives.

Methods of Distributing Data with SQL Server

As illustrated in Figure 7.2, the methods of distributing data provide varying degrees of transactional latency and autonomy. It is important to select the method that best suits your particular business needs and environment.

click to view at full size

Figure 7.2 Comparison of methods for distributing data

Distributed Transactions

Using distributed transactions guarantees that all sites have the same data at the same time. Microsoft Distributed Transaction Coordinator (MS DTC) facilitates distributed transactions in SQL Server by using a protocol known as two-phase commit to guarantee that a transaction completes at all participating sites at the same time.

Replication

Various types of replication exist for distributing data in SQL Server. They are summarized here and presented in detail in Lesson 3.

Transactional replication

Only changed data is distributed. The sequence of transactions is maintained. Conflicts do not arise, because there is only one location at which data is changed.

Snapshot replication

A picture of the entire current data (changed and unchanged) at a source server replaces data at a destination server on a periodic basis or on demand.

Merge replication

Multiple sites make changes to data independently of one another. The changes are periodically merged together at the source server. Conflicts can occur (these can be resolved in a number of ways), so this type of replication does not guarantee transactional consistency.

Lesson Summary

Data is commonly distributed in an organization. SQL Server provides support for distributed transactions and replication to support these distributed data needs. Distributed transactions are implemented using a protocol known as two-phase commit, which guarantees that a transaction completes at all participating sites at the same time. Replication duplicates and distributes recent copies of data from a source database to a destination database. With replication, distribution of changed data may be delayed.