Other Multiple Machine Database Implementations | MCAD/MCSE/MCDBA 70-229 Exam Cram 2: Designing & Implementing Databases w/SQL Server 2000 Enterprise Edition

Many high-demand implementations require the use of multiple servers to handle the workload. Various options exist for using SQL Server on multiple servers. Partitioning, log shipping, replication, federated servers, and clustering are all potential implementations for using multiple servers and obtaining load balancing.

Partitioning and federated servers provide load balancing for a single application and work together to provide users with better response. This type of implementation is in place of a multiple server cluster where all machines load balance to share the workload. In a federation, each server is completely independent; it is the application and database design that implements the load balancing.

Microsoft failover clustering provides for availability by enabling a secondary server to take over activities if a primary server fails. For further information on failover clustering, see SQL Server Books Online, "Creating a Failover Cluster."

Replication places potentially updatable copies of the same data on multiple servers so that applications that allow for site independence can, at the same time, keep all copies of the data synchronized. Most models have a degree of latency or delay between the initial updates and the moment when all data is in agreement.

Standby Servers and Log Shipping

To achieve availability and performance, redundancy is the starting point. Data redundancy can be obtained at the table and database level through denormalization, and at the server level through the use of log shipping, replication, and/or partitioning. We investigated the denormalization of a database design in the preceding chapter. In the physical environment the variety of implementation choices occurs at a machine level. Each technique has a set of requirements that must be closely adhered to.

A warm backup server or standby server is a lower-cost implementation that is often selected as an alternative to replication or clustering. The premise is to back up the production server on a regular basis, restoring it to a second machine that can be put into production in the event of failure in the first computer. A standby server can also assist in taking some of the workload from the production machine if it is used as a read-only query server.

In SQL Server 2000 you can use the Maintenance Plan Wizard to implement a standby server configuration. The wizard prompts you through the configuration of backups and regularly scheduled log shipments to the standby machine.

Using log shipping is a way to have two separate servers that contain the same database and split the query load away from the updating of data. The log information is periodically copied to a read-only, standby server that can then be used for query purposes, thereby offloading some of the work from the main production machine. For further information on log shipping implementations, see SQL Server Books Online, "Maintenance Planning Wizard."

Standby server implementations utilizing log shipping is one form of multiple machine implementations that create redundant data in multiple databases. It is a relatively simple implementation. Other multiple machine implementations involve much more concise configuration. SQL Server offers up a technology referred to as data replication to get data in an orderly fashion from one database to another on the same server or on separate servers. Replication allows for various topologies that control the flow of data between machines.

The exam requires you to choose the appropriate technology and also asks about particular aspects of replication. It is imperative, therefore, that you have a complete understanding of what replication offers in contrast to other aspects of SQL Server.

Replicating Data

Replication is the process by which data and database objects are distributed from SQL Server to other database engines that reside across your enterprise; this data can be distributed to other SQL Server databases or even non-SQL Server databases such as Oracle and others. This replication is performed using a Publisher/Distributor/Subscriber metaphor, which is a metaphor consisting of a Publisher, Subscriber, and Distributor that forms the basis of replication.

To understand replication, you must understand and become familiar with the basic concept of the Publisher/Distributor/Subscriber metaphor. This metaphor defines the different roles SQL Server can play throughout the replication process. SQL server can also play a third role: the Distributor. Each role provides functionality that aids in the replication process. The Publisher is the server that makes the data available so that it can be sent to the Subscriber. The Distributor is the intermediary that collects the data from the Publisher and transports it to the Subscriber. The Subscriber is the server that ends up receiving the data.

Publisher

In replication terminology, the Publisher is the server that produces the original data for the use of the replication process. The Publisher can produce many publications or sets of data to be distributed to other subscribing machines. One Publisher can produce data to be replicated to many Subscribers. Also, many Publishers may produce data to be distributed to just a single, central Subscriber. The former is implemented as a standard Central Publisher/Distributor/Subscriber replication model, and the latter is referred to as a Central Subscriber replication.

Distributor

In SQL Server terminology, the Distributor is the server that contains the distribution database, data history, and transactions; as its name implies, it sends data to Subscribers. The Distributor can be implemented on the same physical server as the Publisher or Subscriber, though it doesn't need to be. It can reside on a separate server across the world connected via a network. The placement of the Distributor and its characteristics depends on the type of replication used.

Subscriber

A Subscriber in SQL Server terms is the server that receives replicated data through the Distributor. A Subscriber receives publications by subscribing to themnot, however, by subscribing to individual articles that represent only a portion of a publication. It is also possible to use stored procedures to set up subscriptions, in which case it is possible to subscribe to a single article. Subscribers can choose from the publications available at the Publisher and don't necessarily need to subscribe to all of them. Other replication terminologies are also important to the process and fall into line with the subscription metaphor. When a subscription is set up, a Publication made up of one or more articles is configured.

Articles and Publications

Simply put, articles are data structures made up of selected columns and/or rows from a table. An article could also be the entire table, although it is recommended that the content of an article be kept to just the minimum amount of data needed for the specifics of an implementation. One or more articles are bundled into a publication to be used for replication. Articles must be grouped into a publication before they are ready to be replicated. In short, then, a publication is a collection of one or more articles and is capable of being replicated.

Replication Strategies

Replication, with its many benefits, serves as a backbone for many businesses. Placing the same data on multiple servers, with each server closer to the user's location, can reduce the use of bandwidth and provide the user with faster update operations and retrieval of data. Without replication, businesses would be incapable of carrying out robust branch operations across the globe. DBAs favor replication for the following reasons:

In using replication, businesses are capable of having data copied from server to server in a multisite enterprise, providing flexibility and more efficient use of networking resources.
Replication allows for a greater concurrent use of data, allowing more people to work with the data at the same time.
Copies of the database can be distributed, bringing data closer to the end user and providing a form of load balancing.
Replication is perfect for the traveling salesman or roaming disconnected user. It enables mobile users who work on laptops to be updated with current database information when they do connect and to upload data to a central server.

Replication techniques can be applied to three replication models, as well as several different physical models. The physical aspects and models have no direct correlation. The replication model supplies the functionality, whereas the physical aspects lay out the placement and roles of individual servers.

Merge, snapshot, and transactional replication all involve essentially the same basic elements to begin with. However, each model has idiosyncrasies of its own that require some thought during the design of the actual implementation.

The application of the initial snapshot that begins the entire replication process can be compressed and/or saved to a CD so that some of the necessary communications can be offloaded. Doing so makes more efficient use of network bandwidth in slow-link or dial-up environments in particular.

Subscriptions can be set up as either a push subscription or a pull subscription. A push subscription is one in which the Publisher initiates a subscription and provides the basis for scheduling the replication process. The pull subscription is one in which the Subscriber initiates the subscription and provides the basis and timing on which data is to be obtained. In either case, whoever initiates a subscription selects the appropriate articles to be transmitted.

Replication can be set up in various ways. The different scenarios in which replication is set up each provide specific benefits and qualities. SQL Server can serve as Publisher, Subscriber, or Distributor. The individual roles can all be set up on a single machine, although in most implementations there is at least a separation between a Publisher/Distributor and the subscribing machine(s). In some other scenarios, the Subscriber of the data from one machine may republish the data to still other Subscribers. The possibilities are endless. You can choose to implement any of the following common scenarios, which cover the basics of a physical replication model. In real-world scenarios, however, the actual physical model used could have any combination of these elements, based on a business's individual needs:

Central Publisher, multiple Subscribers
Multiple Publishers, multiple Subscribers
Multiple Publishers, single Subscriber
Single Publisher, remote Distributor

Central Publisher, Multiple Subscribers

In this form of replication, the data originates at the publishing server, and that original data is sent to multiple Subscribers via the Distributor. Depending on the form of replication used, changes to the data at the destination servers can enable updates to be propagated back to the Publisher and other Subscribers, or it can be treated as read-only data, in which updates occur only at the Publisher. This type of scenario is typically used when a company has a master catalog at the headquarters and has many different Subscribers located elsewhere.

An advantage of this configuration is that multiple copies of the data from a central server are available for user processing on multiple machines. Data can be distributed to the locations where it is needed. With this form of replication, data can be brought closer to the user, and the load on a single server can be reduced. Expensive bandwidth can also be utilized in an improved manner.

Multiple Publishers, Multiple Subscribers

In this replication, every server publishes a particular set of rows that relate to it and subscribes to the publications that all the other servers are publishing so that each of them can receive data and send data. This form is typically used in a distributed warehouse environment with inventory spread out among different locations, or any other situation in which the data being held at each location-specific server needs to be delivered to the other servers so that each location has a complete set of data.

For this form of replication, the correct database design is crucial to having each server publish and subscribe to the correct information. The table structure for the data involved in the publication is usually implemented with a compound primary key or unique index, although it is possible to use an identity or another algorithm that enables each location to be uniquely identified within the entire table. One portion of the key is an identifier for the location, whereas the second element is the data identifier.

Multiple Publishers, Single Subscriber

In the "multiple Publishers, single Subscriber" scenario, a server subscribes to publications on some or all of a number of other publishing servers. This is needed when overall data is required at only one site, possibly the headquarters. Data can be collected from widely dispersed areas, and the central location, the Subscriber, would end up with a master database from all the Publishers combined.

Single Publisher, Remote Distributor

Replication does not need to include a Distributor residing on the same server or even within close proximity to the Publisher. Instead, the machine handling the distribution can be implemented as a totally separate segment. This is practical when you need to free the publishing server from having to perform the distribution task and minimize costs that can be incurred over long-distance or overseas network connections. Data can also be replicated faster and delivered to many Subscribers at a much lower cost while minimizing the load on the Publisher. In situations in which the connection between the Publisher and the Subscriber is over a slow link or over high-cost connections, a remote Distributor should be used to lower the cost and increase data transfer rates.

The individual roles of each server are implemented in all types of replication; the physical configuration does not dictate the type of replication used. The next section examines the types of replication and their implementation.

Types of Replication

Each model provides different capabilities for distributing data and database objects. There are many considerations for selecting a replication type or determining whether replication is a suitable technique for data distribution. The many considerations to determine the suitability of each of the models include transactional consistency, the Subscriber's capability or lack of capability to update data, latency, administration, site autonomy, performance, security, update schedule, and available data sources. Each of these is defined as replication, and they are discussed through the next several sections.

Other data distribution techniques that don't involve replication can offer a different set of features but may not provide for the flexibility offered by replication. To determine which replication type is best suited to your needs, consider the three primary factors: site autonomy, transactional consistency, and latency. These three considerations are illustrated in Figure 4.4, which also compares and contrasts data distribution techniques.

Figure 4.4. Data distribution techniques.

When you are in the position of selecting between the data distribution techniques, there are essentially three questions to be asked: How soon is data needed to agree at all replication locations (latency)? How much site independence is required in processing changes to the data (autonomy)? How consistent does data need to remain after changes are made? The answer to each of these questions will determine the approach used.

Site Autonomy in Replication

Site autonomy measures the effect of your site's operation on another site. A site having full autonomy is completely independent of all other sites, meaning that it can function without even being connected to any other site. High site autonomy can be achieved in SQL Server replication where it would not be possible using other data distribution techniques. Not all replication configurations achieve autonomy; such high site autonomy can be seen best with merge replication.

Site autonomy directly affects transactional consistency. To achieve an environment that both is autonomous and has a high degree of transactional consistency, the data definition must provide a mechanism to differentiate one site from the other. A compound primary key, for example, in a "central Subscriber" or "multiple Publisher, multiple Subscriber" scenario, allows autonomy while achieving transactional consistency. If an implementation enables each site to update the same data, it will always have some degree of transaction inconsistency or at least a delay before consistency is achieved.

Transactional Consistency in Replication

Transactional consistency is a measure of changes made to data, specifically those changes that remain in place without being rolled back. Changes can get rolled back due to conflicts, and this will affect user changes and other user activities. In replication you have multiple distinct copies of your data, and if you allow updates to each copy, it is possible for different copies of a piece of data to be changed differently. If this situation is allowed, as is the case in some forms of replication, you have imperfect (low) transactional consistency. If you prevent two copies from changing independently, as is the case with a distributed transaction, you have the highest level of transactional consistency.

In a distributed transaction, the application and the controlling server work together to control updates to multiple sites. Two-phase commits implemented in some forms of replication also help. The two phases used are preparation and committal. Each server is prepared for the update to take place, and when all sites are ready, the change is committed at the same time on all servers. After all sites have implemented the change, transactional consistency is restored.

Latency in Replication

Latency can be thought of as how long data in the Subscriber has to wait before being updated from the copy of the data on the Publisher. Several factors contribute to latency, but it is essentially the length of time it takes changes to travel from the Publisher to the Distributor and then from the Distributor to the Publisher. If there is no need for the data to be the same, at the same time, in all Publishers and Subscribers, then the latency resident within a replication strategy will not negatively affect an application. A two-phase commit, as SQL Server implements through the use of immediate updating, can minimize latency on updates coming from the Subscriber, but it has no effect on updates sent from the Publisher. Latency can be affected by the workload on the Publisher and Distributor, the speed and congestion of the network, and the size of the updates being transported.

Each type of replication offers advantages and disadvantages. You must select the type based on the requirements of the business application. The three types of replicationsnapshot, transactional, and mergemove data using different principles.

Use of Snapshot Replication

Snapshot replication distributes data and database objects by copying the entire contents of the published items via the Distributor and passing them on to the Subscriber exactly as they appear at a specific moment in time, without monitoring updates. A snapshot is stored on the Distributor, which encapsulates data of published tables and database objects; this snapshot is then taken to the Subscriber database via the Distribution agent.

Snapshot replication is advantageous when replicated data is infrequently updated and modified. A snapshot strategy is preferable over others when data is to be updated in a batch. This does not mean that only a small amount of data is updated, but rather that data is updated in large quantities at distant intervals. Because data is replicated at a specific point in time and not replicated frequently, this type of replication is good for online catalogs, price lists, and the like, in which the decision to implement replication is independent of how recent data is.

Snapshot replication offers high levels of site autonomy. It also offers a great degree of transactional consistency because transactions are enforced at the Publisher. Transactional consistency also depends on whether you are allowing Updating Subscribers, and what type (immediate or queued).

Transactional Replication

Transactional replication is defined as the moving of transactions captured from the transaction log of the publishing server database and applied to the Subscriber's database. The transactional replication process monitors data changes made on the Publisher.

Transactional replication captures incremental modifications that were made to data in the published table. The committed transactions do not directly change the data on the Subscriber but are instead stored on the Distributor. These transactions held in distribution tables on the Distributor are sent to the Subscriber. Because the transactions on the Distributor are stored in an orderly fashion, each Subscriber acquires data in the same order as is in the Publisher.

When replicating a publication using transactional replication, you can choose to replicate an entire table or just part of a table using a method referred to as filtering. You can also select all stored procedures on the database or just certain ones that are to be replicated as articles within the publication. Replication of stored procedures ensures that the definitions they provide are in each setting where the data is to be found. Processes that are defined by the stored procedures can then be run at the Subscriber. Because the procedures are being replicated, any changes to these procedures are also replicated. Replication of a stored procedure makes the procedure available for execution on the local server.

Merge Replication

Merge replication is the process of transferring data from the Publisher to the Subscriber, enabling the Publisher and Subscriber to update data while they are connected or disconnected, and then merge the updates after they both are connected, providing virtual independence. Merge replication therefore allows the most flexibility and adds the most autonomy to the replication process. Merge replication is also the most complex replication because it enables the Publisher and Subscriber to work virtually independently. The Publisher and Subscriber can combine their results at any certain time and combine or merge their updated results.

The Snapshot agent and the Merge agent help in carrying out the process of merge replication. The Snapshot agent is used for the initial synchronization of the databases. The Merge agent then applies the snapshot; after that, the job of the Merge agent is to increment the data changes and resolve any conflicts according to the rules configured.

Conflicts are likely to occur with merge replication. Conflicts occur when more than one site updates the same record. This happens when two users concurrently update or modify the same record with different values. When a conflict occurs, SQL Server has to choose a single value to use. It resolves the conflict based on either the site priority on the database site or a custom conflict resolver. You could give more priority to, for instance, a user in the HR department than one from the sales department. When a conflict is detected, it is resolved immediately after the conflict resolver is executed. A conflict occurs when the Publisher and Subscriber have both changed the record since they last shared a common version. Conflicts can be record based or column based. The default is column based.

What has been presented in this chapter outlines the basics of replication, and although we have discussed many of the principles and strategies, the actual configurations can be quite involved. As you have seen, there are many design considerations to be considered, and the administration over the replication sites can be considerable.

If you are to successfully implement replication, you must get much further into the depths of the agents and implementations. These techniques fall under the scope of the administration processed and are thus out of the scope of this book. For additional assistance or information, you can find a considerable amount of data within SQL Server Books Online, as well as the Microsoft Web Resources.

The only remaining physical design to discuss would be the implementation of distributed transactions. Transactions in general and their implementation over distributed servers are discussed in Chapter 6, "Programming Business Logic."