Chapter 8: Distributed Database Architecture | Microsoft SQL Server 2005: The Complete Reference: Full Coverage of all New and Improved Features

Database servers of the world unite. I hardly think that a zillion database machines are getting ready to take over the world and crush mankind. But no data store need be an island with SQL Server 2005 in the mix, because the system has all the necessary features that enable it to interconnect with other database servers (including non-SQL Server machines). These include other brands of database management systems, even Microsoft Access, cousin (now long distance) Sybase, and (ahem) Oracle and IBM. This chapter focuses on SQL Server’s powerful replication, log shipping, and database mirroring features, which allow you to create a sophisticated distributed data network and non-clustered fail-over architecture. We will also look at one third-party solution to replicate your SQL Server data from one geographical location to another.

This chapter delves into what can be an extremely complex subject if not done with common sense. It will thus focus on what you need to know to get the skeletal replication, log shipping, or database mirror configuration in place as soon as possible, and how to best prepare your organization for such a topology. Once you have a basic understanding of what you are in for, and what SQL Server’s distributed service consists of, you will have a leg up on getting ready to build a distribution network foundation on which to build more complex distributed services. But first, let’s kick off with some theory

Redundancy is considered an element of high availability, the subject of this and the next chapter. Redundancy provides some level of high availability without the cost of full-blown clustering. Redundancy in SQL Server can be achieved using stand-by servers, replication, log shipping, and database and disk mirroring techniques.

Replicated Databases

SQL Server’s replication features make the platform one of the most powerful distributed database solutions on the planet today. Ever since the birth of data network, IT managers have had the need to distribute data around their enterprises and organizations. There are many reasons to distribute data among a collection of servers. You might have some special needs, but here is a collection that almost all DBAs can identify with:

Low to Medium Availability Making data available whenever and wherever users and data consumers need it.
Redundancy Replication allows multiple sites to hold the same data. These sites may all work autonomously and access the servers closest to them on their network segments, but in mission-critical applications a redundant server holding the same information means that more than one copy of the database is accessible in the event of a disaster. This redundancy is useful when you have multiple sites spread across a wide geographical area and the users need to read the same data, especially for reporting applications. Replication is ideal for standby solutions, and as mentioned previously, low- to medium-availability scenarios. (Other choices for high availability in SQL Server 2005 include log shipping, database mirrors, and fail-over clustering, all of which are discussed in Chapter 9.)
Data migration Replication allows you to maintain multiple copies of the same data, which means that one server can cater to the OLTP read/write connections while another is configured for read-only or read-intensive applications. These might be data transformation services pulling data into staging tables for warehouses, or performing direct, “drill”-intensive work on the data, such as in online analytical processing (OLAP).
Autonomy/store forward Replication allows users, such as field agents and engineers, to work with copies of data while offline. Then when they reconnect, new data is propagated to other databases. Such a scenario could involve entering data into PDAs, mobile PCs, portable computers, and other data entry devices. When these devices are connected to the network, replication automatically makes sure the local data is replicated to the system.
Scale out Replication also allows you to scale out your data tier. In other words, data can be browsed in highly distributed applications, such as when data needs to be browsed from the Web from anywhere on the planet.
Data mining Replication allows you to increasingly aggregate read performance in analysis solutions.
Partitioning Replication also helps reduce conflicts based on multiple user data modifications and queries, because data can be distributed throughout the network and you can partition data according to the needs of different business units, departments, or users.

While availability and redundancy are your typical reasons for maintaining a distributed database network, several other factors require further discussion. Replication plays an important part in data mining and data warehousing operations. It is a useful tool that you can use to get data from point A to point B. Then when it is at point B it can be extracted, loaded into a staging database and prepared for transformation to an analysis database.

You can thus use replication to continuously update data marts and data warehouses. This is a far better solution to making periodic bulk extractions and inserts, which are resource intensive and potentially disruptive of day-to-day operations. In my book, it’s better to have data replication taking place all the time than to spend evenings and weekends doing weekly extractions, loads, and scrubbing, or wasting time figuring out how to get huge amounts of batch transfers into the data warehouse before morning.

Understand that the replication services allow replicating data to be transformed. They do not enable the replication of the Analysis Services objects, such as dimensions or CUBEs. They are used to help you set up a uniform data distribution network that moves data from OLTP (transaction) databases to data warehouses or data mart staging databases where data is scrubbed. These databases are used for drill- or query-intensive operations such as reporting, decision support, or analysis. SQL Server Integration Services (SSIS)-formerly Data Transformation Services (DTS)-can be used with replication services for an end-to-end data transformation and migration service. (SSIS is not part of the core database engine; it is thus not extensively covered in this book.)

The data used in decision support or OLAP services is predominantly read-only (used for queries and analysis). For analysis, database loading snapshot replication or transactional replication are often the types of replication used. But the idea is not to directly get data from the live OLTP system into the database. It makes more sense to replicate to an off-limits (read-only) database and then from there extract the data into the staging area, where you can treat it and scrub it before it is inserted into the data warehouse, where it will remain unchanged. For example, you might want to install surrogate keys into the database, and this is not something you want to do to an OLTP system that is a replication partner with other servers. SQL Server’s snapshot replication technology allows the data and database objects to be copied and distributed exactly as they appear at a specific moment in time on the source server.

Replication is also an ideal solution that caters to the problem of data accessibility while you are disconnected from the main data sources. For example, you are disconnected when you are traveling or working at remote locations, in the car or at a home office that is not connected to the corporate data center all the time. Business users often need to use laptops or handheld computers when traveling, and they always need to access data, often on demand, such as in airports or hotel rooms. This is usually achieved by using a modem to dial into a corporate data center or connect to it via an intranet or the Internet connection, many of which now exist at most American airports.

When working online, you can use replication to receive data from the central server in the corporate data center when you connect to a WAN or LAN, or over the Internet. Then you can make changes to data immediately, or you can modify data offline and propagate the changes to the source databases and to other locations when you reconnect to the network.

Data modifications made at remote servers are performed asynchronously at the original server and then sent to other servers. Transactional replication (using the queued updating option we will later discuss) and merge replication are the types of replication most often used for mobile or disconnected users.

For example, let’s say that my company has sales representatives in various regions who need to take current customer and order information with them and update it as they travel. The corporate office decides to publish data from the CUSTOMERS, ORDERS, and ORDER DETAILS tables stored in a central OLTP database and filter the data by the region where each sales representative works. The laptop or handheld computers used by the sales representatives will be the Subscribers to the data, and the sales representatives will be able to update the data as necessary when away from the office and offline. With replication, when the sales representatives reconnect to the network, they can synchronize their data changes with changes made at other locations.

Replicating is also useful over the Internet or the Web, which allows remote, disconnected, and anonymous users to access data when they need it and wherever they need it. For example, your Web site might allow users to browse items for sale, and you might have a large quantity of such hits to the main server. Using replication, you can make sure that data is available for read purposes on other servers. Since the browsing can take place at any server, it allows the site, using load-balancing technology, to handle more traffic.

Another use of replication and Web-based applications is to allow individual Subscribers to download or upload data changes with an application that uses an Internet browser, or by using a connection to the corporate network or share where the data resides. Obviously, you’ll have collateral items on the IS list to cater to, such as integration with third-party firewalls or with ISA Server and IIS. You might also need to configure the file transfer protocol (FTP) to transfer the data over the Internet or HTTP and Web integration (a new feature for SQL Server 2005). One network solution you’ll need to consider with replication services is the configuration of virtual private networks (VPNs) and virtual local area networks (VLANs).

Establishing Replication Requirements

I found that the best way to get into replication is to investigate your enterprise needs and then formulate a requirements document-a needs synthesis. Let’s suppose your company has regional offices around the world. Some regional offices will only be reading the data, while other offices are responsible for keeping up-to-date information on the customers and orders in their particular regions.

Your regional office might only need to read the data and not make any changes. If that’s the case, the central office can filter the data to create the appropriate partitions, which would be based on region information or some other criteria, and then publish that data to regional servers. The snapshot replication type or the transactional replication could be used to achieve your data distribution goals. It all depends on what you are trying to achieve.

Another example: A regional office or data center makes changes to the data and needs the autonomy of the data on its site. So the data can be filtered and replicated to that region. Then the regional office can make changes to its data as needed. When the changes need to be propagated to the central or corporate data center or other regional data centers, the remote data center can synchronize with the central data center such that the changes now made in the remote data center can propagated automatically to the central data center, which in turn can synchronize with other remote data centers. If the central data center needs to distribute the corporate data to other regional sales forces, it can republish the data to the necessary sites.

So you have several options for scheduling distribution of the data and modifying the data at the different remote data centers. If you maintain a continuous and reliable wide area network, multiple data centers can update the data and propagate the changes to the corporate offices immediately. The data is then propagated to other data centers within seconds (via the service of the immediate updating feature). In the case of a remote data center that is offline for a limited amount of time, data modifications can be stored in a queue until the connection is reestablished. This is called queued updating. So let’s review the components of the replication service before we activate the components of our plan.