Planning for Fault Tolerance


SQL Server 2005 solutions typically represent mission-critical systems that are vital to the functioning of an organization. They store data that simply cannot be lost. Consequently, DBAs are typically responsible for designing appropriate fault tolerance for such SQL Server 2005 solutions.

Planning for fault tolerance is all about identifying the various points of failure in your SQL Server 2005 database solution and providing redundancy at these potential points of failure, which include the following:

  • Server hardware

    • Processors

    • Memory

    • Controllers

    • Disk drive storage

    • Power supply unit (PSU)

  • Network

    • Network cards

    • Hubs

    • Switches

    • Routers

    • Cabling

  • Power supply

In certain cases of catastrophic failure, providing fault tolerance at these levels is insufficient. Thus, you still need a properly implemented backup solution. The problem with backup solutions is that they can take a long time to restore; thus, you can implement different standby solutions to reduce downtime:

  • Cold standby   A cold standby involves restoring SQL Server 2005 backups onto a new SQL Server 2005 instance. Consequently, it has the longest downtime and greatest DBA intervention. We hope your DBAs respond well to pressure!

  • Warm standby   A warm standby solution involves a second instance of SQL Server 2005 running that can be switched to after a manual process from the DBA. Warm standby solutions include replication and log shipping.

  • Hot standby   A hot standby solution involves a second instance of SQL Server 2005 running that can be switched to automatically with a minimal disruption of service. Database mirroring and clustering are examples of hot standby solutions.

Irrespective of whether you have a cold, warm, or hot standby solution, you should have some form of redundancy at the disk drive level. So, we’ll cover RAID before examining the various high availability solutions in detail.

RAID

Disk drives are the “least reliable” hardware component in a computer, mainly because they have more mechanical parts. Disk drives also ultimately store your databases. Consequently, you need to provide fault tolerance at the disk drive level.

Redundant arrays of inexpensive disks (RAID) technology allows you to provide redundancy at the data storage level through the use of multiple disk drives configured as a single disk array.

Note 

You can implement RAID solutions at the hardware or software level. Hardware solutions tend to be superior because they have superior performance, less overhead on the operating system, and more features.

You should already be comfortable with RAID, but we’ll quickly go through the more commonly implemented levels of RAID.

RAID-0

RAID-0 is simply a striped set of disk drives, as shown in Figure 3.1. When implementing RAID-0, consider the following:

  • At least two disk drives are required.

  • There is no loss of disk capacity due to redundancy.

  • RAID-0 provides improved performance for both read and write operations because multiple disk drives can be used simultaneously.

image from book
Figure 3.1: RAID-0

Warning 

RAID-0 provides no fault tolerance whatsoever.

RAID-1

RAID-1 provides fault tolerance by storing information on two separate disk drives, as shown in Figure 3.2. RAID-1 is commonly referred to as mirroring. When implementing RAID-1, consider the following:

  • Two disk drives are required.

  • There is a 50 percent loss of disk capacity due to redundancy.

  • Depending on the implementation, RAID-1 can provide improved read performance. Write performance is the same as a single drive.

  • RAID-1 protects against a single disk failure.

  • It takes a long time to recover from a disk drive failure because the entire volume of data has to be rebuilt on the replacement disk drive.

image from book
Figure 3.2: RAID-1

RAID-5

RAID-5 is a striped set of disk drives, with parity distributed across the disk drives, as shown in Figure 3.3. When implementing RAID-5, consider the following:

  • At least three disk drives are required.

  • One disk drive’s worth of capacity is lost due to redundancy.

  • RAID-5 provides improved read performance.

  • RAID-5 typically provides degraded write performance because five operations have to be performed for each write operation (two reads, two writes, and an XOR operation). This performance overhead can be mitigated by hardware solutions.

image from book
Figure 3.3: RAID-5

Tip 

RAID-5 solutions with a larger number of disk drives offer superior performance because you are reading from more disk drives simultaneously. So if you are considering implementing a RAID-5 solution with three disk drives, you might be better off simply using RAID-1.

RAID-10 (1+0)

RAID-10 is a striped set of a mirror, as shown in Figure 3.4. When implementing RAID-10, consider the following:

  • An even number of disk drives has to be used.

  • At least four disk drives are required.

  • RAID-10 has the same fault tolerance as RAID-1.

  • RAID-10 has the same overhead as RAID-1.

  • RAID-10 can potentially handle more than one disk drive failure.

image from book
Figure 3.4: RAID-10

RAID 0+1

RAID 0+1 is simply a mirror of a striped set, as shown in Figure 3.5. When implementing RAID 0+1, consider the following:

  • At least four disk drives are required.

  • An even number of disk drives has to be used.

  • RAID 0+1 has the same fault tolerance as RAID-5.

  • RAID 0+1 has the same overhead as RAID-1.

  • RAID 0+1 does not have as high a reliability as RAID-10.

image from book
Figure 3.5: RAID 0+1

Note 

The Windows operating system supports only RAID-0, RAID-1, and RAID-5.

As you can see, with the exception of RAID-0 and RAID-10, all RAID levels protect against only one disk drive failure. Consequently, RAID-6 (which you can think of as an extension to RAID-5) has been defined, which is to protect against two disk drive failures. RAID-6 uses two independent parity schemes to provide such redundancy.

Note 

We have covered only the basic RAID solutions. Other RAID levels are available, such as RAID-6, as are other vendor-specific implementations.

High Availability

SQL Server 2005 has a number of high-availability solutions. Microsoft has branded these technologies as “SQL Server 2005 Always-On Technologies”; you can find more information about them at http://www.microsoft.com/sql/alwayson/default.mspx.

In the following sections, you’ll look at the different high-availability technologies for SQL Server 2005 in more detail.

Clustering

The failover clustering technology in SQL Server 2005 provides a high-availability solution by protecting against server failure, including the hardware, operating system, and SQL Server 2005 instance. An automatic failover in a cluster typically takes less than a minute and a half.

Note 

Clustering requires specialized hardware, so it tends to be an expensive solution.

A SQL Server 2005 failover cluster consists of a number of components, as shown in Figure 3.6:

  • A shared disk array, which is used to store the database files

  • A quorum, which contains cluster-specific configuration information

  • A virtual server, which is accessed by the database users

  • A heartbeat, which is used by the cluster to determine whether a node is alive or not responding

  • An active node, which the virtual server is effectively running on

  • A passive node, which is acting as a backup node in the case of the active node failing

  • A private network, which is used for a heartbeat signal between the active and passive node

  • A public network, which is used to connect to the virtual server by the database users

image from book
Figure 3.6: Failover cluster

When implementing a clustering solution, you can use a number of different clustering configurations beyond the single-instance cluster shown in Figure 3.6:

  • Multi-instance cluster   A multi-instance cluster typically has two virtual servers installed in a cluster. The database data and log files for each virtual server are typically installed on a shared storage resource that is dedicated to that virtual server. The primary node for each virtual server runs on separate hardware, as shown in Figure 3.7.

    image from book
    Figure 3.7: Multi-Instance cluster

  • N+1 cluster   An N+1 cluster has two or more virtual servers installed in a cluster along with one passive node (+1). The database data and log files for each virtual server are installed on a shared storage resource that is dedicated to that virtual server. In the case of a primary node for a virtual server failing or being taken offline, the passive node takes control of the shared storage resource for that virtual server. The other virtual servers, those using the previously passive node, are not affected by this failover.

    Note 

    An N+1 cluster can be more cost effective than configuring multiple single-instance clusters, because fewer servers are required.

  • N+M cluster   An N+M cluster has two or more virtual servers installed in a cluster together with two or more passive nodes (M). Again, the database data and log files for each virtual server are installed on a shared storage resource that is dedicated to that virtual server. Again, a separate SQL Server 2005 license is required for each virtual server.

The advantage of an N+M cluster is that you have multiple passive nodes that can be utilized in case of multiple failovers, so the load of multiple primary node failures can be spread across the passive nodes.

N+M clusters are typically used when it is important to guarantee levels of performance. Alternatively, they are used where your passive nodes do not have the appropriate level of hardware to handle multiple failovers.

Database Mirroring

Database mirroring is a new high-availability technology for SQL Server 2005 that provides redundancy at the database level. With database mirroring, transaction log records are sent directly from the principal database to the mirror in real time.

Note 

Databases can be mirrored only if they are using the FULL recovery model.

This technique keeps the mirror database up-to-date with a principal database with no loss of committed data. If the principal server fails, the secondary server can take over. In the case of a database failure, with automatic failover configured, database mirroring can failover in less than 30 seconds.

SQL Server 2005 instances play three different roles in a database mirroring solution:

  • The principal server contains the principal database that is used by database users.

  • The secondary server is acting as a hot-standby solution with the mirror database.

  • The witness server is used in the automatic failover process.

Note 

The failover process can be automatic only when a witness server has been configured; otherwise, it has to be initiated manually.

Database mirroring is implemented at the SQL Server 2005 engine level through software, so you don’t need the specialized hardware that clustering requires.

Database mirroring allows only one mirror database per principal database, and usually database users are not able to access the mirror database.

Note 

Database mirroring relies heavily upon reliable network infrastructure, so it is not suited to WAN links that are unreliable or have low bandwidth.

Log Shipping

Log shipping relies on proven technologies, such as transaction log backups, file copying, and the SQL Server Agent to maintain a secondary copy of a SQL Server 2005 database in the event of a disaster.

Note 

Log shipping does not automatically failover from the primary server to the secondary server when the primary database fails.

When configuring log shipping, you will have to decide upon an appropriate schedule for shipping the logs and how to respond to connected users based on your business requirements, acceptable degree of latency, and operational behavior.

Log shipping was designed primarily as a warm standby solution, so the secondary database on the secondary server is configured by default in NORECOVERY mode. This allows additional transaction logs to be loaded but prevents users from accessing the secondary database.

STANDBY mode offers two configuration choices:

  • The default does not disconnect users from the secondary database when the next shipped log is ready to be loaded. Shipped transaction logs accumulate until all users are disconnected.

  • The Disconnect Users in the Database When Restoring Backups check box, shown in Figure 3.8, disconnects users from the secondary database whenever the next shipped transaction log is ready to be loaded. The disconnection occurs based on the schedule that you have configured for the restore job.

    image from book
    Figure 3.8: Disconnect Users in the Database When Restoring Backups check box

The main advantage of log shipping over database mirroring is that you can have multiple secondary servers to which you have log shipped your primary database’s transaction logs. There also is no dependency on the network infrastructure, as with database mirroring, so you can schedule your transaction logs to be shipped periodically throughout the business day or after hours, as required.

Replication

Replication has been around since SQL Server 6.0 and has consistently improved and offered more options through the different versions.

It is important to understand how the different replication technologies work in SQL Server 2005 to determine the appropriate replication topology for your high-availability solution. The three main considerations are the latency that your database solution can afford, whether database users will need to modify data, and, if so, whether you need to deal with update conflicts.

Note 

Environmental and operational factors will influence the performance of your replication topology, so it is a good idea to set up a test environment to see how it performs within your organization.

Figure 3.9 shows the various components of the replication architecture.

image from book
Figure 3.9: Replication architecture

Understanding the Replication Components

It’s important to understand these various components and the roles they play in replication:

  • Article   An article is basically the smallest unit of replication. It can be a table, view, or stored procedure. It represents the data or a subset of the data from a SQL Server entity.

  • Distributor   The distributor is a SQL Server instance that stores replication’s metadata in a system database typically called distribution. A distributor can be a remote SQL Server instance or can reside on the publisher.

  • Publication   A publication is a collection of one or more articles from a database that form the basis of a unit of replication.

  • Publisher   A publisher is a SQL Server instance that makes its database(s) available for replication.

  • Subscriber   A subscriber is a SQL Server instance that receives replicated data.

  • Subscription   A subscription is a request for replicated data. A subscription can use either a push model or a pull model.

Understanding the Replication Agents

Several replication agents are involved in a replication topology depending on the type of replication configured:

  • Distribution Agent (distrib.exe)   Used in transactional and snapshot replication. It basically is monitoring the distribution database and replicating transactions to the various subscribers.

  • Log Reader Agent (logread.exe)   Used in transactional replication. It runs on the distributor and basically monitors published databases for incoming transactions that it copies as required from the published database’s transaction log to the distribution database.

  • Merge Agent (replmerge.exe)   Used by merge replication to reconcile conflicts depending on its configuration.

  • Queue Reader Agent (qrdrsvc.exe)   Used by transactional replication with the queued updating option.

  • Snapshot Agent (snapshot.exe)   Used by all replication types. It is responsible for the initial schema and snapshot of data that forms the basis of replication.

Understanding the Replication Types

SQL Server replication is extremely flexible, making it a great choice to build complex automated replication scenarios. Replication can occur continuously, or it can be scheduled to run periodically, which makes more sense where you need to replicate across wide area network (WAN) links after hours.

SQL Server 2005 supports four types of replication.

MERGE REPLICATION

With merge replication, data modifications are kept track of via triggers on both the publisher and the subscriber. When the publisher and subscriber try to synchronize, they send each list of modified rows and attempt to merge the changes to get a consistent view of the data. In this type of replication, data modification conflicts can occur, so you need to configure some form of conflict resolution.

TRANSACTIONAL REPLICATION

With transactional replication, you are streaming all the DML operations as required from the publisher to the subscriber. As indicated, a transactional replication has a hierarchy, transactions are replicated from the publisher to the subscriber, and you should not update the subscriber.

However, you have the ability to update subscriptions without data conflicts by using one of the following options:

  • Immediate updating   With immediate updating subscriptions, the subscriber and publisher are updated in a single distributed transaction using the Microsoft DTC. There is a minimal chance of a conflict with this option, but it requires reliable network connections.

  • Queued updating   With queued updating subscriptions, you queue the DML operations, which means you can potentially have a conflict because you effectively allow for simultaneous modification of the same data. Consequently, you have to configure some conflict resolution; the options are as follows:

    • Publisher wins (default).

    • Publisher wins, and subscription gets reinitialized.

    • Subscriber wins.

SNAPSHOT REPLICATION

With snapshot replication, you are sending the entire contents of the articles that make up the publication, instead of incremental changes, across the network according to the schedule. Snapshot replication is not commonly used, but in certain cases it makes more sense to send a snapshot, let’s say nightly, than a stream of all the DML statements that occurred within the database throughout that day.

PEER-TO-PEER TRANSACTIONAL REPLICATION

Peer-to-peer transactional replication is a new feature in SQL Server 2005 that takes advantage of the existing transactional replication technology. It is configured through a number of wizards that help you manage the setup and configuration of your peer-to-peer transactional replication solution. Peer-to-peer transactional replication operates in near real time, so the amount of latency and potential data loss is relatively low.

Note 

Peer-to-peer transactional replication is available only in SQL Server 2005 Enterprise Edition.

With peer-to-peer transactional replication, you need to ensure that the database schemas on all of the peers are identical. Because peer-to-peer transactional replication uses the same continuous synchronization technique available in existing transactional replication technology, there is some inherent latency. If one of your SQL Server 2005 instances goes down, it is possible that not all of its transactions will make it to the other servers.

Note 

Peer-to-peer transactional replication has no built-in conflict detection and resolution technology, as with merge replication. The technology is designed to work so that DML operations for any given data are made at only one database, which is then synchronized with its peers.

Peer-to-peer transactional replication works best when your DML operations are mutually exclusive to each site and is designed to scale out your geographic workload by automatically replicating data between these remote sites, as shown in Figure 3.10. In this example, there are help desks based in Sydney, Cuzco, and Nizhny Novgorod that are used to service a global organization. Each help desk site operates in a window of time mutually exclusive from the other two, so there are no DML conflicts.

image from book
Figure 3.10: Peer-to-peer transactional replication



MCITP Administrator. Microsoft SQL Server 2005 Optimization and Maintenance Study Guide (70-444)
MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance (Exam 70-444) Study Guide
ISBN: 0470127457
EAN: 2147483647
Year: 2004
Pages: 146

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net