Decisions, Decisions ... | Microsoft SQL Server 2000 High Availability

Decisions, Decisions

When selecting the right availability technology or technologies for your environment, trust becomes a huge issue: can your company, whether it is a small business or a global powerhouse, bet your server or solution availability, and possibly even the survival of the company, on the investment? That is a lot of pressure, but it is also one of the main reasons that this book emphasizes proper planning: the wrong decision could be costly on many levels.

The Decision Process

The decision process itself is, like achieving availability itself, deceptively simple: analyze your requirements, compare those against the technologies being evaluated, and select the technology that will provide you the best availability based on all factors. This, of course, is easier said than done.

Step 1: Analyzing Requirements

Look back at the guiding principles that came out of the questions posed in Chapter 1. Those guide the rest of the decision process, as they detail the overall availability goals for the solution. The technology should meet those goals, but there will obviously be some compromise involved. No one solution can be 100 percent perfect. The choice or choices made should meet your most important criteria. If perfection is the goal, the end implementation will never happen, and it will be costly on many fronts.

One crucial decision that comes up at this stage is the budget for the project. The amount of money you have to spend greatly affects what type of availability solution you can implement. Do not plan a solution ”even down to rollout plans and support documents ”only to find out you cannot afford it. As casinos tell gamblers, Bet with your head, not over it. For example, a regional financial company provides online services 24 hours a day, including all aspects of banking (automated teller machines, bill payments, account balances , investments, and so on). Any moment of downtime could cost millions of dollars in lost transactions ”an unacceptable situation on the surface. Geographically dispersed clusters seem like an ideal solution, right? They provide an automatic switch that log shipping does not, and should be transparent to end users. It also protects you from having one data center as a single point of failure.

Unfortunately, most geographically dispersed clusters, whether pure hardware or a combination of hardware and software, cost ”at a bare minimum ” hundreds of thousands of dollars. Realistically , the cost is probably in the millions. Add to that the cost of the network to ensure the cluster response time and overall network performance. A wide area network (WAN) of this type is very expensive to implement and requires specialized hardware as well. This combined price tag rules out even many enterprise customers. The sooner that you realize that the ultimate in availability might be out of your reach, the better off you will in planning a solution that will provide you with the best availability you can afford and manage. This is not to say that geographically dispersed clustering is not the solution for you ”it might very well be, and it might work well. However, if you cannot afford it, it is not worth putting the company out of business and yourself out of a job just to say you achieved the ultimate in high availability.

With the cost factor out of the way, it is easier to focus on how to make the right decision for your environment ”as long as you stick to the guiding principles. Other sample questions to ponder during this phase are:

What is the skill level of the staff that will deploy, administer, and maintain the solution? Do they have any experience with the technologies being considered ? Do you have the proper staff in place or do you need to hire? Is there the budget to hire more staff?
What is the deployment time frame? Will the technology choice meet that goal?
Do the applications using the technologies being considered work well with the technology of choice? If the technology is not compatible or does not produce the desired results, it will be a painful experience for everyone involved, making it a bad choice. Consult application vendors or developers to see what their disaster recovery scenarios include for their applications on your platform of choice.
If you are building an application and complete solution from the ground up, do the application developers and testers have experience with the technologies, or do they need training? How long will that training take, and how will it impact the schedule?
Can you afford to have duplicate equipment for development, testing, staging, and production environments?
What hardware will be used? Is it new? Repurposed? Will it actually work and be a supported platform for the technology?

Step 2: Comparing Technologies

Now that you have analyzed your internal situation, it is time to evaluate the technologies. First, pick the version of Windows that meets your current and future growth needs from a memory, processor, and feature standpoint. That said, your decision might change depending on the SQL Server technologies being considered. The operating system choice is usually a fairly straightforward one, but the SQL Server choice is not. For a direct comparison of the SQL Server-based technologies, see the section A Comparison of the SQL Server Technologies later in this chapter. Consult other chapters in the book for more information and attend conferences, ask consultants or salespeople, ask for trial versions, and perform actual proofs of concept. Go into the actual selection process with as much knowledge as you can to make the right decision. Hopefully this book will assist in that process, but it would be nave to think that it is the only source of information on high availability on the Microsoft platform. It is better to take your time than to make the wrong decision.

Step 3: Selecting Technology

With all of the relevant questions asked and answered , and everything thought through as much as possible, it s time to make your choice. Again, this is easier said than done, given that you have multiple technology choices available from which to decide.

A Comparison of the SQL Server Technologies

This section compares and contrasts failover clustering, log shipping, and replication. Backup and restore is woven in.

Choosing Between Log Shipping and Transactional Replication

On the surface, log shipping and transactional replication are deceptively similar. They both offer the ability to send transactions from one server to another in an automated fashion to create a warm standby. They both involve a manual switch to a secondary server, both can be used for reporting, and so on. The subtle differences ultimately become the key points.

The biggest considerations are the ones loosely described in the earlier sections about log shipping and replication:

How many transactions are generated per hour ? Will the solution be able to keep the secondary in sync quickly enough to allow low latency between the two?
How much downtime is acceptable to the end users? Will the switch process invalidate any availability SLA you have in place?
How much data can you realistically afford to lose in an extreme emergency? Will the solution meet your transactional requirements?
Will the solution be too expensive from a hardware, software, or administrative standpoint?
Do you have enough network bandwidth between the primary and secondary?
How large are the entities (transaction log files or transactions) being sent across the wire?
How long will it take to apply the transaction or transaction log to the secondary? Seconds? Minutes? Hours?
What is the capacity of the secondary server? Can it handle the current production load?
Is it a goal to eventually switch back to the original primary after a failure? If so, how will you perform this task?

The answers must be in line with your business requirements.

Transactional replication stores the transactions both at the publisher and the distributor. At the publisher, they are kept in the publication database, and if for some reason they are not published, they still exist there. If they have been replicated, they also exist in the distribution database. Should the publication or distribution databases fail, the log reader process cannot read the transactions.

Log shipping stores the transactions only in the database (and subsequently in the transaction log). Depending on what transaction log backups are available, and assuming you can still get to the transaction log on the primary server if it has been updated after the last transaction log backup, you might still be able to get the remaining transactions. This means that with log shipping, not with transactional replication, there is a higher probability of transactional recoverability.

Replication might require more server overhead than log shipping. Both log shipping and replication will have input/output (I/O) overhead, but remember that replication has replication agents doing work that consumes system resources. If you are sending over every transaction in a highly utilized online transaction processing (OLTP) database, this cannot be discounted. Log shipping has overhead, but because it is run less frequently, the overhead might be less. Log shipping and replication both use SQL Server msdb database, but neither really uses it more than the other.

Latency between the primary server and the secondary server is also a concern. Although transactional replication might have a lower latency, it also requires a synchronized backup strategy, which log shipping does not. See the topic Strategies for Backing Up and Restoring Transactional Replication in SQL Server Books Online for more information. When the replication option Sync With Backup is set on both the publisher and distributor to ensure that both can be restored to the same point, you will get roughly the same latency as log shipping.

For transactional replication, it is paramount that the distribution database is available, so making that SQL Server available impacts the end solution. If you are using the built-in version of log shipping found in SQL Server 2000 Enterprise Edition, the monitor needs to be made available to report status. Not having all pieces available could affect the solution. The backup and restore situation with replication comes into sharper focus if for some reason you need to restore replication for disaster recovery, because if Sync With Backup is not set, replication has to be started from scratch, which could be troublesome .

With transactional replication, there might be an issue if the publisher, distributor, and subscriber are out of sync in terms of the SQL Server version. Generally , a distributor must be of a version equal to or greater than its publishers and a publisher must be of a version equal to or greater than the subscribers. There are sometimes exceptions to the rule in terms of subscribers. Log shipping should function no matter what version of SQL Server 2000 is installed, but all SQL Servers should be at the same level to ensure compatibility and similar behavior after the role change.

Because the distribution database contains the transactions that will be sent to the warm standby, a failure of the distribution database will impact transactional recoverability if the transactions are lost, as well as ability to recover the standby after a failure. You should weigh which one is more important to your business, as log shipping will give you better transactional recoverability. In a catastrophic failure, both log shipping and replication would need to be set up again.

If using the standby as a reporting server is crucial to your business and if you are choosing only one technology, replication might be the better option. Log shipping and its requirement to have exclusive database access, combined with frequent transaction log loads, makes it less than ideal for read-only access to that particular database. However, you also have the ability to stack transaction logs and apply them later. Obviously, this puts your log shipped secondary more out of sync with higher latency, but it can be used for reporting. If disaster recovery is more important than read-only databases, log shipping definitely has the edge in this category.

The process of bringing either the replicated database or log shipped database online must be considered. Technically, the replicated database is online, but does it have all the elements to assume the production role? Does the log shipped database have all the transaction logs applied? How long does the entire process take? Is it tested ? How will you notify the end users? The complexity will vary a little, but historically, it is much more defined from an availability standpoint for log shipping. Transactional replication and its lower latency might mean that a replicated standby is closer to the source for disaster recovery purposes, but the lack of processing would make log shipping a much more attractive option. You still need to weigh what is more important to your business.

Does introducing log shipping or transactional replication add more complexity because there are moving parts and dependencies? Be certain your availability solution cannot end up becoming a barrier to availability! Remember the maintenance of these added points of failure, too. All servers need to be well maintained . Managing and monitoring log shipping is generally easier than it is for replication.

Choosing Between Failover Clustering and Log Shipping or Replication

If you base your decision purely on the choice between automatic or manual switching to a standby server, SQL Server 2000 failover clustering wins hands down. The problems for most people considering it are the aforementioned wasting of resources and the interruption in service that results from the automatic switch. These two factors, coupled with the fact that the hardware is a more specialized solution (even down to driver levels), might scare some away, but they should not. In reality, although SQL Server 2000 failover clustering is a little more complex in some ways than a log shipping or replication solution, in other ways it is not. Failover clustering has moving parts, but you do not need to worry about things like Log Shipping Monitors or making sure that the distributor is up and running to keep the process going. As long as the underlying server cluster is properly configured, there should be no issues with the failover cluster built on top of it.

Now, log shipping or replication might be considered by some poor man s clustering if Network Load Balancing is thrown in the mix to abstract the role change so that the back-end SQL Servers appear to act as one. In reality, however, it is not a server cluster or a failover cluster, so if you really do need the functionality provided by failover clustering, please consider it.

Using the SQL Server Availability Technology Comparison Table

Here is a guide to the terminology used in Table 3-3:

Standby type How Microsoft classifies the solution. The ratings are hot, warm, and cold. Hot represents a solution that provides excellent uptime given all of its features and considerations. Warm provides good availability. Cold is not the best, but it can be achieved. For example, backup and restore would be considered cold.
Failure detection Processes invoked by the technology can detect a failure and react to it.
Automatic switch to standby No manual intervention is required to switch to the standby.
Masks disk failure Does not mean that it completely masks a failure to a local or shared disk subsystem. It means that in the switching process, you will not encounter the same disk failure and can still recover on the standby.
Masks SQL process failure If something goes wrong in the process, the technology can handle the problem.
Metadata support The level of data definition language (DDL) support offered by the technology.
Transactionally consistent The standby server is transactionally consistent.
Transactionally current The standby server is at exactly the same point as the primary server.
Perceived downtime The time that an interruption in service might occur.
Client transparency The client will notice a problem.
Hardware requirements The technology requires specialized hardware.
Distance limitations The technology is constrained by distance.
Complexity The technology is significantly more complex than implementing a normal SQL Server-based solution.
Standby accessible for read-only access The standby server is available for reporting or other read-only functionality.
Performance impact The technology impacts overall server performance.
Impact to backup strategy A backup and restore plan needs to take into account additional complexities due to the technology.

Full-text support The technology supports Full-Text Search in a disaster recovery scenario.

Table 3-3: SQL Server Availability Technology Comparison Table
Availability Feature	Failover Clustering	Log Shipping	Transactional Replication
Standby type	Hot	Warm	Warm
Failure detection	Yes	No	No
Automatic switch to standby	Yes	No, but Network Load Balancing might help	No, but Network Load Balancing might help
Masks disk failure	No, shared disk can possibly be a single point of failure	Yes	Yes
Masks SQL process failures	Yes	Yes	Yes
Masks other process failures	Yes	No	No
Metadata support	Yes, the entire server is failed over	Yes, but only all metadata in the log shipped database; anything outside the database, no	Only selected objects
Transactionally consistent	Yes	Yes	Yes (transactional only)
Transactionally current	Yes, database in a failover is always at the point it is at the time of failure	No, only as current as the last transaction log applied	No, only as good as the last transaction applied
Perceived downtime	Usually under two minutes, plus database recovery	Time to switch servers plus any transaction logs that need to be applied	Time to switch servers
Client transparency	Yes, if referring to the server name or IP address; application must still reconnect or be made cluster aware, so there might be minimal impact	No	No
Hardware requirements	Specialized solutions	No	No
Distance limitations	Normal networking limitations	None	None (limited to network bandwidth, though)
Complexity	Potentially more	Some	More
Standby accessible for read-only access	No, only one copy of the data	Possibly, depending on transaction log load times	Yes
Performance impact	Minimal to none	Minimal: I/O and overhead of log shipping jobs	Minimal: I/O and overhead of replication agents
Impact to backup strategy	Minimal	Minimal to none, depending on if transaction logs are already backed up	Yes, because in a disaster recovery scenario, the publisher and distributor would need to be in sync
Full-text support	Yes, full-text functions normally after failover	No, full-text indexes would need to be rebuilt after database is brought online or if the database from the primary was left at the same point the secondary was brought online, you might be able to attach the indexes	Yes, but need to configure an index to be built on the standby as the transactions are inserted

What Should You Use?

The general rule of thumb is that failover clustering provides the highest availability. That statement has merit from a technology standpoint ”whether or not a geographically dispersed cluster is involved ”because you get an automatic process that switches resources from one computer to another without human intervention.

If you cannot afford SQL Server 2000 Enterprise Edition and the versions of Windows that support clustering, failover clustering is obviously not an option. It might be right for your situation, but if you cannot afford the solution, you need to think about alternatives.

Log shipping is a great primary or secondary solution, and it compliments failover clustering well. It is a trusted, time-tested method that is geared toward making a standby server. The issue winds up becoming the delta of time between the primary server and the secondary server. Because log shipping is not an ideal reporting solution, but a good availability solution, if reporting is needed, you could combine replication with log shipping to provide that functionality. Transactional replication might work for some, but in terms of availability, it should be considered after failover clustering or log shipping.

Again, no matter what, remember your backup and restore strategy. In a complete disaster recovery scenario, your backups might be the only remaining option.

This section outlines some basic guidelines. They will help, although ultimately additional considerations based on the technical details of each technology will help you decide which choice is the best fit for your particular environment.