Chapter 10: High Availability in SQL Server

What, your server is down again for maintenance? In today's economy, an hour's worth of planned or unplanned downtime could mean millions in lost sales or unproductive employees. In this chapter, we'll talk about some of the options you have with SQL Server to prevent downtime. Whether through log shipping or through clustering SQL Server, you can make sure that your system has minimal downtime. The chapter will introduce clustering Windows and SQL Server, and the next chapter will cover how to cluster step by step.

The Need for High Availability

You may have reduced your total cost of ownership of your SQL Server with consolidation, but you now have a big problem. Now that you have ten products running on one server, if that one SQL Server goes down, all ten products go down. This could represent anything from a nuisance to millions of dollars of penalties for each hour of outage.

High availability needs are always driven by the business. Once the business evaluates their cost of going down, they will set the requirements for the technicians. The main reason that this must be driven by the business need is that high availability solutions are quite expensive. The business manager's knee-jerk reaction is almost always to say that no downtime is acceptable. This is where the technician must bite his tongue and show the price tag of doing that.

Downtime is measured in nines. For example, if two nines of uptime is acceptable, this equates to 99 percent uptime (or 3.7 days of downtime yearly). Table 10-1 shows you the complete picture of what 90-99.999 percent uptime equates to in annual downtime.

Table 10-1: Uptime Percentages
Classification	Percentage Uptime	Annual Downtime
One nine	90%	36.5 days
Two nines	99%	3.7 days
Three nines	99.9%	8.8 hours
Four nines	99.99%	53 minutes
Five nines	99.999%	5 minutes

If one nine is acceptable, then potentially you may only need aggressive backup and recovery solutions to accomplish this. With four nines of availability, though, you will need a high availability solution such as log shipping or failover clustering.

It's worthwhile to schedule planned weekly, monthly, or quarterly outages where maintenance such as database upgrades can occur. In the case of hardware and operating system upgrades, you can fail over the server to the backup server in a cluster and perform the maintenance on the system.

Sticker Shock of High Availability

I already mentioned the price of high availability. This prevents many small businesses from implementing such solutions and accepting a lower level of availability. Why is it so expensive? Let's take an average mid-grade server and failover clustering (I'll explain clustering later in this chapter). Immediately with failover clustering, you'll need two machines. On average, a quad-processor server with 4GB of RAM costs about $28,000 after Windows 2000 Advanced Server is installed. This is based on a price I received from a vendor in early 2003.

Because you're clustering, you'll need to double your equipment price, bringing your total to $56,000. Then, since you're clustering SQL Server, you must purchase Enterprise Edition of SQL Server and Windows Advanced Server. Enterprise Edition is four times more expensive per processor than Standard Edition. The total for SQL Server for our fictional server is $20,000 a processor times two processors.

For clustering, you will need some sort of shared drive system. This can range from the low-end shared SCSI to a Storage Area Network (SAN) drive. The price for this can range anywhere from $15,000 (shared SCSI) to $500,000 (SAN), conservatively. That will bring our server's total package without the drives to $96,000. Then plan on your drives to take on a hefty price.

The cost of high availability is not linear. Hard drive redundancy can generally be accomplished with RAID and good backups. As soon as you hit four nines, the cost jumps significantly because that's when failover clusters become almost mandatory.

Single Point of Failure

Reducing any point of failure is essential with any high availability solutions. Chances are, you can't completely remove all the points of failure, but it's important to document and point out those points of failure that can endanger your system. Some components that you want to ensure have redundancy are:

Power supplies Redundant power supplies for both the disk arrays and server prevent downtime if a power supply fails.
Memory Memory that is redundant can prevent downtime if a memory bank fails. A bad memory chip typically shows itself with server lockups. You have several options, from redundant memory banks to ECC.
Fans Redundant fans can ensure that your server doesn't overheat in case of a fan failure.
Storage devices Usage of the proper redundant disk subsystem such as RAID will ensure that your data is protected in case of a drive failure.
Network cards Redundant network cards ensure that if a network card fails, communication to your SQL Server can persist.

The key with redundancy is to ensure that you have the proper monitoring on each of your key devices. Most mid-priced servers ship with software agents to write to the NT event log each time a device fails.

What High Availability Can't Do

There are a lot of misperceptions about what Windows can do with its high availability solution. Microsoft's high availability solution is not a scalability solution. For example, by clustering your SQL Server, you will not have a load balanced system where half your queries go to one system and the other half to the second node. High availability will also not increase the overall capacity of your system. It will only add redundancy in your environment.

High Availability Options

Your options with high availability vary widely based on what type of availability you need and how much you're willing to spend. Typically, you have one of the options shown in Table 10-2. The table also spotlights a few of the features of each of the high availability options. We'll spend this chapter and the next two chapters discussing these options.

Table 10-2: Clustering Options and Their Redundancy Levels
Feature	Failover Clustering	Log Shipping	Transactional Replication
Standby type	Hot	Warm	Warm
Failure detection	Automatic	No, not automatically	No, not automatically
Automatic failover	Yes	No, not automatically	No, not automatically
Learning curve	High	Moderate	High-moderate
Cost	High due to equipment and Enterprise Edition licensing	Moderate due to Enterprise Edition licensing	Low since it doesn't require Enterprise Edition
Data always current	Yes, since drive is shared	No, only as good as last transaction log backup	No, only as good as last synchronization
Performance impact	None	Impact from file copying over network	Impact from Log Reader agent is constantly running
Failover time	Generally anywhere from 10-45 seconds	Seconds, more to recover more thoroughly	Seconds, more to recover more thoroughly
Location of servers	Nearby, <100 miles	Dispersed	Dispersed

Note

There are some third-party solutions that automate the usage of transactional replication for failover clustering. Usage of transactional replication is tricky for this type of setup and should only be considered for smaller databases.

Log Shipping

In Enterprise Edition of SQL Server you can 'ship' transaction logs to a read-only server for backup purposes. This backup server can be geographically distant or in the same server room. I'll talk much more about this option later in this chapter.

Replication

Transactional replication lets SQL Server queue up each transaction and push it to another server, whether that server is in the same server room or geographically distant. Publishing transactions to the second server can be performed on a scheduled basis or on a real-time basis. Transactional replication is only recommended for smaller databases since it can be CPU intensive if you're actively writing to the database. I'll discuss much more about replication in Chapter 12.

Failover Clusters

Failover clusters are my favorite way to handle the high-availability problem. It's also the most expensive and has the highest learning curve. Clustering allows you to have a backup server and if a hardware or intentional outage occurs in the primary server, all the server's resources will fail over to the secondary server. This can be done automatically or manually, and the failover process takes about 30-45 seconds. Typically, applications will only have to refresh the page to reconnect to the new server. This does require shared data storage, and the servers don't have to be in the same room (as long as they have a quick connection between them and a shared disk).