The Planning Process | Microsoft SQL Server 2000 High Availability

Step 1: Breaking Down the Requirements

The bullet points below indicate how you should approach a set of requirements. Do not try to define technology solutions at this stage. As database administrators (DBAs) or system engineers , which many of you reading this book probably are, your natural tendency is to lead with technology. Your goal, however, first and foremost, is to understand the requirements before attaching any technology. Any risks that are drawn up at this stage do not necessarily need answers; they will be answered in time as you move through the process.

Here is what you know so far:

There will be a minimum of two databases: one for reporting and one for your main data. The main data will be OLTP, and the reporting data will be read-only with regular data loading.
Near-real-time reporting looks to be an absolute requirement.
With 700 GB of total disk space needed, following the 80/20 split, there will be 560 GB of data and 140 GB of log. However, that really means 1120 GB of data and 280 GB of log because you have at least two systems hosting copies of the data.
You will have both directly connected and Internet-based users, therefore you will need at least two different security models for the solution.
Because the one system is mainly writes and the other is mainly reads with write updates every few minutes, their disk characteristics are similar.

Here are some questions you must answer:

You have two security models: authenticating users through the Web and directly in the application. Will they be handled the same way in the code?
How will the database logic be implemented ”are you using direct code or stored procedures? This affects your memory requirements.
What is your network bandwidth? Can it handle whatever mechanism you employ for synchronizing the databases?
What type of staffing will be required? What type of expertise is needed, and is it already in-house?
What are the service level agreements (SLAs) beyond the general four nines? Is it four nines from 5 A.M. to 9 P.M. Eastern Standard Time since you cover both North America and Europe? What are the standard support contracts for your company?
Do you need to create a warm standby in addition to the reporting server? Is it possible to make them the same?

At the same time, draw up a list of initial risks:

How will the disks be protected?
Are there any barriers to adopting a technology such as failover clustering?
Where should the servers be placed for ultimate availability?
Will $500,000 be enough? Will trade-offs need to be made?
What is the business impact of downtime?
Are there proper monitoring systems in place to handle the new systems?
What type of maintenance will be needed, and how will it interface with current plans for other systems?
How many Web servers are needed, and do you set up a demilitarized zone (DMZ) environment? Where will each server be placed? (This is outside the database scope, but in an overall project, you need to consider this.)
How will you load test the system? How will you test it in general? Will there be the same equipment to develop and test on? If not, what is the risk from a production standpoint? What is the budget for doing a load test (including dedicated systems and software), and will it be done in-house or externally?

Step 2: Considering Technologies

With enough basic information in hand, you can take the known factors, questions, and risks and turn them into technology decision points. This must be done before you look at purchasing hardware. One of the biggest mistakes that many people or companies make, even without capacity planning information, is to purchase hardware based on a recent system put into production or some basic corporate standard. On paper, whatever the chief purchasing officer buys might seem as if it has the capacity you need, but does it really?

Tackling the database synchronization issue, you have a few choices available to you:

Log shipping This would solve both the reporting and availability and standby issues, but with frequent transaction loads, you really could not use it for reporting.
Replication This would solve the reporting problem, but does the data you need to send fit in the row size limits for replication? If not, that would rule out replication. You would need to consult the developers and development DBAs for these numbers . What method would you choose? If you chose replication, how would you construct your replication architecture?
Data Transformation Services (DTS) packages Creating a DTS package is definitely possible, but how would you monitor such a solution and ensure that it is working? What logic would you put into it? How would it impact the OLTP database? Do your developers have the expertise to code it? Would it affect the development timeline?

From the main OLTP database standpoint, how frequently is data changing? That is, are there many inserts , updates, and deletes occurring or is your data changing by only a small percentage? This impacts your maintenance plans and whether you implement log shipping or replication. How will you protect this database? Failover clustering seems like a logical choice because of automatic failover, but it might be overkill depending on the rest of the solution, or you might not be able to afford it. Remember that cost always lurks in the shadows.

As a first stab, you decide to use failover clustering for protecting the main OLTP database and the distributor ”they will be housed in one instance, transactional replication to populate the reporting server, and log shipping for protecting the entire solution in the event of a whole site failure. Using log shipping means that you not only have another data center but also that you are taking the risk that the log shipped database will serve as both the reporting database and the OLTP database. That might impact performance and would need to be tested .

You also decide to implement this entire solution on Microsoft Windows Server 2003 because it will provide the longest supportability and you know the system will stay in production for at least two years .

Step 3: Designing the Architecture

There are a few phases to designing your architecture: designing the hardware and then dealing with people and processes.

Note	Because people and processes are extremely important in your design, they are covered extensively and in much detail in other chapters. They are also presented at a high level in this chapter, and all considerations are reviewed.

Designing the Hardware

First, examine your disk requirements. On the primary OLTP system, you decide that it will serve as both the Publisher and the Distributor. You need to plan for the OLTP database, a distribution database, msdb usage, transaction logs, MS DTC, and a quorum disk, as shown in Table 11-1. You are looking at more than 700 GB of disk space for each copy of the data.

Table 11-1: Planning the Disk Subsystem
Requirement	Disk Space Planned	Separate LUN
Publisher/main database ”data portion	600 GB	Yes
Publisher/main database ”log portion	200 GB	Yes
Distributor	50 GB	Yes
MS DTC	1 GB	Yes
Quorum	1 GB	Yes

Notice that extra space was allocated to all databases. This allows for some extra room should the estimates provided actually turn out to be lower than what is needed or for additional growth.

Now it is time to address the problem of how to deal with the reporting server and warm standby. Should you use the warm standby as a replication primary? No, because you have clustering in place. It will be used purely for high availability, so you need 800 GB of disk space to match the primary. On the replicated server, you need the 800 GB as well. It is beginning to look as if you will incur a large cost by purchasing enough disk space to host the three copies of the data.

Next , examine your memory requirements. Because you can expand memory as time goes on (although that will incur some downtime), you decide to reduce costs and start with 4 GB of memory and then see how the application behaves before you expand to 8 GB. The /3GB boot.ini switch (described in Chapter 14, Administrative Tasks to Increase Availability ) will be used, allowing SQL Server to use up to 3 GB of total memory, leaving 1 GB for the operating system. The same is true for processor power ”you will start with four processors, but you will buy the fastest available to give you the longest system life. You should also specify that each system be the same brand and, if possible, the same model for ease of management and maintenance.

People and Processes

At this point, start thinking about how you will monitor, maintain, and administer these systems in a production environment. Choosing the technologies involved will dictate how you staff the project from a development and operational standpoint as well as what expertise is needed. If you do not have the expertise in-house, you will need to hire or train people. Along the same lines, is the architecture decided on actually viable for your environment? Will lack of expertise impede its adoption? Is it too complex? Will it meet all growth, performance, availability, reliability, and scalability goals? How will that affect SLAs? Do SLAs need to be rewritten? What backup schemes and other maintenance will need to be revised? Is there monitoring in place? If so, how can it be leveraged? If not, what is needed?

Step 4: Choosing Hardware and Costs

Although it does not matter in what order you configure the systems, this example starts with the cluster. Disks will more than likely represent your greatest cost. Do you go with larger, slower disks or smaller, faster disks? Start by looking at the number of disks needed for an optimal deployment.

Note	The next section goes into depth on cost and disk configuration. Although the amount of detail might be painful, it is necessary.

Server Cluster

Quorum disk Two dedicated disks (must be the same size as all other disks, even though you are using a small portion).
MS DTC Same as the quorum requirements.
Publisher/main data At 600 GB, in terms of raw space, that is 17 36-GB disks or 9 72-GB disks. However, that is raw space only; Windows needs a bit more for formatting. Pad that by at least one drive, meaning 18 and 10, respectively. Now, if you implement RAID 5, add another disk for parity, which translates into 19 and 11, and if you do mirrored stripes or striped mirrors, double the disk amounts to 36 and 20. Now triple these numbers because you have the data on three separate systems.
Publisher/main log At 200 GB, this is approximately six 36-GB disks and three 72-GB disks in terms of raw space. Again, add another disk for Windows formatting, making it seven or four. You do need to mirror your log disk, so double the number, making it 14 or 8.
Distributor (data and log) At 50 GB, you could satisfy this with two 36-GB disks or one 72-GB disk. With RAID 5, that is three or two disks, and if you mirror and stripe, four disks (as that is the minimum for mirroring and striping, otherwise it is just mirroring).
System databases This means mainly msdb and tempdb. Here, plan on at least the same space as the Distributor. Remember, however, that tempdb usage will vary. Because you do not have any testing numbers, you are guessing about the usage you need.

Now you must configure your cluster based on the numbers you determined. Because there is only one main cluster system on your pick list, it makes your decision a bit easier.

You already decided to use the best processor. This will be a two- node cluster. That translates into the 900-MHz model at $81,200 per node (all dollar amounts given are in U.S. currency) or $162,400 for the two cluster nodes alone. You might have to rethink the decision to cluster, as this represents 32.5 percent of your entire budget.
You need two network cards for each node and, at $250 per card, that is $1,000. You are now up to $163,400.
You will use two controller cards per disk enclosure at $5,000 each. Combined with the disk enclosure, which holds 12 disks, each will be $17,000. The cost of the entire disk subsystem will be added in later.
One host bus adapter (HBA) is already included, and you will not have redundant disk paths. This is an availability risk, but you are trying to keep your costs down, and you determine that having only one card combined with clustering is an acceptable risk for this solution.
You need a fibre hub for your disks, which is $3,000, bringing your new total to $166,400.

On the CD
For your disk situation, the math can be found on the CD in the file HW_Cost.xls on the Cluster Disk Costs tab.

As you can see in the file on the CD, HW_Cost.xls, there are many permutations of how you can configure the disk subsystem, with the cheapest being around $94,000 and the most expensive at $140,000. These costs include the number of disk enclosures and controller cards for the enclosures. Assume you will use the option on line 52 ”all 72-GB disks, with RAID 5 used for the Distributor and the system disks, but some form of striping and mirroring for the OLTP system. You are making the trade-off on Distributor write performance and system disks to save cost here and put more into your main OLTP database. This will give you decent speed, large disks, and a smaller amount of disk enclosures. The cost is $115,800, which brings you to a total cost for the cluster of $282,200. That is 56.4 percent of your overall budget.

Tip

Remember, in a real-world situation outside of this book example, take into account the cost of racks and other infrastructure- related items such as determining if you have enough room in the data center to house them or the proper cooling system for the large number of disks and the heat they will generate.

Reporting Server and Log Shipped Secondary

Because you use racks in your data center, only stand-alone system option number two is valid. That translates into the following:

$45,000 for the base system that comes with 2 GB of memory
$3,400 to add another 2 GB of memory
$1,600 for the fibre adapter (HBA) for the server
$1,200 for the fibre hub
$7,000 for the fibre switch
$3,500 for the extra warranty

This calculates to a base cost of $61,700. Add an additional $250 for the log shipped secondary, as you want to use a private network to send the transaction logs, bringing the cost to $61,950 for that system. Those two elements added to the cluster now cost $123,900 for a new total of $406,100. Next comes the disk configuration for the two systems, which can be found on the Rack System “ Non Cluster tab of the Excel spreadsheet. Remember that you will have separate storage area networks (SANs) for each system, which means that you need separate controllers, hubs, and so on. You could use the same SAN, but that would be an availability risk. The cost of the hub and switch was already factored into the price of the systems. Again, the decision is made to use 72-GB disks with mirroring and striping on the data and RAID 5 for the system databases. The cost per system is $93,600, for a total of $187,200. You are now at $593,300, nearly $100,000 over your budget. Because you are over budget, there is no way you can even consider buying test systems without rethinking things because this solution is not viable.

Rethinking the Strategy

Because the reporting server is in the same data center as the cluster, you can utilize the same SAN ( assuming it is on the multicluster list). That would take the server price down to $50,000 for that system (including 4 GB of memory and a fibre adapter). That would put the cluster node costs at $164,400 “$50,000 for the replication server and $61,950 for the log shipped server. Combining the disk space for the SAN choosing the same option (which can be found on the One SAN “ Cluster and Repl tab of the Excel spreadsheet) costs $224,400, and the disk for the log shipped server is $93,600. All of this totals $594,350, which means you still have a cost problem. You now decide to switch to an all RAID 5 “based disk solution on the cluster/replication SAN, using 36-GB 10,000 rpm disks, which costs $168,200. That will reduce the total to $538,150, which is better but still over budget. You decide to go with a RAID 5 data solution for log shipping, which brings the SAN cost down to $61,200, reducing the overall cost to $505,750. You are exposed here because you are filling all 36 slots with used disks, leaving no room for hot spares .

Although you are still over budget, you go to management to explain the trade-offs you had to make and to get approval for the additional $5,750, which, over the life of the system, works out to be just under $240 per month for the two years. Although you have what seems to be a winning design, management questions the need for the log shipping server and says that if they approve any additional money to go over budget, it should be used for a testing system. This puts your availability in a disaster recovery scenario in jeopardy.

You need to factor in the disk enclosure space. Because you need to use the 36-GB 10,000 rpm disks, that is $11,400 for the disks. Because you are using 82 slots and each enclosure has 12 slots, you must add two more disk enclosures for a total of $34,000. That brings the system cost back up to $539,750. If the log shipped system ever becomes the primary, you will need to do backups on that as well. You will need to use 36-GB disks, which will cost $10,925 with RAID 5. Because you have filled up the slots on the three disk enclosures, you will need another two, which adds up to $26,400. Adding this $37,325 to the total, you are now at $577,075.

At this point, you have the following problems:

You are over budget.
You have no test system.
There is not a lot of backup room for every database as it grows, reducing the amount of retention you can have on the active system. You have not budgeted disk space or cost for dedicated backup drives , which is a big problem.
You have what you consider nonoptimal disk configurations for your data (log is fine because it is using RAID 1), possibly exposing you to availability and performance problems.
You have no hot spares for disks.

Unfortunately, there is no good solution without significantly increasing budget. If you eliminate the log shipped server, which dramatically puts you in harm s way in a disaster recovery scenario, you can subtract $160,475, bringing you under budget for the system as it is configured. However, you might lose $160,475 or more in revenue and leads if the system goes down. Will saving money on test systems further increase your risk for potential problems in production?

At this point, you need to make some hard decisions. What is most important to the core solution? Obviously the main OLTP system is required. Up for debate are replication and log shipping as well as the reporting functionality as a whole. If you eliminate the replicated reporting server, go back to using mirrors and stripes for the main production system and allow reporting to take place on it. This would result in $164,400 for the nodes and using 36.4-GB 15,000 rpm disks and a form of mirroring and striping for the data. You decide to use faster, smaller disks to get a performance benefit because you will have more users accessing the system. You no longer need a distribution database, so you need the items shown in Table 11-2.

Table 11-2: New Server Configuration
Function	# of Disks	Cost
Quorum RAID 1	2	$2,000
MS DTC RAID 1	2	$2,000
System databases RAID 5	3	$3,000
OLTP data	36	$36,000
OLTP log	14	$14000
Backups	40	$40,000
Total	97	$97,000

Taking into account the nine disk enclosures at $17,000 per unit, that is a cost of $153,000 for a total disk subsystem price of $250,000. The cluster will now cost $414,400. However, because you are adding load to the system, you add another 4 GB ($8,500) of memory per node, bringing the total cost to $431,400. This solution should satisfy your main concerns about the reporting and main OLTP capabilities. The one question here is your use of tempdb. If it is heavily used, you will need more or separate disks, but you now have the drive space for it, not to mention the room for hot spares, so you have also increased your expandability and availability.

You now have $68,600 left in your budget. This is not enough money for either a test system or a log shipped secondary for disaster recovery. At this point, you need to look at the following options:

Request the additional $91,875 for the log shipped secondary to cover yourself if something catastrophic happens, sacrificing the test system, although the log shipped secondary is not to the same specifications as the production system.
Request the additional $91,875 for the test system to cover yourself if something catastrophic happens, sacrificing the log shipped secondary. Although this test system is not to the same specifications as the production cluster, it would be better than having no test system at all.
Request another $431,400 for a proper test system, no matter what decision you make about the log shipped secondary. This ensures that your production and development environments will be equal. If this is done, the entire solution will cost $1,000,000 or double the original budget.
Document your exposures and hope for the best.

The reality is that the last point ”documenting your risks ”is most likely to be the case, and the remaining $68,600 will go into other aspects such as cost overruns in development. The money might not be spent, but multiple people weighing the risk or reward of doing so will make that decision.