Chapter 12: Disaster Recovery Techniques for Microsoft SQL Server


You can make individual parts of a system fault tolerant. You can design in redundancy for the systems, making the solution highly available. But what happens when a catastrophe strikes and your fault-tolerant, redundant, highly available systems do not weather the proverbial storm ? This is where disaster recovery comes into play.

Many will argue that high availability and disaster recovery are not related , and some incorrectly mistake high availability for disaster recovery. The truth is that high availability planning must include the entire scope of what it takes to keep an application, business, or other enterprise up and running. Technology that makes your solutions fault tolerant, such as failover clustering, might not be sufficient for your disaster recovery needs. For example, although a technology like log shipping can be employed as a primary method of availability, it can also be used in a disaster recovery scenario. People and processes are crucial to all aspects of high availability, and they also form the backbone of most aspects of disaster recovery. This chapter guides you through planning and preparation for disaster recovery, as well as what you can do from a SQL Server perspective to recover your systems.

Planning for Disaster Recovery

There is no easy or add on solution to disaster recovery needs. Disaster recovery planning does not start late in the cycle; it starts from the first day of any project, when you are first assessing risk. Contrary to popular belief, each disaster recovery plan is unique, even for the same technology, because companies are different and their tolerances are different. When you seriously plan for disaster recovery and have your local high availability story well defined, two overriding questions govern the rest of the discussion:

  • What are our site disaster risks, and which ones are acceptable?

  • What are our dependencies from the hardware layer right on through to the end user applications?

    Important

    This book is mainly focused on SQL Server, but disaster recovery encompasses all aspects of a particular solution or system, from people to Microsoft Windows and everything in between. No stone can be left unturned in putting together a disaster recovery plan.

Run Book

There are several benefits to keeping historical records of your environment, including every change made to your servers. The first is, obviously, to help you rebuild it in the event of a disaster. For this reason a complete document, sometimes known as a run book , is crucial to disaster recovery planning and execution. A run book can contain a subset of the configuration information or a very high level of detail, depending on what suits your needs. The run book should also include contact information, disaster recovery plans, and so on.

The run book can be referred to on a daily basis for an easy way to get information about a system. If you keep contact and related system information in it, it can also help you identify systems that send or receive data from your systems and the people who need to be contacted in emergencies (or those who might be willing to answer some questions if, for example, you re researching elusive performance problems).

A run book does not have to be an actual printed book, although you should be sure to print several copies on a regular basis. One copy (whether electronic or printed) should go out with the backup tapes so that the system can be restored in the event of a disaster at the main site. In addition to printing the run book, you could also put the information on a Web site that would be accessible by only authorized users. This could be on an intranet or a secure extranet. For example, you could have a Web page that shows a grid of all the latest service packs , hotfixes, system hardware configuration, and so on. You could also pull that information into a SQL Server database or Extensible Markup Language (XML) and base some self-healing maintenance on that as well. You should also document the organization and procedures for the configuration management of a specific product, project, system, support group , or service.

Items to Place in a Run Book

The following categorized list can be used as the basis for a run book or customized operations guide for your database system. This list is based on Microsoft SQL Server 2000, but it could apply equally to other versions, in concept, and it outlines the configuration items you would use to maintain a high standard of operations support. The following list has been reviewed and approved by operations and field personnel within Microsoft, and it represents the current opinions and best practices of those groups.

On the CD

A sample run book can be found on the CD-ROM in the file Run_Book.doc. This can be used and modified to fit your environment; it serves as a good starting point. In addition, many of the worksheets and templates included for use with other chapters in this book can be used as the starting point for information for your run book.

Important

A few of the items that follow note that you must document passwords. First, make sure this is not in violation of any corporate policy. However, if you cannot document passwords needed for vital accounts during a restore, you must ensure that people who know them can be on site during a recovery operation.

SQL Server Administrative Information

There is a minimum set of information that you must keep to fulfill your responsibilities as database system administrator:

  • Maintain information on maintenance plans: all related scripts, how information is transferred for analysis (if it is), how alerts and errors are handled, and remote servers.

  • Maintain information about database backup files such as their location and type (full database, differential, file, filegroup, log) and how current these are, per location. Also record the times that files are backed up to tape (or other removable storage) and where these are stored, and notes on related directories, data, or files that must also be backed up. Remember to include the password for each backup if necessary. All related information should be recorded.

  • From a command line, run Sqldiag.exe and store the result to a text file. Do this on a regular basis, and save the files historically.

    Important

    Make sure you know what to look for in the output of SQLDIAG. If you or your staff is not familiar with it, script the relevant portions separately.

  • Store your Data Transformation Services (DTS) packages as files to enable an easy modular restore or transfer. Make notes regarding logins and passwords for the packages, and any alternate data sources.

  • Create a script of all system or application users and passwords (including the sa account). Create scripts for application roles or passwords and linked or remote servers.

  • Record your software serial number, a copy of the CDs (including all service packs and hotfixes), and a reference to their location on network file shares.

  • Keep a record of all related interconnected systems, including contacts, configuration information, network topology, and documentation of data interfaces.

  • Record any custom-made database administration objects that you depend on to maintain or administer the server.

  • Record hardware and software vendor support phone numbers and account numbers as well as any login/password information for related Web sites.

  • Record contact information for your remote site, if you have one.

  • Record people to notify in the event that you must re-create standard SQL Server users or reset their passwords.

  • Use a tool like Microsoft Visual Source Safe to manage script versions (for example, schema, install and rollback scripts, and maintenance scripts, and perhaps even the run book). Scripting is especially important for encrypted objects.

  • Write down contact information. Remember to list these by each person s role, so that if the person changes jobs, you can still find the correct contact. Ideally, you should record a group name (department, e-mail discussion list, and so on).

Analysis Services Administrative Information

There is a minimum set of information you must record if you are using Analysis Services in your solutions:

  • Information on maintenance plans: all related scripts, how information is transferred for analysis (if it is), how alerts and errors are handled, and any remote servers.

  • Setup information about the Analysis Services server. Where is the metadata stored (Microsoft Access database, SQL Server database, or SQL Server Meta Data Services)? Where are the data files located? What are the other configuration parameters for Microsoft SQL Server 2000 Analysis Services?

  • As described in the previous sections, store DTS packages that are used to populate Analysis Services cubes as files.

  • Any custom-made DBA objects that you depend on to maintain or administer the server.

Application System Information

There is a minimum set of information you must record about the non-SQL Server portions of your solutions:

  • List all the applications that must be in place for the system to run (either on the server itself or another system, if that is required). Include custom-built software.

  • Document the application s security architecture, including type of logins used, any fixed passwords, and role permissions, and note the process for changing passwords on any multiuser logins.

  • Contacts for the application: include developers and anyone else (analysts, testers, managers) who should be involved if a system change is made to the application or to any related system or process.

Database Components

There is a minimum set of information you must record about your databases:

  • Script out all database schemas, collations, jobs, and custom error messages. Anything that can be saved in script form should be scripted and stored historically. User-defined functions, user-defined datatypes, custom roles, triggers, indexes, and object permissions are all easy to miss in the scripting process. Be careful if you intend for your disaster recovery scenario to encompass these objects.

  • Information related to distributed databases or partitions (if applicable ), such as data dependent routing tables and distributed transaction marks.

  • Linked server connections.

Server Configuration

There is a minimum set of information you must record about your servers to use during disaster recovery:

  • Operating system version, with service pack level and hotfixes.

  • Exact list of hardware in use, and how that hardware is configured.

  • Processor information (for example, speed and class), RAM (amount, speed, and type), and BIOS (manufacturer and version).

  • Dates or version numbers of firmware.

  • Physical and logical disk configuration, including redundant array of independent disks (RAID) levels and disk controller information (including write cache settings), disk type and size, any special options used (that is, allocation units, formerly known as block size ), and use of each disk (explanation).

  • Notes on anything unusual regarding this server, hardware, configuration, or location; for example, if you have disks on different shelves plugged into the same array controller.

SQL Server Configuration

There is a minimum set of information you must record about each SQL Server instance s configuration for use in a disaster recovery scenario:

  • SQL Server installation configuration, including the installation and service pack levels, and any hotfixes that were applied.

  • SQL Server instance names , IP addresses, ports, configuration options, database file locations, service logins and passwords, e-mail accounts, enabled network protocols, and their order.

  • File share information, shares to which the service login must have permission reached by Universal Naming Convention (UNC) names or through any other protocols. This can be a potential nightmare in an off-site disaster recovery scenario, so you need to mitigate this for your environment.

  • Configuration information for any other software that runs on the same server. Make sure complete installation and configuration documentation are available and that correct support personnel (or job titles) are listed as contacts. Also list support numbers and Web sites for each piece of software.

  • Note any client tools that must be installed for remote database connections (for example, to heterogeneous data sources). Note configuration information such as what data access components must be loaded on clients .

  • Document any data source names (DSNs) that exist on the server.

  • Setup of SQL Server 2000 failover clustering, replication, log shipping, and a configuration and topology description.

  • Multi-instance configuration.

  • Notes on anything unusual regarding this server. For example, special features in use such as XML support for Internet Information Services (IIS), Active Directory directory service support, and so on.

Network Information and Configuration

There is a minimum set of information you must record about your network topology for use in a disaster recovery scenario:

  • Switch, hub, router, and gateway information.

  • Topology graphics or diagrams prepared in tools such as Microsoft Visio.

  • Management and configuration utilities for switches, routers, and gateways, and the settings for each; how-to instructions or manuals; and scripts.

  • Any redundant path information based on multipath routing.

  • Notes on anything unusual regarding the network.

Storage Information and Configuration

There is a minimum set of information you must record about your disk subsystem configurations for use in a disaster recovery scenario:

  • Switch information.

  • Fabric topology graphic or diagrams prepared in tools such as Visio.

  • Management and configuration utilities for switches, the settings for each, how-to instructions or manuals, scripts, and other items.

  • Any redundant path information based on multipathing.

  • Impact of clustering on fabric such as zoning or portioning.

  • Notes on anything unusual regarding the fabric.

Other Necessary Information

This is a list of other items that should also be recorded that do not fall under one of the preceding categories:

  • Complete contact information (home phone, e-mail address, cell phone, pager number ”whatever you have) for anyone who might be involved in a disaster recovery. This information might include the following:

    • All of your company s IT resources necessary for the recovery. This would include technical operations manager, DBAs, system administrators, networking people, and so on.

    • All factory support personnel and contract numbers. For example, your Microsoft Premier Product Support Services (PSS) support representative, the hardware vendor sales representative, and your agreement contract numbers.

    • Off-site data storage facility. You need access to the librarian software to prepare a list of which tapes you need to recall.

    • In addition to the off-site data storage facility, you might also need the name of a high-speed transcontinental courier service. For example, if there is a major blackout in the eastern United States, you might need to recover in the Midwest. When time is critical and overnight services are not fast enough, there are options. Airlines have services that can often provide four- hour bonded courier point-to-point delivery. Research this option in advance and have current contact numbers, drop-off times, and prices available.

    • The vendor information for your hot site location.

    • Senior managers who will be responsible for the recovery and their alternates.

    • Telecommunications managers responsible for networking.

    • Other senior management as needed. For example, general manager, chief information officer, and chief financial officer.

    • Work schedules.

  • Application dependencies.

  • Backup and restore plans.

  • Disaster recovery plans.

  • Service level agreements (SLAs).

  • Like plants, people need care to survive. Two main concerns are sustenance and a place to sleep. You might want to include the following:

    • Phone numbers and addresses of hotels and rental agencies in the area of your recovery site. You should also have back-ups, in case they are full.

    • Location of cots, couches, or sleeping bags for on-site personnel.

    • Phone numbers of restaurants that deliver.

    • Names and locations of 24-hour convenience stores in the area where you can get the basics after other stores are closed.

    • Check and see what type of bathroom facilities are available on site, too. For example, are there showers nearby?

    • Note the policy about food and where it can be eaten, and document the break room locations.

      Important

      The run book is only useful if it is meticulously kept up to date and stored on a separate system (which must also be treated as highly available). If you can automate the process, do so.

      Tip

      The run book should be stored both on site and off site, with both copies kept up to date. At least one copy should be a printed document or an encapsulated program or both (not a backup tape or a database that needs a separately installed GUI). Use a versioning tool such as Visual Source Safe to keep track of all changes made to the run book. Also keep a copy of all software installations off site as well in the event systems need to be rebuilt.

SLAs, Risk, and Disaster Recovery

Think back to Chapter 2, The Basics of Achieving High Availability, and the discussion there of SLAs. As crucial as they are to high availability, they are even more crucial to disaster recovery. This is where the difference in downtime ”planned or unplanned ”of minutes, hours, or even days comes into play. SLAs, in the form of support contracts, must exist for every physical and software component of a system to ensure that your SLAs with all parties are met. If you cannot meet your end user SLA because, for example, your disk subsystem was down and you did not have a proper SLA in place for it, that could be a problem. Now, in some cases this might not be possible, but you need to try. SLAs typically encompass issues such as physical security, money, uptime, and response times.

Obviously, the cost of buying the right SLA must be considered . Does the risk justify the financial cost of addressing it? For example, your company s off-site data storage facility might have limited hours. This is probably not a wise choice, and you would need to find a better solution. In the meantime, management decides to put their off-site backups in a bank safe deposit box for easy access. This is a cost- effective solution because it costs only $100 per year. However, only certain individuals can get at the tapes and only during banking hours. Alternatively, you might decide to let the DBA on the night rotation take home a set of tapes. This is not only a risk, it might also violate corporate security policy.

When you are purchasing your SLAs, do not take only high availability into account. Remember disaster recovery, too.

Planning Step 1: Assessing Risk and Defining Dependencies

Disaster recovery, like local high availability, is about mitigating risk. Risk comes in several varieties, such as risk to the business, risk to human life, and risk to technology. Unfortunately, assessing risk is a complex task and each company s risks and tolerance for those risks will be different. The following steps show you the thought process to follow and how to assess risk in your company.

Example Scenario

You have implemented a customized Microsoft Solution for Internet Business (MSIB). This encompasses IIS, SQL Server 2000, Microsoft BizTalk, and Microsoft Commerce Server. This particular solution involves two SQL Server 2000 failover clusters, each with two instances of SQL Server on them.

First, identify the physical servers that are part of this solution and their names. This extends to not only server functionality, but domain controllers and so on. Missing even one server could prove fatal. Taking an inventory of your environment, you see that you have the following:

  • DOMCONT1

  • DOMCONT2

  • STAGING1

  • FIREWALLDOMCONT1

  • FIREWALLDOMCONT2

  • ISASRV1

  • STAGINGSQL1

  • SQL1ANODE1

  • SQL1ANODE2

  • CLU2NODE1

  • CLU2NODE2

  • MSIB1

  • MSIB2

  • BIZTALK1

Next, take an inventory of all of the software configured on each server and make a detailed list, including versions. A final list should also include service packs and hotfixes. Here is a preliminary list for the example environment:

  • Microsoft BizTalk Server 2002

  • Microsoft Commerce Server 2002

  • Content Management Server 2001

  • Internet Information Server 5.0

  • Microsoft Message Queue

  • SQL Server 2000

  • Microsoft Windows 2000

  • Microsoft Windows Server 2003 (for domain controllers)

  • Custom applications

Now it is time to gather information about the instances of SQL Server. The instance names are as follows :

  • SQL1A

  • SQL1B\INST2

  • SQL2A

  • SQL2B\INST2

  • STAGINGSQL1

Table 12-1 lists the databases.

Table 12-1: Databases

Database

SQL Instance

Use

CustomDB3

SQL2A

CustomDB3

CustomDB8

SQL2A

CustomDB3

CustomDB4

SQL2A

CustomDB3

CUSTOMDB1

SQL1B\INST2

CUSTOMDB1

Company_CUSTOMDB1

STAGINGSQL1

CUSTOMDB1

ETSBizTalkReference

SQL1A

BizTalk

CustomDB2

SQL1A

BizTalk

InterchangeBTM

SQL1B\INST2

BizTalk

InterchangeDTA

SQL1B\INST2

BizTalk

InterchangeSQ

SQL1A

BizTalk

MCMS_Data

SQL2A

Content Management Server

MSCS_Admin

SQL2B\INST2

Commerce Server

MSCS_Admin

STAGINGSQL1

Commerce Server

MSCS_BDPermissions

SQL2B\INST2

Commerce Server

MSCS_BDPermissions

STAGINGSQL1

Commerce Server

MSCS_Campaigns

SQL2B\INST2

Commerce Server

MSCS_Campaigns

STAGINGSQL1

Commerce Server

MSCS_CatalogScratch

SQL2B\INST2

Commerce Server

MSCS_CatalogScratch

STAGINGSQL1

Commerce Server

MSCS_EndeavorCE

STAGINGSQL1

Commerce Server

MSCS_ProductCatalog

SQL2B\INST2

Commerce Server

MSCS_ProductCatalog

STAGINGSQL1

Commerce Server

MSCS_Profiles

SQL2B\INST2

Commerce Server

MSCS_Profiles

STAGINGSQL1

Commerce Server

MSCS_TransactionConfig

SQL2B\INST2

Commerce Server

MSCS_TransactionConfig

STAGINGSQL1

Commerce Server

MSCS_Transactions

SQL2B\INST2

Commerce Server

MSCS_Transactions

STAGINGSQL1

Commerce Server

Inventory

SQL2A

CustomDB3

Inventory_Asia

SQL2A

CustomDB3

Inventory_SA

SQL2A

CustomDB3

InventoryArchive

SQL2A

CustomDB3

InventoryArchive2

SQL2A

CustomDB3

CustomDB9

SQL2A

CustomDB3

CustomDB2

SQL2A

CustomDB3

CustomDB11

SQL2A

CustomDB3

CustomDB5

SQL2A

CustomDB3

CustomDB7

SQL2A

CustomDB3

XLANG

SQL1A

BizTalk

Table 12-2 lists database dependencies and their impact on availability.

Table 12-2: Database Dependencies and Impact on Solution Availability

Database

Impact

Dependencies (N/A if None)

CustomDB6

CustomDB3 (North America) will not work if this database is not running.

Databases: CustomDB1, CustomDB8, Inventory, CustomDB9

CustomDB1 (production)

This is a crucial database for the solution. It will, over time, also contain tables for other applications. It is a central point of integration.

All CustomDB3 databases

CustomDB7

These are the database lookups for BizTalk Server (BTS) maps. Some tolerance where CustomDB1, BTM, DTA, and XLANG die here.

N/A

CustomDB5

Risk is low for losing data because it can be recovered elsewhere, but is part of the whole CustomDB2 process.

N/A

InterchangeBTM

Contains the definition of BTS objects. This will not change often, but is crucial for BTS being up.

N/A

InterchangeDTA

This is the logging database for BTS. The use of the textimage field is configurable, and right now it is turned on. There is a low risk to the solution from a high level if this database fails, and it is used mainly for archiving and reporting purposes. However, it is actually a high risk to the solution because if BTS cannot log, BizTalk will not be able to work. You will not lose transactions if this happens, though.

N/A

InterchangeSQ

Shared Queue database. This is very important, as it contains the state of anything received and recorded. If you lose this database, you lose the transaction. CustomDB2 system dials every 60 minutes, CustomDB3 data (spike between 9 A.M. and 12 P.M. North America and Latin America, Asia 10 hours prior, EMEA 6 hours prior).

N/A

MCMS_Data

Content Management Server will be down if this database is not available, and the entire solution will be down if this database is down because it contains all the site content.

Commerce Server databases (from a restore standpoint)

MSCS_Admin (production)

Will affect Commerce Server as it is the administrative definition database.

MCMS_Data (from a restore standpoint)

MSCS_BDPermissions (production)

Like the MSCS_Admin, this is a configuration database mainly used with BizDesk. BizDesk will be affected if this is down, but site will not go down.

MCMS_Data (from a restore standpoint)

MSCS_Campaigns (production only)

This will not affect solution because it is not in use.

N/A

MSCS_CatalogScratch (production only)

Used for virtual catalog generation. There is no need to back up or worry about it because it only affects BizDesk.

N/A

MSCS_ProductCatalog (production only)

Although the site will not shut down if this database is unavailable (it is the master product catalog), you cannot navigate the solution. You will still be able to register users as long as other databases are up.

MCMS_Data (from a restore standpoint), MSCS_Transactions (GUID orphans), MSCS_ProductCatalog (staging)

MSCS_ProductCatalog (staging)

Very important to the solution, as it contains the copy of the catalog that is pushed out to the production server.

N/A

MSCS_Profiles (production)

If down, no one will be able to buy anything on the site or register users.

MCMS_Data (from a restore standpoint)

MSCS_TransactionConfig (production)

This contains the definition of all pipeline configurations. It is updated with code push. It is needed for BTS operation, but can be re-created as long as staging is available. If it is down, users will not be able to buy anything on the site.

MSCS_TransactionConfig (staging)

MSCS_Transactions (production)

This is the heart of BTS; it contains all completed transactions and it will definitely affect the solution. If this database is down, users will not be able to buy anything on the site.

MCMS_Data, MSCS_ProductCatalog

Inventory_USA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB8, CustomDB6, CustomDB9

Inventory_Asia

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB10, CustomDB12

Inventory_SA

This has a minimal impact; CustomDB3 will still be up and running if this database is not available.

CustomDB1, CustomDB11, CustomDB13

InventoryArchive

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

InventoryArchive2

This has a minimal impact on just archived data for CustomDB3.

CustomDB1

CustomDB9

CustomDB3 will not work if this database is down. Customers and partners are in here, and it contains all item definitions.

CustomDB1, CustomDB8, CustomDB6, Inventory_USA

CustomDB2

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB12

CustomDB11

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB13

CustomDB5

CustomDB3 (Asia) will not work if this database is not running.

CustomDB1, Inventory_Asia, CustomDB10

CustomDB7

CustomDB3 (South America) will not work if this database is not running.

CustomDB1, Inventory_SA, CustomDB11

XLANG

Similar to InterchangeSQ, but stores state for orchestration schedules. This is very important to keeping BTS up.

N/A

Once the dependencies are known, you can then establish the facts for each product that will influence any disaster recovery planning.

Known Facts About Servers

The following is a list of facts for each server product.

BizTalk

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • If Internet Information Server (IIS), Message Queuing (also known as MSMQ), or Microsoft Distributed Transaction Coordinator (MS DTC) are down, BizTalk will not function.

  • If BizTalk is down, the site will still technically be up, but not fully functional, which essentially means it is down.

  • If BizTalk and MSMQ are down, the site will not be functional at all.

  • If InterchangeDTA is down, you will not be able to report from BizTalk (and BTS will be down because it cannot log). If you restore it to an older time, CustomDB1, CustomDB3, and so on, are interdependent. The issue is that a transaction might exist elsewhere that is not here, and you will need to create a process to get DTA back in sync.

Commerce Server

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • Commerce Server is dependent on IIS.

  • If you lose your MSCS_ProductCatalog database and need to restore it to an earlier backup, you will invalidate MSCS_Transactions because you will now have orphaned GUIDs.

  • If a problem occurs in Content Management Server or Commerce Server, you will have to restore all Commerce Server databases and Content Management Server databases to the same point if (and only if) the products that exist in one place do not exist in the other. This also means that you will probably need another server to restore the older backups on and reconcile the difference, making it a time-consuming process. If you can restore the catalogs back to the same point in time, this should not be a concern. Otherwise, you will get orphaned postings if products were not added to Content Management Server.

  • Without Commerce Server, there is no solution, because it controls all buying functionality.

  • If Commerce Server order status (that is, MSCS_Transactions) cannot be updated due to Commerce Server being down, the transaction from BTS should roll back if there is a schedule in XLANG, and it might have to be pushed back out. This needs to be addressed.

Content Management Server

  • Content Management Server is dependent on IIS.

  • If a problem occurs in Content Management Server or Commerce Server, you have to restore all Commerce Server databases and Content Management Server databases to the same point if (and only if) the products that exist in one place do not exist in the other. This also means that you probably need another server to restore the older backups on and reconcile the difference, making it a time-consuming process. If you can restore the catalogs back to the same point in time, this should not be a concern. Otherwise, you will get orphaned postings if products were not added to Content Management Server.

  • Without Content Management Server, there is no solution, as it is the presentation server for the MSIB site. All permissions and roles are also stored; the site will not work without authentication, and even if you could authenticate, you need the site content (Content Management Server contains 90 percent of the content).

  • Staging Content Management Server is the master for site content and feeds the production Content Management Server, so staging s availability is crucial for the production and external portion of the solution.

  • The Content Management Server site definition object (SDO) file created by staging updates the production Content Management Server daily. You can take backups from staging and then apply them to production, but not vice versa. The SDO itself is more crucial than the database, because it is the mechanism by which you can re-create objects on the site. To ensure that the company is protected, the following should probably happen, in this order:

    1. Back up the Content Management Server database currently in production (if not done yet; this would be the previous day s configuration).

    2. Back up the staging SQL database prior to generation of the daily SDO file.

    3. Generate the new SDO file, immediately back it up, and then apply it to the production environment.

    4. Back up the newly updated Content Management Server database in production.

    5. If an error is found due to a bad SDO push or inaccurate data in the SDO, the SDO should be repushed to production when it is fixed.

    6. Content Management Server has the ability to support release dates for certain content, so what appears to be an error might not be.

CustomDB1, CustomDB2, and CustomDB3

  • CustomDB1, CustomDB2, BizTalk, and Commerce Server are all related.

  • CustomDB2 uses CustomDB4, which keeps track of all activity for CustomDB2 trading partners. There are file pointers for every interchange received, and they remain until archived. It is hit every 45 minutes when CustomDB2 goes out to get transactions; this activity lasts for approximately 10 minutes. Risk of losing some data is low. Three systems feed CustomDB2: an EDI Value Added Network (which has its own archiving and can go back four days), the back-end SAP system (which can go back one day), and BizTalk (which also comes from SAP).

  • CustomDB1 has two transactional tables. If the application cannot identify customers, it goes elsewhere.

  • CustomDB1 is hit most heavily in the morning, and that might be a logical time to back the CustomDB1 database up (instead of in the middle of the night).

  • CustomDB1 is a central point of integration and, as time goes on, other applications will use it (such as telcom and asset), so it is important to the solution and the company.

  • If a transaction is not committed, everything will roll back and go into the retry queue (sits in MSMQ).

  • The CustomDB1 and CustomDB3 databases should be restored in a recovery situation at the same time.

  • CustomDB3 has triggers on tables, which feed MSMQ, which then feeds CustomDB1. If CustomDB1 needs to be rolled back to an older backup, you will be missing entries from CustomDB3, so you need to establish a process to roll forward entries from CustomDB3.

  • SAP is a flat file extract and a daily feed once in the morning. The flow is to the staging table, to MSMQ, and finally, into CustomDB1. If you need to regenerate data from CustomDB1, it is a difficult process because no transaction ID exists, so you need to devise a disaster recovery process for this interaction.

  • CustomDB3 is broken out by region: North America, Asia, and South America. All backups of dependent databases (see Table 12-2) must be coordinated, which means that you need separate SQL Server Agent jobs that are scheduled to run at the same time to ensure that all dependent databases can be restored to the same point.

  • CustomDB8 might be loaded once or twice a week depending on a reseller sending data. From a recoverability standpoint, the company can go back about a week because the Microsoft Excel spreadsheets are available and the spreadsheet names are stored in the database.

  • You might need a separate server to restore older databases to reconcile differences if a problem happens on one or more databases, especially the dependent ones.

Other Servers

  • Without the DOMCONT1 and DOMCONT2 domain controllers, the SQL Server clusters will not work, and their absence will most likely affect all other functionality as well (not just from an SQL Server perspective), such as MSMQ.

  • MSMQ public queues are dependent on the domain controllers for lookup purposes.

  • Without the Internet Security and Acceleration (ISA) Server domain controllers, ISA will not work.

  • Without IIS, no one will be able to connect to the solution, and Content Management Server, Commerce Server, and BizTalk will be unable to function.

  • If MSMQ is down, BizTalk will not function.

  • If MS DTC is down, BizTalk will not function.

  • Without ISA and the firewall, the solution will not be protected (but it should function).

  • SAP technically has no impact on the site if it is down, but it is part of the solution in one way or another and needs to be brought within its scope. Every message for SAP will be in MSMQ until it can be submitted. Obviously, if MSMQ fails, you cannot send the messages. The data is in Commerce Server, so it could be re-created.

Risks and Unknowns

The following is a list of risks and unknowns that must be accounted for:

  • You do not have a backup until you have done a restore. You need to restore all backups made (possibly on a rotational basis) so that you know the backups being made are usable. This requires hardware that might or might not currently exist.

  • You need to determine the retention time of backup files. Currently, it is limited by storage space, but that will affect the solution s potential availability if you find all of the needed backups that exist (in the event of an emergency) are bad.

  • Schedule fire drills and test the disaster recovery plans and process so that it feels like the real thing. Making disaster recovery plans without testing them is not good enough.

  • The company needs to create disaster recovery plans based on this document and the Excel spreadsheets containing the backup information.

  • What will happen if the hardware load balancer fails? How is that being protected?

  • What is the impact of a network failure? How is that being mitigated?

  • Is the company stocking hardware, such as additional disks, host bus adapters (HBAs), network cards, and so on, in the event a component fails in one server?

  • There is no process to move the staging MSCS_ProductCatalog data to the production environment. You need to test that and initiate a way to re-create it.

  • You must revisit the SLAs to ensure that the steps put in place for disaster recovery reflect the time it takes to do the process.

  • When figuring out exact steps, document where you would have to start the whole process by reinstalling the operating system versus other methods (such as third-party software).

  • Include in this disaster recovery document any relevant support information, such as links to the diagrams for the disk configurations. That will speed up any disaster recovery process.

Planning Step 2: Putting the Plan Together

Once you have assembled your lists of known facts, risks, and dependencies and have all pertinent information in hand (such as SLAs), you can put the plan together. This plan should include the exact steps it will take to recover the system and who is responsible for what task (job roles mainly; a list maintained in the run book should tell you who is currently responsible for the action). Other information that you need to put into your disaster recovery plan includes the following:

  • Define a go/no-go point for executing the plan. Many companies like to try to solve the problem on their own before engaging any type of support such as Microsoft s PSS, original equipment manufacturers, and service providers. Keep in mind that seconds and minutes matter with SLAs. If you want to try to diagnose the problem on your own, set a reasonable time frame (say 15 minutes) to solve the problem. If you cannot do it within that time, pick up the phone. Also keep in mind that if you have, say, a four nines SLA (meaning less than 52 minutes of downtime a year), do you really want to try to diagnose the problem yourself? Putting in explicit instructions avoids what can be known as the hero syndrome. A clear understanding of what skills and knowledge you have in house can make it clear what failures can be addressed internally and which ones must be escalated immediately.

  • Define a chain of command. Chaos can quickly ensue with too many chefs and not enough bottle washers, or plenty of bottle washers but no chef to direct them. Roles and responsibilities should be clear. Even if the people performing the job change, everyone should know how they fit into the works and who reports to whom.

    Important

    A chain of command can solve a bottleneck when dealing with support. Often, when support professionals are contacted, they require a seemingly large amount of information, potentially including a copy of your database. Constructing barriers that prevent the support professional from helping you only increases your downtime. Remember that reputable support organizations are not there to peruse your potentially sensitive data or steal your trade secrets. They want to help you get up and running. Work out any corporate security issues that relate to external organizations prior to contacting support as part of your SLA. Conversely, if you will never be able to work with a support organization because you cannot divulge certain information, you might want to rethink your strategy. Is it better to be down or up?

  • Define a communications plan. With a chain of command, you must establish clear communications. Communications plans are based partially on being able to reach everyone who needs to be involved, so make sure that any type of contact information is updated in the run book. This also goes for any type of work schedules, on-call schedules, and so on.

  • Define schedules for the recovery plan itself. A recovery is often a 24-hour operation. It is neither practical nor smart to expect one person to oversee or work a particular aspect of the plan from the beginning to the end. People become fatigued after a period of time and might make poor decisions. The correct solution is to define shifts and staff them appropriately.

  • Define milestones. These are events on the critical path that, when complete, represent forward movement. Milestones make it easy to gauge your progress. Some examples follow.

    • All recovery personnel have been notified.

    • Personnel have shown up or been called in and are ready.

    • Call numbers for the site and personnel are physically available in e-mail.

    • The backup server is physically ready.

    • Operating systems and service packs are installed.

    • SQL Server is installed.

    • Application software is installed.

    • Tapes have arrived from off-site facilities.

    • The restoration of the database has begun.

    • Periodic updates are provided on the restore operation s percentage complete.

    • The database restore is completed.

    • Logs have been reviewed.

    • Results of various diagnostics attest to the health of the database (for example, DBCC CHECKDB).

    • Applications have been started and connected to the database.

    • Application test reports confirm validity of the data.

    • Management has made the go/no “go decision.

  • Define acceptance criteria and a testing suite for a go/no-go decision. The confidence of recovery personnel in the successful accomplishment of milestones is key. For example, if the restore finished, but the DBCC CHECKDB failed, then clearly there is a lack of confidence in the recovery. Problems influencing a decision to go live can be independent of the database recovery, of course. If SQL Server is completely operational, yet the network is not functional at the time, then clearly the users are not going to be coming online.

When All Else Fails, Go to Plan B

Do not be naive and assume that your plan is perfect. There could be something beyond your control (such as data corruption or inability to get a new disk drive) that might mean you are down for a longer time. Always have a trick up your sleeve and create a backup plan so that if your primary plan fails, you can still get up and running sooner rather than much, much later.

Tip

You might even want to execute Plan B in parallel with Plan A so that there is no lag time in the event Plan A fails.

More Info

For links to SQL Server “specific disaster recovery Knowledge Base articles that contain additional information, consult Knowledge Base article 307775, INF: Disaster Recovery Articles for Microsoft SQL Server. You might also want to print some of these articles and incorporate them into your test plans or run book.




Microsoft SQL Server 2000 High Availability
Microsoft SQL Server 2000 High Availability
ISBN: 0735619204
EAN: 2147483647
Year: 2006
Pages: 137

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net