At some point, you might be called on to execute the plans you have put into place. Successful execution means more than just the correct bits on the disk. It is about the process and organization of the team to get to that point. If you have not tested your plan and do not have the confidence that you know it will work and how long it will take, you could be in for a rude awakening. It cannot be stressed enough how much preparation shortens downtime. Keep in mind that even with careful planning and tests, you can always encounter a situation that is beyond your control. Even with careful planning and testing, you could still miss the agreed-on timeframes detailed in your SLAs. These are the realities of disaster recovery, as it is by nature a difficult and pressure-filled operation. Do not plan for disaster recovery or go into executing a plan thinking that everything will go as planned; expect the unexpected.
During the execution of the actual recovery, everyone involved should be making notes about what is working and what is not, or where things are inaccurate. These notes can be logged directly into the run book text as each step is executed. Notations such as This did not work for the following reason and This is what had to be done to resolve the issue allow process and procedure improvement. After the fact, recall will not be complete and accurate, so any postmortem used to improve the process and procedures will not be as productive as possible, and could even cause future harm if incorrect information is acted on. This postrecovery meeting will assess what was good and what was bad and the feedback should be reflected in an updated disaster recovery plan.
The following is a simple example of a basic recovery plan execution.
Name : Tim
Role: Operations Manager
Recovery responsibility: Coordinate resources, interface with management and business users. Responsible for go/no-go decision. Sends out status e- mails .
Rotation: Watch beginning of recovery. Notified as needed during the restore. End communication with business users.
Names: Jane and Bill
Recovery responsibility: Install SQL Server software. Restore SAP, msdb, and master databases. Ensure that SQL Server Agent jobs are running properly. Run DBCC CHECKDB. Review error log and make sure it is clean. Make sure that database network libraries are configured properly. Install SQL Server client on the application servers.
Rotation: Jane will start the restore. Bill will come on at 2:00 A.M. and finish up the restore, and check the DBCC and logs.
Role: OS administrator
Recovery responsibility: Install operating system. Apply service packs and security patches. Confirm configurations: disk letters and configuration are as expected, disk is formatted for 64K blocks. Make sure the firmware is correct, network cards are set to proper duplex, and so on. Check event viewer and error logs.
Rotation: Build the server on the front end. Be on site for any other issues as they arise.
Role: SAP basis administrator
Recovery responsibility: Build SAP instance. Confirm profile parameters. Ensure connectivity to the database. Check that network library configurations are correct. Make sure batch jobs are running appropriately. Confirm the transport paths are in place and operational. Bring SAP up so it is open to the users.
Rotation: Will configure SAP and check the status at the end. Configure SAP at 11:30 P.M. and check at 6:00 A.M.
9:28 P.M. A power surge crashed the storage area network and corrupted the database. The server itself has crashed, and the database is now unavailable. Operations paged the on-call production manager.
9:30 P.M. Operations manager receives notification and dials in to check the system. Recovery team called. Disaster recovery plan activated.
9:45 P.M. Tim calls the CIO and notifies him or her that a business interruption has occurred. Tim posts the first e-mail to business users about the status and plan as well as notification checkpoints.
10:15 P.M. The team starts showing up. Food and beverage orders are taken. Pizza is ordered for team and coffee is put on.
10:50 P.M. Off-site backup tapes are recalled.
11:00 P.M. Recovery begins. Backup server is confirmed to be on the latest software and firmware revisions. Service packs and security patches are applied. Status e-mail is sent out.
11:30 P.M. SQL Server installation begins. Service packs are applied. When this process is complete, a status e-mail is sent out.
11:36 P.M. SAP is installed and configured on the central instance. Application servers are recovered and the profiles are adjusted for the new database server.
11:45 P.M. Tapes arrive . Contents are confirmed. Another status e- mail is sent.
Midnight Database restore begins. Master and msdb are restored. Another status e-mail is sent.
12:30 A.M. SAP database restoration begins. Another status e-mail is sent.
2:00 A.M. Restore is 39 percent complete. Estimated time to completion is 3.5 hours more. A status e-mail is sent out with an estimated completion of 5:30 A.M.
4:00 A.M. Restore is 78 percent complete as seen through RESTORE with STATS. Another status e-mail is sent out.
5:30 A.M. Restore is complete. DBCC CHECKDB is started. SAP is brought up and locked in single- user mode. SAP diagnostics are begun. Another status e-mail is sent out.
6:00 A.M. SAP is confirmed that all is okay.
7:18 A.M. DBCC CHECKDB complete and clean. Status e-mail is sent out to technical team.
7:20 A.M. Go live decision is made by the operations manager in conjunction with the CIO. A go is given to start up production at half past the hour .
7:30 A.M. SAP is started and error logs are checked. Business users are notified. Work can commence.
Each recovery for each company is different. Different personnel and capabilities exist. Hence, each schedule will be different. The idea is that one person cannot be expected to do it all and responsibilities and roles need to be parceled out appropriately.