Developing a Recovery Plan

The skill and experience of support personnel is crucial in getting failed systems back online with minimal disruption to your business. They need to be trained to troubleshoot problems and to implement recovery procedures when problems occur.

In preparing a recovery plan, start by imagining some typical scenarios. Your plan needs to answer the following questions:

  • Does each operator know how to restart the computer when the disk containing the operating system fails?
  • Do you have a set of Windows 2000 Setup floppy disks that you can use to start a disabled Windows 2000 computer? Do you have a Windows 2000 startup floppy disk?
  • When was the last time you tested the Windows 2000 Setup floppy disks or the Windows 2000 startup floppy disk?
  • If a controller fails, how long will it take to replace the controller? Is the hardware configuration information immediately available?
  • If a hard-disk drive fails, do you know how to replace it? If the drive was part of a fault-tolerant disk set, can you replace the disk and quickly fix the mirrored or RAID-5 volume?

Efficient recovery from system failures requires practice. Schedule drills several times a year that simulate computer crashes and disk failures.

Computers that have recently been taken out of service or are being prepared for production service can be used for training, or you can configure computers specifically for testing and training. Use training sessions and drills to update and document recovery procedures.

Testing Your System for Possible Problems

Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to stress test all functionality.

The following list identifies some of the failures that you need to test:

  • Individual computer components, such as hard disks, controllers, processors, and RAM.
  • External components such as routers, bridges, switches, cables, and connectors.

The following are some useful situations to simulate in your stress tests:

  • Heavy network loads.
  • Heavy disk I/O to the same disk.
  • Heavy use of file, print, and applications servers.
  • Large number of users simultaneously logging on.

Testing Recovery Procedures

Once you have created a set of Windows 2000 Setup floppy disks, a Windows 2000 startup floppy disk, and an ERD, and have backed up the system state data, use the floppy disks, ERD, safe mode and the Recovery Console to practice recovering from problems. This can help you to be diligent about making backups of the system state data and user data. This can also help you determine how long these procedures take to accomplish.

Your testing needs to help you determine the best recovery procedure for a particular situation. Determine when to use the set of Windows 2000 Setup floppy disks, the Windows 2000 startup floppy disk, safe mode, and the Recovery Console to restart your computer and when to use the ERD and Backup to replace files.

Your test computer needs to allow you to conduct the following tests:

  • Look at the MBRs, partition tables, and boot sectors.
  • Find the backup boot sector on an NTFS partition.
  • Deliberately destroy and recover MBRs and boot sectors.
  • Delete Windows 2000 system files and restore them by using the ERD.

Be sure to test recovery procedures before bringing a new computer or server into production. Every operator needs to have both primary and refresher training in recovering from the most common causes of unexpected downtime. Testing needs to include:

  • Testing the UPS on the computer running Windows 2000 Server and on hubs, routers, and other network components.
  • Testing the disaster plan.
  • Restoring from your backups.
  • Testing your ability to rebuild mirrored and RAID-5 volumes if you are running Windows 2000 Server and using software fault-tolerant volumes (a mirrored volume or a RAID-5 volume) or using hardware RAID arrays.

If a network adapter or other network component fails on the domain controller, the server operator needs to be familiar with the procedure for promoting a member server to be a domain controller, and demoting the failed server. Someone who is familiar with the procedure for reinstalling and reconfiguring the network adapter also needs to be available.

If a data volume fails, the operator must be able to restore the data from backup quickly and efficiently. The restore procedure needs to be tested frequently, both to ensure the skill of the operator and to test the quality of the backup tapes. The only way to test the quality of backup tapes is to do a full restore, which guarantees that the data is up-to-date and of consistent quality.

If your backup procedures involve the use of other computers running Windows 2000 Server or Windows 2000 Professional, verify that those backup and restore procedures work as expected.

Documenting Recovery Procedures

You need to develop step-by-step procedures for recovering from a variety of potential failures. You can use these procedures for:

  • Testing a new computer before putting the computer into a production environment.
  • Training new administrators and operators.
  • Creating an operations handbook, including procedures for setting up new user accounts, conducting backups, maintaining ERDs, and completing other common administrative tasks.

Update your documentation when you make configuration changes to your computers or network, especially when you install a new operating system or change the utilities that you use to maintain your system.

© 1985-2000 Microsoft Corporation. All rights reserved.



Microsoft Corporation Staff, IT Professional Staff - Microsoft Windows 2000 Server Operations Guide
Microsoft Corporation Staff, IT Professional Staff - Microsoft Windows 2000 Server Operations Guide
ISBN: N/A
EAN: N/A
Year: 2002
Pages: 404

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net