Lesson 1: Planning for Failure

In Chapter 11, "Troubleshooting Windows 2000 During and After Migration," you looked at potential problems to the migration and how to mitigate against them. In this lesson, you'll investigate other components that could potentially bring your business to a standstill and how to plan for them to keep your business running.

After this lesson, you will be able to

  • Understand the issues to be addressed when considering how to plan for large-scale failure (specific migration actions that might fail and affect the migration program itself) and small-scale failure (impacts on the migration process affecting the system as seen and used).

Estimated lesson time: 15 minutes

Potential Points of Failure

A quality system isn't necessarily one that never has problems. Hard disks fill up or wear out, memory chips develop parity errors, and software can get corrupted. Hence, another way to measure quality relates to how a system handles problems when they do arise.

At the time of this writing, developments are taking place in On Forever technology. On Forever technology aims to enable you to keep your servers running without ever taking them down. Part of this technology includes the ability to support redundant components that would switch over when normal components go wrong, and to support replacing failed components while the system is still running. Hot-swappable disk drive arrays allow a defective hard disk to be replaced without turning off your server. Microsoft has worked closely with IBM on hot-swappable hardware adapters for Microsoft Windows 2000 and has begun an initiative into hot-swappable memory.

In the meantime, you'll still have to cope with hardware problems. Some organizations go so far as to redefine problems as "opportunities." While the term opportunities might not always be appropriate, a certain level of problems are regarded as inevitable, so you must have a strategy in place to deal with them.

Some phases of the migration are so critical that plans should be in place if those stages fail. For example, if the upgrade of the PDC in a Microsoft Windows NT domain fails, it can leave the domain without a PDC. A contingency plan is therefore crucial.

Other aspects of the migration might not be as important but will still need to be dealt with. For example, if a user is unable to run an application correctly because his or her host or the application's host domain is now operating under Windows 2000, there must be a way to note this failure, deal with it in a timely manner, and assess and manage the impact of the problem on other users. Sometimes the problem might be minor, but the user might be critical. For instance, if the user is the CEO of the company, you might now have a critical concern.

Practice 1: Identifying Vulnerabilities

To help you focus on the potential areas of failure, speculate on what you think can go wrong. Using the following table as a starting point, list as many potential points of failure in the system as possible and how you might be able to resolve them. Before continuing with the rest of this lesson, take a look at some of the potential points of failure listed in Appendix A, "Questions and Answers."

CategoryFailure PointPotential Resolution


Instituting a Fault Reporting Process

In Chapter 2, "Project Planning," you learned the importance of communication during the migration. Stakeholders in the migration must be kept informed of the reasons behind it and the progress of the various phases. A well-defined fault-reporting mechanism is required at all levels. Everyone should be fully aware of how faults are reported and have confidence in the system that will deal with them.

Users must be able to report faults with applications and with gaining access to the system and its resources. System managers and administrators must be able to raise issues concerning problems with user and resource management. Finally, those performing the migration itself must have mechanisms whereby they can communicate and resolve problems among themselves.

Fault Escalation

When faults are reported, there must be clearly defined ways of doing the following:

  • Determining responsibility for an issue
  • Ensuring that the ownership of problems is identified

It shouldn't be possible for problems to be "bounced" between two or more divisions without reconciliation. For example, a user shouldn't be referred by the network division to the operating system division and back again.

Instead, there must also be an escalation path for problems so that staff at appropriate levels can become involved in their resolution. The path should contain steps to assign the type of problems or situations to those who can best deal with them.

The way faults are reported and escalated should be well-documented and all involved must be made aware of the processes. The underlying philosophy should be that the entire organization is involved in finding and resolving faults. An example of this process is shown in Figure 12.1.

Figure 12.1 Flowchart depicting problem-solving escalation path

Counting the Cost

At first glance the cost of the extra hardware, software, and personnel to manage the systems might seem excessive relative to the amount of downtime you could experience. Many corporations rely on their backups as the only means of recovery in the event of a failure; however, a lot of processes are involved in a backup regime, which you'll explore in the next lesson. You will also have staff who can't use applications or access data while the server is being rebuilt and the backup restored, which will incur costs in itself.

For example, it's possible to create an environment that has 99.999 percent of guaranteed uptime if you follow the strict guidelines from Microsoft and your hardware vendor. You can calculate that the amount of unexpected downtime is 0.001 percent, which is equivalent to just over eight hours per year. Depending on the server that's unavailable for that time, your corporation could incur a substantial hit on profits. Therefore, calculate the cost of an e-commerce Web server that is not available for eight hours or the cost of users not being able to work for a day to determine the degree of fault tolerance you should establish.

Practice 2: Checking the Uptime of Your Server

In this practice, you'll investigate a utility called Uptime that's available in the Microsoft Windows 2000 Server Resource Kit. Uptime will give you statistics on the number of shutdowns, blue screens, and application problems. These statistics will then allow you to calculate the cost of a system not being available.

  1. On MIGKIT1, open a command prompt and change to the Tools folder.
  2. Type Uptime /s > upreport.txt to run Uptime.exe and save its output to a file named Upreport.txt.
  3. Open the Upreport.txt file in Notepad.
  4. Scroll to the bottom of the report and note the statistics given on the total uptime and system availability.
  5. Close Notepad and from the command prompt, type Uptime /?.

    Note the other options available, specifically the heartbeat option for helping collect information on downtime.

Lesson Summary

In this lesson, you learned that it's important to plan for failure in a migration. Specific steps in the migration must be identified as upgrade failure points and contingency plans must be made. The way these problems are to be addressed, along with a mechanism for escalating more intransigent problems, must be part of the migration planning and communication process.

MCSE Training Kit (Exam 70-222. Migrating from Microsoft Windows NT 4. 0 to Microsoft Windows 2000)
MCSE Training Kit (Exam 70-222): Migrating from Microsoft Windows NT 4.0 to Microsoft Windows 2000 (MCSE Training Kits)
ISBN: 0735612390
EAN: 2147483647
Year: 2001
Pages: 126

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net