In Chapter 11, "Troubleshooting Windows 2000 During and After Migration," you looked at potential problems to the migration and how to mitigate against them. In this lesson, you'll investigate other components that could potentially bring your business to a standstill and how to plan for them to keep your business running.
After this lesson, you will be able to
Estimated lesson time: 15 minutes
A quality system isn't necessarily one that never has problems. Hard disks fill up or wear out, memory chips develop parity errors, and software can get corrupted. Hence, another way to measure quality relates to how a system handles problems when they do arise.
At the time of this writing, developments are taking place in On Forever technology. On Forever technology aims to enable you to keep your servers running without ever taking them down. Part of this technology includes the ability to support redundant components that would switch over when normal components go wrong, and to support replacing failed components while the system is still running. Hot-swappable disk drive arrays allow a defective hard disk to be replaced without turning off your server. Microsoft has worked closely with IBM on hot-swappable hardware adapters for Microsoft Windows 2000 and has begun an initiative into hot-swappable memory.
In the meantime, you'll still have to cope with hardware problems. Some organizations go so far as to redefine problems as "opportunities." While the term opportunities might not always be appropriate, a certain level of problems are regarded as inevitable, so you must have a strategy in place to deal with them.
Some phases of the migration are so critical that plans should be in place if those stages fail. For example, if the upgrade of the PDC in a Microsoft Windows NT domain fails, it can leave the domain without a PDC. A contingency plan is therefore crucial.
Other aspects of the migration might not be as important but will still need to be dealt with. For example, if a user is unable to run an application correctly because his or her host or the application's host domain is now operating under Windows 2000, there must be a way to note this failure, deal with it in a timely manner, and assess and manage the impact of the problem on other users. Sometimes the problem might be minor, but the user might be critical. For instance, if the user is the CEO of the company, you might now have a critical concern.
To help you focus on the potential areas of failure, speculate on what you think can go wrong. Using the following table as a starting point, list as many potential points of failure in the system as possible and how you might be able to resolve them. Before continuing with the rest of this lesson, take a look at some of the potential points of failure listed in Appendix A, "Questions and Answers."
Category | Failure Point | Potential Resolution |
---|---|---|
Hardware | ______________________ ______________________ ______________________ | ______________________ ______________________ ______________________ |
Software | ______________________ ______________________ ______________________ | ______________________ ______________________ ______________________ |
Network | ______________________ ______________________ ______________________ | ______________________ ______________________ ______________________ |
Other | ______________________ ______________________ ______________________ | ______________________ ______________________ ______________________ |
Answers
In Chapter 2, "Project Planning," you learned the importance of communication during the migration. Stakeholders in the migration must be kept informed of the reasons behind it and the progress of the various phases. A well-defined fault-reporting mechanism is required at all levels. Everyone should be fully aware of how faults are reported and have confidence in the system that will deal with them.
Users must be able to report faults with applications and with gaining access to the system and its resources. System managers and administrators must be able to raise issues concerning problems with user and resource management. Finally, those performing the migration itself must have mechanisms whereby they can communicate and resolve problems among themselves.
When faults are reported, there must be clearly defined ways of doing the following:
It shouldn't be possible for problems to be "bounced" between two or more divisions without reconciliation. For example, a user shouldn't be referred by the network division to the operating system division and back again.
Instead, there must also be an escalation path for problems so that staff at appropriate levels can become involved in their resolution. The path should contain steps to assign the type of problems or situations to those who can best deal with them.
The way faults are reported and escalated should be well-documented and all involved must be made aware of the processes. The underlying philosophy should be that the entire organization is involved in finding and resolving faults. An example of this process is shown in Figure 12.1.
Figure 12.1 Flowchart depicting problem-solving escalation path
At first glance the cost of the extra hardware, software, and personnel to manage the systems might seem excessive relative to the amount of downtime you could experience. Many corporations rely on their backups as the only means of recovery in the event of a failure; however, a lot of processes are involved in a backup regime, which you'll explore in the next lesson. You will also have staff who can't use applications or access data while the server is being rebuilt and the backup restored, which will incur costs in itself.
For example, it's possible to create an environment that has 99.999 percent of guaranteed uptime if you follow the strict guidelines from Microsoft and your hardware vendor. You can calculate that the amount of unexpected downtime is 0.001 percent, which is equivalent to just over eight hours per year. Depending on the server that's unavailable for that time, your corporation could incur a substantial hit on profits. Therefore, calculate the cost of an e-commerce Web server that is not available for eight hours or the cost of users not being able to work for a day to determine the degree of fault tolerance you should establish.
In this practice, you'll investigate a utility called Uptime that's available in the Microsoft Windows 2000 Server Resource Kit. Uptime will give you statistics on the number of shutdowns, blue screens, and application problems. These statistics will then allow you to calculate the cost of a system not being available.
Note the other options available, specifically the heartbeat option for helping collect information on downtime.
In this lesson, you learned that it's important to plan for failure in a migration. Specific steps in the migration must be identified as upgrade failure points and contingency plans must be made. The way these problems are to be addressed, along with a mechanism for escalating more intransigent problems, must be part of the migration planning and communication process.