As soon as a failure or degradation of your directory service is detected , a person or an automated process needs to do something about it. In this section we describe some general principles of problem evaluation and resolution, along with some things you should be aware of when remedying problems with a directory service. Planning Your Course of ActionWhen you note a failure of the directory service, take a moment to plan your course of action. Although it is certainly important to get the directory service running correctly as soon as possible, you should also be sure to proceed in a methodical fashion. Before taking any action, ask yourself the following questions:
As soon as you know who is affected by the failure, provide some feedback to let them know the scope of the problem and that you are beginning to work on a resolution. Include an estimated time for repair, if possible. Notifying your end users is important; although it may seem like something you don't have time for, end users deal with outages much better when they aren't kept in the dark. Minimizing the EffectAfter you've achieved a basic understanding of the scope of the problem and who and what it affects, try to minimize the effects of the outage. Doing this can buy you additional time to understand the problem. For example, if you maintain several replicas of the failed server, you may be able to reconfigure your dependent services so that they access the replicas. In some cases, failover to a replica is handled automatically by the directory software. It may also be possible to shut down the dependent services temporarily to minimize the effects. For example, if you have a batch update process that merges changes from a foreign data source, and this process places a heavy load on the directory, you might consider waiting to run that process until the service has been repaired. Directory-enabled electronic message transfer agents (MTAs) that handle mail from external sites are a good example of a dependent service that can be shut down to reduce directory load. Although no inbound mail would be received if the MTA were shut down, external mail servers would queue the mail and periodically attempt to deliver it. MTAs that handle outbound mail should be kept online, if possible, so that users can send mail. Understanding the Root CauseAfter you've isolated the failing components and informed the affected parties, you need to understand the root cause of the problem as thoroughly as possible. If you don't eventually figure out why the failure occurred, it's quite likely that you'll have the same problem in the future. Troubleshooting a complex, distributed system is tricky business, and we can't possibly hope to explain it all in this text. We do, however, outline a basic step-by-step strategy you might follow when analyzing directory service failures:
These steps often lead you directly to the root cause. For example, if the directory service was unable to read database entries from a failing disk drive, you would probably find a close correlation between the time the directory failed and the time the operating system began to log failures. If the root cause is still not obvious, you may want to involve your directory vendor's technical support staff. The log files and other evidence you collect should help the vendor understand and troubleshoot the problem. Another option you may have, depending on your software, is to enable more verbose log output and put the directory server back into service. This action would probably degrade service somewhat, but the additional log output might be very revealing , especially to your vendor's support staff. Be sure to reconfigure the server for normal logging after you've reproduced the problem with verbose logging enabled. If the problem is intermittent and hard to reproduce, you may find that a test bed environment allows you to investigate the problem more thoroughly without affecting your production service. A test bed is also a great place to test new software versions and configuration changes before rolling them out into your production environment. Correcting the ProblemThe process of directory problem resolution is covered in detail in Chapter 20, Troubleshooting. After you've corrected the problem, use the opportunity to improve your directory service. Ask yourself the following questions:
Documenting What HappenedProducing a detailed problem/resolution report is a great idea for several reasons. First, it serves as a learning tool for your fellow system administrators. Second, if the root cause is a bug in the directory server software, a detailed problem report can be extremely helpful to your vendor. Components of a good problem report include the following:
After a problem has been resolved, it's a good idea to conduct a postmortem and ask some questions of yourself. Would troubleshooting have been easier if additional documentation were available? Did you have all the network maps for your organization? Was it clear which applications used the directory and which servers were used by those applications? Did you have all the necessary administrative passwords to gain access to the affected systems? Did you have the telephone and pager numbers of everyone you needed to contact? Always think about how to make your troubleshooting process more effective. |