Understanding and Deploying LDAP Directory Services > 18. Monitoring > Taking Action |
Taking ActionAs soon as a failure or degradation of your directory service is detected , someone or something needs to do something about it. In this section, we describe some general principles of problem evaluation and resolution, along with some things you should be aware of when remedying problems with a directory service. Planning Your Course of ActionWhen you note a failure of the directory service, take a moment to plan your course of action. Although it is certainly important to get the directory service running correctly as soon as possible, you should also be sure to proceed in a methodical fashion. Before taking any action, ask yourself the following questions:
As soon as you know who is affected by the failure, provide some feedback to let them know the scope of the problem and that you are beginning to work on a resolution. Include an estimated time for repair, if possible. Notifying your end users is important; although it may seem like something you don't have time for, end users deal with outages much better when they aren't kept in the dark. Minimizing the EffectAfter you've achieved a basic understanding of the scope of the problem and who and what it affects, try to minimize the effects of the outage. This can buy you additional time to understand the problem. For example, if you maintain several replicas of the failed server, you may be able to reconfigure your dependent services so that they access the replicas. In some cases, failover to a replica is handled automatically by the directory software. It may also be possible to temporarily shut down the dependent services to minimize the effects. For example, if you have a batch update process that merges changes from a foreign data source, and this process places a heavy load on the directory, you might consider waiting to run that process until the service has been repaired. Directory-enabled electronic mail transfer agents (MTAs) that handle mail from external sites are a good example of a dependent service that can be shut down to reduce directory load. Although no inbound mail would be received if the MTA were shut down, external mail servers would queue the mail and periodically attempt to deliver it. MTAs that handle outbound mail should be kept online, if possible, so that users can send mail. Understanding the Root CauseAfter you've isolated the failing components and informed the affected parties, you need to understand the root cause of the problem as thoroughly as possible. If you don't eventually figure out why the failure occurred, it's quite likely that you'll have the same problem in the future. Troubleshooting a complex, distributed system is tricky business, and we can't possibly hope to explain it all in this text. We do, however, outline a basic step-by-step strategy you might follow when analyzing directory service failures:
These steps often lead you directly to the root cause. For example, if the directory service was unable to read database entries from a failing disk drive, you would probably find a close correlation between the time the directory failed and the time the operating system began to log failures. If the root cause is still not obvious, you may want to involve your directory vendor's technical support staff. The log files and other evidence you collect should help the vendor understand and troubleshoot the problem. Another option you may have, depending on your software, is to enable more-verbose log output and put the directory server back into service. This would probably result in some degradation of service, but the additional log output may be very revealing , especially to your vendor's support staff. Be sure to reconfigure the server for normal logging after you've reproduced the problem with verbose logging enabled. If the problem is intermittent and hard to reproduce, you may find that a testbed environment allows you to investigate the problem more thoroughly without affecting your production service. A testbed is also a great place to test new software versions and configuration changes before rolling them out into your production environment. Correcting the ProblemThe process of directory problem resolution is covered in detail in Chapter 19, "Troubleshooting." After you've corrected the problem, use the opportunity to improve your directory service. Ask yourself the following:
Documenting What HappenedProducing a detailed problem/resolution report is a great idea for several reasons. First, it serves as a learning tool for your fellow system administrators. Second, if the root cause is a bug in the directory server software, a detailed problem report can be extremely helpful to your vendor. Components of a good problem report include
After a problem has been resolved, it's a good idea to conduct a postmortem and ask a number of questions of yourself. Could troubleshooting have been made easier if additional documentation were available? Did you have all the network maps for your organization? Was it clear which applications used the directory and which servers were used by those applications? Did you have all the necessary administrative passwords to gain access to the affected systems? Did you have the telephone and pager numbers of everyone you needed to contact? Always think about how to make your troubleshooting process more effective.
|
Index terms contained in this sectionaction plansresolving problems 2nd 3rd correcting problems directories monitoring resolving problems 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th documenting problems 2nd 3rd 4th failovers replication maintenance resolving problems action plans 2nd 3rd correcting the problem documenting the problem 2nd 3rd 4th finding the cause 2nd 3rd 4th 5th minimizing damage 2nd minimizing damage resolving problems 2nd monitoring resolving problems action plans 2nd 3rd correcting the problem documenting the problem 2nd 3rd 4th finding the cause 2nd 3rd 4th 5th minimizing damage 2nd MTAs replication failover to resolving problems (monitoring for errors) action plans 2nd 3rd correcting the problem documenting the problem 2nd 3rd 4th finding the cause anomolies gathering evidence log file analysis vendor support staff minimizing damage 2nd support persons vendors researching problems technical support vendors researching problems troubleshooting resolving problems action plans 2nd 3rd correcting the problem documenting the problem 2nd 3rd 4th finding the cause 2nd 3rd 4th 5th minimizing damage 2nd vendors support staff researching problems |
2002, O'Reilly & Associates, Inc. |