Further Reading | Understanding and Deploying LDAP Directory Services (2nd Edition)

Understanding and Deploying LDAP Directory Services > 18. Monitoring > Taking Action

< BACK

CONTINUE >

153021169001182127177100019128036004029190136140232051053054012005144154214024084046120

Taking Action

As soon as a failure or degradation of your directory service is detected , someone or something needs to do something about it. In this section, we describe some general principles of problem evaluation and resolution, along with some things you should be aware of when remedying problems with a directory service.

Planning Your Course of Action

When you note a failure of the directory service, take a moment to plan your course of action. Although it is certainly important to get the directory service running correctly as soon as possible, you should also be sure to proceed in a methodical fashion. Before taking any action, ask yourself the following questions:

Exactly what part of the directory service has failed? In what way has it failed?
Which users and directory-enabled applications depend on the failed service?
What exactly is the effect? Is something irreparable happening as a result of this outage (e.g., is email bouncing), or has the system simply stopped performing useful work? A guiding principle here is that delayed results or degraded performance are better than downtime or incorrect results.
Are there any log files or other evidence that might be erased or overwritten? If so, make copies of these files as soon as possible so that they don't get lost.

As soon as you know who is affected by the failure, provide some feedback to let them know the scope of the problem and that you are beginning to work on a resolution. Include an estimated time for repair, if possible. Notifying your end users is important; although it may seem like something you don't have time for, end users deal with outages much better when they aren't kept in the dark.

Minimizing the Effect

After you've achieved a basic understanding of the scope of the problem and who and what it affects, try to minimize the effects of the outage. This can buy you additional time to understand the problem. For example, if you maintain several replicas of the failed server, you may be able to reconfigure your dependent services so that they access the replicas. In some cases, failover to a replica is handled automatically by the directory software.

It may also be possible to temporarily shut down the dependent services to minimize the effects. For example, if you have a batch update process that merges changes from a foreign data source, and this process places a heavy load on the directory, you might consider waiting to run that process until the service has been repaired.

Directory-enabled electronic mail transfer agents (MTAs) that handle mail from external sites are a good example of a dependent service that can be shut down to reduce directory load. Although no inbound mail would be received if the MTA were shut down, external mail servers would queue the mail and periodically attempt to deliver it. MTAs that handle outbound mail should be kept online, if possible, so that users can send mail.

Understanding the Root Cause

After you've isolated the failing components and informed the affected parties, you need to understand the root cause of the problem as thoroughly as possible. If you don't eventually figure out why the failure occurred, it's quite likely that you'll have the same problem in the future.

Troubleshooting a complex, distributed system is tricky business, and we can't possibly hope to explain it all in this text. We do, however, outline a basic step-by-step strategy you might follow when analyzing directory service failures:

Gather evidence, which can include any of the following:
- Directory usage and error logs
- Usage and error logs of dependent applications
- User reports of problems
- Operating system logs
- Alerts generated by your monitoring system
- Core dumps from directory server software (UNIX systems) or Dr. Watson logs (Windows NT)
After you've collected the evidence, look for things out of the ordinary that occurred around the time of the failure. You'll often know only the time that a user complaint was received or that the monitoring system noticed a problem with the directory.
Starting at that time, work backward through the log files, looking for any suspicious error messages. Arrange these chronologically to get a better picture of the sequence of events leading up to the failure (but watch out for clock skew between computer systems).

These steps often lead you directly to the root cause. For example, if the directory service was unable to read database entries from a failing disk drive, you would probably find a close correlation between the time the directory failed and the time the operating system began to log failures.

If the root cause is still not obvious, you may want to involve your directory vendor's technical support staff. The log files and other evidence you collect should help the vendor understand and troubleshoot the problem. Another option you may have, depending on your software, is to enable more-verbose log output and put the directory server back into service. This would probably result in some degradation of service, but the additional log output may be very revealing , especially to your vendor's support staff. Be sure to reconfigure the server for normal logging after you've reproduced the problem with verbose logging enabled.

If the problem is intermittent and hard to reproduce, you may find that a testbed environment allows you to investigate the problem more thoroughly without affecting your production service. A testbed is also a great place to test new software versions and configuration changes before rolling them out into your production environment.

Correcting the Problem

The process of directory problem resolution is covered in detail in Chapter 19, "Troubleshooting." After you've corrected the problem, use the opportunity to improve your directory service. Ask yourself the following:

Was there anything that could have been done to avoid the failure in the first place? Is there some workaround that can minimize the effect if the problem happens again?
Was the correct person notified promptly? Is there some improvement that can be made to the notification system?
Is there some specific procedure that should be followed if this problem occurs again? If the problem occurs again, and someone else needs to deal with it, any documentation you can provide will help him or her.

Documenting What Happened

Producing a detailed problem/resolution report is a great idea for several reasons. First, it serves as a learning tool for your fellow system administrators. Second, if the root cause is a bug in the directory server software, a detailed problem report can be extremely helpful to your vendor. Components of a good problem report include

Version numbers of the directory server software.
Version numbers of the operating system software.
Any patches or service packs installed for the operating system.
Any optional or third-party software installed on the system, such as the Veritas file system option on the Solaris operating system.
Relevant portions of directory server logs, operating system logs, and dependent software logs.
Copies of directory database files (be aware of the contents of these files ”they may contain sensitive information).
Steps to re-create the problem, if known.
An accurate description of the observed behaviors. Try not to make any assumptions here; instead, simply report what you saw. In other words, offer "just the facts."
Your assessment of the root cause of the problem, if any.

After a problem has been resolved, it's a good idea to conduct a postmortem and ask a number of questions of yourself. Could troubleshooting have been made easier if additional documentation were available? Did you have all the network maps for your organization? Was it clear which applications used the directory and which servers were used by those applications? Did you have all the necessary administrative passwords to gain access to the affected systems? Did you have the telephone and pager numbers of everyone you needed to contact? Always think about how to make your troubleshooting process more effective.

Understanding and Deploying LDAP Directory Services, 2002 New Riders Publishing

< BACK

CONTINUE >

Index terms contained in this section

action plans
resolving problems 2nd 3rd
correcting
problems
directories
monitoring
resolving problems 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th 14th 15th 16th
documenting
problems 2nd 3rd 4th
failovers
replication
maintenance
resolving problems
action plans 2nd 3rd
correcting the problem
documenting the problem 2nd 3rd 4th
finding the cause 2nd 3rd 4th 5th
minimizing damage 2nd
minimizing
damage
resolving problems 2nd
monitoring
resolving problems
action plans 2nd 3rd
correcting the problem
documenting the problem 2nd 3rd 4th
finding the cause 2nd 3rd 4th 5th
minimizing damage 2nd
MTAs
replication
failover to
resolving problems (monitoring for errors)
action plans 2nd 3rd
correcting the problem
documenting the problem 2nd 3rd 4th
finding the cause
anomolies
gathering evidence
log file analysis
vendor support staff
minimizing damage 2nd
support persons
vendors
researching problems
technical support
vendors
researching problems
troubleshooting
resolving problems
action plans 2nd 3rd
correcting the problem
documenting the problem 2nd 3rd 4th
finding the cause 2nd 3rd 4th 5th
minimizing damage 2nd
vendors
support staff
researching problems

2002, O'Reilly & Associates, Inc.