Directory Services Future

Understanding and Deploying LDAP Directory Services > 19. Troubleshooting > Troubleshooting and Resolving Problems

<  BACK CONTINUE  >
153021169001182127177100019128036004029190136140232051053054012005146028252082195125142

Troubleshooting and Resolving Problems

When presented with a problem, you should follow a methodical, step-by-step approach to troubleshooting. The steps we suggest are covered in this section.

Step 1: Assess the Problem and Inform Affected Persons

When you first become aware of a problem, you should perform a quick initial assessment of the evidence and try to understand the scope of the problem. At this stage, the underlying cause of the problem may not be obvious. You may have only a few unrelated problem reports from your help desk that may or may not clearly point to the cause. When the underlying cause isn't clear, try to think of all the possible causes and investigate each one.

Depending on the size of your organization and how your computing support is provided, you may need to contact other people to understand the possible causes. For example, if you are in charge of running the LDAP servers, another group is responsible for the physical network, and yet another group is responsible for the network routing, you may need to get in touch with all these groups to understand if the problem is the fault of a system you administer. If you work in this type of organization, it's extremely valuable to cultivate good working relationships with the groups that provide services that you depend on and the groups that use the services you provide. Having clear escalation processes and up-to-date contact information is also crucial to this type of distributed problem solving.

For example, if you receive notification that users cannot access Web pages that require authentication, you don't really know if the problem lies in the Web server itself, the directory, or some infrastructure piece such as the network between them. Is the Web server accessible? If you access it yourself, can you observe the same problem the end users encountered ? Is the directory server running? Is there some problem with the directory server data? Make a list of possible causes.

After you've completed this initial assessment, inform the affected persons. Let them know that you are aware of the problem and are working on resolving it. If you have a pretty good idea of how long it will take to solve the problem, provide an estimate (perhaps padding it a bit to give yourself time to proceed methodically). To the greatest extent possible, inform your users and help desk which services are affected by the problem and what the symptoms might be. For example, if the directory is inaccessible because of a network failure, what will users of the corporate address book application see? What will happen to incoming and outbound electronic mail? Providing information to the help desk will help them understand the problem and more effectively communicate with the end users.

Step 2: Contain the Damage

After you've thought through the possible causes, are there any causes that might result in long- term data loss or corruption? For example, if the data in the directory has been damaged and entries for some employees are missing, what will happen to electronic mail addressed to those people? Most email server software will bounce the message (return to sender) with an explanation that no such address exists. Or if a disk drive appears to be failing and an automatic update script is about to run and put a heavy update load on the directory, what will happen to the data updates? Will they be lost?

If there is any possibility of such damage, it's often best to shut down the affected parts of the directory. Remember, delayed results or service unavailability are better than incorrect results or data loss.

Step 3: Put the System Back into Service by Applying a Short-Term Fix

After you've completed your initial assessment and contained any possible damage, it's time to start putting the directory back into service with a short-term fix, if appropriate. For example, if your master directory server has a bad logic board, and it will be several days before you can obtain a replacement part, perhaps an appropriate short-term fix is to place another machine in service. Or if you've determined that an automatic directory update process has erroneously removed entries from the directory, an appropriate short-term fix might be to restore the directory from a backup tape and shut off the automatic update process until you can analyze and resolve the problem with it.

The short-term fix will often leave you with somewhat reduced capacity. For example, you might put a replacement machine into service that isn't as fast as the machine it replaces . A failed directory server might simply be taken out of service until it can be repaired, assuming that you have a sufficient number of replicas to handle the remaining client load without serious degradation. This is fine as long as the situation is eventually resolved and the full capacity of the directory is restored.

Step 4: Fully Understand the Problem and Devise a Long-Term Fix

As soon as your directory is running with your short-term fix, it's time to examine all the evidence and fully comprehend what happened . For some types of problems, you will already understand the problem fully: Perhaps a power supply failed and has to be replaced , or a bug in update software needs to be fixed. In these cases, a long-term fix isn't really necessary.

In other cases, however, you might not fully understand what happened. Perhaps the directory server machine became unresponsive and had to be rebooted, or maybe replica directory servers got out of synchronization and had to be rebuilt from the master server. If you aren't 100% certain why the problem occurred in the first place, you should spend some time analyzing the failure. The following evidence can help with this step:

  • Directory server usage and error log files

  • Directory audit log files showing the changes made to the directory

  • Log files generated by the operating system

  • Log files from dependent applications such as Web servers, messaging servers, and so on

  • Output from your NMS software

  • Problem reports from end users

  • Problem reports from your maintenance staff

Creating a timeline of these events is often very helpful in understanding the chronology and cause-effect relationships. For example, you may note that the directory server began reporting errors writing to its files at the same time the operating system began to log errors with the SCSI bus on the host. This might lead you to suspect a problem with the cable attaching the disk drive to the server, other SCSI peripherals attached to the SCSI bus, or the disk itself.

Assuming that the underlying problem isn't something you can fix right away, you need to develop your long-term fix. This might be straightforward, such as scheduling downtime to replace a failed power supply when the replacement part arrives; or it might be quite complicated if it involves fixing a bug in an automatic update procedure, for example.

Complicated long-term fixes should ideally be tested before deployment. Maintaining a lab environment, in which you can test fixes before applying them to your production environment, is a great way to address this need. For large or mission-critical directories, it's an absolute requirement.

Step 5: Implement the Long-Term Fix and Take Steps to Prevent the Problem from Recurring

When your long-term fix has been identified and tested, it's time to deploy it. If the long-term fix requires any directory downtime, schedule it well in advance, ideally during evening or weekend hours when the directory sees low usage. For example, if you've put a replacement server in place while waiting for replacement parts for your primary server, you'll need some scheduled downtime to put the primary server in place.

Tip

One way to help ensure that changes to your directory environment are as nondisruptive as possible is to implement a change control policy. Such a policy describes the lead time required before a change can be implemented and who must be notified. An example of a change control policy might include the following:

  • At least 24 hours before any change happens, a change control notice must be submitted to all relevant persons, including administrators of dependent systems. This notice should describe the change being made, the reason for the change, and the anticipated effects on dependent services.

  • During this 24- hour period, the change may be vetoed if one of the affected administrators believes it will conflict with a previously scheduled change. If vetoed, the change must be rescheduled for a later time.

  • Emergency changes must, of course, be permitted.

Such a policy will help ensure that all affected staff are prepared for the maintenance or outage and know whom to contact if the outage causes problems with dependent systems.



In some cases, you may be able to avoid the downtime entirely. For example, suppose your software supports multimaster replication and you need to exchange a server for its replacement. You could put the repaired server into place, make it a read-write replica, and then remove the temporary server from service. Making the transition in this manner is entirely transparent to end users.

When you have everything up and running, ask yourself if there is some way to either prevent the problem from happening again or mitigate the negative consequences. For example, if an outage was caused by the failure of a server, could the impact have been lessened by having more replicas of the data available? If the outage was caused by a bug in the server software or operating system, have appropriate patches been deployed to all servers? Think of each incident as providing you with valuable input for improving the quality of your directory service.

Step 6: Arrange to Monitor for the Problem

If your monitoring system did not detect the problem, is there some way to update your monitoring strategy? If you are able to update your monitoring system to catch the problem in the future, you should be able to improve the response time. If the problem is a result of a bug in OS or directory server software, but a patch is not yet available, monitoring for the condition and taking some action such as rebooting a server can improve the reliability of the directory until a patch is available and can be installed.

Step 7: Document What Happened

Finally, take the time to produce a report of the problem, the steps you followed to determine the cause, and the resolution. Also record the total time the service was interrupted . A collection of these reports can be an extremely valuable resource for new technical staff. These reports can also be useful when communicating with management, especially if you want to make the case for additional funding for things such as increased server capacity or improved monitoring support.



Understanding and Deploying LDAP Directory Services,  2002 New Riders Publishing
<  BACK CONTINUE  >

Index terms contained in this section

directories
          troubleshooting
                    assess and inform 2nd 3rd 4th 5th
                    contain the damage
                    documenting problems
                    long-term fixes 2nd 3rd
                    monitoring for future problems 2nd
                    short-term fixes 2nd
                    solution implementation and prevention measures 2nd 3rd 4th
documenting
          problems
long-term fixes 2nd 3rd
monitoring
          for future problems 2nd
preventive maintenance 2nd 3rd 4th
          monitoring for future problems 2nd
problems
          resolving
                    assess and inform 2nd 3rd 4th 5th
                    contain the damage
                    documentation problems
                    long-term fixes 2nd 3rd
                    monitoring for future problems 2nd
                    short-term fixes 2nd
                    solution implementation and prevention measures 2nd 3rd 4th
resolving
          problems
                    assess and inform 2nd 3rd 4th 5th
                    contain the damage
                    documentation problems
                    long-term fixes 2nd 3rd
                    monitoring for future problems 2nd
                    short-term fixes 2nd
                    solution implementation and prevention measures 2nd 3rd 4th
short-term fixes 2nd
troubleshooting
          action plan
                    assess and inform 2nd 3rd 4th 5th
                    contain the damage
                    documenting problems
                    long-term fixes 2nd 3rd
                    monitoring for future problems 2nd
                    short-term fixes 2nd
                    solution implementation and prevention measures 2nd 3rd 4th

2002, O'Reilly & Associates, Inc.



Understanding and Deploying LDAP Directory Services
Understanding and Deploying LDAP Directory Services (2nd Edition)
ISBN: 0672323168
EAN: 2147483647
Year: 1997
Pages: 245

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net