Troubleshooting and Resolving Problems | Understanding and Deploying LDAP Directory Services (2nd Edition)

When presented with a problem, you should follow a methodical, step-by-step approach to troubleshooting. The steps we suggest are covered in this section.

Step 1: Assess the Problem, and Inform Affected Persons

When you first become aware of a problem, perform a quick initial assessment of the evidence and try to understand the scope of the problem. At this stage the underlying cause of the problem may not be obvious. You may have only a few unrelated problem reports from your Help Desk that may or may not clearly point to the cause. When the underlying cause isn't clear, try to think of all the possible causes and investigate each one.

Depending on the size of your organization and how your computing support is provided, you may need to contact other people to understand the possible causes. For example, if you are in charge of running the LDAP servers, another group is responsible for the physical network, and yet another group is responsible for the network routing, you may need to get in touch with all these groups to understand whether the problem is the fault of a system you administer. If you work in this type of organization, it's valuable to cultivate good working relationships with the groups that provide services you depend on and the groups that use the services you provide. Having clear escalation processes and up-to-date contact information is also crucial to this type of distributed problem solving.

For example, if you receive notification that users cannot access Web pages that require authentication, you don't really know whether the problem is with the Web server itself, the directory, or an infrastructure piece such as the network between them. Is the Web server accessible? If you access it yourself, can you observe the same problem the end users encountered ? Is the directory server running? Is there a problem with the directory server data? Make a list of possible causes.

In completing this initial assessment, a diagram of your application's components and the network devices they depend on is helpful. Figure 20.1 shows a typical Web application.

Figure 20.1. A Typical Web Application

Suppose that users are able to reach your application's login page but receive an error message after providing their user name and password. Also assume that user authentication is implemented in your application layer by a Java 2 Enterprise Edition bean that runs on your application server and uses the Java Naming and Directory Interface (JNDI) to verify user credentials against the LDAP servers.

As a first step, you could verify whether your Web servers have connectivity to the application servers. If other pages on your site are implemented with support from the application layer, then try to view one of those pages. If they work properly, you can assume that your Web servers are able to talk to the application servers. Assume that you've completed this step and have verified that your application servers are up and are accessible from the Web tier .

When you're sure that the problem lies in the interaction between the application servers and the directory servers, look at the components involved in providing LDAP service to your application tier. Those components, and some troubleshooting techniques for diagnosing them, are as follows :

The Java code that implements the authentication logic on the application servers. Application server logs can be helpful in determining the proper functioning of application-layer components.
The directory servers themselves . Are the server processes running? Are they responding to LDAP requests ? A cursory examination of the directory server logs may prove useful.
The load balancer (LB2 on Figure 20.1) that provides failover for the directory servers. Is the load balancer functional? Is it properly relaying requests to the directory servers? Load balancers are complex devices and can easily be misconfigured.
The physical network(s) that connect the application servers, load balancer LB2, and the directory servers. The ping utility is helpful in determining whether the network is at fault. If network performance is degraded because of faulty wiring, you may be able to observe errors on the network interfaces of your servers. The netstat i command is useful for observing interface errors.

After you complete this initial assessment, inform the affected persons. Let them know that you are aware of the problem and are working to resolve it. If you have a good idea of how long it will take to solve the problem, provide an estimate (perhaps padding it a bit to give yourself time to proceed methodically). To the greatest extent possible, inform your users and Help Desk which services are affected by the problem and what the symptoms might be. For example, if the directory is inaccessible because of a network failure, what will users of the corporate address book application see? What will happen to incoming and outbound e-mail? Providing information to the Help Desk staff will help them understand the problem and communicate more effectively with the end users.

Step 2: Contain the Damage

After you've thought through the possible causes, can you identify any that might result in long- term data loss or corruption? For example, if the data in the directory has been damaged and entries for some employees are missing, what will happen to e-mail addressed to those people? Most e-mail server software will bounce the message (return to sender) with an explanation that no such address exists. Or if a disk drive appears to be failing and an automatic update script is about to run and put a heavy update load on the directory, what will happen to the data updates? Will they be lost?

If there is any possibility of such damage, it's often best to shut down the affected parts of the directory. Remember, delayed results or service unavailability are better than incorrect results or data loss.

Step 3: Put the System Back into Service by Applying a Short-Term Fix

After you've completed your initial assessment and contained any possible damage, it's time to start putting the directory back into service with a short-term fix, if appropriate.

For example, if your master directory server has a bad logic board and it will be several days before you can obtain a replacement part, perhaps an appropriate short-term fix is to place another machine in service. Or if you've determined that an automatic directory update process has erroneously removed entries from the directory, an appropriate short-term fix might be to restore the directory from a backup tape and shut off the automatic update process until you can analyze and resolve the problem with it.

The short-term fix often leaves you with somewhat reduced capacity. For example, you might put a replacement machine into service that isn't as fast as the machine it replaces . A failed directory server might simply be taken out of service until it can be repaired, if you have a sufficient number of replicas to handle the remaining client load without serious degradation. This temporary solution is usually acceptable as long as the situation is eventually resolved and the full capacity of the directory is restored.

Step 4: Fully Understand the Problem, and Devise a Long-Term Fix

As soon as your directory is running with your short-term fix, it's time to examine all the evidence and fully comprehend what happened . For some types of problems, you will already understand the problem fully: Perhaps a power supply failed and has to be replaced , or a bug in update software needs to be fixed. In these cases a long-term fix isn't necessary.

In other cases, however, you might not fully understand what happened. Perhaps the directory server machine became unresponsive and had to be rebooted, or maybe replica directory servers got out of sync and had to be rebuilt from the master server. If you aren't 100 percent sure why the problem occurred in the first place, spend some time analyzing the failure. The following evidence can help with this step:

Directory server usage and error log files
Directory audit log files showing the changes made to the directory
Log files generated by the operating system, including system logs (usually found in /var/log or /var/adm on Unix systems) or the Windows event log
Log files from dependent applications such as Web servers, messaging servers, and so on
Output from your NMS software
Problem reports from end users
Problem reports from your maintenance staff

Creating a timeline of these events will often help you understand the chronology and causeeffect relationships. For example, you may note that the directory server began reporting errors writing to its files at the same time the operating system began to log errors with the SCSI bus on the host. This correlation might lead you to suspect a problem with the cable attaching the disk drive to the server, other SCSI peripherals attached to the SCSI bus, or the disk itself.

Tip

Most servers log events chronologically and mark each event with a timestamp. Correlating logs across multiple servers is much easier if the server's clocks are all in sync. Use a time synchronization protocol such as the Network Time Protocol (NTP) to keep your server clocks synchronized.

If the underlying problem isn't something you can fix right away, you need to develop a long-term fix. The long-term fix might be straightforward, such as scheduling downtime to replace a failed power supply when the replacement part arrives; or it might be complicatedfor example, involving fixing a bug in an automatic update procedure.

Complicated long-term fixes should ideally be tested before deployment. Maintaining a lab environment in which you can test fixes before applying them to your production environment is a great way to address this need. For large or mission-critical directories, it's an absolute requirement.

Step 5: Implement the Long-Term Fix, and Take Steps to Prevent the Problem from Recurring

When your long-term fix has been identified and tested, it's time to deploy it. If the long-term fix requires any directory downtime, schedule it well in advance, ideally during hours of low directory use. For example, if you've put a replacement server in place while waiting for replacement parts for your primary server, you'll need some scheduled downtime to put the primary server in place.

In some cases you may be able to avoid downtime entirely. For example, suppose that your software supports multimaster replication and you need to exchange a server for its replacement. You could put the repaired server into place, make it a read/write replica, and then remove the temporary server from service. A transition carried out in this manner is entirely transparent to end users.

When you have everything up and running, ask yourself whether there is a way either to prevent the problem from happening again or to mitigate the negative consequences. For example, if an outage was caused by the failure of a server, could the impact have been lessened if more replicas of the data had been available? If the outage was caused by a bug in the server software or operating system, have appropriate patches been deployed to all servers? Think of each incident as providing you with valuable input for improving the quality of your directory service.

Tip

One way to help ensure that changes to your directory environment are as nondisruptive as possible is to implement a change control policy . Such a policy describes the lead time required before a change can be implemented and who must be notified. An example of a change control policy might include the following:

At least 24 hours before any change happens, a change control notice must be submitted to all relevant persons, including administrators of dependent systems. This notice should describe the change being made, the reason for the change, and the anticipated effects on dependent services.
During this 24- hour period, the change may be vetoed if one of the affected administrators believes it will conflict with a previously scheduled change. If vetoed, the change must be rescheduled for a later time.
Emergency changes must, of course, be permitted.

Such a policy will help ensure that all affected staff are prepared for the maintenance or outage and know whom to contact if the outage causes problems with dependent systems.

Step 6: Arrange to Monitor for the Problem

If your monitoring system did not detect the problem, can your monitoring strategy be updated? If you are able to update your monitoring system to catch the problem in the future, you should be able to improve the response time. If the problem is the result of a bug in operating system or directory server software but a patch is not yet available, monitoring for the condition and taking an action such as rebooting a server can improve the reliability of the directory until a patch is available and can be installed.

Step 7: Document What Happened

Finally, take the time to produce a report of the problem, the steps you followed to determine the cause, and the resolution. A workflow diagram describing these steps may be useful in the future, especially if you need to create a decision tree to aid in troubleshooting. In addition, record the total time the service was interrupted . A collection of these reports can be a valuable resource for new technical staff. These reports can also be useful when you're communicating with management, especially if you want to make the case for additional funding for things such as increased server capacity or improved monitoring support.