Types of Problems | Understanding and Deploying LDAP Directory Services (2nd Edition)

As mentioned earlier in this chapter, there are four main classifications of directory problems. In this section we provide an overview of the various issues that can arise with each type of problem.

Directory Outages

The first type of problem we explore is the directory outage . In an outage , part or all of your directory service becomes unavailable. Outages can occur when one or more of the directory servers have become unreachable because of a network problem, or because the directory server software or hardware has failed in some way. When an outage occurs, users receive no service at all.

Causes

Causes of directory outages fall into two broad categories: hardware failures and software failures. Hardware that can fail includes network components such as routers, switches, network interface cards, and network cabling; and server components such as CPU boards , disk drives , memory, power supplies , and so on. An outage can also result from a power outage. Software failures can include failures of the operating system itself or of the directory server software. Other software running on the server can also malfunction and cause the directory server software to fail or become unresponsive .

Implications

When a directory outage occurs, your users and directory-enabled applications receive no directory service at all. Because LDAP is a client/server protocol, outages are generally noticed as a failure to connect to the directory service. Outages produce three different symptoms: connection timeouts, refused connections, and hung connections.

In a connection timeout , the client attempts to open a connection to the server but receives no response at all. In this case the client waits a fixed period of time for a response. Eventually the client times out and reports an error; however, this timeout period is typically on the order of minutes ”longer than most users are willing to wait. Connection timeouts can occur if the directory server becomes unreachable because of a network failure, if the server itself is powered off, or if the operating system has crashed.

With a refused connection , the directory server's operating system is operating correctly and the server is reachable on the network, but no server process is listening for incoming LDAP connections. In this case the operating system's TCP/IP stack returns a "connection refused" message to the client. Refused connections typically happen when the directory server process has terminated abnormally because of a software bug. They can also occur if the server machine has restarted, perhaps because of a power failure, but the directory server software is not configured to start automatically upon system reboot.

With a hung connection , the server is reachable and a directory server process is running, but the service malfunctions. In this case the connection is still accepted, but no further data flows from the server to the client. Hung connections can result from software bugs or hardware failure, such as a hung SCSI (Small Computer System Interface) bus on the server.

Resolution

If hardware failure is the underlying cause of an outage, the usual course of action is to replace and restart the failed hardware. For example, if the cause is determined to be a bad network cable, the obvious remedy is to plug in a new cable.

In some cases, however, outages caused by hardware failure can be more challenging to correct, either because the needed replacement parts are not available or because the directory data was damaged (which can happen if a disk drive fails). Ideally you should maintain a stock of spare parts or purchase an on-site service agreement with a guaranteed maximum response time. Depending on the availability requirements and any service-level agreements you might have, even a short outage may be unacceptable. In this case you need to look into providing a highly available directory service, as discussed in Chapter 17, Backups and Disaster Recovery.

When software failure is the root of a directory outage, more vigilance is required. Simply restarting the failed software component may bring the service back online, but if the underlying condition that triggered the software failure is still present, the service may soon fail in exactly the same way. For example, suppose that the directory server's cache is configured to be much too large and the server becomes unresponsive because of excessive paging activity. Restarting the server would probably make the service perform acceptably for a while, but as soon as the cache filled, the system might again experience the problem.

When the cause of a software outage is unclear, try to note as much detail about the state of the service while the problem is occurring. Is the server process running? If so, how much memory is it using? Is it consuming any CPU time? Is there disk activity? Do the operating system or directory service log files contain any useful information? If the cause of the problem is still unclear, it may be appropriate to restart the service and watch carefully to see whether the service degrades slowly or fails suddenly. For example, you might notice that the service fails whenever a particular query arrives at the server from a particular directory-enabled application. Information like this can be valuable when you're working with your vendor's technical support organization.

Performance Problems

Another common type of directory problem is poor performance, which can manifest itself in various ways. The overall performance of your directory may be poor, or a specific type of directory operation such as a delete operation might be slow. Performance problems may be consistent, or they may be intermittent. Troubleshooting these problems requires careful analysis and attention to detail.

Causes

Misconfigured software is the most common cause of directory performance problems. A misconfigured directory server might not perform optimally, or it might not function at all. For example, most directory server software uses a RAM-based cache to improve performance. If this cache is too small, the directory's performance can suffer, perhaps to the point where the server seems to be hung. To correct this problem, increase the size of the cache. On the other hand, if the cache configuration is too large, the server's virtual memory system may experience excessive paging, thereby slowing performance. Most operating systems offer a utility for observing virtual memory system paging activity (perfmon on Windows NT and vmstat on Solaris, for example).

Another common misconfiguration problem results from not maintaining appropriate indexes for the types of searches your server handles. Netscape Directory Server 6, for example, permits searches on any attribute. However, search performance is poor on unindexed attributes because the server might need to look through every entry in the database to locate matching entries. If your server takes a long time to respond to search requests and consumes a large amount of CPU time, you may want to check whether clients are using unindexed attributes in search filters.

If you notice that your Netscape Directory Server 6 is performing poorly, you can examine the server access log for clues. The access log is found in the logs directory beneath the server root and is named access .

First look in the access log for the string "notes=U" in the RESULT log entry for search operations. This string indicates that the server was unable to use an index to service the search operation.

Note that Netscape Directory Server 6 writes two lines to the server log for each client operation it services. The first log entry is written when an operation is received from a client, and the second is written when the result message is sent to the client. Because the server can handle many client operations concurrently, log entries from other client operations may be interleaved in the log, and the two entries for a given client operation may not be adjacent.

To correlate a client operation initiation log entry with its corresponding result message, you need to examine the connection and operation information. Each incoming connection to the directory server is assigned an increasing connection number, which is logged as conn=c in the access log, where c is the connection number. Each operation serviced for a given connection is assigned an increasing operation number, which is logged as op=o , where o is the operation number. The following example shows what Netscape Directory Server might log in response to a client's search operation:

 [11/Jan/2002:08:05:37 -0800] conn=24 op=8 SRCH base="dc=example,dc=com" scope=2 filter="(description=*engineer*)" attrs=ALL other log entries... [11/Jan/2002:08:05:44 -0800] conn=24 op=8 RESULT err=0 tag=101 nentries=7 etime=7 notes=U

Notice that the client has issued a search with the filter (description=*engineer*) , and that the corresponding RESULT log message indicates that no index was available to service the search operation (because notes=U was logged). Netscape Directory Server 6 requires a substring index on the description attribute to service this search efficiently . In this case, if clients frequently perform substring searches on the description attribute, consider adding such an index to the server's configuration.

Also examine the elapsed time for each operation, recorded in RESULT messages as the string etime=n , where n is the elapsed time for the operation in seconds. In the previous example, notice that the search operation took seven seconds to complete, which is longer than expected. Most directory operations should complete in less than two seconds. If you find that some operations are taking longer to complete, you should verify that the proper indexes are being maintained if the operation is a search, or that the server is not overloaded if the operation is an update. The following Unix command will display a summary of elapsed times for all operations in the access log:

 grep RESULT access  awk '{print }'  sort  uniq -c

The output from the command might look like this:

 32924 etime=0   112 etime=1     1 etime=2     1 etime=3

This output indicates that 32,924 operations completed in less than one second, 112 operations completed in one second, and only two operations took longer than one second. This type of distribution is what you should expect to see on a server that is performing well. If you run this command periodically against your access log, you can quickly determine if any operations are running slowly, warranting further investigation.

Your software may also encounter a specific limit within either the server software or the operating system. For example, most versions of Unix have a hard limit on the number of file descriptors that a single Unix process may use (one file descriptor is used for every open TCP/IP connection and every open file). This limit is usually in the thousands of connections, but it may be altered via Unix commands.

Another type of limitation you may encounter is the size of the TCP listen queue. This parameter controls how many incoming TCP connections can simultaneously be in the process of being opened. TCP connections are usually established quickly and are removed from the listen queue as soon as they are completely established. However, if many clients attempt to connect to a server at the same time, or if a problem prevents the connection from being quickly established, the listen queue can fill up and new incoming connections may appear to hang temporarily until the queue drains.

Older versions of the Unix operating system had a small listen queue (often fixed at only five incoming connections) that was inadequate for high-volume network servers. Newer operating system versions are configurable, and you may need to significantly increase the size of the listen queue. HTTP servers are particularly susceptible to this problem because Web browsers tend to open many short-lived connections. It's often necessary to increase the listen queue to 64 connections or more on a heavily used Web server. LDAP clients typically use longer-lived connections, but you still may need to increase the listen queue size on an LDAP server.

If your server software runs inside a single process, it is subject to an operating system connection limit. This means that there is a cap on the maximum simultaneous number of connections that can be handled by a single process. If there are so many clients that your server encounters this limit, you will notice that the server refuses connections (clients receive a "connection refused" error from the TCP/IP stack). The proper action in this case is to increase the limit, add additional replicas to handle the load, or reduce the number of client connections (if the client software opens connections unnecessarily or neglects to close connections when finished).

Some vendors provide utility software that can help you identify problems with operating system tuning parameters, such as an incorrectly sized listen queue, and can recommend more appropriate settings. For example, Netscape Directory Server 6 is bundled with a utility named dsktune that analyzes your operating system. It verifies that all required patches are installed, checks that the per-process limit on file descriptors is set appropriately, and confirms that several important TCP parameters are set to values that will yield optimum performance.

Finally, you may encounter performance-degrading software bugs in the directory server software or the operating system. Software vendors increasingly are using the World Wide Web to distribute patches and publish knowledge bases full of information about their products. In addition, Usenet newsgroups are a tremendous resource for learning about known bugs, workarounds, and patches.

If you find a previously unknown bug, make sure that you report it to the software vendor in as clear a fashion as possible so that the bug can be fixed. For information on submitting a good bug report, see Chapter 19, Monitoring.

Implications

The implications of performance problems can range from slight degradation of the service to outright failure. The symptoms can affect all users equally, or they might affect only a subset of directory users. For example, whereas a misconfigured cache can result in poor performance for all users and directory-enabled applications, a missing attribute index might result in poor search performance for only users and applications who search on the unindexed attribute (unless many users do so, in which case the server's overall performance may suffer).

Resolution

When you're resolving performance problems, it's important to proceed logically and deliberately. Take notes that describe the problems exactly as reported or observed , and then try to reproduce the problem yourself. For example, if your users complain that address book searches in Netscape 7 take a long time, try the same search yourself. Ask the users how the search dialog is configured and perform the same search. If you can reproduce the problem, you're well on your way to understanding the root cause. If you can't reproduce the problem, ask yourself what's different about your environment and the user 's environment. Are you connected to the same server? Are you authenticated or bound anonymously? Are you closer to the server within the network? Does the server access log contain any telltale clues? Try to eliminate each difference, one by one, until you can duplicate the problem.

Remedying the problem may be simple if a small configuration change is required, or it may be complicated if a bug has been discovered in your software. If no fix is available, is there a workaround? Can you mitigate the effects of the problem by reconfiguring your directory in some way? For example, installing more memory, adding another CPU, or adding an additional replica may provide the additional capacity you need if you encounter a limit on your software.

No matter what the remedy, take the time to document the problem, the cause, and your workaround and/or long- term fix. Software problems can be complex, and the more you can share troubleshooting knowledge with your peers, the more effective your organization will be at providing a high-quality service. These details can also be useful to the operating system or directory server software vendor if a software bug is the root cause.

Tip

It's helpful to understand and be able to interpret the types of information that your server software can log. For example, Netscape Directory Server writes detailed information to its access log (refer to Chapter 19, Monitoring, for information on the access log and techniques for analyzing it).

The access log can help you understand problems. For example, if a user complains of slow search performance, you can search the logs for the IP address of the user's machine. When you find the log entries corresponding to the user's session, you can see exactly the type of search base and search filters that the user's client software presented to the server. This information may help you determine whether the problem is the result of a misconfiguration in the user's software or a problem with the server itself.

Problems with Directory Data

Directory data problems may be the result of missing, extra, or incorrect information. In the worst case, the database files that your directory server software uses may become corrupted because of software bugs, operating system bugs, or operator errors. In our experience with actual directory deployments, this is the most common type of problem.

Data problems are often a consequence of another problem, such as misconfigured software. In other cases, data problems can result from incorrect actions on the part of data administrators. Problems with data itself can also be a cause of many other problems. For example, if access control attributes have been erroneously changed or removed, users and applications may not be able to access needed directory entries.

Causes

When incorrect data appears in your directory, someone or some process must have put it there. For example, a confused departmental administrator might remove the entry for an active employee instead of a terminated employee. On a larger scale, an automated update process that reconciles database records from a human resources database might, as a result of a bug, place incorrect employee information in the directory or remove needed information.

Typical monitoring software won't detect this type of problem unless the data is so damaged that the server crashes or cannot even start up. You will usually learn of this type of problem via end-user reports unless you proactively monitor data quality.

To be more proactive, you might consider developing tools to monitor the quality of data in your directory or build data validation tools into the software that you can use to synchronize your directory with external data sources. Such tools can detect problems with data before they are noticed by users. More information on data quality monitoring can be found in Chapter 18, Maintaining Data.

Implications

When incorrect data ends up in your directory, dependent applications can start behaving incorrectly. For example, if your directory is used to authenticate users accessing your internal Web servers, but some users have been incorrectly removed from the directory, they will be unable to access the protected resources. Or if a user's directory entry has been removed, e-mail destined for that user may be returned to the sender. In general, if the directory appears to be operating correctly but one or more of your users are having problems, check the contents of the relevant directory entries.

Even more subtle errors can occur if the directory holds information about network resources such as file servers and printers. If the directory entries corresponding to these devices are removed or damaged in some way, the services provided by those devices may become unavailable.

If database files become corrupted, symptoms may be either obvious or subtle. All the entries in the directory may disappear (which is easy to notice), or certain entries may simply not be returned when certain types of searches are performed. Robust server software prevents these types of inconsistencies from arising as a part of normal operation, but operator mistakes can cause any number of unanticipated problems. For example, although Netscape Directory Server uses a transactional database that ensures data consistency even in the case of operating system crashes, manually removing one of the attribute index files and restarting the server can cause unexpected and unwanted results.

If the corruption is subtle, it may go unnoticed for some time. When dealing with corrupted data, always be open to the possibility that the damage actually occurred some time ago and has only now been noticed.

Resolution

If you determine that you have a problem with the data in your directory, the first thing to do is determine the extent of the damage. To do this, of course, you need to have some idea of what should actually be in the directory. A good starting point is to look at the directory contents. Do you see approximately the correct number of entries in your directory? If you see too few entries, indicating that entries have been erroneously deleted, it may be prudent to shut down certain dependent services. For example, if you know that the entries for an entire department are missing from the directory, you should probably shut down the servers that handle e-mail for those people. If the mail server cannot locate a user's e-mail address in the directory, it may incorrectly conclude that there is no such user and return the mail to the sender.

Sometimes the safest thing is to shut down the affected servers. Directory-enabled applications generally notice that the directory is unavailable, report a meaningful error, and retry the operation later. However, if the data is incorrect or missing but the directory is not shut down, applications may behave incorrectly.

As soon as you know the extent of the damage, you need to set about repairing it. How you do this depends on the damage and the knowledge you have about the correct contents of the directory. For example, if only a single user's entry has been deleted, it's probably most appropriate simply to re-create the entry. On the other hand, if your directory's entire contents have been wiped out by a buggy automated update process, you need to restore your data from a comprehensive source such as a set of backup files or tapes. We suggest that any update scripts you develop yourself include the capability to log their actions to a file so that you can analyze the scripts' actions later if necessary.

After you restore your directory, you need to understand how the damage happened . Did incorrectly configured access control allow removal of an entry? Did a data merge process go awry? You need to examine log files and other records to determine when and how the damage occurred. This step is important for preventing similar problems. As we've learned in our own deployment experience, problems rarely resolve themselves permanently!

Security Problems

The final category of problems is related to security. The most serious type of security problem is unauthorized access to directory data. An attacker may attempt to compromise the security of the directory with the intent of reading or damaging sensitive directory data or rendering the service useless for other users via a denial-of-service attack. The topics of directory access control and security are covered in detail in Chapter 12, Privacy and Security Design. The steps outlined there should protect you against many common types of break-ins, but how do you notice and respond to security problems if they occur in spite of your best efforts?

A good way to detect compromised security is to be on the alert for telltale signs that the directory has been tampered with. Such signs, often subtle, might include access to your directory from an unexpected location on the Internet (if your directory is accessible from the Internet at all), or a report from a user that his or her directory entry has been altered unexpectedly. This is often just a consequence of a normal automated update, but it can also signal that the user's password or other credentials have been compromised. If you suspect unauthorized access, directory server logs can help track down the date and time when the tampering occurred and the origin of the LDAP connection.

A denial-of-service attack, on the other hand, has one purpose: to render the directory unusable. The attacker seeks to consume all available resources, perhaps by issuing thousands of repeated unindexed searches against the directory, or by exploiting a known bug in the directory or operating system software and causing a crash.

Causes

If your directory is directly accessible from the Internet, it may be subject to attacks from any place in the world. Or if your directory provides authentication and personalization services for a Web-based application that is Internet-accessible, an attack on the Web application may overwhelm your directory server.

On the other hand, if your directory is accessible only from inside your corporate network, you are less susceptible to attack from the outside ”but you are not immune. Given enough time and motivation, disgruntled employees can certainly wreak havoc on a directory server.

What motivates people to mount denial-of-service attacks or attempt to break in to your directory? In some cases the sheer challenge of breaking in is sufficient. In other cases a disgruntled employee may be angry enough to attempt to compromise your directory as a way of achieving revenge .

Implications

A denial-of-service attack can render a directory unresponsive by consuming excessive resources, or it can take down a directory by exploiting known bugs. A security breach is serious, especially if your directory is used for authentication and access control for your critical business resources. The implications of compromised security are highly dependent on the type of data stored in your directory.

Resolution

Discovering a denial-of-service attack is usually simple because the affected servers and dependent services become unresponsive or unavailable. A well-designed monitoring system will note sudden peaks in server load. When you suspect a denial-of-service attack, be sure to save any relevant log files from the time of the attack. The logs may be useful in identifying the source of the attack launched against your servers. Be aware, however, that distributed denial-of-service attacks enlist compromised systems, or " zombies ," to do their dirty work. The original source of the attack may not be reflected in your logs.

If you determine that the attack originated from a location inside your company, you probably have some recourse to stop the attacker. Be careful when tracing the attack back to the original source, however; the fact that a particular machine was the origin of the attack doesn't necessarily mean that the owner was the attacker. A skilled hacker will cover his footprints carefully and avoid using his own desktop machine to originate an attack.

Tip

If the origin of the attack is outside your company, you might be able to use a firewall product such as Cisco's PIX Firewall to block access to your directory from the originating network.

It's also entirely possible for a user to accidentally cause a heavy load on the directory without knowing it. For example, if a user attempts to search the directory for an unindexed attribute but grows impatient and submits the search several times, the directory may become less responsive . In other cases, clever users may write scripts that collect directory data for legitimate purposes, but do so inefficiently or too frequently. These are not malicious attacks, and you need to be aware of this possibility when you're tracing the problem to its source.

Reconfiguring the directory to index the attribute or setting smaller administrative limits may be an effective way to protect the directory from such inadvertent denial-of-service attacks. Setting reasonable size limits (which control the number of entries returned in response to a search request) and time limits (which control the maximum amount of time the directory will spend responding to a search request) can help make your directory more robust. However, not all directory software can be configured in this manner, so be sure to consult your documentation.

If you suspect that the security of your directory has been compromised in some way, immediate action is required. First have a plan for who needs to be contacted. In some cases, you may have on-site computer security experts who will respond to the situation and escalate it as necessary. In other cases, you may be the in-house expert, and you may need to escalate the situation to your internal security department or even local, state, or federal law enforcement agencies.

In an extremely high-security environment, it may be appropriate to shut off access to the compromised services, including the directory itself and any dependent services such as authenticated intranet Web access. However, shutting off services abruptly will almost certainly tip off any attacker that his actions have been detected . If you can learn the network address from which the hacker is connecting, it may be possible to observe the actions closely and understand the extent of the damage. Gathering more evidence may help you catch the intruder and may prove useful if you decide to involve law enforcement agencies. Always keep a detailed log of evidence when you suspect the presence of an intruder. Compromised security is a serious problem and is covered in more detail in Chapter 12, Privacy and Security Design.