Look Before You Leap | Novells Guide to Troubleshooting eDirectory

Chapter 1, "The Four Basics of eDirectory Troubleshooting," mentions that a typical troubleshooting process is comprised of five steps:

Gather information.
Develop a plan of attack.
Execute the plan.
Evaluate the results. Go back to step 1 if necessary.
Document the solution.

When faced with a DS problem, it is important that you follow a consistent set of troubleshooting procedures ”that you use every time ”to solve and resolve the problem. The following sections examine each of these five troubleshooting steps as well as some other parts of the troubleshooting process.

Identifying the Problem and Its Scope

Before you attempt to deal with a problem, you need to first determine what the problem is and what the scope of the problem is. You need to know what you are getting yourself into so you can properly allocate time and resources for the task.

NDS/eDirectory problems can manifest in a number of ways. They can appear directly in the form of error or warning messages in utilities or on the server console. They can also show up indirectly as part of an error message when you attempt to perform a DS operation. Sometimes a DS error shows up as a side effect of other system or network problems. For instance, a -625 error is a DS error that indicates communication failure. However, the cause of this error is not DS itself but has to do with the subsystem that is responsible for data communication ”which DS has no control over. Therefore, you need to be able to differentiate between the symptoms and the problem.

TIP

Two key points to keep in mind at this initial phase of the troubleshooting process are to keep an open mind about the possible causes of the error and to consider the whole system ”not just the DS portion ”as a unit. Not all error messages you see indicate situations that you need to deal with right away or concurrently. Often, error conditions snowball : One initial error can result in dozens of other warning and error messages that may or may not point back to the initial error condition.

NOTE

I once saw a Cobol program compilation error listing that was over 15 pages long (the program itself was only 3 pages long). It all resulted from a single missing period starting about the fifth line from the beginning!

Users tend to cry wolf fairly easily. Therefore, before jumping to any conclusions, you should not panic; rather, you should confirm the error condition by either looking for concrete evidence of a problem or try to duplicate the problem. When you have established that there is indeed a problem, you must determine the extent of that problem.

You need to determine two facets of the problem. First, you need to determine the size of impact, such as how many people the error is affecting (including whether the CEO is one of the affected users) and the potential dollar cost of the error to the company. You also need to determine in which subcomponent the error condition happens. You can use these two facets to help prioritize your work schedule and troubleshooting efforts.

Our experience indicates that most of the time the DS errors you are mostly likely to encounter fall into the following three categories:

Communication
Time synchronization
Replica synchronization

During the discovery process, you should check each of these areas in the order listed. You should make note of the following information, as it will help you determine the likely cause of the error:

The symptom or steps required to reproduce the problem
The exact error code and text of the error message
Whether the error is seen on only one server or happens on others; if the error shows up on other servers, is there anything common between these servers, such as being in the same replica ring?
The version of NDS/eDirectory running on the server(s) that reported the error
The version of the operating system and patch level of the affected server(s)

It is important to quickly ascertain whether the error condition is occurring at the server, partition, replica, or object level. For instance, assume that the error code you receive is -601 when you're trying to read the Title attribute of a User object from Server A. Do you get the same error when you try to access the Title attribute from another server in the replica ring? If you do not get the error when you're trying to access this attribute from another server, the likely source of the problem is Server A itself. Otherwise, all the servers in the replica ring may be suspect.

Determining the Cause of the Problem

eDirectory reports problems or generates errors if a certain condition within the network prevents one or more eDirectory background processes from starting or completing. In most cases, your only indication of a DS error is in the form of an error code, which you see only when you perform a health check or enable the DSTrace screen. The following is a list of common DS errors:

Physical communication problems between servers or between servers and client workstations
Time synchronization problems
NDS/eDirectory version incompatibility
Replica inconsistency
Improperly moved or removed servers
An improperly changed server network address
Synchronization problems (perhaps due to schema mismatch)
Performance issues (such as low memory)
Human error

To determine or narrow down the cause of the problem, you need to first ascertain whether it is an actual eDirectory problem or some other type of error that is manifesting itself as an eDirectory issue. When you are certain that the problem is indeed an eDirectory problem, you need to analyze the information you have gathered about the error condition and narrow it down to the particular background process that caused it. Then, using the various resources at your disposal, such as online error code listings that show possible causes and the Novell Knowledge Base, you should try to gain a handle on the real cause of the problem.

Formulating Possible Solutions to the Problem

You can almost always use more than one method or tool to fix a given eDirectory problem. When you search the TIDs for a possible solution to a particular problem, you shouldn't be surprised to find multiple documents that suggest different ways to resolve the problem. However, before you implement any of the solutions, you should do the following:

List all the possible solutions
List the possible consequences of each action (such as doing X will take 2 hours to complete, and doing Y will only take 45 minutes but involves shutting down seven servers)
Check the latest available NDS/eDirectory updates and operating system patches to see whether the problem is one of the addressed issues

When installing a patch, you should always select to back up files that will be replaced when the installer prompts you. If the installer does not offer such an option, you should abort the update process, perform a backup of your system files (including those on the DOS partition, in the case of NetWare servers), and then restart the patch update. Furthermore, it is always good practice to keep a library of older patches that you have installed in case the new patch doesn't fix your problem or causes more problems; this way, you can roll back to an older patch that you know to work.

TIP

One of the possible solutions that you should always consider is doing nothing . NDS /eDirectory is designed to self-heal in many instances. Often, an error you encounter occurs when you try to perform some major operation as soon as a change (such as adding a replica) is made. Generally , the rule of thumb is to give DS an hour or two to " settle down" after such a change, and the error you see will resolve itself. Of course, you don't always have the luxury of letting DS sit for a few hours to see whether the problem goes away ” especially when there are a score of people standing around (and one of them is the CEO), asking, "What's happening?" (Refer to Chapter 4, "Don't Panic," for a discussion on how to deal with such situations.) Nonetheless, you should not easily discount the value of the DS self-healing process.

WARNING

Depending on the DS error in question, some of the recommended fixes can be quite drastic. For instance, a TID may recommend that you forcefully remove a replica or even DS from the server as one of the steps to resolve object renames. You need to appreciate the full consequence of each step taken in your error-resolving process before taking it, and you need to have a back-out plan in place. If you are uncomfortable with any of the steps, you should consult with someone who is more experienced or open an incident with Novell so you have someone to back you up.

Ranking the Solutions

When you have a list of possible solutions to the problem you're experiencing, you then need to rank them based the following criteria:

The odds that the action will resolve the error condition
The risks associated with the corrective action (for example, if the action fails, it could make the problem worse or result in additional complications, such as extended system downtime)
Additional fixes that may be required as a result of the action (For instance, if the fix calls for you to remove the server from the tree and then add it back in, file system trustee rights may be lost, and you might need to allow extra time to restore the trustee assignments.)
The ease with which the fix can be applied (that is, how long will it take?)
How the users will be further affected during the time when the corrective action is taking place (for example, will the users lose access to resources they now have access to?)

There is a medium to high degree of risk that if a corrective action does not work, it may lead to more harm to the system, and a sign-off from upper management is often warranted. You should present to management the rational for selecting the action but also lay out for them the possible consequences and rollback options.

TIP

If some of your solutions seem to contradict each other, you should consult with a knowledgeable co-worker or a friend. You should also make use of the expertise available to you, free of charge, at various online newsgroups, especially the Novell-sponsored ones available at http://support-forums.novell.com.

TIP

It is usually a good idea to solve a problem yourself so you can learn from the experience. Novell Support provides access to certain tools that are not available to the general public, and using these tools could save you hours of work. When you compare the cost of your downtime with the cost of an incident call, you see that there are circumstances when it is more expedient and cost-effective to open an incident with Novell and have Novell dial in to your system to perform the fix. You can always monitor Novell's progress and learn more about DS troubleshooting at the same time.

If at all possible, you should test solutions in a lab environment before attempting them on a production network. At the very least, you should always have a test server available so you can dry-run a procedure before attempting it on a production server.

Implementing a Solution

When you have decided on a course of action and received the necessary management approval, you need to make a backup of the Directory Information Base (DIB) files on the affected servers before implementing the fix. This will provide you with a back-out option in case something goes wrong.

You can find information on how to do this in the "Backing Up a Server's DS Database" section of Chapter 13, "eDirectory Health Checks."

If you are running eDirectory 8.7 or higher, you can use either the hot backup or the cold backup feature of eDirectory Backup eMTool, as discussed in Chapter 8, "eDirectory Data Recovery Tools," to take a snapshot of the current view of the replicas on the servers that you will be working with.

TIP

Before you perform any replica ring “ related repairs , it is best to first do a local database repair. This ensures that the data in the local DIB is in good order before it is allowed to be replicated to other servers.

The effects of your corrective actions may not show up immediately. You should allow some time for the various eDirectory background processes to perform their tasks . For example, assume that you were unable to merge two partitions because of stuck obituaries . After implementing your fix, you might not be able to merge the partitions right away because you need to allow the obit processes to advance the flags through the different stages and clear out the dead objects. Because this generally involves communication between a couple servers, at least a few minutes will be required before the stuck obits are cleared out. After that, you can perform the partition merge.

TIP

Depending on the scope of impact of the DS error, you should keep your user community, especially upper management, informed of what is happening and your progress. This way, they can arrange their work and get a few things accomplished while waiting for access to the tree to be restored.

Verifying the Solution to the Problem

After the eDirectory processes report "All processed = YES" (and you might want to wait until you see this being reported a couple times to be on the safe side), you should verify that the problem has indeed been resolved. You can accomplish this in one of two ways:

Attempt the initial action that caused the error to surface and see whether you can now perform it successfully
Perform an abbreviated DS health check (as outlined in Chapter 13) to ensure that no more errors are being reported

If the problem persists, you should return to your list of possible solutions and consider taking another action. You might need to go back and reexamine and reconsider the possible causes of the problem to ensure that you are on the correct track.

Documenting the Steps in the Solution

You should keep a logbook that contains information about your servers, network configuration, and any special NDS/eDirectory information (such as any special naming convention used for designating administrative users or ACL assignments made to protect admin users). This logbook should also contain maintenance data about your servers and network, such as what operating system or NDS/eDirectory patch was applied and when. This allows you to determine whether the DS or other error condition may be a result of one of these updates. It is also a good idea to keep a copy of the patches installed as part of the logbook so you have ready access to older patches in case you need to roll back an update.

TIP

As the saying goes, "If you didn't write it down, it didn't happen." This is why keeping the logbook is so critical.

In your logbook, you should also document the DS errors that have occurred, their causes, and the solutions you have identified. Even if an error was not resolved by a given solution, you should note it for future reference and label it as an " unlikely " solution. Whatever you do, you shouldn't totally discount it from consideration the next time around because the same or a similar error condition could show up again. Even if a solution didn't work the first time around, it might very well work the next time because the condition that causes the same or similar error message or code might be different the next time around.

Depending on the type of error, you are likely to encounter it again sometime down the road. For instance, no matter how careful you are about your tree, stuck obits will continue to surface from time to time. Having the steps on error resolution clearly documented helps to ensure that if you need to solve a given problem, you have a "play book" that you can follow.

Avoiding Repeating the Problem

It is estimated that more than 90% of the software-related problems you encounter are due to human error. They can be caused by unintentional actions taken by an inexperienced administrator or procedures not being properly followed. They can also be due to a number of other causes. For example, someone might simply turn off the server and remove it from the network instead of first removing DS from the server before removing it from the network.

Once you have identified the source of a DS problem, you should review existing published procedures to ensure that any oversights or omissions that resulted in the error are promptly amended, and you should be sure that the information is passed on to your co-workers .

TIP

New support staff should be adequately trained before being permitted to have physical access to the servers where most harm may occur. At the very least, you should have each new staff member pair up with an experienced member of your staff when performing certain tasks the first couple times.

At the end of a problem-resolution session, depending on the scope of the error involved, you should hold a meeting with your support staff to go over lessons learned and discuss how the problem can be avoided in the future. You should not , however, use such a meeting to assign blame of any sort .