Manual Partition and Replication Processes | Novells Guide to Troubleshooting eDirectory

Partitioning and replication operations all require communications with the server that holds the Master replica or replicas. The following sections examine the common operations used in manipulating partitions and replicas.

WARNING

Although many of these operations function even if some of the servers in the replica ring are unavailable, it is not at all recommended that they be performed until connectivity can be restored and verified . Even though the impact to users is not noticeable if everything proceeds normally, any partition- related operations (including adding or removing a replica) should be considered to be major changes to the tree.

Prior to initiating a partition or replica operation, it is always a good idea to perform a basic health check to verify communication to all the servers that will be involved in the operation. You can easily do this by using NDS iMonitor. NDS iMonitor has a couple different options that are useful in this situation ”the Agent Synchronization option, shown in Figure 6.13, and the partition Continuity option, shown in Figure 6.14.

Figure 6.13. The NDS iMonitor Agent Synchronization status screen.

Figure 6.14. The NDS iMonitor partition Continuity status screen.

NOTE

You can find detailed NDS /eDirectory health check recommendations in Chapter 13, "eDirectory Health Checks."

WARNING

Do not confuse the Agent Synchronization link under the Links listing with the Agent Synchronization link found on the Agent Configuration page. The latter is used to change synchronization-related settings.

As you can see in Figure 6.13, the Agent Synchronization screen shows a quick overview of synchronization status. This status is obtained by reading a single server and seeing the status that the server recorded for the last synchronization. If the last synchronization status was All Processed=YES , the synchronization is determined to have been successful and the errors count is zero. If the status was All Processed=NO , the synchronization failed and the errors count shows the number of errors.

NOTE

The following information is shown on the Agent Synchronization status screen:

Partition ” The names of the partitions located on this server.
Errors ” The number of errors encountered during the last synchronization cycle.
Last Successful Sync . ”The amount of time since all replicas of the partition were successfully synchronized from this server.
Maximum Ring Delta ” The oldest send delta of any server in the replica ring. This value is the same as the highest send delta in the replica status list.
Replica's Perishable Data Delta ” The amount of data on the partition that has not yet been successfully replicated since the server last synchronized that partition.

This basic check shows high-level problems in synchronization, but in order to really determine the status, you should check each server in the replica ring. The Partition Continuity screen (refer to Figure 6.14) provides this information. Synchronization errors between one server and another are apparent here. If there is a synchronization problem reported between servers, errors are reported at the end of the page.

After you have verified that the involved replicas are properly synchronized, you can then proceed with a partitioning/replication operation. Before looking at the various operations, though, let's first review the states in which a replica can be in because they can help you determine what stage you are at during a partitioning or replication operation.

Replica States

When working with partitions and replicas, you need to be familiar with the various states they can be in. Table 6.5 lists the possible states that a partition or replica can go through and their values (in both hexadecimal and decimal).

Table 6.5. Replica States

REPLICA STATE	DECIMAL VALUE	HEXADECIMAL VALUE	DESCRIPTION	CAN BE ABORTED?
On			Replica is on	N/A
New Replica	1	1	Replica is new	No
Dying Replica	2	2	Replica is dying (that is, being removed)	No
Partition Locked	3	3	Replica is locked in preparation for a move	No
Change Replica Type	4	4	Replica is currently having its type changed	Yes
”	5	5	Replica is almost finished changing types	No
Transition On	6	6	Replica is changing to the On state	No
Transition Move	7	7	Replica is changing to the Move state	Yes
Transition Split	8	8	Replica is changing to the Split state	Yes
Create	32	20	Replica is being created	Yes
Create	33	21	Replica is almost created	No
Split	48	30	Replica is preparing to split	Yes
Split	49	31	Replica is almost finished splitting	No
Join	64	40	Replica is preparing to join, or merge	Yes
Join	65	41	Replica is almost finished joining, or merging	No
Move	80	50	Replica is preparing to move a container	Yes
Move	81	51	Replica is almost finished moving a container	No

The values that appear in some DSRepair log files and DSTrace screens can provide insight into the current state of operations. A detailed explanation of some of the states follows :

On ” Indicates the normal state of a replica.
New Replica ” Indicates that the replica is in the process of forming. This state should last no more than a few minutes.
Dying Replica ” Indicates that the replica is in the process of being deleted. This replica should disappear completely in a few hours.
Transition On ” Indicates that the replica is in the process of going on but is currently in transition. This state is typical during a replica installation. The replica is not fully on until the installation is complete.
Transition Move ” Indicates that the replica is in the process of going to the Move state but is currently in transition. This state is typical during a Move Partition operation.
Transition Split ” Indicates that the replica is in the process of going to the Split state but is currently in transition. This state is typical during a Split Partition operation.

TIP

The two important states to watch for are whether the replica is on or successfully deleted. If the replica is stuck in a split, join, or move state, the value itself is not necessarily important, except as an indication that the operation is not yet complete. You should then determine the reason for the operation's incompletion. It could be due to communication failure between servers.

NetWare 4.1 and higher enable you to abort a partition operation that is in progress. A pending partitioning or replication operation can be aborted through DSRepair, as shown in Figure 6.15.

Figure 6.15. Using DSRepair to abort an in-progress partition operation.

NOTE

Not all partition operations can be canceled successfully (refer to Table 6.5). DSRepair might judge that a given operation couldn't be canceled due to potential damage to the tree. In such cases, the operation continues as scheduled.

Now that you have a good grounding in the replica states, the following sections examine the various partitioning/replication operations, starting with the Split Partition operation.

The Split Partition Operation

The Split Partition operation is the process that is used to create a new (child) partition. When you install the first DS server in a tree, the [Root] partition is created automatically; any other partitions created are split off the [Root] partition.

NOTE

Partition operations are performed using either NDS Manager or ConsoleOne. Unless you have access only to NetWare 4.x servers, ConsoleOne is the preferred utility because it also supports filtered replicas, whereas NDS Manager does not. However, you can download the latest version of ConsoleOne from http://download.novell.com and use it with NetWare 4.

The information reported by the DSTrace screen is fairly minimal. Watching the operation entails enabling the Partition DSTrace flag ( +PART in NetWare and Unix). This enables the trace information for all partitioning operations. Listing 6.17 shows the information presented during a Split Partition operation.

Listing 6.17. A Split Partition Operation

 SPLITTING -- BEGIN STATE 0 (20:28:39) *** DSALowLevelSplit <[Root]> and <XYZCorp> *** Successfully split all partitions in ring. ADDED 010000B6 and 0C0000BC to partition busy list. SPLITTING -- END STATE 0  *CNTL: This server is the new master for [0C0000BC]<XYZCorp>   *CNTL: SetNewMaster for [0C0000BC]<XYZCorp> succeeded.  Turning replicas on after changing replica type.

While a Split Partition operation is being performed, further partitioning and replication operations for that partition are suspended . Further operations will result in an error -654 ( ERR_PARTITION_BUSY ) until the replicas are turned on (that is, become usable). This operation is indicated in the last line of Listing 6.17.

TIP

Although you can perform operations on multiple partitions concurrently, it is best to work on them one at a time, especially if the partitions share common servers.

Notice that lines 7 and 8 in Listing 6.17 (which appear in boldface) indicate that the server the trace was done on became the Master replica for the new partition. This is to be expected because this server holds the Master replica of the parent partition. When you perform a Split Partition operation, the servers that end up with replicas are the same as the ones that hold replicas of the parent partition. After the replicas are turned on, you can further manipulate the replicas by adding, removing, or changing the replica types.

The Merge Partition Operation

Merging a partition ”also referred to as joining a partition ”is the reverse of splitting a partition. The Merge Partition operation merges parent and child partitions into a single partition. As Listing 6.18 shows, two operations actually take place during a join ”a join up operation and a join down operation. The join up operation is the process of joining of the child partition with the parent; the join down operation is the process of the parent joining with the child partition.

Listing 6.18. DSTrace Messages from a Join Operation

 (20:28:08)*** DSAStartJoin <XYZCorp> to <[Root]> *** JOINING DOWN -- BEGIN STATE 0 JOINING DOWN -- END STATE 0 JOINING UP -- BEGIN STATE 0 JOINING UP -- END STATE 0 JOINING DOWN -- BEGIN STATE 1 PARENT REPORTING CHILD IS STILL IN STATE 1 JOINING UP -- BEGIN STATE 1 JOINING UP -- END STATE 1 JOINING DOWN -- BEGIN STATE 1 JOIN: Reassigning unowned replica changes for [010000B6]  <[Root]> succeeded, total values reassigned 1  (20:28:12) *** DSALowLevelJoin <[Root]> and <XYZCorp> *** ADDED 010000B6 to partition busy list. JOINING DOWN -- END STATE 1

The Merge Partition operation results in a single partition where there were two; however, the replicas for each of the old partitions have to be dealt with in such a way that bindery services on all servers are not disrupted. When you're merging partitions together, it is very important to determine where the new partition's replicas are going to be. For example, if you have eight servers involved in the Merge Partition operation, you will end up with eight replicas of the new partition. This might not be desirable, so you will want to examine where these new replicas will be and what services would be affected on each server if you were to remove the replica from this server.

The Move Partition Operation

Moving a partition is similar to moving an object ”in fact, the operation uses the same code within the DS module to perform the operation. The biggest difference is that the Move Partition operation also generates Create Replica operations, which in turn result in object Synchronization operations. The Move Partition operation is a fairly complex operation ” more so than the other operations discussed in this chapter. Before you commence, you must make sure you have no synchronization problems in the partitions involved.

WARNING

A total of three existing partitions can be affected by a move operation: two parent partitions and the partition being moved. It is important that you verify the synchronization status of all three partitions before initiating a Move Partition operation.

There are two rules to remember when moving partitions:

Moving a partition cannot violate containment rules for the partition root object.
The partition being moved must not have any child partitions.

Figure 6.16 shows an example of a violation of the first rule. This Move Partition operation is invalid because containment rules are violated: O=XYZCorp cannot be moved to under O=DIV1 because an organization cannot contain another organization.

Figure 6.16. Illegal partition moves.

graphics/06fig16.gif

By extension of the second rule, it is not possible to move a partition so that it becomes subordinate to a child partition. As Figure 6.16 shows, it is also not permissible to move the East.XYZCorp partition under the OU=West.O=XYZCorp partition because there exists a child partition, OU=IT.OU=East.O=XYZCorp , under OU=East.O=XYZCorp .

The following sections focus on some things you need to watch out for when moving a partition.

Important Considerations for Partition Moves

NetWare 5 and higher introduce several objects into the tree at the time of installation, depending on which additional services are installed on the server. In addition, other Novell products (including eDirectory) or third-party products may also create dependencies on a server's context in the tree. When you're moving a partition, it is useful to determine which objects will be affected by a server's move if the server should be in the partition being moved. References to objects within the partition being moved may not be changed. In this section we'll look at a few NetWare-specific examples.

NetWare 6 installs Secure Authentication Services (SAS), used for security services such as Secure Sockets Layer (SSL) communication, as part of the basic core component. This add-on creates an object in the tree (named SAS Service “ servername ) and references the server that hosts the service. When a partition containing this service is moved, you need to re-create the object by unloading the SAS.NLM module, loading SASI.NLM (the SAS installation utility), and logging in with sufficient rights to re-create the SAS object in the tree.

TIP

Refer to TID 10063314 for information on how to create the SAS Service object manually on different operating system platforms.

The Novell Distributed Print Services (NDPS) broker service also has dependencies on the server location: A Broker object is created in the tree in the server's context. When the server object is moved, shut down, and brought back up in the new location, the broker service will not start properly.

WARNING

Of particular significance is NetWare's license service. If you relocate a partition that contains license information for NetWare 5 servers, you will need to reassign the license files to the servers. This requires reinstalling the license service on the server or servers that have moved as a result of the Move Partition operation. NetWare 6, on the other hand, does not suffer from this problem because its licensing model changed to be user based.

Many other add-on services can be affected by the Move Partition operation. The best thing to do is check all your non- User objects and see which of them reference servers. Moving a server object ”and a partition, by extension ”is not a trivial operation and has widespread impact in most production environments.

The Process Involved in the Move Partition Operation

The Move Partition operation consists of two parts : the Move Partition request and the Finish Partition Move request. The Move Partition request is sent by the client to schedule the move. This process performs several verification operations, including the following:

Ensuring that the user has Create object rights to the destination container to which the partition is being moved.
Verifying that there is not an object in the destination container that has the same name as the partition root object being moved.
Verifying that the affected replicas are all available to perform partition operations.
Ensuring that the Transaction Tracking Service (TTS) is available and enabled on all NetWare servers that are running pre-NDS 8 and are involved in the Move Partition operation. NDS operations are dependent on TTS, and if TTS is not available, NDS cannot function. (eDirectory, on the other hand, does not have this limitation because of the FLAIM database it uses.)

When the preceding tasks are completed, the servers handle the Finish Partition Move request. This process has two functions:

Moving the partition root object and all subordinate objects from one context to another valid context
Notifying the server that holds the Master replica of the partition that the partition has moved

This process also performs several verification operations, including verifying that the partition root object being moved and all subordinate objects do not have an Inhibit Move obituary on them. If there is such an obituary within the partition, the process aborts with a -637 error ( ERR_PREVIOUS_MOVE_IN_PROGRESS ).

A second verification process involves testing to see whether the partition root object to be moved is the [Root] object for the tree. It is not possible to move the [Root] object, and attempting to do so will result in a -641 error ( ERR_INVALID_REQUEST ).

NOTE

The Finish Partition Move process also tests to see whether you are attempting to move an object that is not a partition root. If you are, you get a -641 error. The standard Novell-supplied administration utilities do not allow such as move, but the check is there to prevent third-party utilities from attempting such an illegal move due to inadequate safeguards.

A further test is done to verify that the servers involved in the move are running at least NDS 4.63; there is no reason you should still be running NDS 4.63 or an older version, but the DS code needs to perform this check as a precaution. Novell made changes to the DS code that are involved in partition moves, and using versions older than NDS 4.63 with versions newer than NDS 4.63 causes a move to fail. Mixing versions in this manner causes a -666 error ( ERR_INCOMPATABLE_DS_VERSION ) to be reported.

NOTE

The DS engine on the Master replica of the partition being moved generates a list of the servers that need to be informed about the Move Partition operation. This list includes the servers containing real copies of the partition root object as well as all the servers listed in the BackLink attribute for the partition root object (that is, servers holding external references of the partition root object). Each server object is then checked to see whether there is a DS Revision attribute. The value of this attribute is then checked to see whether it meets the minimal version requirement for the operation, which is 463 .

If a server in the list happens to be an Unknown object or an external reference object that is not backlinked, there is a good possibility that no DS Revision attribute exists. In that case, the DS Revision value is . This value does not meet the minimal version requirement, and the operation fails with a -666 error.

Next, DS checks to verify that the containment rules are not being violated by the move. The DSA finds a server with a copy of [Root] and asks for the class definition for the destination parent object's class; if the partition root object being moved is in the containment list of the destination, the move is allowed to proceed. Otherwise, a -611 error ( ERR_ILLEGAL_CONTAINMENT ) is generated, and the process aborts.

Another verification is done to ensure that the partition root object's DN and the DNs of all subordinate objects do not exceed the maximum length of 256 Unicode characters (512 bytes). If any of the objects affected has a DN that exceeds this length, a -610 error ( ERR_ILLEGAL_DS_NAME ) is returned.

NOTE

In the check of the objects that are subordinate to the partition root object, the actual returned code may be a -353 error ( ERR_DN_TOO_LONG ). This error code means the same thing as -610 but is reported by the client library instead of the server.

A further step in the Move Partition process is the submission of a third process to the destination server: an NDS Start Tree Move request. This request actually performs the move operation and is responsible for moving both the partition root object and all the child objects to the new context.

When the move is complete and the partition root object being moved has been locked to prevent other partition operations from occurring, the Replica Synchronization and Backlinker processes are scheduled. When they are successfully scheduled, the partition root object is unlocked.

Moving a partition also causes the creation and deletion of SubRef replicas, which are needed to provide connectivity between partitions, as discussed in Chapter 2. The old SubRef replicas will be deleted from the servers that hold them, and new SubRef replicas will be created as necessary to provide connectivity to the new context.

The Rename Partition Operation

The Rename Partition operation is very similar to the Rename Object operation, except that the obituaries issued are different ”rather than the OLD_RDN and NEW_RDN obituaries being issued, the obituaries issued are Tree_OLD_RDN and Tree_NEW_RDN . Renaming a partition is really a special case of the Object Rename operation because the only object directly affected is the partition root object.

The Rename Partition operation is one operation that can hold up any other type of partition or replication operation. NDS/eDirectory checks for this condition before attempting the Add Replica, Delete Replica, Split Partition, Join Partition, and Change Partition Type operations.

The Create Replica Operation

Creating a replica, also known as an Add Replica request, requires communication with each server in the replica ring for the partition being affected. An inability to communicate with a server in the replica ring results in a -625 error ( ERR_TRANSPORT_FAILURE ) or a -636 error ( ERR_UNREACHABLE_SERVER ).

NOTE

If a server has a SubRef replica and you want to promote it to be a real replica on the server, the operation you need to use is the Create Replica operation, not the Change Replica Type operation. This is because a SubRef replica is not a real copy of the partition; rather, it contains just enough information for NDS operations such as tree-walking . Therefore, the only way to change its type is to place a copy of the real replica on that server.

WARNING

You should never change a SubRef replica type except in a DS disaster recovery scenario, and you should do that only as a very last resort. Refer to the section "Replica Ring Inconsistency" in Chapter 11, "Examples from the Real World" for more information about this process.

Creating a replica of a partition involves making changes to the local partition database and then performing a synchronization of all objects in the partition to the server receiving the new replica. Problems can occur for two reasons:

Communication cannot be established or maintained with a server in the replica ring.
If the server being examined to determine the location of the Master replica does not have a replica attribute, error -602 ( ERR_NO_SUCH_VALUE ) is returned, and the operation is aborted.

The Delete Replica Operation

The Delete Replica operation is similar in requirements to the Create Replica operation. The Delete Replica operation requires all servers in the replica ring be reachable . The server holding the Master replica of the partition processes the request.

The verification routines ensure that the replica being removed is a Read/Write or Read-Only replica. If the replica in question is the Master replica, a -656 error ( ERR_CRUCIAL_REPLICA ) is returned.

NOTE

Some utilities give you the option of making another replica the Master replica before you delete the current one, instead of returning the “656 error and aborting the operation.

A lock is placed on the partition during the operation. Unlike in other operations, this lock is left in place for a number of steps, including an immediate synchronization that is scheduled to ensure that all objects in the replica being moved have been synchronized. This ensures that information in the objects stored in the replica being deleted does not get lost if it is newer than the information in other replicas.

The Change Replica Type Operation

Compared to the other operations we have looked at in this chapter, the Change Replica Type operation is relatively simple. This operation is easiest to perform from the ConsoleOne utility. Figure 6.17 shows the ConsoleOne dialog box, Change Replica Type, that is used during this operation.

Figure 6.17. The Change Replica Type dialog box.

NOTE

As discussed earlier in this chapter, changing a SubRef replica to a Master, Read/Write, or Read-Only replica is treated as a Create Replica operation. You should not confuse it with the "force promotion" process discussed in Chapter 11 that is used for DS disaster recovery. ConsoleOne does not present you with a Change Replica Type option if the selected replica is a SubRef or Master replica.

In Figure 6.17 you can see the replica types available for changing the selected server's replica type. Because the selected server currently holds a Read/Write replica of the partition, you have a number of options to choose from.

NOTE

Even though you have an option to change the replica type to a Read/Write replica, if you select that option, the OK button is disabled because the replica is already a Read/Write replica.

Changing a Read/Write or Read-Only replica to a Master replica actually causes two changes to be made. First, the Master replica is changed to a Read/Write replica. Second, the Read/Write or Read-Only replica becomes the Master replica; this is done because there cannot be two Master replicas for a given partition ”and it is done for you automatically.

NOTE

When running eDirectory 8.5 and higher, you can also change a replica's type to either Filtered Read/Write or Filtered Read-Only. However, note that before you set up any replication filters, which are server-specific, only the following objects (if they exist within the partition) will be placed in a Filtered replica:

Container objects (and their subordinate container objects), such as organizations and organizational units
NCP Server objects and their SAS objects, but not their other associated objects, such as the SSL objects
The Security container and its (leaf and container) subordinate objects
The Admin User object if it exists in the partition in question, but not other User objects

These objects allow you to authenticate to the target server as Admin and set up replication filters at a later time.

The Change Replica Type operation generally occurs very quickly because no replicas need to be created or deleted in order to change the replica. The replica ring is updated on all servers that hold replicas (including SubRef replicas), and the server or servers affected have a change made in their partition entry tables to reflect the change in replica type.

TIP

If you receive a -637 error ("move in progress") during a Change Replica Type operation, you should check for possible stuck obits in that partition.