Server and Data Recovery

Hardware and human-caused errors are unavoidable. There may be occasions when you need to recover lost DS data, such as group membership information or user Home Directory attribute values. There may be other times when the hard drive hosting NDS/eDirectory fails and you need to recover as fast and as completely as possible. The following sections cover the following scenarios:

Group membership recovery
Home Directory attribute recovery
Recovering from a server crash
Loss of all replicas except for subrefs in a replica ring

Group Membership Recovery

One situation that we have seen occur is that an administrator accidentally deleted group memberships from a large number of users. In this circumstance, the administrator was attempting to add a number of users to a group by using the UImport utility. Unfortunately for the site in question, the administrator used a control file that specified REPLACE VALUE=Y , resulting in the new group membership being added but original group memberships being deleted. Because these group memberships were used to assign rights in the file system and determine which applications are available to each user, this became a big problem very quickly.

Fortunately, the change was made off-hours, so the immediate impact was minimal. Of even more importance was the backup of the DS tree that had been made several weeks earlier. Although it is true that in many cases backups of DS are not of much use, in this case, the backup did contain a large percentage of the users in the tree and the information necessary to rebuild the majority of the users' group memberships.

The following tools were used for this recovery:

The backup of DS made several weeks earlier
A server not connected to the production network
NList
UImport
Two awk scripts

The first step in this recovery was the restoration of the old DS group information. The backup product used was only capable of restoring to a server named the same as the server the backup was taken from. In order to accommodate this limitation, we took a lab server from our isolated network and renamed it. Next, the DS tree was restored to that server. To ensure that the dependencies for group memberships were restored properly, we restored the data twice.

While the DS tree was being restored on the isolated network, two awk scripts were developed. The first script was designed to create a batch file to list the group memberships for each user listed in the original UImport data file. Because the number of users affected was about 100 out of 5,000, it did not make sense to restore group membership information for all users. Instead, a text file with the desired list of users and their contexts was created, using the following format:

 .UserID1.Context1 .UserID2.Context1 .UserID3.Context2

The following awk script was used to parse the preceding information into a batch file:

 BEGIN { print "del grpinfo.txt" } {      count = split(  BEGIN { print "del grpinfo.txt" } { count = split($0, object, ".") printf("cx ") for (x=3; x<= count; x++) printf(".%s", object[x]) printf("\n") printf("nlist user = " object[2]) printf(" show \"group membership\" >> grpinfo.txt\n") } 
 , object, ".")      printf("cx ")      for (x=3; x<= count; x++)           printf(".%s", object[x])      printf("\n")      printf("nlist user = " object[2])      printf(" show \"group membership\" >> grpinfo.txt\n") }

The resulting batch file looks like this:

 del grpinfo.txt cx .Context1 nlist user = UserID1 show "group membership" >> grpinfo.txt cx .Context1 nlist user = UserID2 show "group membership" >> grpinfo.txt cx .Context2 nlist user = UserID3 show "group membership" >> grpinfo.txt

When the DS restore finished, the batch file was run to generate a file called GRPINFO.TXT , showing all group memberships for the user objects in question. This GRPINFO.TXT file was in the following format:

 Object Class: User Current context: Context1 User: userID1         Group Membership: Group1.Admin.Groups.Admin...         Group Membership: Group2.XYZCorp One User object was found in this context One User object was found. Object Class: User Current context: Context1 User: userID2         Group Membership: Group3.Admin.Groups.Admin...         Group Membership: Group4.XYZCorp One User object was found in this context One User object was found. Object Class: User Current context: context2 User: userID3         Group Membership: Group1.Admin.Groups.Admin...         Group Membership: Group2.XYZCorp One User object was found in this context

This file was then parsed, using a second awk script, to create the final data file used for the new run of UImport. This data file is in a format that is usable by UImport:

 ".userID1.context1",".Group1.Admin.Groups.Admin" ".userID1.context1",".Group2.XYZCorp" ".userID2.context1",".Group3.Admin.Groups.Admin" ".userID2.context1",".Group4.XYZCorp" ".userID3.context2",".Group1.Admin.Groups.Admin" ".userID3.context2",".Group2.XYZCorp"

You should note a couple things about the data file created. First, the user ID contains a leading dot. This is done so the script can be run from any context and so the input is valid. The second thing you should notice is that there are multiple entries for a given user ID, but UImport handles these entries just fine.

The challenge is in parsing the trailing dots on the group memberships in GRPINFO.TXT and coming up with a script that works reliably to perform the conversion. The following is the awk script that does this:

 /Current context:/ { cx =  } /User:/ {cn = } /Group Membership:/ {      printf("\".%s.%s\",", cn, cx)      gsub(/\tGroup Membership: /, "")      grptmp =  /Current context:/ { cx = $3 } /User:/ {cn = $2} /Group Membership:/ { printf("\".%s.%s\",", cn, cx) gsub(/\tGroup Membership: /, "") grptmp = $0 num = split(cx, tmpcx, ".") counter = 1 while (substr(grptmp, length(grptmp)) == ".") { counter++ sub(/\.$/,"",grptmp) } printf("\".%s", grptmp) for (y=counter;y<=num;y++) { printf(".%s", tmpcx[y]) } printf("\"\n") } 
 num = split(cx, tmpcx, ".")      counter = 1      while (substr(grptmp, length(grptmp)) == ".")      {           counter++           sub(/\.$/,"",grptmp)      }      printf("\".%s", grptmp)      for (y=counter;y<=num;y++)      {           printf(".%s", tmpcx[y])      }      printf("\"\n") }

This script counts the number of trailing dots and compares that to the number of parts in the current context. It then removes the leading portions of the current context until it runs out of dots at the end of the group name . Next, it concatenates the group name to the remaining portion of the context, which results in the correct context for the group.

When the new data file is created, we created a control file that uses two fields: one for the user login ID and one for the group membership being processed . Upon watching the run of UImport, we were able to determine which user IDs had been moved or deleted. Even though not all the users were covered in this fix, there were sufficient users fixed to prevent a major outage the following day. In total, out of 100 users, only about 10 had to be modified.

NOTE

This example serves as a reminder that a disaster recovery solution need not be a 100% solution; if you can automate a large portion of the work in a reasonable amount of time, any remnants can be handled by hand or on a case-by-case basis.

REAL WORLD: Programmatically Adding a User to a Group

If instead of using an existing application such as UImport to add users to a group, you are developing your own application, you should be aware of a few things. The action of adding a user to a group involves a total of four major changes in DS :

Add the user's DN to the group's Member attribute.
Add the user's DN to the group's Equivalent to Me attribute.
Add the group's DN to the user's Group Membership attribute.
Add the group's DN to the user's Security Equals attribute.

The current DS module does not automatically make these four changes happen simultaneously . Therefore, if you are writing a program to accomplish this task, you must make all four of these changes in your program's code. If you use the NWUsrGrp ActiveX control in the Novell Developer Kit ( NDK ), it performs the four necessary steps for you when it adds a user to a group or deletes a user from a group. However, if you use the NWDir or NWIDir controls, you need to code the four steps as part of your program logic.

Home Directory Attribute Recovery

It is a fairly common occurrence that upon fixing certain DS- related issues, the Home Directory attributes of User objects are lost. As discussed earlier in this chapter, in the "Unknown Objects" section, when an object that is referenced by any DS attribute is removed from the tree, that DS attribute's value is automatically cleared. Because Home Directory is a single-valued attribute, clearing its value means deleting the attribute.

TIP

The procedures discussed here can also be used to update existing Home Directory values when you physically move the folders from one volume or server to another.

The Home Directory attribute uses the SYN_PATH syntax and references a Volume object in its value. If, for any reason, that Volume object is removed from the tree, the Home Directory attribute is cleared. You can repopulate this value fairly easily by using one of the following methods :

Generate a text file that contains the username and home directory information and then use UImport to update the User objects. The text file would look something like this:
```
 ".  userID.context  ", ".  volume_object.context  :\  path  " 
```
Generate a text file that contains the username and home directory information and then use Import Convert Export (ICE) to update the User objects via LDAP. The LDIF file would look something like this:
```
 version: 1 dn: cn=  username  ,ou=  context  ,o=  context  changetype: modify ndshomedirectory: cn=  vol_object  ,ou=  context  ,o=  context  #0#\users 
```
The preceding two solutions require you to create a separate record for each user because the path of the home directory is unique for every user. An easy alternative is to use Homes (www.novell.com/coolsolutions/tools/1568.html), with which you can simply select a starting context and set the Home Directory attribute for all users inside a container (see Figure 11.18).

Figure 11.18. Setting home directory information by using Homes.

Recovering a Crashed SYS Volume or Server

One of the most-asked questions in any network is, "How do I correctly recover from a crashed server?" For those of you who have worked with NetWare 3, you know it's quite straightforward: Install a new server, restore the bindery from a backup, and restore your file system. In the case of a single-server NDS/eDirectory network, the process is pretty much the same as that with NetWare 3: Install a new server, restore DS from a backup, and then restore your file system. Because of the distributed nature of DS, however, things are a little more interesting when you have a multiserver NDS/eDirectory network.

To successfully recover from a lost server in a multiserver environment, it is essential that you maintain a regular backup of the server-specific information (SSI) files for all the DS servers on your network. (Chapter 8 discusses the situation for eDirectory.) It would also be helpful if you have up-to-date documentation about your DS tree, such as where NCP Server and Volume objects are located. You should also have a record of the partitions and a list of servers where the Master and various other replicas are stored. Finally, you should have the correct license file(s) for the crashed server.

NOTE

The process for recovering from a crashed hard drive where NDS /eDirectory resides (such as the SYS volume on NetWare) is the same as having a dead server because your DS is gone.

NOTE

For more information about SSI files and their purposes, see Chapter 8 .

The following are the steps you need to take to restore a crashed server or a SYS volume in a multiserver DS environment when you don't have a current set of SSI data available:

Don't panic!
Reconfigure time synchronization configuration in the tree, if necessary.
Create a Computer object in the tree to act as a placeholder for server references.
Use SrvRef (see ftp://ftp.dreamlan.com/srvref.zip) to replace server references in the tree (see Figure 11.19).

Figure 11.19. Replacing server references.
Delete from the tree the old NCP Server object for the failed server. Do not delete the associated Volume objects, however. Leave them intact to preserve references that other objects (such as Directory Map objects) may have to these objects as well as any DS trustee assignments made.
If the failed server held a Master replica of any partition, go to another server in the replica ring that has either a Read/Write or Read-Only replica and use DSRepair to promote that replica to a Master. Repeat this step for every master replica stored on the failed server. Then clean up the replica rings to remove the downed server from the lists. (See the "Replica Ring Inconsistency" section, earlier in this chapter, for details.)

TIP

After your replica ring cleanup, you should spot-check the DSTrace output on a number of servers to see whether the replica rings are okay and that everything is synchronizing correctly. You do not want to install a server into a tree that's not fully synchronized.
Rebuild the crashed server by using existing documentation. Ensure that the same server name, volume names , IPX/IP addresses, and so on are used. Install the server into a separate temporary tree.
If you are just recovering a lost SYS volume, load DSREPAIR.NLM with the -XK6 switch (which deletes all volume trustees) and then perform a Check Volume Objects and Trustees operation. When prompted to make the change on the SYS volume, answer No; for all other volumes , answer Yes. See TID #10013535 for details on this step. (This is Step 22 in TID #10013535.)
Remove NDS from the rebuilt server.
Reconfigure the time synchronization setting on the rebuilt server, if necessary.
Install the rebuilt server back into the production tree, using the same context the original server was installed in.
Use SrvRef to restore server references in the tree.
Restore data and trustee information to the server. You should be careful when restoring the SYS volume data so that you don't overwrite any new support pack files with older ones. If you've made modifications to your AUTOEXEC.NCF file, you should ensure that the older copy from your backup does not overwrite it.
Reestablish replica information by using ConsoleOne. (You might want to wait until after-hours and after the data restoration has completed.)
Reinstall licenses, if necessary.
Reinstall any server-based applications, such as BorderManager.
Reissue any SSL certificates for the recovered server, as necessary.
Delete the temporary Computer placeholder object from the tree.

When restoring files to a volume that was nearly full during the backup, you might run into insufficient disk space issues. This is especially true when volume compression is used. Although SMS-compliant backup software can back up and restore a compressed file in its compressed format, that's not the default in most backup software; therefore, chances are good that you'll restore previously compressed files in their uncompressed format. And because compression is a background operating system process, files are not compressed until the compression start time is reached. You can, however, flag files as immediate compress, but that's an extra manual step you have to take. And afterward, you have to remember to undo the flag or else the files will always be compressed again after access, causing unnecessarily high server utilization.

Another volume-related issue that you can get caught with during a restoration is suballocation. Again, because it is a background process, files are not suballocated as they are restored; therefore, if you're restoring many (small) files, you can run out of disk space before the complete restoration is done.

To work around these two disk space problems, it is best that you try to maintain at least 15% to 20% free disk space on each volume. Even better, you should make certain that the replacement drive capacity is larger.

After the restoration of the file system is complete, you should restart the server yet one more time to ensure that the restoration didn't overwrite any important system files. Then you should perform a spot-check on some of the restored directories and files to check for correct trustee assignments, file ownerships, and so on. You should also spot-check DS objects to ensure that you don't have any Unknown or renamed objects.

Subordinate References Only in the Replica Ring

The steps discussed in the section "Recovering a Crashed SYS Volume or Server" work well when you have replicas on other servers to recover DS information from; however, there is also the (very) unlikely situation where you lose one partition within the tree and, for some reason, no replica of that partition exists. What can you do? First of all, take a deep breath and don't panic! Depending on the partition location within the tree structure, all may not be lost.

Consider the sample DS tree shown in Figure 11.20. Two of the servers in this tree contain the following replicas:

Figure 11.20. If `FS2` is lost, a hole exists in the DS tree between `OU=B` and `OU=E` .

graphics/11fig20.gif

Server `FS1`	Server `FS2`
Master of `[Root]`
Master of B
SubRef of C	Master of C
Read/Write of E	Master of E

NOTE

Because Server FS1 has a copy of B (the parent) but not C (the child), DS automatically placed a SubRef replica of C on server FS1 .

If Server FS2 is lost due to hardware failure and no other servers hold a replica of C, you lose the only full replica of the C partition. (SubRef replicas are not full replicas, and they contain only enough information to locate other replicas and track synchronization.) When this happens, you have a hole in the DS tree between OU=B and OU=E . You can't use any of the procedures discussed earlier in this chapter to recover the C partition because no other full replicas exists.

In this scenario, where a SubRef replica of the lost partition exists, it is possible to rebuild the links to the lost portion of the DS tree and then perhaps restore the objects from a recent backup. The following procedure explains how you may recover from the loss of a single partition in a multipartition tree and have no full replicas of that partition:

WARNING

The following procedure may not work for all cases and, therefore, you should consider acquiring the assistance of Novell Support to rebuild the links to the missing partition in your tree. At the very least, you should test the procedure in a lab environment before ever using it in a production environment.

Don't panic! Don't attempt any DS recovery or repair procedures.
Follow the steps outlined in the section "Replica Ring Inconsistency," earlier in this chapter, to clean up the replica rings for other partitions that have replicas on this crashed server, and make sure your other partitions are synchronizing without errors.
If more than one server has a SubRef replica of the lost partition, choose one to work with. The best choice would be a server that has the least number of replicas on it.
On the server chosen in step 3, load DSRepair with the -A command-line switch and promote the SubRef replica to a Master by using the steps outlined earlier in this chapter, in the "Replica Ring Inconsistency" section. This changes the SubRef replica into a real replica; however, because a SubRef replica doesn't contain any object information, the recovered replica will be empty.

Depending on your replica placement of this lost partition, SubRef replicas of this partition on other servers may be upgraded to Read/Write replicas.
Use DSTrace to check that this partition is synchronizing correctly. If it is not, you should consider opening an incident with Novell for further assistance.
When the replica ring is synchronizing, use your most recent backup to perform a selective restoration of the DS objects that were in the lost partition. Take note of any objects in other parts of the tree that may have turned into Unknown objects due to loss of their mandatory attributes. You may need to do a selective restoration on those objects or re-create them.

Re-create any bindery objects and DS objects (such as print queues) that depend on object IDs. Reassign DS object trustee assignments, if necessary.

If you don't have a SubRef replica to work with, you need to first make sure no one attempts any repair operations because they could make a bad situation worse . Then you should open a call with Novell Support for assistance.

Group Membership Recovery

REAL WORLD: Programmatically Adding a User to a Group

Home Directory Attribute Recovery

Figure 11.18. Setting home directory information by using Homes.

Recovering a Crashed SYS Volume or Server

Figure 11.19. Replacing server references.

Subordinate References Only in the Replica Ring

Figure 11.20. If FS2 is lost, a hole exists in the DS tree between OU=B and OU=E .

Figure 11.20. If `FS2` is lost, a hole exists in the DS tree between `OU=B` and `OU=E` .