Coexistence Techniques | Understanding and Deploying LDAP Directory Services (2nd Edition)

There are multiple ways to achieve directory coexistence. In this section the various techniques are described, and information is provided about when to use which technique. Depending on your requirements, you may need to use only one of these techniques or all of them.

A directory system that implements any of these techniques is sometimes broadly called a metadirectory . The idea behind the term is that the directory serves as an aggregation point for information that lives in other directories and data sources.

Migration

Data migration is the most rudimentary form of coexistence. We are hard-pressed to even describe it as coexistence because it refers to a one-time event rather than an ongoing process. Nevertheless, it is a good starting point for our discussion.

Data migration is simply a way of populating your directory from a data source. Implicit in the migration concept is the fact that the source is not used after its data is copied into the directory. A good time to use data migration might be when switching from one e-mail system to another. If the old e-mail system has its own application-specific directory containing user and group information, and the new e-mail system uses your enterprise directory, migration is a good way to get the data out of the old system and installed in the new system with minimal disruption and inconvenience to your users.

Figure 23.1 illustrates data migration. You must take care to ensure that once you begin migrating the data, the old system does not accept updates because any such updates will be lost.

Figure 23.1. Data Migration

Most directory products come with migration tools. Some, such as Netscape's LDAP Data Interchange Format (LDIF) and Directory Services Markup Language (DSML) import tools, rely on you to provide a text file of information in a standard format. The system from which you migrate is often able to produce a text file, in either that format or one that is easily convertible . Other tools are specific to a particular application. For example, most directory-enabled mail server software includes tools that migrate data from legacy and competitors ' e-mail packages to an LDAP directory service.

One-Way Synchronization

A big step up from data migration in terms of complexity and functionality is one-way data synchronization. Synchronization is simply the process of copying data from one data source to another on an ongoing basis. In one-way synchronization, your directory service is periodically populated from a data source. The reverse is also possible: An external database can be populated periodically from the directory. In one-way synchronization , external changes to the data are allowed only at the synchronization source, not at the destination. Changes to the data are propagated either by replacement of the entire content of a portion of the directory information tree or by application of only those changes that have occurred since the last update.

The advantage of doing a total replacement is primarily simplicity: You just delete the old data and replace it with the new data. Some older database and directory systems do not have a facility for tracking changes, making it difficult to generate incremental updates and easier just to perform a total replacement. The disadvantage of the total update method is poor performance: For large data sets, it can take a long time to completely re-create the entire directory each time a change is made. Your directory service may even need to be offline during this process.

An incremental synchronization process typically performs much better. If only 5 percent of the data changes between updates, an incremental update will be cheaper than a total replacement by a factor of 20. Another advantage of incremental updates is that they are usually done over LDAP while the directory is up and running, a practice that reduces service disruptions and ensures that the regular directory access control checks are used to control access to the data. One more problem is that extra changes made in the directory service may be accompanied by unpleasant side effects. For example, if each person entry is deleted and re-added once each day as part of a total update process, external systems that watch for "new person entry" events may be confused or do more work than is necessary.

Even if the end system with which you synchronize does not support the generation of incremental changes, consider implementing this capability yourself. For example, you could save the last full extract from the system and compare it to the next full extract, thus making it possible to perform an incremental update.

One-way synchronization is often used to extract data from your directory on an ongoing basis and populate other directories. Central control over the data may be maintained while at the same time the data is made available for use in a variety of other systems. One-way synchronization is also often used to extract information from corporate data sources, such as a human resources database, in order to populate a read-only directory. Replicating data from a central source gives directory clients access to the data they need while leveraging the corporate data management procedures you already have in place.

Most directory coexistence plans require several different one-way synchronization relationships. For example, a user's name and job title might be pulled from the human resources (HR) database, whereas the telephone number might be pulled from the phone system database. The directory service itself might send its user ID and e-mail address attributes to several application-specific directories for use in e-mail address books. Figure 23.2 shows an example of multiple one-way synchronization relationships.

Figure 23.2. Multiple One-Way Synchronization Relationships

Two-Way Synchronization

In two-way synchronization , data element changes are propagated in both directions between your directory and another data source. Changes to the data element may be made at either location, and the changes are propagated to all data repositories participating in the synchronization effort. Figure 23.3 shows a graphical view of two-way synchronization.

Figure 23.3. Two-Way Synchronization

The advantage of two-way synchronization over one-way synchronization is that it provides maximum flexibility. There is no need to select a single owner of the data and make other repositories read-only. Instead, every repository can continue to be a read/write source for the data.

An example of data that you might want to maintain in multiple data sources and synchronize in both directions is user password data. It would be nice if user passwords were synchronized across your enterprise directory, NOS directories, and various application directories and if password changes were propagated to all systems. Of course the security implications of synchronizing passwords among multiple systems might prevent you from doing so.

The disadvantages of two-way propagation are its complexity, its occasional unpredictability , and the political difficulties inherent in implementing it. When changes are allowed to be made in more than one system, it is relatively easy for conflicts to arise. A change made to a data element in one location may conflict with a change made at roughly the same time to the same data element in another location. For example, a person's telephone number may be replaced in both systems with a different value. This fact will cause some people who maintain data sources to become very nervous, and they may not like the idea of two-way propagation at all. To maintain a consistent view of the data in all locations, conflicts must be avoided entirely or resolved in a predictable and efficient way.

Resolving update conflicts can be difficult. Even a simple approach, such as serializing access on the basis of synchronized time, requires an additional service to keep the times synchronized ”and ties can still occur. Other solutions, perhaps involving a policy-based conflict resolution strategy, can be simpler to implement but often result in unexpected behavior from the user's point of view. If a user does not understand the conflict resolution policy in place on his system, he may be surprised if a change he makes is overwritten by someone else's change.

Although some circumstances may require two-way synchronization, the added complexity and potential user confusion are usually reason enough to avoid it. If you think you have a specific need for two-way synchronization, be creative and try to think of a way to avoid it. For example, you might deploy a centralized service that allows password changes in only one location and uses one-way synchronization to quickly push those changes out to all other data sources. This kind of system would probably be a minor inconvenience to users, but it would make life a lot easier for everyone in the long term.

Alternatively, consider how you could intercept calls to change passwords at other data sources and reroute them to the centralized service. This approach would give the illusion of two-way synchronization without the same complexity.

N-Way Join

When synchronizing data from multiple sources, usually you want to match up related information from each data source. For example, if your coexistence policy calls for pulling employees ' names , job titles, and manager information from the corporate human resources database and telephone numbers from the telephone operations database, you would like to end up with one entry in your directory service that contains all this information for a single person. You do not want to end up with two entries for each person ”one from the HR database and one from the telephone operations database. The process of matching up entries from disparate databases is called joining , and to many people this capability is what defines a true metadirectory system. Sometimes the term N-way join is used to emphasize the fact that data may be joined from an arbitrary number of different data sources.

Note

Several vendors of directory coexistence software sell packages that include the term metadirectory in the product name. Some examples are Microsoft Metadirectory Services, Critical Path's CP Meta-Directory Server, and Sun ONE Meta-Directory. These products vary in their capabilities, but all of them support N-way join.

To join two entries in different data sources, you need to have a data element that is common to both sources. We refer to this data element as the join attribute . Figure 23.4 illustrates the general concept of an attribute-based join.

Figure 23.4. An Attribute-Based Join

For example, if each of your source databases contains a field for U.S. Social Security number (SSN), you can use it to determine which entries correspond to the same people in both databases. Using a unique identifier such as an SSN or an employee ID number is much better than using a potentially nonunique identifier, such as a person's name. But SSNs are typically viewed as sensitive information; see the Privacy and Security Considerations section later in this chapter for a detailed discussion of SSN-related concerns.

A unique identifier may not be available, or you may not be able to use it (for example, employee IDs are also viewed as sensitive information in some companies). Sometimes you will not have anything better than a person's name to use for your join attribute. Such a limitation can reduce the efficiency of your synchronization procedure, create the need for manual synchronization and repair, and cause incorrect joining of information.

Overcoming these problems is one of the biggest challenges in providing a metadirectory service that has accurate data.

A join on first and last names may typically match no better than 50 percent of the names in your databases, and some of the matches may be false positives. The numbers get significantly worse as the number of entries in your databases increases (for example, the chance of having two people with the name "Babs Jensen" in a database of 100,000 people is significantly greater than the chance of having two people with the name "Babs Jensen" in a database of 100 people).

The result of inadequate matching is usually a lot of manual work. An administrator typically goes through the unmatched entries by hand, comparing other information to try to determine a match. A worse outcome could be that the wrong match is made automatically or by a careless administrator. In this situation, Person A's information can appear in Person B's entry. Depending on the type of information, how the directory service is used, and what people are involved, the consequences of this kind of error can range from annoying to serious.

When a good join attribute is not available, you can use multiple criteria to join data. Although each data value may not uniquely match the value in another data source, a combination of values might. For example, you might require matches on both name and city in order to reduce false positive matches involving names such as "John Smith." You can also use this technique to produce more matches than any one join attribute would. Choose a set of join attributes and assign a weight to each one. Then choose a threshold at or above which the data is assumed to match. For example, you might assign 2 points to surname, 1 point to first name, and 1 point to city, and use a threshold of 3 points. In this case if surname and city match, or if surname and first name match, you have a join. But a match of first name and city is not enough to conclude that you have a valid join.

If you have more than one data source that is authoritative for the same data element, you will probably need to use different join rules for each data source. For example, there may be one centralized personnel database that stores information about employees, but information about contractors may be managed independently by each department. The optimal join criteria must be selected for each data source.

Joining entries is an important capability that is needed to create an efficient and accurate directory coexistence process. When you evaluate directory coexistence software, be sure to consider its ability to provide this feature. Also investigate the software's advanced abilities , such as joining across multiple attributes and the extent to which you can tune the joining algorithm. In some environments it may make sense to sacrifice accuracy for a reduction in manual administration. Other environments may not be able to tolerate inaccurate joins.

Virtual Directory

Another directory coexistence technique involves use of a virtual directory system. A relatively new addition to the directory world, the virtual directory uses a different approach from that of most synchronization techniques. Instead of copying data from one data source to another, a virtual directory provides a real-time directory view of selected data from multiple data sources. You can think of this as virtual synchronization : The virtual directory generally does not store its own persistent copy of the data (although it may cache the data to improve performance).

The implementation is simple in concept: A virtual directory looks to the outside world like a regular LDAP server, but it holds no persistent data of its own. When it receives a request, it reformats and reroutes that request to the necessary data sources. The answers received are collated, reformatted, and sent back to the requestor . Figure 23.5 shows the general architecture of a virtual directory system.

Figure 23.5. The General Architecture of a Virtual Directory System

A virtual directory system has several advantages:

The question of who updates the multiple copies of data is neatly solved . The virtual directory simply routes update requests to the appropriate data sources; no copies of the data are made.
The propagation delays inherent in a synchronization-based approach are avoided. No data is copied, and each query is mapped onto the source data store in real time.
The virtual directory allows you to dispense with messy and costly data management procedures designed to synchronize data.

The virtual directory scheme also has several drawbacks:

In practice, a virtual directory system is complex to implement and deploy. The algorithms required to map queries among a set of data sources, collate results, deal with failures that may have occurred in one source database but not another, and so on can be difficult to determine and implement.
Performance is likely to suffer compared to that of a centralized, synchronized directory approach. The source databases are not likely to be as fast as your general-purpose enterprise or extranet directory to begin with, and adding extra network round-trips, real-time query and data mapping, and result and error processing can reduce performance even further.
The "real-time" response of the resulting system may be only as reliable as the least reliable data source that is included. For this reason, virtual directory deployments often combine traditional synchronization techniques in which data is copied (for slow or difficult-to-access data sources) with the real-time access techniques described in this section. Doing so makes access to the virtual directory more reliable; the trade-off is that the information that is synchronized will not be as up-to-date as the information that is accessed on the fly.

The pros and cons of a virtual directory add up to a few conclusions. Virtual directories are best suited for environments in which your goal is to provide access from a few well-known applications to an existing database or set of databases. Knowing the applications, and therefore the kinds of requests they generate, allows you to significantly reduce the scope of your virtual directory deployment project. Another important consideration is performance: A virtual directory will be slower than a high-performance directory server that holds a copy of all data locally. You should be sure to evaluate virtual directory system performance early to ensure that it meets the performance needs of your applications.

Although pioneering directory deployers have created their own custom virtual directory by using facilities such as Netscape's Directory Server plug-in API, a variety of off-the-shelf software is now available. The best-known product is Radiant Logic's RadiantOne Virtual Directory Server. Figure 23.6 shows an example of a deployment that uses this product.

Figure 23.6. A Deployment That Uses RadiantOne Virtual Directory Server

Data Translation

Regardless of the approach used to move data to and from your directory service, some form of data translation is usually necessary. Data translation involves any of the following:

Reformatting data elements to meet data source requirements . It may be impossible to store data pulled from one data source in another source without reformatting the data. For example, in LDAP it is recommended that postalAddress attribute values consist of no more than six lines (separated by dollar signs) of no more than 30 characters each. If you need to move the postal address information from your directory service to a human resources database that supports only three lines of 40 characters each, some reformatting is required as part of your synchronization system.
Reformatting data elements to improve consistency . Sometimes data is translated as a convenience to end users or applications. For example, even if the names that are synchronized into your directory service from an HR database consist entirely of uppercase letters , you may want to present the data in mixed case in your directory service (this kind of translation is difficult to do perfectly because of the wide variation in how names are constructed ). Another example of this kind of reformatting is adding a prefix to telephone number values that are incomplete. Some data sources may store only a telephone extension (typically four or five digits), but you may want the values in your directory to follow the international standard for phone numbers, which includes the country code, area code, and local number.
Combining two or more data elements . Sometimes your data sources may not have a data value you need, but they may contain two or more other data elements that can be used to construct the value. For example, separate street number, street name, city, state, and zip code fields can be combined to form one LDAP postalAddress value.
Removing duplicate data . Some data sources contain duplicate records (for example, two entries for the same person) or duplicate values, and these should be removed as part of your directory coexistence process.
Removing obsolete or incomplete data . Some data sources contain outdated or incomplete records that you will not want to include in your directory service. For example, a human resources database may contain information about former employees or contractors that no longer have an active contract with your company. Another example is a database of Web site visitors that contains many incomplete records (a very likely scenario if Web site visitors are not forced to enter all the information when they sign up).

Most off-the-shelf metadirectory and virtual directory software packages provide data translation features. However, the set of possible translation requirements is very large, so you may have to create custom code to handle some of the translations you need.