Checking Data Quality | Understanding and Deploying LDAP Directory Services (2nd Edition)

The purpose of data maintenance is to ensure that the data in your directory service has the highest possible quality. Quality of data has several aspects, but we will focus primarily on the accuracy and timeliness of data. Naturally you will want to check the quality of your data both to monitor how well your data maintenance procedures are working and to get an idea of the kind of service you're providing to the users of your directory.

Bad data can creep into your directory service from various directions, including the following:

Bad source data . If there is bad data in the source from which you're populating your directory, the data in your directory will also be bad. If you detect such a situation, use the opportunity to improve the quality of the source data. If that approach doesn't work, you might consider filtering the data as it comes in to remove things such as nonprintable characters . If the data in the source is just plain wrong, try to find out why and correct the problem.
User or administrator error . People make mistakes. Whenever users or administrators are responsible for entering data, you run the risk of human error. Increased education and training, as well as directory data validation filters, can help correct this kind of problem.
Systematic error . Systematic error can be introduced by a flaw in the automated procedure used to populate the directory or by a program that has a bug in it. Fixing these kinds of problems can dramatically increase your directory's quality.

Methods of Checking Quality

There are several methods for checking the quality of data in your directory. The following are three common methods:

Source of truth . If you have a source of truth for the data you want to check (typically one or more of your source databases), you can simply compare its data with the data in your directory. This may be easier said than done, of course. You might dump the directory data and source data to files and write a script or program to compare the two files and then report any differences. Or you might write a program that reads information directly from the directory and the source database and then does the comparison online. This is likely to be expensive, however you do it. A lower-cost approach may be to incorporate this check into the regular data synchronization procedure you have developed for the source.
Spot checks . A second method is to perform spot checks of the directory and rely on statistical inferences to tell you about the overall quality of your directory data. You can write a program to select entries from the directory at random and compare them to the corresponding entries in the source-of-truth database. This method is much less expensive than doing a complete comparison. You'll need to decide for yourself how many entries to check to have confidence that you're getting an accurate and representative sample of data.
User survey . A third method is to survey users to ask them about data quality or monitor user complaints about incorrect data. This method works well only for data that users care about and can judge the accuracy of. This is also a statistical method, so you'll need to do some educated guessing to derive the overall quality of your data.

Checking a source of truth and spot-checking are techniques that can be used to check the syntactic validity of information even when no source-of-truth database exists or is accessible. For example, you could read all (or a sampling) of the e-mail address attributes in the directory and determine whether they are syntactically valid.

Implications of Checking Quality

It's important to consider the implications of the methods you use to check the quality of your data for the operation of your directory service. Be sure to choose a method that does not significantly reduce directory performance. Depending on the method you select, you may have to make a trade-off between how often you check for quality and the accuracy of your checking methods. The main concerns in this area are methods that cause an excessive load on the directory or cause the directory to be unavailable.

For example, consider a method that requires reading over LDAP all the entries in your directory. Your directory might have the capacity to respond to this kind of request without degrading performance for other users, but then again it might not. If you use a method like this, you can run the check at night or at another off-peak time when the directory has plenty of extra capacity to respond to the data-checking requests . However, such an arrangement may be difficult if your directory operates in a global environment in which there is no off-peak time. Another approach then is to create a dedicated directory replica that does nothing but process these data verification tasks .

Consider also a method that requires you to dump your directory's data to a file. Some directory server software allows you to perform this operation without taking the service down, but some does not. If you are planning to use this method, make sure that the software you choose supports online dumps or that your service can tolerate the downtime. Remember that you have replication to help with the availability problem, so consider taking down a replica to produce the extract instead of taking down the master server. Also consider producing your own extract over LDAP, but be careful you don't degrade performance as discussed earlier.

Correcting Bad Data

Whatever method you use to check the quality of your data, be sure to investigate the cause whenever you encounter an error. Identifying the cause will help you correct problems with the system that produced the bad data. Although this kind of investigation can be time-consuming and expensive, it's usually well worth it. You'll often find that many errors are caused by the same underlying problem. Fixing that problem can dramatically increase the quality of your data.

Bad data may be caused by many underlying problems, some of which were already discussed briefly . Systematic errors in programs or procedures should be treated as bugs and corrected. Bad data introduced through human error might be the result of inadequate training or documentation for either users or administrators; increasing the quality and coverage of this training and documentation can cause corresponding improvements in the quality of your data. Human error can also be the result of poor software design. Spend time with users and administrators responsible for updating the directory, and observe the steps they take when maintaining data. Observing others will help you spot flaws in the software and procedures they use.

Finally, even if you can't eliminate poor data coming into your directory, you can mitigate the damage by installing data validation filters. As mentioned earlier, these filters can be installed in directory clients that users and administrators use to update the directory, or they can be installed in the directory service itself.