Security Threats

Understanding and Deploying LDAP Directory Services > 17. Maintaining Data > Checking Data Quality

<  BACK CONTINUE  >
153021169001182127177100019128036004029190136140232051053054012004114023059153052030124

Checking Data Quality

The purpose of data maintenance is to ensure that the data in your directory service has the highest “possible quality. Quality of datahas several aspects, but we will focus primarily on the accuracy and timeliness of data. Naturally, you will want to check the quality of your data both to monitor how well your data maintenance procedures are working and to get an idea of the kind of service you are providing to the users of your directory.

Bad data can creep into your directory service from a number of directions, including the following:

  • Bad source data.   If there is bad data in the source from which you are populating your directory, the data in your directory will also be bad. If you detect this, use the opportunity to improve the quality of the source data. If thatdoesn't work, you might consider filtering the data as it comes in to remove things such as nonprintable characters . If the data in the source is just plain wrong, try to find out why and correct the problem.

  • User or administrator error.   People make mistakes. Any time users or administrators are responsible for entering data, you run this risk. Increased education and training, as well as directory data validation filters, can help correct this kind of problem.

  • Systematic error.   This kind of error can be introduced by a flaw in the automated procedure used to populate the directory or by a program that has a bug in it. Fixing these kinds of problems can dramatically increase your directory's quality.

Methods of Checking Quality

There are several methods you can use to check the quality of data in your directory. The following are three common methods:

  • Source of truth.   If you have a source of truth for the data you'dlike to check (typically, one or more of your source databases), you can simply compare its data with the data in your directory. This may be easier said than done, of course. You might dump the directory data and source data to files and write a script or program to compare the two files and then report any differences. Or, you might write a program that reads information directly from the directory and the source database and then does the comparison online. This is likely to be expensive however you do it. A lower-cost approach may be to incorporate this check into the regular data synchronization procedure you have developed for the source.

  • Spot check.   A second method is to perform spot checks of the directory and rely on statistical inferences to tell you about the overall quality of your directory data. You can write a program to select entries from the directory at random and compare them to the corresponding entries in the source of truth database. This method is much less expensive than doing a complete comparison. You'll need to decide for yourself how many entries to check to have confidence that you're getting an accurate and representative sample of data.

  • User survey.   A third method is to survey users to ask them about data quality or monitor user complaints about incorrect data. This method works well only for data that users care about and can judge the accuracy of. This method is alos statistical in nature, so you'll need to do some educated guessing to derive the overall quality of your data.

Source of truth and spot checking can be used to check the syntactic validity of information even when no source of truth database exists or is accessible. For example, you could read all (or a sampling) of the email address attributes in the directory and determine whether they are syntactically valid.

Implications of Checking Quality

It's important to consider the implications of your data quality checking methods for the operation of your directory service. Be sure to choose a method that does not significantly reduce directory performance. Depending on the method you choose, you may have to make a trade “off between how often you check for quality and the accuracy of your checking methods. The main concerns in this area are methods that cause an excessive load on the directory or cause the directory to be unavailable.

For example, consider a method that requires reading over LDAP all the entries in your directory. Your directory might have the capacity to respond to this kind of request without degrading performance for other users, but then again it might not. If you use a method like this, you can run the check at night or another off-peak time when the directory has plenty of extra capacity to respond to the data-checking requests . This may be difficult if your directory operates in a global environment in which there is no off-peak time. Another approach then is to create a dedicated directory replica that does nothing but process these data-verification tasks .

Consider also a method that requires you to dump your directory's data to a file. Some directory server software allows you to perform this operation without taking the service down, but some does not. If you are planning to use this method, be sure the software you choose supports online production of the necessary extracts or that your service can tolerate the downtime. Remember that you have replication to help with the availability problem, so consider taking down a replica to produce the extract instead of taking down the master server. Also, consider producing your own extract over LDAP ”but be careful you don't degrade performance as discussed earlier.

Correcting Bad Data

Whatever method you use to check the quality of your data, be sure to investigate the cause any time you encounter an error. This will help you correct problems with the system that produced the bad data. Although this kind of investigation can be time-consuming and expensive, it's usually well worth it. You'll often find that many errors are caused by the same underlying problem. Fixing that problem can dramatically increase the quality of your data.

Many underlying problems can cause bad data, some of which were already discussed briefly . Systematic errors in programs or procedures should be treated as bugs and corrected. Bad data introduced through human error might be the result of inadequate training or documentation for either users or administrators; increasing the quality and coverage of this training and documentation can cause corresponding quality increases in your data. Human error can also be the result of poor software design. Spend time with users and administrators responsible for updating the directory and observe the steps they take when maintaining data. This can often point out flaws in the software and procedures they use.

Finally, even if you can't eliminate poor data coming into your directory, you can mitigate the damage by installing data-validation filters. As mentioned earlier, these filters can be installed in directory clients that users and administrators use to update the directory, or they can be installed in the directory service itself.



Understanding and Deploying LDAP Directory Services,  2002 New Riders Publishing
<  BACK CONTINUE  >

Index terms contained in this section

administrators
         errors
                    checking quality
bad data
         correcting
                    checking quality 2nd
checking
         quality
                    data maintenance 2nd 3rd 4th 5th 6th
correcting
         bad data
                    checking quality 2nd
data
         maintenance
                    quality-checking 2nd 3rd 4th 5th 6th
directories
         data maintenance
                    quality-checking 2nd 3rd 4th 5th 6th
errors
         systematic
                    checking quality
         user/administrator
                    checking quality
maintenance
         data
                    quality-checking 2nd 3rd 4th 5th 6th
quality
          data maintenance 2nd 3rd 4th 5th 6th
                    bad source data
                    correcting bad data 2nd
                    sources of truth
                    spot checks
                    systematic errors
                    user or administrator errors
source of truth
          checking quality
sources
         bad data
                    checking quality
spot checks
          checking quality
systematic errors
          checking quality
users
         errors
                    checking quality

2002, O'Reilly & Associates, Inc.



Understanding and Deploying LDAP Directory Services
Understanding and Deploying LDAP Directory Services (2nd Edition)
ISBN: 0672323168
EAN: 2147483647
Year: 1997
Pages: 245

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net