Reducing Costs | Understanding and Deploying LDAP Directory Services (2nd Edition)

Understanding and Deploying LDAP Directory Services > 24. Case Study: A Large University > Maintenance

< BACK

CONTINUE >

153021169001182127177100019128036004029190136140232051053054012003009099078016121242090

Maintenance

Maintenance of the Big State directory was relatively simple, but not without its problems. Some of the more tedious and time-consuming aspects of system maintenance are covered in the following sections.

Data Backup and Disaster Recovery

Three approaches are taken to provide backup and disaster recovery for the directory service. First, each machine providing service is backed up to magnetic tape regularly. These tapes are saved for six weeks and rotated periodically. This kind of backup is used primarily to guard against any of the single systems becoming corrupted. The system's configuration can be restored from the backup tape, and its directory can be repopulated using one of the techniques described later in this section.

Second, directory replicas are also used as a backup service. If a secondary replica becomes corrupted, directory data can be restored from the master replica. If the master directory server becomes corrupted, it can be restored from the database of the most up-to-date replica server. This kind of procedure often provides more timely backups than the magnetic tape method. The tapes are typically updated with an incremental backup once per day, whereas directory replicas are constantly updated and kept in sync with the master.

Third, the directory data is dumped to a text file and transferred to a different, secure machine every night. If the master and all replica directories become corrupt, they can be restored from this saved file, losing only the intervening changes. This kind of backup is helpful in recovering from such problems as an out-of-control administrative procedure that mistakenly deletes a bunch of entries from the directory. It is also helpful in recovering data mistakenly deleted or modified by a user .

20-20 Hindsight: Backup Procedures

Maintaining three separate backup and disaster recovery procedures may seem like overkill, but Big State found that each procedure was the best choice for different situations, and sometimes the procedures can be used in combination. For example, a disk failure on a production machine is best handled by restoring the machine's file systems from the tape backup, and then restoring its directory data from one of the replica directories. This kind of creative thinking made the job of recovering from failures and catastrophes much easier. Using the right combination of procedures helps to minimize recovery time and increase system uptime.

Maintaining Data

One of the most trouble-prone and time-consuming tasks associated with the Big State directory service is data maintenance. There are several procedures related to this task:

Bulk data source loading. This is the main way the directory is populated with data. Periodic data dumps are obtained from the personnel and student source databases and merged with the existing directory data to produce an updated directory. The merging procedure is accomplished by a program called "munge," which is used in conjunction with a set of step-by-step procedures that accomplish the merge. The merging procedure is quite complicated and error-prone , and it is further complicated because the directory must be restricted to read-only mode during much of the procedure.
Directory content administration. These kinds of changes are primarily made by help desk and other support staff in response to user requests. This includes requests to move an entry from one department to another and to change certain attributes that users themselves are not allowed to change. For example, a title attribute maintained in the directory often doesn't contain exactly what a user considers his or her title to be. In this instance, the user can only request that the title be changed.
Other automated data source loading. This includes various other automated links to data sources on campus. For each source, a separate data loading procedure was developed and must be maintained. One source of problems used to involve different sources fighting with each other; that is, one source would add data to the directory, only to have it overwritten by another source (this typically happened to a user with accounts on two or more local systems). The solution was to add source-tracking capabilities to the automated data source loading procedures.

The Big State directory is not unusual in the occasionally poor quality of its data. This problem leads to user complaints, which in turn lead to manual work by directory maintenance staff to correct the problems. These manual tasks can become quite burdensome and expensive.

Monitoring

Big State has an extensive monitoring system in place that focuses on network devices such as routers, hubs, and server network interfaces. As is the case in many organizations, the group that provides this monitoring is distinct from the group that deployed the directory. The Big State directory designers wanted to leverage this existing monitoring infrastructure as much as possible when monitoring the directory system.

The Big State monitoring system provides a mechanism for calling out to user-developed code to perform certain tests. The directory deployment team worked with the monitoring system maintenance staff to incorporate plug-in programs it wrote to perform a number of directory tests. These allowed directory alerts to be displayed on the monitoring system's trouble board, to be dealt with by monitoring system staff when appropriate.

The directory deployment team developed and documented procedures to help the monitoring staff know what to do in case of a directory alert. Initially, these procedures usually specified paging or emailing a directory team member, depending on the severity of the event. As both the monitoring and directory teams became more comfortable with the service, procedures were updated to allow the monitoring staff to troubleshoot certain problems. Some alerts were even automated, causing directory team members to be automatically paged in the event of a serious failure, such as the directory becoming unreachable or directory replication or email queues becoming inordinately large.

Another aspect to the Big State monitoring system is log analysis. The directory team developed log analysis software to produce daily and weekly summaries of directory-related activities. This includes the number and types of operations the directory servers themselves handle, as well as statistics on important directory-enabled applications. For example, the periodic reports detailing usage of the phone book and email applications is invaluable for predicting capacity problems, justifying funding expenditures, and general public relations.

20-20 Hindsight: Monitoring

The Big State monitoring system worked fairly well, but it tended to produce a lot of false alarms. This was partly caused by incorrect threshold parameters in the monitoring software. For example, sometimes the directory would be reported as down when it was in fact just slow. These threshold parameters were tuned to improve the situation.

More seriously, the monitoring system revealed a great deal of variance in the level of service provided to directory clients . This was partly a capacity problem, easily remedied with the introduction of bigger and faster directory server machines. But it was also an indication of problems with the directory software itself. These problems were harder to fix and required relatively expensive staff time to debug and improve the directory server software.

Troubleshooting

Big State developed a number of troubleshooting procedures for dealing with directory problems. Following are some of the more interesting problems that led to the development of these procedures:

Infinite email loops . This is one of the nastier problems encountered by the directory team. Users are allowed to create mail groups that can contain addresses that point anywhere at all. Sometimes they create a condition in which email sent to a group is sent to another address that forwards it back to the group ”creating an infinite loop. Most mail software has no trouble handling such loops, stopping them after some maximum hop count is achieved. Unfortunately, loops can get more complicated and hard to detect. For example, consider a group with two members, each of which points back to the group itself. This situation results in an exponentially increasing number of emails flooding the system.
Data feed disasters. On a few occasions, the data feed from the staff and student source databases contains erroneous information. A typical mistake is the exclusion of a whole class of employees . Without careful safeguards, this situation can have resulted in wholesale deletion of many directory entries.
Replication failures. The replication software has proved to be rather unstable. Unanticipated changes to the master server or small configuration differences between master and replica often cause changes to fail to replicate. This situation requires manual intervention by a knowledgeable directory administrator and often involves directory downtime. In the most serious situations, either the master or replica server crashes while attempting to replicate a change.

Several troubleshooting procedures developed from these problems. Perhaps the most important is related to problems with the email service. Step one in dealing with any email-related problem is to turn off email service. No harm is done by this because undeliverable mail is queued, usually for up to three days. Taking this step avoids the more serious error of bouncing email and buys the directory maintenance team valuable time to figure out and correct the problem.

The mail loop problem has never been fully resolved, but a number of steps were taken to mitigate the problem. First, the mail delivery software was modified to detect and reject the simplest forms of mail loops that it could detect. Second, an automated process was developed to trawl the directory for other situations that were likely to cause mail loops, and administrators were alerted of suspected problems. Third, the directory monitoring software was improved to more quickly and accurately detect mail loops when they occur. Finally, better tools were developed to recover from mail loop disasters. These tools make it easier to hunt down and delete the loop-generating messages clogging the mail system.

Other tools were developed to help detect data feed disasters before serious damage can occur. These include changes to the munge program to look for large, unexpected changes in the user population during the monthly data merge. Large changes are reported to administrators who can ensure they are legitimate .

Tools were also developed to detect and recover from replication failures. The directory monitoring system has been augmented with tests to alert administrators if replication appears stuck (for example, if the replication queue gets too big). Tools were also developed to make the process of recovering from a replication failure easier. These include scripts that automate the process of creating one directory replica from another. The process of clearing out "bad" changes in a replication log remains manual, although more recently directory administrators have created surgical techniques for repairing entries damaged by replication errors without causing service interruptions.

Understanding and Deploying LDAP Directory Services, 2002 New Riders Publishing

< BACK

CONTINUE >

Index terms contained in this section

automated data source loading
Big State University case study
backups
Big State University case study 2nd 3rd 4th
Big State University case study
maintenance 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th
backups 2nd 3rd 4th
data 2nd 3rd
disaster recovery 2nd
monitoring 2nd 3rd 4th
troubleshooting 2nd 3rd 4th 5th
bulk data source loading
Big State University case study
case studies
Big State University
maintenance 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th
content
directory administration
Big State University case study
data
maintenance
Big State University case study 2nd 3rd
directories
case studies
Big State University 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th
content administration
Big State University case study
disaster recovery
Big State University case study 2nd
email
infinite loops
troubleshooting 2nd
failures
replication
troubleshooting 2nd
loops
email
troubleshooting 2nd
maintenance
Big State University case study 2nd 3rd 4th 5th 6th 7th 8th 9th 10th 11th 12th 13th
backups 2nd 3rd 4th
disaster recovery 2nd
monitoring 2nd 3rd 4th
troubleshooting 2nd 3rd 4th 5th
data
Big State University case study 2nd 3rd
zsee also disaster recovery
troubleshooting
monitoring
Big State University case study 2nd 3rd 4th
performance
monitoring
Big State University case study 2nd 3rd 4th
replciation
failures
troubleshooting 2nd
troubleshooting
Big State University case study 2nd 3rd 4th 5th
data feed disasters 2nd
infinte email loops 2nd
replication failures 2nd

2002, O'Reilly & Associates, Inc.