General Characteristics of Data Elements

All data elements share some general characteristics:

Format
The size of each data value
The number of distinct data values
Data ownership and restrictions
Consumers
Frequency of changes in values (dynamic versus static)
Range of applicability (shared versus application-specific)
Relationships with other data elements

Each characteristic mentioned in the preceding list is discussed in more detail in the following sections.

Tip

Before you design your directory schema (a topic we will tackle in Chapter 8), you should characterize each element using the guidelines included in the following sections. Add this information to the list of data elements you created when you examined your application's needs.

Format

Data elements can be grouped according to the natural format of the information. For example, people's names are always textual data, but telephone numbers consist primarily of digits. Table 7.3 shows some of the more common data formats and provides sample data elements for each.

If your textual data is written in more than one character set or language, be sure to note that as well. As you will see in Chapter 8, Schema Design, each LDAP attribute type is assigned a syntax and a set of matching rules that precisely define the rules for interpreting the stored values. For example, the cn (common name ) attribute is of the syntax DirectoryString with a caseIgnoreMatch matching rule. This means that Unicode characters can be stored in a cn attribute and that the case of letters that make up a name is not significant in comparisons of one cn value with another.

The Size of Each Data Value

Knowing the approximate size of each value in bytes will help with the more directory-specific design work that we will tackle in subsequent chapters. Although it is sometimes difficult to assign hard limits for the size of a data element, it is usually relatively easy to come up with a range that encompasses the typical data values that will be used. For example, the elements of a North American telephone number combine to require approximately 16 characters of storage (1 character for the North American country code of 1, 3 for the area code, 7 for the local number, and 5 for optional spaces and punctuation).

At the other end of the scale, if you choose to store images in your directory service, the size of each value will be much larger (for example, 50K or larger). Be sure to check your directory server software to see whether it places any restrictions on the size of the data values.

Table 7.3. Common Data Formats

General Format	Common Variations	Sample Data Elements
Text string	Case sensitive, case insensitive	Person's name, printer's name, URL
Multiline text string	Case sensitive, case insensitive	Postal address, description
Phone number	Local, international	Work phone number, fax number
Numeric	Integer, floating point	Employee number, cost
Multimedia	Image, sound, movie	Photograph, music sample
Binary	(Many variations)	Digital certificate, preferences data

When considering how large text data values will be, do not forget to take into account the character set if you have any data that is not plain ASCII. As explained in Chapter 2, Introduction to LDAP, the UTF-8 encoding of the Unicode character set is used to represent international data in LDAP directories. UTF-8 is a variable-length encoding: Each UTF-8 character requires between 1 and 4 bytes, and each ASCII character requires only 1 byte.

Tip

In some cases it makes more sense to store a pointer to a data value in the directory service instead of storing the data value itself. This pointer-based approach is especially useful when the data value is large but related to another data element that you plan to store in your directory.

For example, if you are storing information about all your department's current projects, you may want to include HTTP URLs that point to the detailed project plans. The project plans themselves should remain on Web servers because they are probably large, complex documents, but users and applications will still be able to locate these documents by consulting the directory service.

The Number of Distinct Data Values

For each data element, you should answer the question, How many data values will this element typically have? For example, a person will usually have only one user ID but may have several phone numbers. This information will help in directory capacity planning and may also be useful if you replicate or synchronize with other data stores that may have different characteristics. It will be important to know, for example, if a data store can store only one value for a person's name, whereas your directory can store many values.

Data Ownership and Restrictions

Dealing with data ownership issues is one of the more challenging aspects of directory design. When you're thinking about access control, privacy, and security issues, it is important to know exactly which people and applications should be allowed to view or modify a data element. Data ownership also affects whether you will allow a data element to be changed by directory clients and whether and how a data element is kept in sync with other data sources. Often various restrictions will be imposed on a data value by the data owner or another interested group .

In addition, data owners may associate business rules with critical data elements, and these rules may restrict who can view or update the data element. For example, the human resources department probably has a well-defined process that restricts who can assign employee numbers to new hires, and it may also impose restrictions on who can see employee numbers.

Some questions to ask yourself include, Who should be notified when this data element is modified? If this data element is stored in more than one data source, which system has final authority over the data element? and Who has money or another stake riding on the accuracy of this data value?

Consumers

The consumers of a data element are the directory-enabled applications and external data sources that use it. When you're planning directory replication and topology, and managing the relationships between your directory service and other data sources, it is helpful to know about the consumers of each data element.

For example, a message transfer agent (MTA) is a piece of application software whose job is to route e-mail messages to their correct destinations. When processing e-mail, LDAP-enabled MTAs may look at an attribute in a user's entry called mailHost , which gives the host name of the server that holds the user's e-mail (see Figure 7.2).

Figure 7.2. An MTA and the `mailHost` Attribute

The MTA is thus an important consumer of the mailHost attribute, so it may be important for the MTA always to get the most up-to-date copy of the mailHost attribute that is available. E-mail client software might also use the same mailHost attribute to determine the location of a user's mail drop, in which case mailHost becomes a shared attribute.

Similarly, users may be allowed to change their home telephone numbers in the directory service, and these changes may be propagated to a personnel system that stores its data in an Oracle database. Figure 7.3 shows this scenario.

Figure 7.3. Home Phone Number Propagated to Personnel Database

In this scenario the personnel system is a consumer of the home telephone number, but it is also a data source for other, non-LDAP applications that may access the phone number directly from it. If the home phone number is not a piece of data critical to the personnel system, it may be OK for updates between it and the directory service to be fairly infrequent. However, if the phone number is critical to both the personnel system and the directory service, a process that accomplishes frequent synchronization may need to be developed.

If you compile detailed information about all the consumers of a data element, you can aggregate all the information into an estimate of how often a given data element will be accessed. Again, this information is useful when you're doing capacity planning for your directory service.

Frequency of Changes in Values: Dynamic or Static?

It is also helpful to know which data elements are dynamic (that is, have values that change often) and which are static (that is, have values that change infrequently). You will need this information for design of your directory server topology, as well as for capacity planning. For example, suppose you use a replicated directory service that allows writes for a given entry to occur on only one server (a single master system). If you have many attributes whose values change often, you may need to partition your data to avoid overwhelming any one master server with the write traffic.

One way to characterize the dynamic or static nature of attributes is by estimating the ratio of reads to writes for each data element. For example, a user ID may be written once when a student joins a university but read dozens of times each day as e-mail is delivered; this attribute is static because it has a read-to-write ratio that is effectively infinite. In contrast, if a Web browser stores a user's personal bookmarks in the directory service, they may be changed once a day or more often; these attributes are dynamic because they have a read-to-write ratio that may be close to 1:1.

Range of Applicability: Shared or Application-Specific?

Some data elements are used by many applications; others may be used by only one application. Shared data elements require careful planning so that the needs of all the applications are met adequately. On the other hand, if a data element is used by only one application and the data values are large or accessed frequently, consider keeping the values outside your directory service to avoid performance problems when accessing the values. In your data policy statement you may want to include some guidelines for making this kind of decision (as discussed earlier in this chapter).

Tip

One thing to watch for if you conclude that a data element is application-specific is that, over time, new applications may come online that also use the data element. When in doubt, assume that a data element will be shared.

Note that even if data elements are not shared by more than one directory-enabled application, it may make sense to store them together to ease manageability or improve the availability of the information (through directory replication). For example, it may be desirable to delete a person's e-mail “related data elements when the person is deleted from the corporate phonebook. The easiest approach is to store the e-mail “related elements in the directory service along with all the person's contact and other information. That way, you won't need to delete the information in both the directory and the mail system to delete a user's record.

Relationships with Other Data Elements

When you're selecting a schema and laying out the namespace for your directory service, it is useful to know how your data elements are related. Because directory entries typically represent real-world objects, it is important to know which data elements relate to the same kind of object.

For example, if you have an entry in your directory service for each printer attached to your network, you will want to make it easy for an application to find all the printer-related data elements. You can accomplish this by choosing a schema that defines an all-inclusive printer object (see Chapter 8, Schema Design, for more information on schemas).

Some relationships between data elements are subtler and may be easily overlooked. For example, if you use the directory service to determine who used a printer in the previous 24 hours, you will need to relate information about some users to the printer's entry. You could address this need by including in each printer entry a set of data elements that point to the entries of the people who have recently used the printer.

A Data Element Characteristics Example

Suppose that you work at a large university with a great variety of installed e-mail systems. E-mail is often a major factor affecting the expenditure of information technology dollars, and you want to show your boss the value of your new directory service. To do this, you decide to develop as your first directory-enabled application a service that reroutes all e-mail entering the university from the Internet to the correct system. Figure 7.4 shows the basic setup.

Figure 7.4. The Business Card E-Mail Service

By configuring your Domain Name System (DNS) correctly, you arrange for all e-mail messages sent to string@bigu.edu to arrive on the machine called redirector.bigu.edu . This machine runs a copy of the sendmail software that has been configured to perform mail routing using an LDAP directory service. Using criteria constructed from string , the MTA searches the directory service running on ldap.bigu.edu for a user's entry. For example, if a message is addressed to babs.jensen@bigu.edu , a search with an LDAP filter of cn=babs jensen is performed. If an entry is found, the e-mail message is re-sent to the user's mail delivery address, which is typically a mail server in the individual's department or school within the university.

We christen this service the Business Card E-mail Service, or BCES, because now people can safely put one centrally managed e-mail address on their business cards that will not change even if they switch departments within the university.

One of the first design questions we need to answer is, What data elements do we need and what are their essential characteristics? Table 7.4 provides one potential answer (the real answer depends on exactly which version of the sendmail software is used and how it is configured).

Note that some characteristics are missing from Table 7.4. We don't include any information on how dynamic each data element is because we believe that none of the data elements hold data values that change often. Also, because we are focused on just one application, we have not yet needed to think about which data elements will be shared with other directory-enabled applications.

Table 7.4. Sample Analysis of Data Element Characteristics

Element (Example)	Format	Size/ Number of Values	Owner	Consumers	Related to
Full name ( `John Jones` )	Text	<128 chars. /1 or a few values	Personnel dept.	Users; BCES	User's entry
User ID ( `jjones` )	Text	<8 chars. /1 value	IS dept.	BCES	User's entry
E-mail address ( `jjones@bigu.edu` )	Text (Internet mail address)	Many chars. /1 or a few values	IS dept.	Users; BCES	User's entry
Delivery address ( `jjones@math.bigu.edu` )	Text (Internet mail address)	Many chars. /1 value	User and system admins.	BCES	User's entry

Analyzing Data Elements

After you have compiled a fairly complete list of data elements, examine each data element you plan to include in your own directory to determine which characteristics it shares with others. The goal is to eliminate redundant elements and to develop a clear understanding of how each element will be used so that you can ensure that it makes sense to include it in your directory. By doing this analysis up front, you will save time during the schema and namespace design stages and avoid deployment problems.

For example, suppose that a certain dynamic data element will be modified on average many times each minute, but the data element is used by only one application. Because most directory implementations are optimized for read operations (and update operations are relatively slow), it may make more sense to avoid storing such a dynamic, application-specific data element in your directory service altogether.