Determining and Managing Information Context | Data Protection and Information Lifecycle Management

< Day Day Up >

Although any number of attributes can provide context to data, the most important from an ILM perspective are

Classification
Content
Relationships
State
Location(s)

Other attributes may also be important, depending on the organization and its information management needs. In many cases, they will be components of the attributes stated here.

The Anatomy of an E-Mail

A good example to consider is an e-mail object. An e-mail has a number of constituent components that make it an e-mail. First, it can be classified as an e-mail. That may be because a person can recognize it as one or because it has a MIME type of message/rfc822. It has content that can be examined for e-mail formatting and relationships related to an e-mail, such as an attachment. The object may also reside in a directory that is only for e-mails and may have a file format specific to e-mail systems. Finally, there might be state information, such as the time the object was created, headers, or similar descriptors (Figure 8-2). The object is recognized as an e-mail because it has the context of an e-mail.

Figure 8-2. Anatomy of an e-mail

Classification

Classification is a quick form of identifying what information is. This is something that humans do quite well but machines do not. For ILM, classification is the most important attribute and will drive most actions within an ILM policy.

Classes may be broad, such as Financial, Marketing, and Personnel. They may also be very specific, such as First Quarter Financial Reports. If classes are too broad, actions will be limited to only those that can take place among many different types of objects. If classes are too specific, the organization will drown in policy documents.

Classifying structured data is easy. The classes are determined by the schema. Unstructured data, on the other hand, can be very difficult to classify. Humans can do this by looking at the data "Yep, that's our third-quarter financial report" but computers are terrible at it.

To classify unstructured data, rules-based context is overlaid on the data and stored as metadata. Various attributes of the data are examined to provide a class for the information. The existence of an object in a particular directory or folder, along with keywords found in the content of the object, may be used by a rules-based system to determine its class.

Another way to classify unstructured data is through human intervention. When information is created, the person creating the information, or a designated person, can choose a class for it. Even in this case, a set of rules on how to determine a piece of information's class will be needed. Otherwise, classification will be inconsistent and useless.

State

State describes content and metadata context at a specific point in time. Changes in some component of the context indicate a change in state. ILM policies may demand that these changes in state trigger actions. The specific metadata that defines state in an ILM system is described by policies. Within ILM policies, state is the catalyst for actions. If a state change occurs, an action, proscribed by policies, must also occur.

Age

Age is a concept central to all lifecycles. To say that something has a lifecycle is to indicate that it is changing over time, or aging.

For ILM, age does not really exist by itself. Time is a component of state, and aging is a function of the differences between two different states. Two different timeframes, such as two different dates, represent two different states. When ILM policies are written, actions should be triggered by changes in state, not just changes in time. Other elements of a piece of information's context may also have changed in that same interval and will affect decisions.

Tracking State and History

Time is a necessary element of state, even if the timeframe is only now. It is possible to only define a current state, although it is more useful to define state in other timeframes. By tracking state over time, it is possible to accumulate a history of the information. The timeframe "now" defines a current snapshot while other timeframes define history.

This is a powerful tool for managing information. By tracking state, it is possible to compare the current state against an expected state. Changes in state will help determine whether the information:

Has been copied, deleted, or moved
Has had a constituent component modified or whether a new version has been created
Has had related information changed
Has been transformed into another type of information
Has aged past a defined point

Information Transformations

Information is frequently transformed from one type into another. The act of copying the contents of an e-mail into a word processor document does not change the content. Instead, it transforms the information from one class of information (e-mail) into another class (document). This represents a change in state. Depending on how ILM policies are designed, the document may now be considered a new document with a relationship to the e-mail or a new version of what the e-mail represents.

In either case, this transformation will be detected if changes in state are tracked. The state of the e-mail will have changed, because either a new relationship will be added to the current state that was not in the previous state or a new branch of the e-mail's state will be created.

Tracking this transformation is important for complying with ILM policies, especially those regarding information retention and destruction. If a policy exists that requires all information of a particular class to be destroyed at some point in time, transformed documents may need to be destroyed as well. The same is true for retaining information.

Content

The important part of any information is its content. Content is the "stuff" of information the words in the document, the numbers in the spreadsheet, and the picture in the image. In a computer system, content is stored as data.

Much of the context of information can be derived from the content. By examining a document, clues can be found that help discern whether it is a letter to a friend or a technical manual. Humans are very efficient at performing this task, whereas computers are not. Knowledge management systems have developed very sophisticated inference engines to do what we do naturally. Inference engines examine the content of a document to determine its meaning, usually for purposes of classification. Through the use of statistical analysis and rules-based systems, context can be rendered from the document. These systems are rudimentary compared with what human inferences can do. They often miscategorize information and need human editors to make corrections.

Search engines are similar to inference engines in that they scan content for clues as to its meaning. Unlike inference engines, search engines are more of a tool to help humans make content decisions. Often based on keywords, a search engine can provide a list of possible targets. The human then decides whether it meets the criteria for classification.

For ILM purposes, humans can do the job of deriving context from content. A person can make the decision as to what the content means. Unfortunately, this is inefficient. It is not too difficult to ask end-users to make a decision as to what the content means for newly created information. It is a daunting job to have people go through existing information and determine context from content.

Hashing Data

In many cases, examination of the actual content is important only when classifying data. After that, it is necessary to note only when content changes. The size of the data may not change even though the content changes. File system dates are unreliable, because they can be changed even when no content has actually been altered. Instead, a hash of the data may be used and compared with hash values taken at later dates. A hash is a set of characters generated by running the content through certain algorithms. It is commonly used to generate security keys and digital signatures. The most common algorithms for producing file hashes are MD-5 and SHA-1. For a demonstration of hashing, look at security programs such as GnuPG (www.gnupg.org) and Pretty Good Privacy (www.pgp.com). They generate hashes based on text in a document. These hashes are then used to ensure that the document has not changed in transmission.

< Day Day Up >