Metadata Basics | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

One of the most common definitions of metadata we hear is Metadata is data about data. This is vague to the point of uselessness. It doesnt help you understand what metadata is or why you should care. We think about metadata as all the information that describes the contents, structures, and operations of the DW/BI system . Metadata defines the contents of the warehouse, the structures that hold those contents, and the processes that brought those contents into being. In this section, we talk about the purpose of metadata, describe the common types of metadata found in the DW/BI environment, and discuss the concept of the metadata catalog.

The Purpose of Metadata

Metadata serves two main purposes: defining and describing the objects and processes in a system.

Some metadata is used to define a process, object, or behavior. When you change the metadata, you change the process. A simple example is the start time of a SQL Server Agent job. Change the value of the start time element and you change the start time of the process. This idea of using metadata to define a process outside the code was an early, practical use of metadata. Separating a programs code from its parameters and definitions allows the developer (and in some cases, the user ) to change the parameters and definitions without having to edit and recompile the code. This concept has been around for decades in forms like table-driven programs and configuration files. In a similar fashion, metadata is a core concept of object-oriented programming. The object-oriented concept of properties is essentially metadata, and changing the properties of an object will change how the object behaves or appears, without your having to edit and rebuild the object itself.

Other metadata is used to describe an object or process. This kind of descriptive metadata is essentially documentation. If you change the process, but dont change the description, the process itself still works, but your understanding of the process based on its description is now incorrect. Some of the common properties of an object, like its name or description, do not affect its appearance or behavior; they simply describe it in some way. Its this idea of describing that leads to metadata as documentation.

The DW/BI industry often refers to two main categories of metadata: technical and business . Weve added a third category called process metadata . As well see in the descriptions of these categories that follow, technical metadata is primarily definitional, while business and process metadata are primarily descriptive. Be careful with these categories because there is some overlap. Its best to not get too dogmatic when you are dealing with metadata.

Technical metadata defines the objects and processes that make up the warehouse itself from a technical perspective. This includes the system metadata that defines the data structures themselves , like tables, fields, data types, indexes, and partitions in the relational engine, and databases, dimensions, measures, data mining models, and partitions in Analysis Services. In the ETL process, technical metadata defines the sources and targets for a particular task, the transformations (including business rules and data quality checks), what tasks make up a job, what jobs happen when, and so on.

This description of technical metadata is cause for some confusion, because some of this technical metadata can also be used as business metadata. Certainly, business users are interested in the tables and columns that are available for query. Security- related metadata also is interesting to both camps. From the technical perspective, the system must be able to identify users, determine what groups they belong to, and assess the validity of their specific request. From the user perspective, knowing what data is available for access can save a lot time and frustration. There are always users who are interested in the ETL transformation rules, especially when the data doesnt make sense to them. On the other hand, some technical metadata elements are of no interest to business users. Few users are interested in the definition of the partition function on a given fact table.
Business metadata describes the contents of the data warehouse in more user-accessible terms. It tells us what data we have, where it comes from, what it means, and what its relationship is to other data in the warehouse. The name and description fields in Analysis Services are good examples of business metadata. Business metadata often serves as documentation for the data warehouse. As such, it may include additional layers of categorization that simplify the users view by sub-setting tables into business-oriented groups, or omitting certain columns or tables. Data source views and the AttributeHierarchyDisplayFolder properties in Analysis Services serve this purpose. When users browse the metadata to see whats in the warehouse, they are primarily viewing business metadata.
Process metadata describes the results of various operations in the warehouse. In the ETL process, each task logs key data about its execution, like start time, end time, CPU seconds used, disk reads, disk writes , rows processed , and so on. Similar process metadata is generated when users query the warehouse. This data is initially valuable for troubleshooting the ETL or query process. After people begin using the system, this data is a critical input to the performance monitoring and improvement process. It can also be valuable to monitor user access both as a demonstration of the popularity of the warehouse and for security purposes.

Note that process metadata is really measurement or transaction data for the business processes of building and using the data warehouse. If this is your organizations core businessan information provider that collects sales data for an industry and sells access to that data to many customers, for examplethen what we would normally call process metadata becomes the business process data for the fact and dimension tables in the data warehouse. Business folks at this information provider would be most interested in analyzing this process data. It tells them who is using their products, what products theyre using, what service level they are receiving, and so on. If you think about it, process metadata is also the business process data for the DW/BI team.

The Metadata Repository

All of these metadata elements need a place to live. Ideally, each tool would keep its metadata in a shared repository where it could be easily reused by other tools and integrated for reporting and analysis purposes. This shared repository would follow standards for how metadata is stored so the repository could be easily accessed by any tool that needs metadata, and new tools could easily replace old tools by simply reading in their metadata.

For example, if you had a shared, centralized repository in your warehouse, you would use your ETL tool to design a package to load your dimensions. The ETL tool would save that package in the repository in a set of structures that at least allow inquiry into the content and structure of the package. If you wanted to know what transforms were applied to the data in a given dimension table, you could query the repository. Some of the metadata that defines the core structures may need to be saved in proprietary XML files, but that metadata could still be retrieved and displayed if it was surrounded by the appropriate descriptive metadata.

Unfortunately, this wonderful, integrated, shared repository is relatively rare in the DW/BI world today, and when it does exist, it must be built and maintained with significant effort. Most DW/BI systems are like the Tower of Babel, and Microsoft SQL Server 2005 is no exception, with each component in the toolset keeping its own metadata in its own structures and formats.

It may provide some comfort to know that managing metadata is a challenge that is not unique to the Microsoft platform. For decades, people in the software industry have realized that managing metadata is a problem. There have been, and continue to be, major efforts within many companies to build a central metadata repository. At best, these are unstable successes. The amount of effort it takes to build and maintain the central repository ends up being more than most companies are willing to pay. At the same time, several major software companies have tried to address the problem from a product perspective. Most of these products are large-scale, enterprise repositories that are built to handle every kind of system complexity. Implementers have a hard time navigating the product complexity, so most of the functionality remains unused. Again, the cost and effort often brings the project stumbling to its knees. Weve often seen initial success at implementation followed by a slow (or rapid) divergence from reality until the repository falls into complete disuse.