Data Capabilities | Microsoft Corporation - Guidelines for Application Integration

The capabilities defined in this section provide integration at the data level.

Schema Definition

Schemas describe the structure and format of data. Each schema contains property information pertaining to the records and fields within the structure of data. This information is vital to ensuring that the data actually makes sense. Fortunately, in many cases the schema definition already exists and you only need to maintain the definition, rather than redefine the structure of the data.

Schema definitions are presented in a number of different forms in different applications. These forms include well-formed XML, document type definitions (DTDs), Electronic Data Interchange (EDI), and structured document formats. Many predefined schemas in each of these formats may be appropriate for your environment, depending on the applications that you use. For example, hundreds of schemas are associated with the dozens of EDI documents that are defined within each of the multiple versions of the two EDI document standards (X12 and EDIFACT).

You may find it useful to preload schemas from standards organizations, if the schemas are likely to be used in your environment. Sources of such schemas include:

RosettaNet
Open Applications Group (OAG)

Other frequently used schema definitions are available from the e-marketplace, including:

Gas Industry Standards Board (GISB) — used by the energy industry
Transora — used in e-commerce
WorldWide Retail Exchange — used in business-to-business scenarios
Covisint — used in the automotive industry
Exostar — used in the aerospace and defense industries
ChemConnect — used in the chemicals, feed stocks, and plastics industries
FreeMarkets — used in supply management
E2open — used in process management

In current applications, one of the most commonly used schema definition languages is XML. However, because it is fairly likely that your environment will include applications that do not use XML, you generally should support multiple schema definition languages. Doing so is important in any case to ensure that you support future technologies.

It is also possible to automatically generate XML schemas from data structures such as relational databases, fixed and variable format files, application programming interface specifications, COBOL copybooks, and so on. As well as being much easier, automatically generated schemas are less prone to errors.

Often templates are used for defining common data or documents formats, such as purchase orders, invoices, and shipping notices. The following is an example of a purchase order schema definition:

  <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http:// tempuri.org/po.xsd" xmlns="http://tempuri.org/po.xsd" elementFormDefault="qualified">  <xs:annotation>  <xs:documentation xml:lang="en">   Purchase order schema for Example.com.   Copyright 2000 Example.com. All rights reserved.  </xs:documentation>  </xs:annotation>  <xs:element name="purchaseOrder" type="PurchaseOrderType"/>  <xs:element name="comment" type="xs:string"/>  <xs:complexType name="PurchaseOrderType">  <xs:sequence>   <xs:element name="shipTo" type="USAddress"/>   <xs:element name="billTo" type="USAddress"/>   <xs:element ref="comment" minOccurs="0"/>   <xs:element name="items" type="Items"/>  </xs:sequence>  <xs:attribute name="orderDate" type="xs:date"/>  </xs:complexType>  <xs:complexType name="USAddress">  <xs:sequence>   <xs:element name="name" type="xs:string"/>   <xs:element name="street" type="xs:string"/>   <xs:element name="city" type="xs:string"/>   <xs:element name="state" type="xs:string"/>   <xs:element name="zip" type="xs:decimal"/>  </xs:sequence>  <xs:attribute name="country" type="xs:NMTOKEN"    fixed="US"/>  </xs:complexType>  <xs:complexType name="Items">  <xs:sequence>   <xs:element name="item" minOccurs="0" maxOccurs="unbounded">   <xs:complexType>    <xs:sequence>    <xs:element name="productName" type="xs:string"/>    <xs:element name="quantity">     <xs:simpleType>     <xs:restriction base="xs:positiveInteger">      <xs:maxExclusive value="100"/>     </xs:restriction>     </xs:simpleType>    </xs:element>    <xs:element name="USPrice" type="xs:decimal"/>    <xs:element ref="comment" minOccurs="0"/>    <xs:element name="shipDate" type="xs:date" minOccurs="0"/>    </xs:sequence>    <xs:attribute name="partNum" type="SKU" use="required"/>   </xs:complexType>   </xs:element>  </xs:sequence>  </xs:complexType>  <!--- Stock Keeping Unit, a code for identifying products --->  <xs:simpleType name="SKU">  <xs:restriction base="xs:string">   <xs:pattern value="\d{3}-[A-Z]{2}"/>  </xs:restriction>  </xs:simpleType> </xs:schema>

Schema Recognition

The Schema Recognition capability checks that the schema can be read and understood, and that it is well-formed. The well-formed XML standard dictates that an XML document have a single root and that elements must nest completely or not at all.

Because it is likely that multiple formats will be used for different schemas, it is important that your schema recognition capability be able to accept multiple schema formats. After a schema is recognized and registered, subsequent data can be received and checked for compliance to the corresponding schema.

The Schema Recognition capability also performs schema validation. Schema validation is the verification that the schema definition conforms to a predefined document structure (for example, a DTD).

Mapping

Data is often rendered in different formats in source and target applications. The relationship between two applications and how the data must be transformed from the source to the target is handled by the Mapping capability.

As you define your mapping capability, you should consider both the development (or design-time) component, and the run-time component. For development, the mapping capability should enable developers to specify the mapping logic in a declarative, nonprocedural way — for example, using the mouse to delineate the mapping between a field in the source data structure and the position of that data in the target data structure. At run time, the mapping capability should be able to access reference data (for table lookups) and perform other data enrichment functions, such as sorting and filtering multiple input records before creating an output data set.

Two forms of mapping can be performed:

Field mapping
Semantic mapping

The following paragraphs discuss each mapping type in turn.

Field Mapping

Field mapping is a data translation process in which the records and fields in the source specification are related to their corresponding occurrences in the destination specification. When data is moved between different applications, several levels of transformation may be required.

In some cases, each field that is moved from the source to the target representation has to be converted. Maybe the source application allows mixed case, but the target application accepts only uppercase. Or maybe the source application allows additional characters (such as dashes or spaces) for formatting purposes.

Often, the data can be converted by applying a straightforward conversion algorithm (for example, temperatures can be converted from Celsius to Fahrenheit through a simple calculation). In other cases, though, the transformation may require a table lookup and replacement (for example, converting the gross weight of a product to net weight may require a lookup based on the product code).

Semantic Mapping

As data is moved from one application to another, the data structure itself may need to be modified. This modification is known as semantic mapping.

As an example of semantic mapping, consider the different ways of representing addresses. Table A.1 shows two separate applications storing the same address information, but in a different structure.

Table A.1: Different Ways of Storing Address Information
Application 1	Example	Application 2	Example
Address 1	One Microsoft Way	Number	One
Address 2		Street	Microsoft Way
Town	Redmond	City	Redmond
State/Province	WA	State and ZIP Code	WA 98052
ZIP/Postal Code	98052

Moving data between these two representations requires shuffling the data elements around a bit, but care must be taken so that none of the semantic values of the data (in other words the meaning of the data) is lost as a result of this reorganization of the data elements.

Data Validation

Data requirements for an enterprise change over time. You may have preexisting systems in your environment that produce data that is not of the quality that the business now requires. Or a change in business context may mean that data that was once correct is now incorrect. The Data Validation capability can verify that data meets the criteria you set and is therefore useful to transfer between applications. The complexity of the validation depends on the enterprise business rules and the capabilities of the application integration tool itself. However, Table A.2 gives some examples of what to expect.

Table A.2: Example Data Validation Rules
Data validation rule	Example
Syntax validation	Alphabetic characters appear only in alphabetic fields.
Semantic validation	A date field actually holds a date.
Format validation	A date field requires dates in the U.K. date format (24/03/2001).
Range validation	A field requires a number in the range from 10 to 10000.
Dependency validation	A field must contain a certain value when another field contains a certain value.
Mandatory validation	A field must contain a value.
Size validation	A field must be 20 characters long.
Value set validation	A value in the field must be M or F (as in Male or Female).
Count validation	There cannot be more than one Employee ID for each employee.

Data Transformation

The Data Transformation capability is the capability that actually renders the content of input data elements to the corresponding elements of the output data as specified in the map. Data Transformation works in conjunction with the Mapping capability to ensure that not only is data sent to the correct location, but that it is in the right form when it arrives. The Data Transformation capability performs a number of important tasks, including:

Data aggregation/disaggregation
Data enrichment
Data summarization
Data filtering

The following paragraphs discuss each of these tasks in turn.

Data Aggregation/Disaggregation

Some data must be merged together before it can be sent. Data may need to be combined from multiple applications, or from a single application over time. The Data Transformation capability uses the Mapping capability to identify the data to be merged and composes new data out of elements of the input data. The map tells the Data Transformation capability where the data elements come from.

Data aggregation may be required to perform both transactional (one record-at-a-time) transformation and batch, set-oriented transformation. For nearly real-time or real-time transactional transformation, the Data Transformation capability may wait for the arrival of messages from several applications and then combine that data to create a single output message. Or, the Data Transformation capability may receive a message and then more proactively request the additional data elements from other applications.

Data aggregation is often constrained by the Rules Processing capability, which monitors the composition process and examines the integrity of the composed message.

Data disaggregation is the opposite of data aggregation. Specific data may need to be broken up into several pieces of output data. Semantic mapping specifies the output data, and data disaggregation decomposes the input data into the output data.

Data Enrichment

When input data is being formatted into output data, the input data may not hold all of the information that is required for output data. The Data Transformation capability enables you to specify where to acquire the information to enrich the output data. In many cases, this information is acquired from a data store. The Data Transformation capability may use the Data Access capability to look up this information source.

Data Filtering

The Data Transformation capability provides a mechanism with which users can filter out information from specific data. There may well be an occasion when it is either difficult to filter the information at the source, or when the filter should only be applied to certain targets. Applying the filter to certain targets is very important when you want to filter out sensitive information that a business user does not want to send to certain partners. Data filtering is also used to disseminate predefined cross-sections of data from source applications to target applications across multiple channels. Using data filtering can prevent you from having to change the source application, which may save significant effort and maintenance costs.

Data Access

As data is being passed from application to application, often different collections of the data are required by each application. One important consideration is how you take the data that already exists and optimize the process of accessing that data. There are three choices to consider for data access:

Dynamic data access
Staged data access
File/database access

The following paragraphs discuss each of these data access choices in turn.

Dynamic Data Access

Integration solutions are often best served by dynamically creating business objects, where a business object contains all of the required data to be accessed by a business process. The Data Access capability creates these business objects when source data is published or when a request arrives from a composed or straight-through processing (STP) application.

Staged Data Access

In some occasions, creating data for access dynamically isn't the best approach. For example, when the data collection process has an unacceptable impact on the response times of the applications that are being accessed (specifically, the response times for the regular users of the source applications), the integration solution has to be adjusted so that the impact on these users is less troublesome.

Similarly, response times for the STP or composed application may be unacceptable. For example, suppose that you are building a composed application and that collecting the data required to meet users' needs will involve sequential access to ten applications, each with a response time of two seconds. The two-second response time for each individual application is quite acceptable to most users. However, the cumulative response time of 20 seconds is not acceptable.

These performance issues can be addressed by placing the data into a separate database, called a staged data set. A platform for staging data (using some intermediate data format) is useful for those integration scenarios where the data from the source applications can't be accessed directly and immediately from the source applications.

Putting data into a staged data set is a straightforward process when the data is stable and when the access from the integration programs is read-only. However, if the source applications update the data on a regular basis, those updates have to be propagated from the source applications to the staged data set within a short time frame. Otherwise, the data will become stale and less useful to the integration programs. The reverse problem is also a possibility. If the integration programs are going to change the data in the staged data set, then those changes have to be propagated to the source applications in a short time frame.

File/Database Access

In addition to accessing data from applications, data stored in flat or structured files as well as databases need to be retrieved for processing. Most file and database access mechanisms are programmatic.