Technical
Solution
SQL Server Integration Services is the tool we
will use to bring in our dimension data and our sales and budget
facts from the source systems. Integration Services will also be
used to transform the data into the shape we want before it is
loaded into our data warehouse.
At a high level, all you want to do is copy some
data from one or more data sources to the data warehouse. However,
we are at the point now where you need to specify in detail just
how to do that, and there is more to this than simply copying the
data.
The sales and budget facts depend on the
dimension data, so we must load the dimensions first. For
simplicity, we have elected to reload all the data in our dimension
and fact tables rather than trying to determine what has changed.
Our data volume is small enough that we can get away with this. We
address how you can load only the changes to the dimensions and
facts in Chapter 8.
We need to accomplish several things in our ETL
process. We need to load the dimensions, and then the sales facts
can be loaded. Annual forecasts are prepared toward the end of each
year, and these facts are loaded independently of the other
dimensions or facts. When the data loading is completed, we need to
process the dimensions and cubes in Analysis Services.
Getting Started
with Integration Services
We are going to create an Integration Services
project to load our dimensions and facts into the data warehouse.
The sequence of
tasks
and transforms you define in Integration
Services are saved in
packages
,
which can reside in SQL Server or the file system. We like to
create modular packages that have a single purpose, instead of
creating very large packages that do everything. Later, we will add
a package for loading the sales data, and another to load the
budgets
. We strongly
encourage
the use of integrated source
control, such as Visual Source Safe 2005 or Visual Studio Team
System, to prevent multiple developers from inadvertently modifying
the same package.
Data Sources and
Destinations
An Integration Services project usually needs
one or more sources of data and a destination for the transformed
data. In our project, we want to populate our data warehouse with
data from a number of data sources: the dimension data and the
sales from our on-line transaction processing (OLTP) systems, and
the budgets from spreadsheets. Integration Services uses connection
managers to connect data with tasks and transforms in Integration
Services. Connection managers specify a database in an instance of
SQL Server or other relational database such as Oracle or DB2, a
file, a spreadsheet, or one of several other data sources. We find
it useful to define these as soon as we create a package. We'd like
to clarify a potential area of confusion about Data Source objects.
These are not connection managers, but you can create a connection
manager from a data source object. The advantage of a Data Source
object is that it is visible to all packages in a project, and any
connection managers based on a Data Source object are synchronized.
If you change where a Data Source object points, all connection
managers based on that data source will change to the new location.
Note that this only applies during design. It is the connection
manager that is exposed for configuration at runtime. This is
discussed later in the section on deployment.
You don't need to have all your data in one
source, or even one type of sourceit can come from any number of
different sources. If you don't see a data provider for your
particular data source, there is usually one available from third
parties. A quick search for "ole db providers" using your favorite
search engine should locate what you need.
For our customer, we'll need a connection to the
data warehouse on SQL Server, a connection to each of the sales
databases, and a connection to the file location where the Excel
budget data is kept.
If you are looking down the road to deploying
your solution to another environment, rest easy. You can
dynamically change where a connection manager points to at runtime
by assigning a variable to the appropriate property of the
connection manager, such as the ServerName property. You can also
set up an external runtime configuration that will set any property
of a connection manager. We'll look at how to do that in a later
section on deployment.
|
We're
assuming
at this point that you have set
up your data warehouse in the previous chapter and that your source
data is accessible via the network from the machine running SQL
Server 2005. Our example uses data from a SQL Server database:
|
1.
|
In the BI Development Studio, select New
Project from the File menu, and choose Integration Services Project
from the Business Intelligence Projects subfolder.
|
|
2.
|
Name
the project Manufacturing and click OK.
Click
Next
to skip the first page.
|
|
3.
|
Create a new data source by right-clicking
Data Sources in the Solution Explorer.
|
|
4.
|
Skip over the welcome page. Click New on the
How to Define the Connection page.
|
|
5.
|
On the Connection Manager form, choose the
provider (Native OLE DB\SQL Native Client for our example).
|
|
6.
|
Type or select the server name. Remember to
specify the \Instance name if you are referring to a named instance
of SQL Server.
|
|
|
|
|
7.
|
Choose the method to log on to the server. (We
prefer Windows authentication.)
|
|
8.
|
Select a database to connect to that contains
one or more of the source tables for the dimensions and click OK.
Rename the connection manager to SalesTracking.
|
|
9.
|
Repeat this process to create a data source
for the data warehouse. Note that in Step 8, you will choose the
data warehouse database, rather than the source database, and
rename it to DataWarehouse.
|
|
10.
|
At the bottom of the Control Flow tab,
right-click in the Connection Managers area, select New Connection
from Data Source, and choose the Data Source you just created.
|
Now we are ready to start defining our ETL
process.
|
Tip: Be Consistent in Naming Connection
Managers
When it comes time to deploy your solution from
development to other environments, you will want to redirect your
data sources and destinations. The configuration for each
connection manager is usually stored by name in a SQL Server table
or an XML file. If you use the same name in each Integration
Services package for connection managers that refer to the same
database, you will only have to change a few rows in a
configuration table or an attribute in a few XML configuration
files. Note that connection manager
names
are case sensitive.
Loading the
Dimensions
We are going to create one package for each
dimension we want to load. All these packages will be part of the
project we just created.
Integration Services explicitly separates the
design of the process flow from the design of the data flow. This
separation
simplifies
the creation of ETL processes, and makes the
process you implement much more understandable to others who might
have to maintain the application. You first define the sequence of
events that need to take place in order to populate the data
warehouse. This is called the
Control
Flow
. Separately, you define how each table in the data
warehouse is
populated
from the data sources and what
transformations the data must go through before it is ready for the
data warehouse. This is called the
Data
Flow
. Each package has one control flow. You can have
multiple data flows in a control flow, and you will have a data
flow for each entity you are populating in the data warehouse.
Defining the
Control Flow
We'll start by loading the Sales Territory
dimension table. In this section, we create an Integration Services
package you can use as a model for loading dimension tables from
any source. We hasten to point out that we will deal with what are
called slowly changing dimensions (SCDs) in a separate chapter
because of the extra complexity needed to handle them. Right now,
we're just going to ease into loading what we will assume to be
static dimensions. This is appropriate for simple dimensions such
as status, condition, or gender. It is also appropriate for
situations where you don't care about the historical values of a
dimension and will be reloading all your data when you update the
data warehouse, or for when you are just working on a proof of
concept. In our experience, most applications have an implicit
requirement to maintain history, so you should plan on
accommodating
this in your designs.
Because our control flow for loading a dimension
consists of just copying data from a source to a destination,
simply drag a Data Flow task from the Toolbox onto the surface of
the Control Flow tab, as shown in Figure 4-4. Click the label to
rename the Data Flow task to Load Sales Territories.
Finally, you are ready to build the piece of the
puzzle that you originally set out to do: moving some data into the
data warehouse.
Defining the Data
Flow
Data flow processing is initiated when a Data
Flow task is executed in a control flow. In a Data Flow task, for
one entity in the data warehouse, you define the flow of data from
source to destination and the transforms needed along the way. The
other streams required to populate other tables in the data
warehouse will be handled by other Data Flow tasks. You can apply
as many transformations as needed to prepare the data for the data
warehouse. In this initial example, we just copy a subset of the
columns
from the source to the data warehouse.
|
Now we'll continue on with designing a package
to load a dimension. In this example, we copy a subset of the
columns of the Sales Territory table from our OLTP source system
into a table in our data warehouse.
|
1.
|
Drag a Data Flow task from the Toolbox onto
the surface of the Control Flow tab, as shown in Figure 4-4. Click
the label to rename the Data Flow task to Load Sales
Territories.
|
|
2.
|
Double-click the new Data Flow task to
open
the data flow design surface. Drag an OLE DB Source from the
Toolbox onto the pane and change the name to Sales Territory.
|
|
3.
|
Drag an OLE DB Destination data flow
destination from the Toolbox onto the pane and change the name to
DW SalesTerritory.
|
|
4.
|
Click Sales Territory and drag the green arrow
onto DW Sales Territory. You can't configure the data destination
properly until it has data
flowing
into it. You will notice in the
Sales Territory data flow source that there is a small red circle
with an X in it. This is because it is missing some information.
You can hover the mouse over the X to see what the problem is. We
need to tell it where to find the source data, which table it is
in, and which columns we want to copy.
|
|
5.
|
Double-click the Sales Territory OLE DB Source
to open the OLE DB Source Editor.
|
|
|
|
|
6.
|
Set the OLE DB Connection Manager setting to
SalesTracking.
|
|
7.
|
Set the data access mode to Table or view, and
select [Sales].[Sales Territory] from the list of tables. Click
Columns.
|
|
8.
|
We only want to copy TerritoryID, Name,
CountryRegionCode, and Group, so check those columns. Uncheck the
others.
|
|
9.
|
Double-click the DW Sales Territory SQL Server
destination to open the OLE DB Source Editor.
|
|
10.
|
Set the Connection Manager to
DataWarehouse.
|
|
11.
|
Select [dbo].[dimSalesTerritory] from the list
of tables.
|
|
12.
|
Click Mappings.
|
|
13.
|
Drag and drop each input column onto the
corresponding destination column. The TerritoryID input column is
mapped to the business key, not to the SalesTerritoryKey.
SalesTerritoryKey is an internally generated surrogate key. (The
editor does try to map the columns based on the name, but none of
our names match. If there are any automatic mappings, it is a good
idea to review them.)
|
|
14.
|
Click OK to save the configuration. This would
be a good time to save the whole solution, too.
|
Your data flow should now look like Figure
4-5.
|
Tip: SQL Server Destinations Only Work on a
Local Server
We did not choose the SQL Server Destination in
the Quick Start because it will only work when the database is on
the same server that the package is running on. If you know you
will always be running on the same server, you should definitely
choose the SQL Server Destination, because it is faster than the
OLE DB Destination.
Testing Your
Package
To try out your package, simply click the green
Play button (or press F5, or choose Debug/Start). The data flow
between the source and destination will show you how many rows were
sent to the destination. You should see the task
turn
green when it
has completed. The package is still in debug mode. To continue to
work with the package, you need to stop debugging by pressing
Shift+F5 or choosing Debug/Stop debugging.
You're so excited that it actually worked, you
call the boss over to see it go, and press Play again. After what
seems to be an interminable delay, the source task turns green, but
the destination turns
red.
It
didn't work the second time. You claim it is just the demonstration
effect, but deep down you know something is missing.
Where do you look to see what went wrong? Click
the tab labeled Progress to see a detailed list of the results of
each step, as shown in Figure 4-6. Scroll down to find the row with
an exclamation mark and hover the mouse over the line to view the
entire error message. The message autohides after a short time. You
can right-click it, copy it to the clipboard, and paste it into
Notepad to make it easier to read the message.
In the message, you'll see that there is a
violation of a unique key constraint. The business key is required
to be unique, and that's a good thing. We just tried to load the
same data twice. One of the constraints prevented an accidental
duplication of the dimensional data, but we need to be able to run
this task more than once! We need to change the control flow so
that the table is emptied before we reload.
|
We strongly encourage you to implement
constraints (such as uniqueness) at the beginning of the project.
This is the time when the system is most exposed to data errors, so
why not catch them early? It helped us detect a problem very early
in a recent example in this chapter. You can
spend
a considerable
amount of time wondering why the answers are wrong when it's really
the data in your warehouse that's wrong. How to validate the data
is discussed further in Chapter 7, "Data Quality."
|
Making Your
Package Repeatable
The problem with the task we just created is
that the data in the destination is not overwritten. Instead, we
are inserting additional data even if it's the same data. A simple
fix for this is to delete the data at the destination just before
we insert the new data. All we need to do is issue a Truncate Table
SQL statement. We can do this by modifying the control flow, adding
a simple Execute SQL task before the Load Sales Territory task.
|
In this example, we truncate the destination
table by adding a SQL statement to the control flow. This will
prevent the same rows from being added twice to the table:
|
1.
|
Drag an Execute SQL task onto the Control Flow
tab, above the Data Flow task.
|
|
2.
|
Rename this task Empty Sales Territories.
|
|
3.
|
Double-click the Empty Sales Territories task
on the Control Flow tab to bring up the Execute SQL Task
Editor.
|
|
4.
|
On the General tab of the editor, verify that
the ResultSet property is set to None, the Connection Type is
OLE-DB, and the SQLSourceType is Direct input.
|
|
5.
|
Click in the Connection property and choose
the DataWarehouse connection manager from the list.
|
|
6.
|
In the SQL Statement property, enter the
following SQL command:
TRuncate table
dbo.DimSalesTerritory
|
|
7.
|
Click Parse Query to check for syntax errors.
Note that this does not check to see whether the tables and columns
exist! Your query may still fail at runtime if you made a spelling
mistake in the table name.
|
|
8.
|
Click OK to save the definition of this
task.
|
|
|
|
|
9.
|
Set the execution order of the two tasks by
dragging the green arrow from the Empty Sales Territories task onto
the Load Sales Territories task. This causes the Load Sales
Territories task to wait for the successful completion of the Empty
Sales Territories task before beginning execution. You can
right-click the green arrow and set the condition for starting to
failure of the previous task or any completion status.
Your new control flow will now look like Figure 4-7.
|
|
Now you can bring back the boss and click Play
with confidence. The table will be emptied first, and then
repopulated with the current contents of the source, so there will
be no problem rerunning this task.
Inspecting the
Data Flow
Sometimes, your transforms don't work as you
expected, and the problem is more complex that
forgetting
to empty
a table. If you want to see what data is flowing between two data
flow components, you can add a Data Viewer to monitor the output of
a task. Simply right-click the flow line you want to examine, and
click Add. Choose Grid to show a table of the rows being passed
between the data flow
components
.
The Data Viewer shows you a batch of rows, not a
continuous stream of data. The size of the batch depends on the
size
of the rows. The data flow pauses at the end of a batch,
allowing you to inspect the data in that batch. Press the Play
button (green arrow) to continue.
Completing the
Dimensions
You've now got a general idea how a simple data
flow is built. You can now go ahead and build packages for loading
the rest of the dimensions. Use the same Integration Services
project for these dimensions; just create a new package by
right-clicking the Packages node in the Solution Explorer pane.
After you have the dimensions in place, you can load the facts.
Loading the Fact
Table
Unlike our dimension tables, the source records
for the Shipments facts come from more than one table. The sales
data is in two major tables: the SalesHeader and the SalesDetail
tables. We also need to include the ProductCostHistory table so
that we'll know how much the items cost when they were shipped. The
header contains the business keys for sales-person, territory, the
dates, and the customer. The detail table contains the business
keys for product and special offer, as well as the measures for
quantity shipped, unit price, and discount. To answer questions
such as "Where do we ship each of our products?" and "What was our
cost for the products we shipped?," we need to have columns from
all three source tables in our Shipments fact table.
The other transformation that we need to perform
is to translate the business keys in the incoming source records to
the corresponding surrogate keys used by the dimensions.
Working with
Multiple Source Tables
You have several choices about how to create
this
denormalized
view of your shipments that joins the three
source tables together. You could create a view in the original
data source and directly query this view through the data source in
your data flow. But often, you can't make any changes or additions
to the source database, so you need to have an alternative method
for loading this data. You can either load each table independently
and
denormalize
later with a view, or you use a SQL query in your
data source that joins the tables you need. For demonstration
purposes, we're going to load from multiple tables using a SQL
query into a denormalized table.
In our previous example where we loaded a
dimension, the OLE DB data source Data Access Mode property was set
to Table or view. To specify a query that we will build
ourselves
,
you need to set the Data Access Mode to SQL Command. If you click
the Query Builder button, you will see a
fairly
familiar graphical
query designer. Use the Add Table icon (plus sign over a grid) to
select the three tables we need (SalesOrderHeader,
SalesOrderDetail, and ProductCostHistory) and create the joins by
dragging SalesOrderID from the Header to the Detail table, and
ProductID from the Detail to the CostHistory table. Finally,
qualify which time frame in CostHistory we are referring to by
adding a where clause:
WHERE OrderDate >= ProductCostHistory.StartDate and OrderDate < ProductCostHistory.EndDate
Check off the columns we want to use. (We won't
talk you through checking off each one; you can decide what you
need.) Now you can click OK twice, and you have a multi-table data
stream in your data flow (see Figure 4-8).
Looking Up the
Dimension Keys
The next step is to translate the business keys
in the data stream from our data source into the surrogate keys
used by the dimension tables in the data warehouse. The process is
simple: Using the business key, look up the corresponding row in
the dimension table. Sounds like a join, so why not use a join? We
want to avoid hardcoding any database names in a query. Another
reason is to ensure data quality by ensuring that we have a
matching dimension member. We discuss this in Chapter 7. Another
reason is to avoid
hardcoding
any references to databases on other
servers. You would not be able to change the server name to use the
package in another environment without editing the package. This
only invites maintenance
headaches
.
Integration Services has a Lookup transform that
can translate business keys into surrogate keys, and also meets
both criteria of allowing the dimension table to be in a separate
database from the source fact table, and it will assist us with
data quality. You use one Lookup transform for each business key
you are translating, so you will end up with a series of lookups in
the data flows of the packages where you are populating a fact
table. For each lookup transform, we need to identify which
business key we want to translate into a surrogate key, which
database and table contains the dimension table that maps these two
keys, and finally which column in the dimension table is the
surrogate key we want to return to the data stream.
To configure a Lookup transform to translate
business keys into surrogate keys, drag a Lookup transform from the
Toolbox onto the data flow surface, and connect the source data
stream to it. Rename it to something descriptive such as Lookup
Product. Double-click the transform to edit it. You need to specify
a connection manager that points to the database containing the
reference table, which is one of our dimension tables in the data
warehouse. Next, select the reference table from the Use a table or
view list.
On the Columns tab, specify which columns in the
data flow are to be used to uniquely specify a row in the reference
table. Just drag one or more columns from the available input
columns onto the Available Lookup Columns in the reference table.
This is shown in Figure 4-9.
Finally, check one ore more column from the
Available Lookup Columns that you want to have joining the data
flow. This is usually the surrogate key column in the dimension
table.
A word of caution applies to Lookup transforms.
The matching columns or join criteria is case sensitive, unlike a
join in SQL. You might need to create a derived column to set the
case of the business keys to a known statefor example, all
uppercase.
Add a Lookup transform for each of the other
dimensions using the same pattern we used for the Product
dimension.
To complete the data flow, drag an OLE DB Data
Destination from the Toolbox onto the dataflow surface. Rename this
destination to Sales Data. Reuse the Data Warehouse connection
manager and set the table to your sales data table. (If you haven't
created this table in the data warehouse yet, you can easily do so
by clicking the New button beside the table field.)
You should end up with a data flow that looks
like Figure 4-10.
Loading the Budget
Information from Excel
The previous examples loaded data from
relational tables. Now, let's look at loading data from an Excel
spreadsheet. We want to be able to import the budget figures into
the data warehouse so that we can compare them with the actual
results.
The company creates sales forecasts for each of
the four
quarters
in a year, for each region and product. These
forecasts are at the product level because they are used for plant
capacity planning as well as revenue forecasts.
Budgeting will need its own package. You add a
new package to a project by right-clicking on Packages in the
solution explorer, and choosing New Package. Rename the package to
Load Budgets. You will follow the same pattern of adding a data
flow task to the control flow, along with any other tasks you need,
such as one to empty the budget table in the data warehouse.
For the data flow, you start with an Excel data
source. When you edit the data source and click New to create the
connection manager, you will see that the dialogue is different
than for a database source. You browse to the Excel file containing
your budget, and specify which sheet in the workbook you want to
use.
|
Users like to see some things on a spreadsheet
that don't work well when you are trying to import data. For
spreadsheets that we will be importing into the data warehouse, we
only have a single row of headings because the Excel connection
cannot handle more than one row of headings.
Empty rows left at the bottom of the spreadsheet
are another thing that can cause problems. They will appear as rows
with null values. To get around this, in the Excel data source we
can add a filter "where Region is not null." To implement this, we
needed to switch the data access mode from Table or view to SQL
Command, and build a SQL query by clicking Build.
|
Matching Source
and Destination Data Types
One important thing you need to do is set the
true data types of each column. Excel chooses a type for each
column based on an examination of the first eight rows. That choice
doesn't always match the data warehouse requirements. For example,
numeric columns come in as float, whereas general comes in as
nvarchar(255). We recommend using a Data Conversion transform right
after the Excel data source to set the data types to what you
really need.
Reshaping the
Data
Fact tables, such as the budget forecasts we are
working with, need to have exactly one occurrence of each measure
per row, along with one member of each dimension. This isn't always
the way people like to see the information presented to them on a
spreadsheet.
Looking back at the spreadsheet we are
importing, shown in Figure 4-2, you see that the rows aren't
normalized. Four forecasts for each region and product category
repeat on each row; and the time dimension is not a value in the
row, it is a column name. We really need the forecast on one row
for each quarter. This will allow us to create a cube to compare
the forecasts with the actual revenue. Fortunately, we can use an
Unpivot transform to transpose these repeating fields onto a row of
their own. In our case, one row from the spreadsheet will become
four rows after the
Unpivot
transform.
|
To use the Unpivot transform, you need to
specify which columns should be repeated on each row (passed
through) and which columns are the repeated columns of the row and
should generate a row for each of the columns. You also need to
provide a value for a new column that denotes which column is on
the new row:
|
1.
|
Drag an Unpivot transform onto the data flow
pane, below the Excel source.
|
|
2.
|
Connect the Excel Source to the Unpivot
transform.
|
|
3.
|
Double-click the Unpivot transform to open the
editor to set the configuration.
|
|
4.
|
Check Q1, Q2, Q3, and Q4 for the columns to
unpivot. These names will also be used in the Pivot Values, which
will be an additional column used to identify which column a new
row represents. Think of the names as an additional component in
the compound key for this row. We change them later in Step 7.
|
|
5.
|
Uncheck the Description column from the Pass
Through check box. We don't need the description, because we have
the description code. It was just there to help the users
understand which product they were budgeting for. Columns Q1 to Q4
are automatically passed through to the output stream, on separate
rows.
|
|
6.
|
Set the Destination Column for each row to
Units. This is the name of the new column that the value in Q1 to
Q4 will be placed in. You could use different names for each
column, but that isn't appropriate for normalizing this
spreadsheet.
|
|
7.
|
Set the Pivot Value column to 1, 2, 3, and 4,
corresponding to Q1 to Q4.
|
|
8.
|
Set the Pivot Value Column Name to Quarter.
This column will take on the Pivot Values corresponding to columns
Q1 to Q4 we set in Step 7, based on the column that is on the
current row.
|
|
9.
|
Click OK to save the configuration.
|
|
{% if main.adsdop %}{% include 'adsenceinline.tpl' %}{% endif %}
In the Unpivot transformation, we created a new
column for Quarter. Whenever a transform creates a new column,
check that the column has a data type compatible with the column it
is mapped to in the next step in the data flow. You can check and
set the data type for a new column created in a transform by using
the Input and Output Properties tab on the Advanced Editor for that
transform. Start the Advanced Editor by right-clicking the
transform. The Unpivot transform configuration should look like
Figure 4-11.
We have a couple of small tasks remaining before
we can save the data into the warehouse. We need to use the Lookup
task to find the Territory surrogate key for the region in the
budget, and another Lookup task to find the product record to give
us the product surrogate key, unit price, and unit cost. Finally,
we want to compute the total price and cost for this product for
this region and quarter.
Adding New Columns
for Total Cost and Price
To create the two new columns for the total cost
and total price, we'll use a Derived Column transform. This
transform enables you to specify a name for the new column, define
an expression for the value of the column, and to set the data
type. Drag a derived column transform onto the data flow panel and
give it a name like Compute cost and price.
Double-click the Derived Column transform to
open it for editing. Enter a new column name for the derived
column, such as TotalCost.
Provide the formula for the column. In our
example, drag Budget Units from the available columns into the
Expression field, enter an asterisk (*), and then drag in Product
Cost. Check that the data type is correct. It should be Currency.
You can create multiple new columns in one Derived Column
transform, so just repeat this procedure for BudgetRevenue. Figure
4-12 shows how you would configure a derived column transform to
create the new columns.
Saving the Budget
into the Data Warehouse
The output of the Compute cost and price
transform is now ready to be saved to the data warehouse. This
output contains all the columns we've defined. You have done
something similar when you saved the SalesTerritory dimension into
the data warehouse. All you need to do is drop an OLE DB
Destination onto the data flow, drag the green arrow from the
Derived transform onto the destination, and map the columns from
the stream onto the destination table. Your data flow for budget
loading should now look like Figure 4-13.
Loading Multiple
Sets of Data
In our example, we have individual budget
spreadsheets submitted from each region, so there will many
spreadsheets and we don't want to have to change our package to
specify the name of each spreadsheet. You can use the For Each Loop
Container task to enumerate the file names in a directory that
match a specific pattern. If we establish a convention that all our
budgets will have filenames that start with Budget and have .XLS
extensions, and we put them in a fixed directory, we can easily
load every budget regardless of any other distinguishing
parts
of
the name. For example, the name for the Canadian budget for fiscal
year 2006 would be Budget-CAD-2006.XLS. This fits the pattern and
distinguishes it from the Southwest budget.
Using
Variables
in
a Package
The For Each loop gives us back a list of
filenames; but how do we work with each filename, one at a time? We
need a way to take one filename off the list and pass it to a data
source so that it works on a file from the list, not the one we
coded into the package at design time. This is what variables are
for.
Variables are objects you can store values in
for use in calculations, expressions, and passing information
between tasks and other packages. Variables have scope. Scope
defines how widely known the variable will be. Package scope means
that the variable will be known to every task in the package, and
also inside every data flow invoked by the package. If the Data
Flow tab had been selected rather than the Control Flow tab, the
variable would have had the scope only of the Data Flow task and
would not be visible to the For Each task. This
prevents
accidental
name collisions, but can also cause you a bit of confusion when you
try to find the variable if you have
accidentally
created a
variable with the wrong scope.
|
1.
|
Go to the Control Flow tab, select Variables
from the Integration Services menu to show the Variables window,
and click the Add Variable icon. Give the variable a name, such as
BudgetFileName. Select String as the data type.
|
|
2.
|
You want to specify an initial Value for this
variable that points to a valid Excel spreadsheet so that you can
work with the data flow in design mode. Without a value, the Excel
data source will complain; so, specify a full
path
such as
C:\Budget-CAD-2006.xls.
|
|
3.
|
Drag and drop a new For Each Loop Container
onto the Control Flow tab, and double-click it to configure the
properties.
|
|
4.
|
On the Collection section, specify the folder
name that contains your budget spreadsheets and the filenames to
search for (such as Budget*.xls).
|
|
5.
|
On the Variable Mappings section, select the
variable created in Step 1 (BudgetFileName) from the list and
specify 0 as the Index to map. As the For Each loop iterates
through the filenames, each will be assigned one at a time to the
variable. Click OK.
|
|
|
|
|
6.
|
Drag your existing data flow task on top of
the For Each Loop Container. If the Data Flow task has a
preceding
task, such as one to empty the table, you must disconnect the two
tasks and make the connection to the For Each Loop Container. This
will let Integration Services know that the data flow needs to be
executed for every file that is retrieved in the loop.
|
|
7.
|
You need to configure the Excel connection to
pick up the filename from the variable. Right-click the Excel
Connection Manager in the Connection Managers area at the bottom of
the package designer and open the Properties window. Click the
ellipsis button (...) next to the Expressions property.
|
|
8.
|
Select the ExcelFilePath property and drag the
variable BudgetFileName into the expression area. Click OK.
|
Your control flow for loading the budgets should
now look like what is shown in Figure 4-14.
|
|