Chapter 1: Getting Started


This book is about applications. Specifically, this book is about applying the functionality of SQL Server 2005 Integration Services (SSIS) to help you envision, develop, and implement your data processing needs. The discussions throughout the book spotlight how SSIS can help you accomplish your data integration and processing requirements.

Core to the data processing that SSIS does best is extraction, transformation, and loading (ETL). Over the years, this ETL has taken on a range of different meanings, from the general perspective of moving data from somewhere to somewhere else, to the specific application of data warehousing ETL. In fact, ETL has its roots in business intelligence (BI) and data warehouse processing.

This chapter provides important background information for generalized ETL that DBAs will need, as well as basic data warehousing ETL concepts. In addition, this chapter includes a practical review of SSIS functionality and provides the foundation for building the book’s examination of applying the functionality of SSIS to help you accomplish your individual goals in data integration and processing requirements.

Choosing the Right Tool for the Job

If you have any inclination toward home remodeling, chances are you enjoy walking through the tools area of your local home improvement store. Hundreds of different tools have been manufactured that perform a variety of functions and, in some cases, some fairly esoteric uses.

Any novice handyman can attest to the adage that the right tool for the job makes the job easier. The same concept applies when it comes to handling data. There’s no doubt that, depending on the right situation, there may be a specific tool to handle such a function. Think about all the different types of data processing needs that you have across your organization:

  • Data synchronization between systems

  • Data extraction from ERP systems

  • Ad hoc reporting

  • Replication (both homogeneous and heterogeneous)

  • PDA data synchronization

  • Legacy system integration

  • Vendors and partner data files integration

  • Line of business data

  • Customer and employee directory synchronization

  • Data warehouse ETL processing

As you may know, when it comes to data processing, there are a lot of tools out there. Some are created for specific situations (such as folder synchronizing tools), whereas other tools are designed to perform a variety of functions for different situations. So, the traditional question often posed is which tool can best meet the business and logical requirements to perform the tasks needed?

Consider the host of tools found in the ever-evolving Microsoft toolset. You can use Transact SQL (TSQL) to hand-code a load, Host Integration Server to communicate with a heterogeneous data source, BizTalk to orchestrate messages in a transactional manner, or SSIS to load data in batches. Each of these tools plays a role in the data world.

Although there can be overlaps, each tool has a distinct focus and target purpose. When you become comfortable with a technology, there’s always the tendency to want to apply that technology beyond its intended “sweet spot” when another tool would be better for the job. You’ve no doubt heard the phrase “when you’re a hammer, everything looks like a nail.” For example, C# developers may want to build an application to do something that SSIS could potentially do in an hour of development time. The challenge everyone faces entails time and capacity. There is no way everyone can be an expert across the board. Therefore, developers and administrators alike should be diligent about performing research on tools and technologies that complement each other, based on different situations.

For example, many organizations use BizTalk for a host of purposes beyond the handling of business-to-business communication and process workflow automation. These same organizations may be perplexed as to why BizTalk doesn’t scale to meet the needs of the organization’s terabyte data warehousing ETL. The easy answer is that the right tool for bulk BI processing is an ETL tool such as SSIS. In fact, as shown in Figure 1-1, SSIS provides an excellent platform for leveraging its high-performance data pipeline.

image from book
Figure 1-1: SSIS high-performance data pipeline

The process outlined in Figure 1-1 may be simple enough, but essentially what SSIS does is to provide the technology to make the process efficient and scalable, and provide the functionality to handle data errors.

This chapter reviews ETL concepts in more detail, and gets you started with an SSIS example. Before diving into the expert level details found in the ensuing chapters, reminding you about ETL concepts and SSIS features will help solidify the background needed before moving to the details of the SSIS application.

Be Careful About Tool Selection

In some client environments, an ETL tool may be chosen without consideration for the availability of industry skills, support, or even the learning curve. Even though the tool could perform “magic,” it usually doesn’t come with a pocket magician, just the magic of emptying your corporate wallet. In many cases, thousands of dollars have been spent to purchase an ETL tool that takes too long to master, implement, and support. Beyond the standard functionality questions you should ask about a tool, be sure to also consider the following:

  • Your internal skill sets

  • The trend of industry use of the tool

  • How easy it is to learn

  • The ease of supporting the tool

This book focuses on the three most common categories of SSIS usage:

  • Data warehouse ETL

  • Data integration

  • SSIS administration

Before going any further, it makes sense to consider the purpose and background of each of these types of ETL.

Data Warehousing ETL

Some of you may be well-versed in data warehousing and related ETL concepts, but for those who are not, here is a high-level overview of data warehousing. Data warehousing focuses on decision support, or enabling better decision making through organized accessibility of information. As opposed to a transactional system such as a point of sale (POS), Human Resources (HR), or customer relationship management (CRM) that is designed to allow rapid transactions to capture information data, a data warehouse is tuned for reporting and analysis. In other words, instead of focusing on the entry of information, data warehousing is focused on the extraction and reporting of information to show trending, summary, and data history.

Databases designed for data warehousing are created in a structure called a dimensional model, which involves two types of tables. Dimension tables hold informational data or attributes that describe entities. Fact tables capture metrics or numeric data that describe quantities, levels, sales, or other statistics. A data warehouse may involve many dimension tables and fact tables. Figure 1-2 shows the relationships between several dimension tables and one fact table in a structure often called a star schema.

image from book
Figure 1-2: Star schema

The focus of this book is not on the design of the dimension tables and fact tables, but rather on getting data into these structures from other repositories. Processing ETL for data warehousing involves extracting data from source systems or files, performing transformation logic on the data to correlate, cleanse, and consolidate, and then loading a data warehouse environment for reporting and analysis (see Figure 1-3).

image from book
Figure 1-3: The ETL system

For those who are already versed in ETL concepts and practice, you’ll know that when it comes to developing a data warehouse ETL system, moving from theory to practice often presents the biggest hurdle. Did you know that ETL typically takes up between 50 and 70 percent of a data warehousing project? That is quite a daunting statistic. What it means is that even though presenting the data is the end goal and the driving force for business, the largest portion of developing a data warehouse is spent not on the presentation and organization of the data, but in the behind-the-scenes processing to get the data ready.

Data Integration

You can also use SSIS to synchronize data between systems, or to replicate data. For example, you may want to create a business-to-business portal site, and you may need the site to interface with the source data on the mainframe. In this case, you may get the data delivered in nightly extracts from the mainframe and load it into your SQL Server table. Another very common ETL task that DBAs face is receiving files from File Transfer Protocol (FTP) servers (or on network shares) that must be processed and loaded into another system. This type of process involves moving files and then processing the data, which may involve de-duping, combining files, cleaning bad data, and so on.

Part of the job of integrating data may include data extraction. Data extraction is moving data out of a source, and, although it sounds easy enough, some of the challenges involve extracting only changes and also optimizing the extraction to make it scalable. Chapter 3 provides more information on data extraction for applying SSIS to data sources and extraction techniques.

In addition, if you think you have a perfect data source, think again! Chances are you will be dealing with missing data, mistyped data, NULL values, and just plain dirty data. Refer to Chapters 5 and 7 to learn how to handle real-world data situations. Data quality issues span a range of challenges, and you will need to plan your data cleansing to ensure you can accommodate your final goal of data processing.

SSIS Administration

If you are a DBA responsible for SSIS packages (whether you created them or were given responsibility), then chances are you will have the responsibility of monitoring the package execution and ensuring that the transactions are correctly implemented to keep the database in a consistent state. Some of you will also be responsible for package deployment with a team of people, and securing packages so that only the right people have the ability to access and execute packages.

Although not every environment will be upgrading from SQL Server 2000 DTS, many DBAs do face this situation. DTS adoption was broad because of its ease of use and execution. And now that SSIS is here and proven, you may need to take your packages and move them to SSIS. This is easier said than done, given the architectural changes in the products. Chapter 11 provides more information on DTS migration to help get you there without losing your mind.

Yet another aspect of SSIS is database administration. In fact, in SQL Server 2005, SSIS is also used to help manage your database environment. For example, SQL Server maintenance plans (such as database backups, index defragmentation, database consistency checking, and so on) use SSIS behind the scenes to coordinate the administration operations.

Optimizing and scaling SSIS is another common responsibility for both DBAs and developers alike. Chapter 12 targets scaling SSIS, including ways to optimize destinations and how to take advantage of SSIS functionality to make faster, more scalable packages.



Expert SQL Server 2005 Integration Services
Expert SQL Server 2005 Integration Services (Programmer to Programmer)
ISBN: 0470134119
EAN: 2147483647
Year: 2004
Pages: 111

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net